updated

ecf5c25b · Laura Cook · 7244b64b · ecf5c25b
Commit ecf5c25b authored 4 years ago by Laura Cook
--- a/README.md
+++ b/README.md
@@ -171,59 +171,85 @@ chip/

 ### 1. FastQC on raw reads

-```
-fastqc
-```
-
 ### 2. Alignment

-```
-bowtie2-build
-
-```
-
 ### 3. Filtering

-```
-SAMtools sort
-SAMtools view
+### 4. Alignment QC & Library Complexity

-picard MarkDuplicates
-```
+### 5. deepTools

-### 4. Alignment QC & Library Complexity
+### 7. phantomPeakQuals

-```
-SAMtool
+### 8. Call narrow peaks (MACS2)
+
+__Effective Genome Length__

-picard
+We can approximate effective genome size for various read lengths using the khmer program and `unique-kmers.py`. This will estimate the number of unique kmers (for a specified length kmer) which can be used to infer the total uniquely mappable genome. (I.e it doesn't include highly repetitive regions). https://khmer.readthedocs.io/en/v2.1.1/user/scripts.html

-preseq
+This was a suggestion of deepTools: https://deeptools.readthedocs.io/en/latest/content/feature/effectiveGenomeSize.html

+Install khmer program:
+```{bash eval=FALSE}
+pip3 install khmer
 ```

-### 5. deepTools
+Run `unique-kmers.py` on dunnart genome for read length of 150bp:

-```
-deepTools
+```{bash eval=FALSE}
+/usr/local/bin/unique-kmers.py -k 150 dunnart_pseudochr_vs_mSarHar1.11_v1.fa
 ```

-### 7. phantomPeakQuals
+<details><summary>__Total estimated number of unique 150-mers: 3074798085__</summary>
+<p>

-```
-spp
-```
+| 150-mers                                             |                     |                   |                       |
+|------------------------------------------------------|---------------------|-------------------|-----------------------|
+| 3074798085                                           |                     |                   |                       |
+|                                                      |                     |                   |                       |
+| number of unique k-mers:                             | 3074798085          |                   |                       |
+| false positive rate:                                 | 0.010               |                   |                       |
+|                                                      |                     |                   |                       |
+| If you have expected false positive rate to achieve: |                     |                   |                       |
+| expected_fp                                          | number_hashtable(Z) | size_hashtable(H) | expected_memory_usage |
+| 0.100                                                | 3                   | 4.928212e+09      | 1.478464e+10          |
+| 0.200                                                | 2                   | 5.187050e+09      | 1.037410e+10          |
+| 0.300                                                | 1                   | 8.620729e+09      | 8.620729e+09          |
+| 0.400                                                | 1                   | 6.019271e+09      | 6.019271e+09          |
+| 0.500                                                | 1                   | 4.435996e+09      | 4.435996e+09          |
+| 0.600                                                | 1                   | 3.355701e+09      | 3.355701e+09          |
+| 0.700                                                | 1                   | 2.553877e+09      | 2.553877e+09          |
+| 0.800                                                | 1                   | 1.910479e+09      | 1.910479e+09          |
+| 0.900                                                | 1                   | 1.335368e+09      | 1.335368e+09          |
+|                                                      |                     |                   |                       |
+| If you have expected memory to use:                  |                     |                   |                       |
+| expected_memory_usage                                | number_hashtable(Z) | size_hashtable(H) | expected_fp           |
+| 1.000000e+09                                         | 1                   | 1.000000e+09      | 0.954                 |
+| 5.000000e+09                                         | 1                   | 5.000000e+09      | 0.459                 |
+| 1.000000e+10                                         | 2                   | 5.000000e+09      | 0.211                 |
+| 2.000000e+10                                         | 4                   | 5.000000e+09      | 0.045                 |
+| 5.000000e+10                                         | 11                  | 4.545455e+09      | 0.000                 |
+| 1.000000e+11                                         | 22                  | 4.545455e+09      | 0.000                 |
+| 2.000000e+11                                         | 45                  | 4.444444e+09      | 0.000                 |
+| 3.000000e+11                                         | 67                  | 4.477612e+09      | 0.000                 |
+| 4.000000e+11                                         | 90                  | 4.444444e+09      | 0.000                 |
+| 5.000000e+11                                         | 112                 | 4.464286e+09      | 0.000                 |
+| 1.000000e+12                                         | 225                 | 4.444444e+09      | 0.000                 |
+| 2.000000e+12                                         | 450                 | 4.444444e+09      | 0.000                 |
+| 5.000000e+12                                         | 1127                | 4.436557e+09      | 0.000                 |

-### 8. Call narrow peaks (MACS2)

-```
-macs2 callpeaks
-```
+</p>
+</details>
+
+
+I will use this number as my estimate for effective genome size.

 ### 9. Create consensus peaksets
+
 ### 10. Annotate peaks relative to gene features (HOMER)
-### 11. Present QC for raw read, alignment, peak-calling in MultiQC

+### 11. Present QC for raw read, alignment, peak-calling in MultiQC

 ### 12. Plot DAG