We can approximate effective genome size for various read lengths using the khmer program and `unique-kmers.py`. This will estimate the number of unique kmers (for a specified length kmer) which can be used to infer the total uniquely mappable genome. (I.e it doesn't include highly repetitive regions). https://khmer.readthedocs.io/en/v2.1.1/user/scripts.html
...
...
@@ -277,48 +272,24 @@ Run `unique-kmers.py` on dunnart genome for read length of 150bp:
Estimated number of unique 150-mers in /Users/lauracook/../../Volumes/macOS/genomes/Scras_dunnart_assem1.0_pb-ont-illsr_flyeassem_red-rd-scfitr2_pil2xwgs2_60chr.fasta: 2740338543
Total estimated number of unique 150-mers: 2740338543
```
<details><summary>_Total estimated number of unique 150-mers: 3074798085_</summary>
Normalised to the reads per genomic content (normalized to 1x coverage)
Produces a coverage file
### rule deeptools_fingerprint:
...
...
@@ -382,6 +355,8 @@ Cross-correlation analysis is done on a filtered (but not-deduped) and subsample
### rule phantomPeakQuals:
# 7. Call peaks (MACS2)
...
...
@@ -428,7 +403,13 @@ therefore if amount that overlaps between each replicate divided by the length o
### rule overlap_peaks_H3K27ac:
ENCODE files:
| File format | Information contained in file | File description | Notes |
|-|-|-|-|
| bigWig | fold change over control, signal p-value | Two versions of nucleotide resolution signal coverage tracks. | The signal is expressed in two ways: as fold-over control at each position, and as a p-value to reject the null hypothesis that the signal at that location is present in the control. |
| bed and bigBed (narrowPeak) | peaks | Relaxed peak calls for each replicate individually and for both replicates' reads pooled together. | These peaks are thresholded to sample enough noise in the experiment for efficient statistical comparison of replicates in subsequent steps; as such, many false positives are expected to be present. They are not meant to be interpreted as definitive binding events, but are rather intended to be used as input for subsequent statistical comparison of replicates. |
| bed and bigBed (narrowPeak) | replicated peaks | The set of peak calls from the pooled replicates. | These peaks are either observed in both replicates, or are observed in two pseudoreplicates. Pseudoreplicates are peak sets called on half of the pooled reads, chosen at random without replacement. |
# Plot DAG
...
...
@@ -436,3 +417,8 @@ therefore if amount that overlaps between each replicate divided by the length o