From e897205f27d3a88095f6c3580d34370df93bdd23 Mon Sep 17 00:00:00 2001 From: Laura Cook <l.cook2@student.unimelb.edu.au> Date: Fri, 24 Jul 2020 17:07:40 +1000 Subject: [PATCH] minor formatting changes --- dunnart/README.md | 74 +++++++++++++++++++---------------------------- 1 file changed, 30 insertions(+), 44 deletions(-) diff --git a/dunnart/README.md b/dunnart/README.md index bbfdd6c..15beabd 100644 --- a/dunnart/README.md +++ b/dunnart/README.md @@ -162,10 +162,8 @@ chipseq/ ├── genomes/ ├── results/ │ ├── bowtie2/ -│ └── fastQC/ -│ └── deepTools/ +│ └── qc/ │ └── macs2/ -│ └── phantomPeaks/ ├── envs/ └── configs/ ``` @@ -253,9 +251,6 @@ __Parameters__ - `-2`: pair 2 - - - __Effective Genome Length__ We can approximate effective genome size for various read lengths using the khmer program and `unique-kmers.py`. This will estimate the number of unique kmers (for a specified length kmer) which can be used to infer the total uniquely mappable genome. (I.e it doesn't include highly repetitive regions). https://khmer.readthedocs.io/en/v2.1.1/user/scripts.html @@ -277,48 +272,24 @@ Run `unique-kmers.py` on dunnart genome for read length of 150bp: Estimated number of unique 150-mers in /Users/lauracook/../../Volumes/macOS/genomes/Scras_dunnart_assem1.0_pb-ont-illsr_flyeassem_red-rd-scfitr2_pil2xwgs2_60chr.fasta: 2740338543 Total estimated number of unique 150-mers: 2740338543 ``` -<details><summary>_Total estimated number of unique 150-mers: 3074798085_</summary> -<p> -| 150-mers | | | | -|------------------------------------------------------|---------------------|-------------------|-----------------------| -| 3074798085 | | | | -| | | | | -| number of unique k-mers: | 3074798085 | | | -| false positive rate: | 0.010 | | | -| | | | | -| If you have expected false positive rate to achieve: | | | | -| expected_fp | number_hashtable(Z) | size_hashtable(H) | expected_memory_usage | -| 0.100 | 3 | 4.928212e+09 | 1.478464e+10 | -| 0.200 | 2 | 5.187050e+09 | 1.037410e+10 | -| 0.300 | 1 | 8.620729e+09 | 8.620729e+09 | -| 0.400 | 1 | 6.019271e+09 | 6.019271e+09 | -| 0.500 | 1 | 4.435996e+09 | 4.435996e+09 | -| 0.600 | 1 | 3.355701e+09 | 3.355701e+09 | -| 0.700 | 1 | 2.553877e+09 | 2.553877e+09 | -| 0.800 | 1 | 1.910479e+09 | 1.910479e+09 | -| 0.900 | 1 | 1.335368e+09 | 1.335368e+09 | -| | | | | -| If you have expected memory to use: | | | | -| expected_memory_usage | number_hashtable(Z) | size_hashtable(H) | expected_fp | -| 1.000000e+09 | 1 | 1.000000e+09 | 0.954 | -| 5.000000e+09 | 1 | 5.000000e+09 | 0.459 | -| 1.000000e+10 | 2 | 5.000000e+09 | 0.211 | -| 2.000000e+10 | 4 | 5.000000e+09 | 0.045 | -| 5.000000e+10 | 11 | 4.545455e+09 | 0.000 | -| 1.000000e+11 | 22 | 4.545455e+09 | 0.000 | -| 2.000000e+11 | 45 | 4.444444e+09 | 0.000 | -| 3.000000e+11 | 67 | 4.477612e+09 | 0.000 | -| 4.000000e+11 | 90 | 4.444444e+09 | 0.000 | -| 5.000000e+11 | 112 | 4.464286e+09 | 0.000 | -| 1.000000e+12 | 225 | 4.444444e+09 | 0.000 | -| 2.000000e+12 | 450 | 4.444444e+09 | 0.000 | -| 5.000000e+12 | 1127 | 4.436557e+09 | 0.000 | +__Indexing genome file__ -</p> -</details> +Build Index + +__Load modules:__ + +```{bash eval=FALSE} +module load gcc/8.3.0 +module load bowtie2/2.3.5.1 +``` +__Build index__ + +```{bash eval=FALSE} +bowtie2-build /data/projects/punim0586/lecook/chip/reference_data/bowtie2/dunnart_pseudochr_vs_mSarHar1.11_v1.fasta +``` # 3. FILTERING @@ -359,6 +330,8 @@ ChIP-seq Standards: ### rule deeptools_coverage: +Normalised to the reads per genomic content (normalized to 1x coverage) +Produces a coverage file ### rule deeptools_fingerprint: @@ -382,6 +355,8 @@ Cross-correlation analysis is done on a filtered (but not-deduped) and subsample ### rule phantomPeakQuals: + + # 7. Call peaks (MACS2) @@ -428,7 +403,13 @@ therefore if amount that overlaps between each replicate divided by the length o ### rule overlap_peaks_H3K27ac: +ENCODE files: +| File format | Information contained in file | File description | Notes | +|-|-|-|-| +| bigWig | fold change over control, signal p-value | Two versions of nucleotide resolution signal coverage tracks. | The signal is expressed in two ways: as fold-over control at each position, and as a p-value to reject the null hypothesis that the signal at that location is present in the control. | +| bed and bigBed (narrowPeak) | peaks | Relaxed peak calls for each replicate individually and for both replicates' reads pooled together. | These peaks are thresholded to sample enough noise in the experiment for efficient statistical comparison of replicates in subsequent steps; as such, many false positives are expected to be present. They are not meant to be interpreted as definitive binding events, but are rather intended to be used as input for subsequent statistical comparison of replicates. | +| bed and bigBed (narrowPeak) | replicated peaks | The set of peak calls from the pooled replicates. | These peaks are either observed in both replicates, or are observed in two pseudoreplicates. Pseudoreplicates are peak sets called on half of the pooled reads, chosen at random without replacement. | # Plot DAG @@ -436,3 +417,8 @@ therefore if amount that overlaps between each replicate divided by the length o ``` snakemake --dag | dot -Tsvg > dag.svg ``` + + +# Annotate peaks + +Create Tbxdb for use with Bioconducter packages -- GitLab