updated notes

c5960ccf · Laura Cook · da44d6c7 · c5960ccf
Commit c5960ccf authored 4 years ago by Laura Cook
--- a/dunnart/README.md
+++ b/dunnart/README.md
@@ -333,6 +333,14 @@ ChIP-seq Standards:
 Normalised to the reads per genomic content (normalized to 1x coverage)
 Produces a coverage file

+The bigWig format is an indexed binary format useful for dense, continuous data that will be displayed in a genome browser as a graph/track, but also is used as input for some of the visualization commands in deepTools.
+
+- `normalizeUsing`: Possible choices: RPKM, CPM, BPM, RPGC. We will use BPM (Bins Per Million), which is similar to TPM in RNA-seq. BPM (per bin) = number of reads per bin / sum of all reads per bin (in millions).
+- `binSize`: size of bins in bases
+- `smoothLength`: defines a window, larger than the binSize, to average the number of reads over. This helps produce a more continuous plot.
+- `centerReads`: reads are centered with respect to the fragment length as specified by extendReads. This option is useful to get a sharper signal around enriched regions.
+
+

 ### rule deeptools_fingerprint:

@@ -356,12 +364,83 @@ Cross-correlation analysis is done on a filtered (but not-deduped) and subsample
 ### rule phantomPeakQuals:


+| col. | abbreviation     | description                                                                                         |
+|------|------------------|-----------------------------------------------------------------------------------------------------|
+| 1    | Filename         | tagAlign/BAM filename                                                                               |
+| 2    | numReads         | effective sequencing depth i.e. total number of mapped reads in input file                          |
+| 3    | estFragLen       | comma separated strand cross-correlation peak(s) in decreasing order of correlation.                |
+| 4    | corr_estFragLen  | comma separated strand cross-correlation value(s) in decreasing order (COL2 follows the same order) |
+| 5    | phantomPeak      | Read length/phantom peak strand shift                                                               |
+| 6    | corr_phantomPeak | Correlation value at phantom peak                                                                   |
+| 7    | argmin_corr      | strand shift at which cross-correlation is lowest                                                   |
+| 8    | min_corr         | minimum value of cross-correlation                                                                  |
+| 9    | NSC              | Normalized strand cross-correlation coefficient (NSC) = COL4 / COL8                                 |
+| 10   | RSC              | Relative strand cross-correlation coefficient (RSC) = (COL4 - COL8) / (COL6 - COL8)                 |
+| 11   | QualityTag       | Quality tag based on thresholded RSC (codes= -2:veryLow, -1:Low, 0:Medium, 1:High, 2:veryHigh)      |
+
+
+NSC; NSC>1.1 (higher values indicate more enrichment; 1 = no enrichment)
+
+RSC; RSC>0.8 (0 = no signal; <1 low quality ChIP; >1 high enrichment
+
+Quality tag based on thresholded RSC (codes: -2:veryLow,-1:Low,0:Medium,1:High; 2:veryHigh)
+
+| 1  | filename         | H3K4me3_repA | H3K4me3_repB | H3K27ac_repA | H3K27ac_repB |
+|----|------------------|--------------|--------------|--------------|--------------|
+| 2  | numReads         | 42648566     | 51061536     | 53601754     | 52186536     |
+| 3  | estFragLen       | 220          | 220          | 220          | 225          |
+| 4  | corr_estFragLen  | 0.582720253  | 0.655264803  | 0.572938445  | 0.546343334  |
+| 5  | phantomPeak      | 155          | 155          | 155          | 155          |
+| 6  | corr_phantomPeak | 0.5635504    | 0.6347016    | 0.556278     | 0.5294049    |
+| 7  | argmin_corr      | 1500         | 1500         | 1500         | 1500         |
+| 8  | min_corr         | 0.3157568    | 0.3268253    | 0.3713508    | 0.3778125    |
+| 9  | NSC              | 1.845472     | 2.004939     | 1.54285      | 1.44607      |
+| 10 | RSC              | 1.077362     | 1.066791     | 1.090092     | 1.111736     |
+| 11 | QualityTag       | 1            | 1            | 1            | 1            |
+
+> Take home from phantomPeakquals is that all replicates appear to be high quality. The NSC and RSC scores meet the requirements of the ENCODE guidelines. Suggests that there is good enrichment in the ChIP replicates.
+
+

 #  7. Call peaks (MACS2)

+__Input file options__
+
+- `-t`: The IP data file (this is the only REQUIRED parameter for MACS)
+- `-c`: The control or mock data file
+- `-f`: format of input file; Default is “AUTO” which will allow MACS to decide the format automatically.
+- `-f BAMPE`: Here, the fragments are defined by the paired alignments' ends, and there is no modelling or artificial extension. Singleton alignments are ignored. This is the preferred option for using only properly paired alignments.
+- `-g`: mappable genome size which is defined as the genome size which can be sequenced; some precompiled values provided.
+
+__Output arguments__
+
+- `--outdir`: MACS2 will save all output files into speficied folder for this option
+- `-n`:The prefix string for output files
+
+__Shifting model arguments__
+
+- `-s`: size of sequencing tags. Default, MACS will use the first 10 sequences from your input treatment file to determine it
+- `--bw`: The bandwidth which is used to scan the genome ONLY for model building. Can be set to the expected sonication fragment size.
+- `--mfold`: upper and lower limit for model building
+
+
+__Peak calling arguments__
+
+- `-q`: q-value (minimum FDR) cutoff, default 0.05
+- `--nolambda`: do not consider the local bias/lambda at peak candidate regions
+- `--broad`: broad peak calling
+
+- `--keep-dup all`: PCR duplicates have been removed by another program, such as Picard's MarkDuplicates. So need to specify to keep all duplicates. Default is to remove all potential PCR duplicates. Picard's MarkDuplicates is better.
+
+I've left all the shifting model and peak calling arguments as default
+

 ### rule call_peaks_macs2:

+- `peaks.narrowPeak`: BED6+4 format file which contains the peak locations together with peak summit, pvalue and qvalue
+- `peaks.xls`: a tabular file which contains information about called peaks. Additional information includes pileup and fold enrichment
+- `summits.bed`: peak summits locations for every peak. To find the motifs at the binding sites, this file is recommended
+

 #  8. Peak QC

@@ -403,13 +482,68 @@ therefore if amount that overlaps between each replicate divided by the length o
 ### rule overlap_peaks_H3K27ac:


-ENCODE files:
+# Consensus peaks and grouping into putative enhancers and promoters
+
+Separate overlapped replicate peak files:
+```
+less H3K27ac_overlap.narrowPeak| grep "A-3" > H3K27ac_overlap_repA.narrowPeak
+
+less H3K27ac_overlap.narrowPeak| grep "B-3" > H3K27ac_overlap_repB.narrowPeak
+
+less H3K4me3_overlap.narrowPeak| grep "A-2" > H3K4me3_overlap_repA.narrowPeak
+
+less H3K4me3_overlap.narrowPeak| grep "B-2" > H3K4me3_overlap_repB.narrowPeak
+```
+
+Keep only the intersected region:
+```
+bedtools intersect -a H3K27ac_overlap_repA.narrowPeak -b H3K27ac_overlap_repB.narrowPeak > H3K27ac_consensus.narrowPeak
+
+bedtools intersect -a H3K4me3_overlap_repA.narrowPeak -b H3K4me3_overlap_repB.narrowPeak > H3K4me3_consensus.narrowPeak
+```
+
+Find uniquely H3K4me3 sites (i.e. peaks that don't overlap with H3K27ac):
+
+```
+bedtools intersect -v -a macs2/H3K4me3_consensus.narrowPeak -b macs2/H3K27ac_consensus.narrowPeak > H3K4me3_only.narrowPeak
+```
+
+Find unique H3K27ac sites (i.e. peaks that don't overlap with H3K4me3):
+```
+bedtools intersect -v macs2/H3K27ac_consensus.narrowPeak -b macs2/H3K4me3_consensus.narrowPeak > H3K27ac_only.narrowPeak
+```
+
+Find peaks common between H3K27ac & H3K4me3 with a reciprocal overlap of at least 50%.
+```
+bedtools intersect -f 0.5 -r -a macs2/H3K4me3_consensus.narrowPeak -b macs2/H3K27ac_consensus.narrowPeak > H3K4me3_and_H3K27ac.narrowPeak
+```
+
+
+# Find overlap with TWARs
+
+## Get dunnart bed coordinates for TWARs
+
+Use BLASTn to blast Tasmanian devil TWAR sequences against dunnart genome to find coordinates for TWARs.
+
+```
+blastn -task blastn -num_threads 4 -db dunnart_pseudochr_vs_mSarHar1.11_v1.fasta -query sarHar1_twars.fa -out dunnart_TWARs.bed -outfmt "6 sseqid sstart send qseqid length"
+```
+
+## Intersect with H3K4me3 and H3K27ac
+
+```
+intersectBed -a H3K4me3_consensus.narrowPeak -b dunnart_TWARs.bed > twar_H3K4me3_overlap.bed
+```
+
+```
+intersectBed -a H3K27ac_consensus.narrowPeak -b dunnart_TWARs.bed > twar_H3K27ac_overlap.bed
+```
+
+
+
+
+

-| File format | Information contained in file | File description | Notes |
-|-|-|-|-|
-| bigWig | fold change over control, signal p-value | Two versions of nucleotide resolution signal coverage tracks.  | The signal is expressed in two ways: as fold-over control at each position, and as a p-value to reject the null hypothesis that the signal at that location is present in the control. |
-| bed and bigBed (narrowPeak) | peaks | Relaxed peak calls for each replicate individually and for both replicates' reads pooled together. | These peaks are thresholded to sample enough noise in the experiment for efficient statistical comparison of replicates in subsequent steps; as such, many false positives are expected to be present. They are not meant to be interpreted as definitive binding events, but are rather intended to be used as input for subsequent statistical comparison of replicates. |
-| bed and bigBed (narrowPeak) | replicated peaks | The set of peak calls from the pooled replicates. | These peaks are either observed in both replicates, or are observed in two pseudoreplicates. Pseudoreplicates are peak sets called on half of the pooled reads, chosen at random without replacement. |


 # Plot DAG
@@ -420,5 +554,3 @@ snakemake --dag | dot -Tsvg > dag.svg


 # Annotate peaks
-
-Create Tbxdb for use with Bioconducter packages