Skip to content
Snippets Groups Projects
Commit cd229abf authored by Laura Cook's avatar Laura Cook
Browse files

updated

parent 3c21d7fc
No related branches found
No related tags found
No related merge requests found
...@@ -122,18 +122,22 @@ __mm10.2bit__ - contains the complete mouse/mm10 genome sequence in the 2bit fil ...@@ -122,18 +122,22 @@ __mm10.2bit__ - contains the complete mouse/mm10 genome sequence in the 2bit fil
### lastZ ### lastZ
To align placental mammals, we used previously determined lastz parameters (K = 2400, L = 3000, Y = 9400, H = 2000, and the lastz default scor- ing matrix) that have a sufficient sensitivity to capture orthol- ogous exons To align placental mammals, we used previously determined lastz parameters (K = 2400, L = 3000, Y = 9400, H = 2000, and the lastz default scoring matrix) that have a sufficient sensitivity to capture orthologous exons
To align placental mammals, we used the lastz alignment parameters K = 2400, L = 3000, Y = 9400, H = 2000 and the lastz default scoring matrix, correspond- ing to parameter set 2 in Table 1. To align non-placental vertebrates, we used K = 2400, L = 3000, Y = 3400, H = 2000 and the HoxD55 scoring matrix. Citation: Increased alignment sensitivity improves the usage of genome alignments for comparative gene annotation. Nucleic Acids Res. 2017;45(14):8369–77. To align placental mammals, we used the lastz alignment parameters K = 2400, L = 3000, Y = 9400, H = 2000 and the lastz default scoring matrix, correspond- ing to parameter set 2 in Table 1. To align vertebrates, we used K = 2400, L = 3000, Y = 3400, H = 2000 and the HoxD55 scoring matrix. Citation: Increased alignment sensitivity improves the usage of genome alignments for comparative gene annotation. Nucleic Acids Res. 2017;45(14):8369–77.
Create commands for running lastZ for all scaffolds: `lastz.sh` Create commands for running lastZ for all scaffolds: `lastz.sh`
Run as an array on slurm: `array_wrapper.slurm` Run as an array on slurm: `array_wrapper.slurm`
#### Convert maf to axt-format #### Convert maf to axt-format
http://last.cbrc.jp/doc/maf-convert.html
``` ```
maf-convert axt my-alignments.maf > my-alignments.axt module load last/last/1066
maf-convert axt ${tr}.maf > ${tr}.axt
``` ```
...@@ -145,13 +149,62 @@ We use axtChain (http://www.soe.ucsc.edu/~kent; default parameters) to build co- ...@@ -145,13 +149,62 @@ We use axtChain (http://www.soe.ucsc.edu/~kent; default parameters) to build co-
axtChain -linearGap=loose mm10_smiCra1.axt mm10.2bit smiCra1.2bit mm10_smiCra1.chain axtChain -linearGap=loose mm10_smiCra1.axt mm10.2bit smiCra1.2bit mm10_smiCra1.chain
``` ```
### chainMergeSort
Merge short chains into longer ones, concatenate chains and sort
```
chainMergeSort *.chain > smiCra1_mm10.chain
```
Installing Genomic Alignment Tools (Hiller group)
```
git clone https://github.com/hillerlab/GenomeAlignmentTools.git
cd GenomeAlignmentTools/kent/src
make
# Kent binaries
PATH=$PATH:/data/projects/punim0586/lecook/chipseq-pipeline/cross_species/bin/GenomeAlignmentTools/kent/bin;export PATH
export KENTSRC_DIR=/data/projects/punim0586/lecook/chipseq-pipeline/cross_species/bin/GenomeAlignmentTools/kent/src/
cd /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/bin/GenomeAlignmentTools/src
export MACHTYPE=x86_64
make
PATH=$PATH:/data/projects/punim0586/lecook/chipseq-pipeline/cross_species/bin/GenomeAlignmentTools/src
```
### RepeatFiller ### RepeatFiller
https://github.com/hillerlab/GenomeAlignmentTools https://github.com/hillerlab/GenomeAlignmentTools
RepeatFiller [5] is a tool to incorporate newly-detected repeat-overlapping alignments into pairwise alignment chains [4]. Its runtime adds little to the computationally more expensive step of generating chains in pairwise whole-genome alignments. RepeatFiller circumvents the problem that considering all repeat-overlapping alignment seeds during whole genome alignment is computationally not feasible. Therefore, RepeatFiller only aligns local genomic regions that are bounded by colinear aligning blocks, as provided in the chains, which makes it feasible to consider all seeds including those that overlap repetitive regions. RepeatFiller application to mammalian genome alignment chains can add between 22 and 84 Mb of previously-undetected alignments that mostly originate from transposable elements [5]. This helps to comprehensively align repetitive regions and improves the annotation of conserved non-coding elements.
``` ```
python3 RepeatFiller.py -c mm10_smiCra1.chain -T2 mm10.2bit -Q2 smiCra1.2bit python3 RepeatFiller.py -c smiCra1_mm10.chain -T2 mm10.2bit -Q2 smiCra1.2bit
``` ```
### patchChain
patchChain.perl performs a highly sensitive local pairwise alignment for loci flanked by aligning blocks [1,3]. Given an alignment chain [4], it considers all chains that pass the score and span filters (optional parameters), extracts all the unaligning loci and creates local alignment jobs. After executing these alignment jobs, the newly found and the original local alignments are combined and used to produce a new set of improved chains.
This procedure is recommended for comparisons between species that are separated by >0.75 substitutions per neutral site [1].
```
patchChain.perl /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/wholeGenomeAlignment/chainCleaner/smiCra1_mm10_repFill_chainCl.chain /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/data/genomes/mm10.2bit /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/data/genomes/smiCra1.2bit /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/data/genomes/mm10.chrom.sizes /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/data/genomes/smiCra1.chrom.sizes -chainMinScore 5000 -gapMaxSizeT 500000 -gapMaxSizeQ 500000 -gapMinSizeT 30 -gapMinSizeQ 30 -numJobs 10 -jobDir jobs -jobList jobList -outputDir pslOutput -minEntropy 1.8 -windowSize 30 -minIdentity 60 -lastzParameters "--format=axt K=1500 L=2500 M=0 T=0 W=5 Q=/data/projects/punim0586/lecook/chipseq-pipeline/cross_species/bin/GenomeAlignmentTools/example/HoxD55.q"
```
This create a folder with the jobs and a jobList which can be called in an array slurm script. Then you can run the jobs in parallel.
Need to work out how to combine these and do I then run chainCleaner again?
### chainCleaner ### chainCleaner
https://github.com/hillerlab/GenomeAlignmentTools https://github.com/hillerlab/GenomeAlignmentTools
After RepeatFiller, we applied chainCleaner with parameters -LRfoldThreshold = 2.5 -doPairs -LRfoldThresholdPairs = 10 -maxPairDistance = 10000 -maxSuspectScore = 100000 -minBrokenChainScore = 75000 to improve alignment specificity. After RepeatFiller, we applied chainCleaner with parameters -LRfoldThreshold = 2.5 -doPairs -LRfoldThresholdPairs = 10 -maxPairDistance = 10000 -maxSuspectScore = 100000 -minBrokenChainScore = 75000 to improve alignment specificity.
...@@ -160,32 +213,75 @@ chainCleaner improves the specificity in genome alignment chains by detecting an ...@@ -160,32 +213,75 @@ chainCleaner improves the specificity in genome alignment chains by detecting an
``` ```
chainCleaner smiCra1_mm10_repFil.chain mm10.2bit smiCra1.2bit smiCra1_mm10_repFill_chainCl.chain smiCra1_mm10_repFill_chainCl.bed -tSizes=mm10.chrom.sizes -qSizes=smiCra1.chrom.sizes -LRfoldThreshold=2.5 -doPairs -LRfoldThresholdPairs=10 -maxPairDistance=10000 -maxSuspectScore=100000 -minBrokenChainScore=75000 -linearGap=loose
``` ```
`-LRfoldThreshold` = threshold for removing local alignment blocks if the score of the left and right fill of brokenChain. Default is 2.5
`-doPairs` = flag: if set, do test if pairs of chain breaking alignments can be removed
`-LRfoldThresholdPairs` = threshold for removing local alignment blocks if the score of the left and right fill of broken chains (for pairs). Default = 10
`-maxPairDistance` = only consider pairs of chain breaking alignments where the distance between the end of the upstream CBA and the start of the downstream CBA is at most that many bp (default 10000)
`-maxSuspectScore` = threshold for score of suspect subChain. If higher, do not remove suspect.
`-linearGap`=loose
### chainNet ### chainNet
Given a set of alignment chains, chainNet produces alignment nets, which is a hierarchical collection of chains or parts of chains that attempt to capture only orthologous alignments [4]. The original chainNet implementation approximates the score of "sub-nets" (nets that come from a part of a chain and fill a gap in a higher-level net) by the fraction of aligning bases. This can lead to a bias in case the aligning blocks of a chain are not equally distributed. We implemented a new parameter "-rescore" in chainNet that computes the real score of each subnet [2]. Given a set of alignment chains, chainNet produces alignment nets, which is a hierarchical collection of chains or parts of chains that attempt to capture only orthologous alignments [4]. The original chainNet implementation approximates the score of "sub-nets" (nets that come from a part of a chain and fill a gap in a higher-level net) by the fraction of aligning bases. This can lead to a bias in case the aligning blocks of a chain are not equally distributed. We implemented a new parameter "-rescore" in chainNet that computes the real score of each subnet [2].
Make the alignment nets:
``` ```
chainNet in.chain target.sizes query.sizes target.net query.net chainNet /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/wholeGenomeAlignment/chainCleaner/smiCra1_mm10_repFill_chainCl.chain /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/data/genomes/mm10.chrom.sizes /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/data/genomes/smiCra1.chrom.sizes /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/wholeGenomeAlignment/chainNet/mm10.net /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/wholeGenomeAlignment/chainNet/smiCra1.net
``` ```
### netClass
Add classification info to net.
usage:
netClass [options] in.net tDb qDb out.net
tDb - database to fetch target repeat masker table information
qDb - database to fetch query repeat masker table information
```
netClass chainNet/mm10.net ../data/genomes/mm10.db ../data/genomes/smiCra1.db mm10_class.net
can't find database ../data/genomes/mm10.db in hg.conf, should have a default named "db"
```
ERROR with netClass, not sure where to get the "database" from??
### NetFilterNonNested ### NetFilterNonNested
Before building a multiple alignment from the pairwise alignment nets, it is recommended to remove low-scoring alignment nets that are unlikely to represent real homologies. While the netFilter program [4] removes nested nets in case their parent net does not satisfy the specified score and size criteria, NetFilterNonNested.perl applies a non-nested filtering procedure that considers and filters each net individually [1,3]. This avoids removing nested nets that would satisfy the specified criteria, even if a parent net is removed.
Keeping nets that score higher than 10000 and keeping all nested nets that align to the same locus if they score higher than 3000 for non-placental mammals NetFilterNonNested.perl -doScoreFilter -keepSynNetsWithScore 3000 -keepInvNetsWithScore 3000 -minScore1 10000 ref.query.net.gz > ref.query.filtered.net
NetFilterNonNested.perl -doScoreFilter -keepSynNetsWithScore 3000 -keepInvNetsWithScore 3000 -minScore1 10000 mm10.smiCra1.net > mm10.smiCra1.filtered.net
NetFilterNonNested.perl -doScoreFilter -minScore1 10000 -keepSynNetsWithScore 3000 -keepInvNetsWithScore 3000
### netChainSubset ### netChainSubset
Create liftOver chain Create liftOver chain
``` ```
netChainSubset ../net/chr.net chr.chain ../over/chr.chain netChainSubset ../net/chr.net chr.chain ../over/netchr.chain
``` ```
### LiftOver ### LiftOver
dunnart_to_mouse dunnart_to_mouse
liftOver -bedPlus=4 ../../mouse/results/macs2/H3K27ac_E11.5_overlap.narrowPeak netChainSubset/smiCra1_mm10.chain test.bed unmapped.bed
### Quality control
FluentDNA looks pretty good for comparing the quality of the alignment. Gives a table of statistics but also provides visualisation tools which looks cool!
https://github.com/josiahseaman/FluentDNA
```
./fluentdna --fasta=data/hg38.fa --chainfile=data/hg38ToPanTro5.over.chain --extrafastas data/panTro5.fa --chromosomes chr19 --outname="Human vs Chimpanzee"
```
References References
......
##!/usr/bin/env bash ##!/usr/bin/env bash
## This script loops through all scaffolds in the dunnart genome and ## This script loops through all scaffolds in the dunnart genome and
## generates a separate lastz command for each scaffold ## generates a separate command for each scaffold
## These commands are then used in a slurm array script to run jobs in parallel ## These commands are then used in a slurm array script to run jobs in parallel
TRA=($(for file in *.fa; do echo $file |cut -d "." -f 1;done)) TRA=($(for file in *.axt; do echo $file |cut -d "." -f 1-2;done))
echo ${TRA[@]} echo ${TRA[@]}
...@@ -12,6 +12,14 @@ for tr in ${TRA[@]}; ...@@ -12,6 +12,14 @@ for tr in ${TRA[@]};
do do
echo 'lastz_32 /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/data/genomes/mm10.fa[multiple] /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/data/genomes/smiCra1_RM/'${tr}.fa 'H=2000 K=2400 L=3000 Y=9400 --format=maf > /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/data/genomes/maf/'${tr}_mm10.smiCra1.maf # generate lastz commands
#echo 'lastz_32 /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/data/genomes/mm10.fa[multiple] /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/data/genomes/smiCra1_RM/'${tr}.fa 'H=2000 K=2400 L=3000 Y=9400 --format=maf > /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/data/genomes/maf/'${tr}_mm10.smiCra1.maf
# generate convert maf to axt format commands
#echo 'maf-convert axt '${tr}.maf' > '${tr}.axt
## build co-linear alignment chains
echo 'axtChain -linearGap=loose /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/data/genomes/axt/'${tr}'.axt /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/data/genomes/mm10.2bit /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/data/genomes/smiCra1.2bit '${tr}'.chain'
done done
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment