From cd229abfa3fcd962c0f798653c4f65f1af4850c0 Mon Sep 17 00:00:00 2001
From: Laura Cook <l.cook2@student.unimelb.edu.au>
Date: Fri, 11 Dec 2020 12:26:00 +1100
Subject: [PATCH] updated

---
 cross_species_comparison/README.md        | 108 ++++++++++++++++++++--
 cross_species_comparison/scripts/lastz.sh |  14 ++-
 2 files changed, 113 insertions(+), 9 deletions(-)

diff --git a/cross_species_comparison/README.md b/cross_species_comparison/README.md
index 73bb4e7..31d70ab 100644
--- a/cross_species_comparison/README.md
+++ b/cross_species_comparison/README.md
@@ -122,18 +122,22 @@ __mm10.2bit__ - contains the complete mouse/mm10 genome sequence in the 2bit fil
 
 ### lastZ
 
-To align placental mammals, we used previously determined lastz parameters (K = 2400, L = 3000, Y = 9400, H = 2000, and the lastz default scor- ing matrix) that have a sufficient sensitivity to capture orthol- ogous exons
+To align placental mammals, we used previously determined lastz parameters (K = 2400, L = 3000, Y = 9400, H = 2000, and the lastz default scoring matrix) that have a sufficient sensitivity to capture orthologous exons
 
-To align placental mammals, we used the lastz alignment parameters K = 2400, L = 3000, Y = 9400, H = 2000 and the lastz default scoring matrix, correspond- ing to parameter set 2 in Table 1. To align non-placental vertebrates, we used K = 2400, L = 3000, Y = 3400, H = 2000 and the HoxD55 scoring matrix. Citation: Increased alignment sensitivity improves the usage of genome alignments for comparative gene annotation. Nucleic Acids Res. 2017;45(14):8369â€“77.
+To align placental mammals, we used the lastz alignment parameters K = 2400, L = 3000, Y = 9400, H = 2000 and the lastz default scoring matrix, correspond- ing to parameter set 2 in Table 1. To align vertebrates, we used K = 2400, L = 3000, Y = 3400, H = 2000 and the HoxD55 scoring matrix. Citation: Increased alignment sensitivity improves the usage of genome alignments for comparative gene annotation. Nucleic Acids Res. 2017;45(14):8369â€“77.
 
 Create commands for running lastZ for all scaffolds: `lastz.sh`
+
+
 Run as an array on slurm: `array_wrapper.slurm`
 
 
 #### Convert maf to axt-format
+http://last.cbrc.jp/doc/maf-convert.html
 
 ```
-maf-convert axt my-alignments.maf > my-alignments.axt
+module load last/last/1066
+maf-convert axt ${tr}.maf > ${tr}.axt
 ```
 
 
@@ -145,13 +149,62 @@ We use axtChain (http://www.soe.ucsc.edu/~kent; default parameters) to build co-
 axtChain -linearGap=loose mm10_smiCra1.axt mm10.2bit smiCra1.2bit mm10_smiCra1.chain
 
 ```
+
+### chainMergeSort
+Merge short chains into longer ones, concatenate chains and sort
+
+```
+chainMergeSort *.chain > smiCra1_mm10.chain
+```
+
+Installing Genomic Alignment Tools (Hiller group)
+
+```
+git clone https://github.com/hillerlab/GenomeAlignmentTools.git
+
+cd GenomeAlignmentTools/kent/src
+
+make
+
+# Kent binaries
+PATH=$PATH:/data/projects/punim0586/lecook/chipseq-pipeline/cross_species/bin/GenomeAlignmentTools/kent/bin;export PATH
+
+export KENTSRC_DIR=/data/projects/punim0586/lecook/chipseq-pipeline/cross_species/bin/GenomeAlignmentTools/kent/src/
+
+cd /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/bin/GenomeAlignmentTools/src
+export MACHTYPE=x86_64
+make 
+
+PATH=$PATH:/data/projects/punim0586/lecook/chipseq-pipeline/cross_species/bin/GenomeAlignmentTools/src
+
+```
+
+
+
 ### RepeatFiller
 https://github.com/hillerlab/GenomeAlignmentTools
+RepeatFiller [5] is a tool to incorporate newly-detected repeat-overlapping alignments into pairwise alignment chains [4]. Its runtime adds little to the computationally more expensive step of generating chains in pairwise whole-genome alignments. RepeatFiller circumvents the problem that considering all repeat-overlapping alignment seeds during whole genome alignment is computationally not feasible. Therefore, RepeatFiller only aligns local genomic regions that are bounded by colinear aligning blocks, as provided in the chains, which makes it feasible to consider all seeds including those that overlap repetitive regions. RepeatFiller application to mammalian genome alignment chains can add between 22 and 84 Mb of previously-undetected alignments that mostly originate from transposable elements [5]. This helps to comprehensively align repetitive regions and improves the annotation of conserved non-coding elements.
 
 ```
-python3 RepeatFiller.py -c mm10_smiCra1.chain -T2 mm10.2bit -Q2 smiCra1.2bit
+python3 RepeatFiller.py -c smiCra1_mm10.chain -T2 mm10.2bit -Q2 smiCra1.2bit
 ```
 
+### patchChain
+patchChain.perl performs a highly sensitive local pairwise alignment for loci flanked by aligning blocks [1,3]. Given an alignment chain [4], it considers all chains that pass the score and span filters (optional parameters), extracts all the unaligning loci and creates local alignment jobs. After executing these alignment jobs, the newly found and the original local alignments are combined and used to produce a new set of improved chains.
+
+This procedure is recommended for comparisons between species that are separated by >0.75 substitutions per neutral site [1].
+
+
+```
+patchChain.perl /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/wholeGenomeAlignment/chainCleaner/smiCra1_mm10_repFill_chainCl.chain /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/data/genomes/mm10.2bit /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/data/genomes/smiCra1.2bit /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/data/genomes/mm10.chrom.sizes /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/data/genomes/smiCra1.chrom.sizes -chainMinScore 5000 -gapMaxSizeT 500000 -gapMaxSizeQ 500000 -gapMinSizeT 30 -gapMinSizeQ 30 -numJobs 10 -jobDir jobs -jobList jobList -outputDir pslOutput -minEntropy 1.8 -windowSize 30 -minIdentity 60 -lastzParameters "--format=axt K=1500 L=2500 M=0 T=0 W=5 Q=/data/projects/punim0586/lecook/chipseq-pipeline/cross_species/bin/GenomeAlignmentTools/example/HoxD55.q"
+
+```
+
+This create a folder with the jobs and a jobList which can be called in an array slurm script. Then you can run the jobs in parallel.
+Need to work out how to combine these and do I then run chainCleaner again?
+
+
+
 ### chainCleaner
 https://github.com/hillerlab/GenomeAlignmentTools
 After RepeatFiller, we applied chainCleaner with parameters -LRfoldThreshold = 2.5 -doPairs -LRfoldThresholdPairs = 10 -maxPairDistance = 10000 -maxSuspectScore = 100000 -minBrokenChainScore = 75000 to improve alignment specificity.
@@ -160,32 +213,75 @@ chainCleaner improves the specificity in genome alignment chains by detecting an
 
 
 ```
+chainCleaner smiCra1_mm10_repFil.chain mm10.2bit smiCra1.2bit smiCra1_mm10_repFill_chainCl.chain smiCra1_mm10_repFill_chainCl.bed -tSizes=mm10.chrom.sizes -qSizes=smiCra1.chrom.sizes -LRfoldThreshold=2.5 -doPairs -LRfoldThresholdPairs=10 -maxPairDistance=10000 -maxSuspectScore=100000 -minBrokenChainScore=75000 -linearGap=loose
 
 ```
 
+`-LRfoldThreshold` = threshold for removing local alignment blocks if the score of the left and right fill of brokenChain. Default is 2.5
+`-doPairs` = flag: if set, do test if pairs of chain breaking alignments can be removed
+`-LRfoldThresholdPairs` = threshold for removing local alignment blocks if the score of the left and right fill of broken chains (for pairs). Default = 10
+`-maxPairDistance` = only consider pairs of chain breaking alignments where the distance between the end of the upstream CBA and the start of the downstream CBA is at most that many bp (default 10000)
+`-maxSuspectScore` = threshold for score of suspect subChain. If higher, do not remove suspect.
+
+`-linearGap`=loose
+
 ### chainNet
 Given a set of alignment chains, chainNet produces alignment nets, which is a hierarchical collection of chains or parts of chains that attempt to capture only orthologous alignments [4]. The original chainNet implementation approximates the score of "sub-nets" (nets that come from a part of a chain and fill a gap in a higher-level net) by the fraction of aligning bases. This can lead to a bias in case the aligning blocks of a chain are not equally distributed. We implemented a new parameter "-rescore" in chainNet that computes the real score of each subnet [2].
 
+Make the alignment nets:
+
 ```
-chainNet in.chain target.sizes query.sizes target.net query.net
+chainNet /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/wholeGenomeAlignment/chainCleaner/smiCra1_mm10_repFill_chainCl.chain /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/data/genomes/mm10.chrom.sizes /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/data/genomes/smiCra1.chrom.sizes /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/wholeGenomeAlignment/chainNet/mm10.net /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/wholeGenomeAlignment/chainNet/smiCra1.net 
 ```
 
+### netClass
+Add classification info to net.
+usage:
+   netClass [options] in.net tDb qDb out.net
+       tDb - database to fetch target repeat masker table information
+       qDb - database to fetch query repeat masker table information
+
+```
+netClass chainNet/mm10.net ../data/genomes/mm10.db ../data/genomes/smiCra1.db mm10_class.net
+can't find database ../data/genomes/mm10.db in hg.conf, should have a default named "db"
+```
+ERROR with netClass, not sure where to get the "database" from??
+
+
+
 ### NetFilterNonNested
+Before building a multiple alignment from the pairwise alignment nets, it is recommended to remove low-scoring alignment nets that are unlikely to represent real homologies. While the netFilter program [4] removes nested nets in case their parent net does not satisfy the specified score and size criteria, NetFilterNonNested.perl applies a non-nested filtering procedure that considers and filters each net individually [1,3]. This avoids removing nested nets that would satisfy the specified criteria, even if a parent net is removed.
+Keeping nets that score higher than 10000 and keeping all nested nets that align to the same locus if they score higher than 3000 for non-placental mammals NetFilterNonNested.perl -doScoreFilter -keepSynNetsWithScore 3000 -keepInvNetsWithScore 3000 -minScore1 10000 ref.query.net.gz > ref.query.filtered.net
 
+NetFilterNonNested.perl -doScoreFilter -keepSynNetsWithScore 3000 -keepInvNetsWithScore 3000 -minScore1 10000 mm10.smiCra1.net > mm10.smiCra1.filtered.net
 
+NetFilterNonNested.perl -doScoreFilter -minScore1 10000 -keepSynNetsWithScore 3000 -keepInvNetsWithScore 3000
 
 
 ### netChainSubset
 Create liftOver chain
 
 ```
-netChainSubset ../net/chr.net chr.chain ../over/chr.chain
+netChainSubset ../net/chr.net chr.chain ../over/netchr.chain
 ```
 
 ### LiftOver
 
 dunnart_to_mouse
 
+liftOver -bedPlus=4 ../../mouse/results/macs2/H3K27ac_E11.5_overlap.narrowPeak netChainSubset/smiCra1_mm10.chain test.bed unmapped.bed
+
+
+### Quality control 
+FluentDNA looks pretty good for comparing the quality of the alignment. Gives a table of statistics but also provides visualisation tools which looks cool!
+
+https://github.com/josiahseaman/FluentDNA
+
+```
+./fluentdna --fasta=data/hg38.fa --chainfile=data/hg38ToPanTro5.over.chain --extrafastas data/panTro5.fa --chromosomes chr19 --outname="Human vs Chimpanzee"
+```
+
+
 
 
 References
diff --git a/cross_species_comparison/scripts/lastz.sh b/cross_species_comparison/scripts/lastz.sh
index 41baef6..480d56f 100644
--- a/cross_species_comparison/scripts/lastz.sh
+++ b/cross_species_comparison/scripts/lastz.sh
@@ -1,10 +1,10 @@
 ##!/usr/bin/env bash
 
 ## This script loops through all scaffolds in the dunnart genome and
-## generates a separate lastz command for each scaffold
+## generates a separate command for each scaffold
 ## These commands are then used in a slurm array script to run jobs in parallel
 
-TRA=($(for file in *.fa; do echo $file |cut -d "." -f 1;done))
+TRA=($(for file in *.axt; do echo $file |cut -d "." -f 1-2;done))
 
 echo ${TRA[@]}
 
@@ -12,6 +12,14 @@ for tr in ${TRA[@]};
 
 do
 
-echo 'lastz_32 /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/data/genomes/mm10.fa[multiple] /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/data/genomes/smiCra1_RM/'${tr}.fa 'H=2000 K=2400 L=3000 Y=9400 --format=maf > /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/data/genomes/maf/'${tr}_mm10.smiCra1.maf
+# generate lastz commands
+#echo 'lastz_32 /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/data/genomes/mm10.fa[multiple] /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/data/genomes/smiCra1_RM/'${tr}.fa 'H=2000 K=2400 L=3000 Y=9400 --format=maf > /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/data/genomes/maf/'${tr}_mm10.smiCra1.maf
+
+# generate convert maf to axt format commands
+#echo 'maf-convert axt '${tr}.maf' > '${tr}.axt
+
+## build co-linear alignment chains
+echo 'axtChain -linearGap=loose /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/data/genomes/axt/'${tr}'.axt /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/data/genomes/mm10.2bit /data/projects/punim0586/lecook/chipseq-pipeline/cross_species/data/genomes/smiCra1.2bit '${tr}'.chain'
+
 
 done
-- 
GitLab