From 8661c846291d1469ac5f8fb9d7e8f3774b611d56 Mon Sep 17 00:00:00 2001
From: Laura Cook <l.cook2@student.unimelb.edu.au>
Date: Fri, 28 Aug 2020 14:46:14 +1000
Subject: [PATCH] added info on converting to 2bit format

---
 cross_species_comparison/README.md | 42 +++++++++++++++++++++++++-----
 1 file changed, 36 insertions(+), 6 deletions(-)

diff --git a/cross_species_comparison/README.md b/cross_species_comparison/README.md
index 21f3f4b..c46dad1 100644
--- a/cross_species_comparison/README.md
+++ b/cross_species_comparison/README.md
@@ -18,11 +18,6 @@ http://genomewiki.ucsc.edu/index.php/Whole_genome_alignment_howto
 https://www.bioconductor.org/packages/release/bioc/vignettes/CNEr/inst/doc/PairwiseWholeGenomeAlignment.html
 https://github.com/hillerlab/GenomeAlignmentTools (more fine-tuned and advanced)
 
-If you do not have access to sufficient resources: I would advise against using Blastn on its own because you will often obtain several hits for every query sequence, and it may be difficult to find appropriate parameters that identify true orthologs over the entire genome or even across a large region at this evolutionary distance.
-I would then recommend to identify orthologous regions beforehand as you mentionned (e.g. 100kb around orthologous genes) and use a local aligner to align those two regions (Lastz, MUMmer4, or another). This will drastically reduce the computational power required and should run on a desktop machine. However, your alignments will not be exhaustive, and it is likely that you will not be able to assess conservation for a sizeable fraction on your ChIPseq peaks.
-
-I hope this is helpful, don't hesitate to get back to me if you have questions.
-
 __Methods from Hecker & Hiller 2020 Whole Genome Alignment:__
 Going to try this method.
 
@@ -32,8 +27,43 @@ After building chains, we applied RepeatFiller (RRID:SCR_017414), a method that
 
 The main difference between this 120-mammal alignment and our previous 144-vertebrate alignment [16] is that the former focuses entirely on mammals and includes many new species (120 vs 74 mammals, see Supplementary Table 1). In addition, we updated genome assemblies of 12 species that were already included in the previous alignment (species are marked in Supplementary Table 1). Finally, the 120-mammal alignment used RepeatFiller to improve the completeness of alignments between repetitive regions.
 
+__To align non-placental mammals, we used K = 2400, L = 3000, Y = 3400, H = 2000 and the HoxD55 scoring matrix.__
+
+MASKING: Both genomes have to be repeatmasked and masked Tandem Repeat Finder (trf) first (thanks to Hiram for pointing this out)
+ALIGNING: The two genomes are aligned with BLASTZ (we don't use blastz's own chaining, see discussion (angie)). This generates lav-files, which have to be converted to psl (lavToPsl)
+CHAINING: Two matching alignments next to each other are joined into one fragment if they are close enough (axtChain). As every genomic fragment can match with several others, we keep only the longest chains : first do axtSort then filter with axtBest (more info on the mailing list)
+NETTING: Group blocks of chained alignments into longer stretches of synteny (netChain)
+MAF'ING: From the synteny-files (positions), get the sequences and re-create alignments
+PhastCons: Using the maf-files, calculate the strength of conservation for every base, similar to a Vista- or protein Conservation plot, but applicable to multiple alignments
+
+
+### Preparation
+
+#### Repeat mask dunnart genome
+
+????
+
+#### Create .2bit and .sizes files
+
+```
+faToTwoBit ../../dunnart/genomes/Scras_dunnart_assem1.0_pb-ont-illsr_flyeassem_red-rd-scfitr2_pil2xwgs2_60chr.fasta smiCra1.2bit
+```
+
+```
+twoBitInfo smiCra1.2bit stdout | sort -k2rn > smiCra1.chrom.sizes
+```
+
+#### mm10 target genome
+http://hgdownload.cse.ucsc.edu/goldenpath/mm10/bigZips/mm10.2bit
+
+__mm10.2bit__ - contains the complete mouse/mm10 genome sequence in the 2bit file format.  Repeats from __RepeatMasker__ and __Tandem Repeats Finder__ (with period of 12 or less) are shown in lower case; non-repeating sequence is shown in upper case.  
+
+
 ### lastZ
-We used previously determined lastz parameters (K = 2400, L = 3000, Y = 9400, H = 2000, and the lastz default scoring matrix) - Sharma V, Hiller M. Increased alignment sensitivity improves the usage of genome alignments for comparative gene annotation. Nucleic Acids Res. 2017;45(14):8369–77.
+
+To align placental mammals, we used previously determined lastz parameters (K = 2400, L = 3000, Y = 9400, H = 2000, and the lastz default scor- ing matrix) that have a sufficient sensitivity to capture orthol- ogous exons
+
+To align placental mammals, we used the lastz alignment parameters K = 2400, L = 3000, Y = 9400, H = 2000 and the lastz default scoring matrix, correspond- ing to parameter set 2 in Table 1. To align non-placental vertebrates, we used K = 2400, L = 3000, Y = 3400, H = 2000 and the HoxD55 scoring matrix. Citation: Increased alignment sensitivity improves the usage of genome alignments for comparative gene annotation. Nucleic Acids Res. 2017;45(14):8369–77.
 
 ### axtChain
 ### RepeatFiller
-- 
GitLab