first commit

9f47e352 · Laura Cook · 306bbbae · 9f47e352 · 9f47e352 · 9f47e352
Commit 9f47e352 authored 4 years ago by Laura Cook
--- a/cross_species_comparison/README.md
+++ b/cross_species_comparison/README.md
+Ideas from Anshul Kundaje
+
+Rather than BLAST, compute the gapped k-mer similarity between regions across the species. Do the following
+
+1. Find orthologous genes between the two species (once I have dunnart genome):
+- JustOrthologs - https://academic.oup.com/bioinformatics/article/35/4/546/5063405
+This seems to be a new fast method, metrics are comparable/slightly better than OrthoMCL OMA and OrthoFinder, doesn't require all-versus-all BLAST comparisons, which are time-consuming and memory intensive which are is used by other methods. Also creates just a single step command for generating orthologs instead of multiple steps that take awhile to run like in other programs. Compares dinucleotide percentages to determine the level of sequence identity between two CDS regions. Specific parameter for distantly related species. Tested in paper on species that are 312 million years divergent and works well.
+
+```
+ARGUMENT OPTIONS:
+
+ -q	Query Fasta File					A fasta file that has been divided into CDS regions and sorted based on the number of CDS regions.
+ -s	Subject Fasta File					A fasta file that has been divided into CDS regions and sorted based on the number of CDS regions.
+ -o	Output File							A path to an output file that will be created from this program.
+ -t	Number of Cores						The number of Cores available
+ -d	For More Distantly Related Species	A flag that implements the -d algorithm, which gives better accuracy for more distantly related species.
+ -c	Combine Both Algorithms				Combines both the normal and -d algorithms for the best overall accuracy.
+ -r Correlation value					An optional value for changing the required correlation value between orthologs.
+```
+
+Fasta file of CDS regions that have been sorted based on the number of CDS regions for both the dunnart and the mouse
+
+- mouse = `wget ftp://ftp.ensembl.org/pub/release-100/fasta/mus_musculus/cds/Mus_musculus.GRCm38.cds.all.fa.gz`
+- dunnart =
+
+# dunnart as query and mouse are query - see how different they are
+
+```
+python justOrthologs.py -q orthologTest/mm10_cds.fa -s orthologTest/dunnart_cds.fa -o dunnart_mm10_orthologs -c -d
+```
+
+2. Now for gene x in species A mapped to gene y in species B, identify approximate syntenic regulatory domains (e.g. if it were human vs mouse you could use +/-100 Kb around the genes)
+- Don’t really get this step, so for two genes that are orthologs take -/+ 100kb either side of the gene for both species and I guess save that region with the orthologous gene name, species and coordinate
+
+For dunnart - look at distance to the nearest gene for all peaks see how many fit into the 100kb range either side.
+
+3. Now ur goal is map elements between these two domains
+
+4. Map elements between these two domains (how should I do this?)
+
+5. For all pairs of elements across the species, compute their gapped k-mer similarity. You can use the gkmSVM kernel for this.
+gkmSVM kernel - http://www.beerlab.org/gkmsvm/
+tutorial: http://www.beerlab.org/gkmsvm/gkmsvm-tutorial.htm
+
+so the pair is the 100kb upstream in the mouse and dunnart or the downstream in the mouse and dunnart.
+
+R package
+
+```
+install.packages('gkmSVM')
+```
+
+Sample input files: ctcfpos.bed nr10mers.fa ref.fa alt.fa
+The first step is to build the kernel matrix (the pairwise similarity scores for all the sequences in the positive and negative sets).
+
+Positive set:
+Negative set:
+
+
+
+6. Do this for pairs of elements for all matched domains across species. Lets call these syntenic pairs
+
+7. As a control compute the gkm similarity for pairs of elements in non matched domains. This gives you the null distribution.
+- pairs of elements in non-matched domains?
+
+8. Compare the observed similarities of syntenic pairs against the null distribution to obtain statistical significance
+
+9. Do multiple testing correction
+
+10. Pairs that pass, are your cross-species matched enhancers
--- a/cross_species_comparison/find_orthologs.sh
+++ b/cross_species_comparison/find_orthologs.sh
--- a/dunnart/scripts/trimfastq.py
+++ b/dunnart/scripts/trimfastq.py
+##################################
+#                                #
+# Last modified 2017/11/08       #
+#                                #
+# Georgi Marinov                 #
+#                                #
+##################################
+
+import sys
+import os
+
+READS_LOG_INTERVAL = 5000000
+MILLION = 1000000
+
+
+def run():
+
+    if len(sys.argv) < 2:
+        print(
+            'usage: python %s <inputfilename> <bpToKeep | max> [-trim5 bp] '
+            '[-flowcellID flowcell] [-addEnd 1 | 2] [-replace string '
+            'newstring | blank] [-renameIDs prefix] [-stdout]' % sys.argv[0])
+        print(
+            '\tthe -trim5 option will trim additional bp from the 5 end, '
+            'i.e. if you want the middle 36bp of 38bp reads, use 36 as bp '
+            'to keep and 1 as the trim5 argument')
+        print(
+            '\tUse - to specify standard input, the script will print'
+            '(to standard output by default')
+        print(
+            '\tThe script can read compressed files as long as they '
+            'have the correct suffix - .bz2 or .gz')
+        sys.exit(1)
+
+    inputfilename = sys.argv[1]
+    doMax = False
+    if sys.argv[2] == 'max':
+        doMax = True
+        trim = 'max'
+    else:
+        trim = int(sys.argv[2])
+    outputfilename = inputfilename.split(
+        '/')[-1].split('.fastq')[0] + '.' + str(trim)+'mers.fastq'
+    doFlowcellID = False
+
+    doStdOut = True
+
+    if '-flowcellID' in sys.argv:
+        doFlowcellID = True
+        flowcellID = sys.argv[sys.argv.index('-flowcellID')+1]
+        if doStdOut:
+            pass
+        else:
+            print('will include flowcell ID', flowcellID, 'in reads headers')
+
+    doRenameIDs = False
+    if '-renameIDs' in sys.argv:
+        doRenameIDs = True
+        RID = '@' + sys.argv[sys.argv.index('-renameIDs') + 1]
+
+    dotrim5 = False
+    if '-trim5' in sys.argv:
+        dotrim5 = True
+        trim5 = int(sys.argv[sys.argv.index('-trim5')+1])
+        if doStdOut:
+            pass
+        else:
+            print('will trim ', trim5, 'bp from the 5-end')
+        outputfilename = inputfilename.split(
+            '.fastq')[0] + '.' + str(trim)+'bp-5prim-trim.fastq'
+
+    doAddEnd = False
+    if '-addEnd' in sys.argv:
+        doAddEnd = True
+        END = sys.argv[sys.argv.index('-addEnd')+1]
+        if doStdOut:
+            pass
+        else:
+            print('will add',  '/'+END, 'to read IDs')
+
+    doReplace = False
+    if '-replace' in sys.argv:
+        doReplace = True
+        oldstring = sys.argv[sys.argv.index('-replace')+1]
+        newstring = sys.argv[sys.argv.index('-replace')+2]
+        if newstring == 'blank':
+            newstring = ''
+        if doStdOut:
+            pass
+        else:
+            print('will replace',  oldstring, 'with', newstring, 'in read IDs')
+
+    if doStdOut:
+        pass
+    else:
+        outfile = open(outputfilename, 'w')
+
+    doStdIn = False
+    if inputfilename != '-':
+        if inputfilename.endswith('.bz2'):
+            cmd = 'bzip2 -cd ' + inputfilename
+        elif inputfilename.endswith('.gz'):
+            cmd = 'gunzip -c ' + inputfilename
+        else:
+            cmd = 'cat ' + inputfilename
+        p = os.popen(cmd, "r")
+    else:
+        doStdIn = True
+
+    line = 'line'
+
+    i = 0
+    shorter = 0
+
+    if dotrim5:
+        i = 1
+        j = 0
+        while line != '':
+            if doStdIn:
+                line = sys.stdin.readline()
+            else:
+                line = p.readline()
+            if line == '':
+                break
+            if i == 1 and line[0] == '@':
+                if doFlowcellID and flowcellID not in line:
+                    ID = '@'+flowcellID+'_'+line.replace(' ', '_')[1:-1]+'\n'
+                else:
+                    ID = line.replace(' ', '_')
+                if doReplace:
+                    ID = ID.replace(oldstring, newstring)
+                if doRenameIDs:
+                    ID = RID + str(j)
+                if doAddEnd:
+                    ID = ID.strip()+'/'+END+'\n'
+                i = 2
+                continue
+            if i == 2:
+                i = 3
+                sequence = line[trim5:len(line)].strip()
+                continue
+            if i == 3 and line[0] == '+':
+                plus = '+\n'
+                i = 4
+                continue
+            if i == 4:
+                scores = line
+                i = 1
+                scores = line[trim5:len(line)].strip()
+                scores = scores[0:trim]
+                j = j+1
+                if j % READS_LOG_INTERVAL == 0:
+                    if doStdOut:
+                        pass
+                    else:
+                        print(str(j/MILLION) + 'M reads processed')
+                if doMax:
+                    sequence = sequence.replace('.', 'N')
+                else:
+                    sequence = sequence[0:trim].replace('.', 'N')+'\n'
+                if doStdOut:
+                    print(ID.strip())
+                    print(sequence.strip())
+                    print(plus.strip())
+                    print(scores)
+                else:
+                    outfile.write(ID.strip()+'\n')
+                    outfile.write(sequence.strip()+'\n')
+                    outfile.write(plus.strip()+'\n')
+                    outfile.write(scores + '\n')
+                continue
+    else:
+        i = 1
+        j = 0
+        while line != '':
+            if doStdIn:
+                line = sys.stdin.readline()
+            else:
+                line = p.readline()
+            if line == '':
+                break
+            if i == 1 and line[0] == '@':
+                if doFlowcellID and flowcellID not in line:
+                    ID = '@'+flowcellID+'_'+line.replace(' ', '_')[1:-1]+'\n'
+                else:
+                    ID = line.replace(' ', '_')
+                if doReplace:
+                    ID = ID.replace(oldstring, newstring)
+                if doRenameIDs:
+                    ID = RID + str(j)
+                if doAddEnd:
+                    ID = ID.strip()+'/'+END+'\n'
+                i = 2
+                continue
+            if i == 2:
+                i = 3
+                j = j+1
+                if j % READS_LOG_INTERVAL == 0:
+                    if doStdOut:
+                        pass
+                    else:
+                        print(str(j/MILLION) + 'M reads processed')
+                if doMax:
+                    sequence = line
+                else:
+                    if len(line.strip()) < trim:
+                        shorter += 1
+                        sequence = line.strip().replace('.', 'N')+'\n'
+                    else:
+                        sequence = line[0:trim].replace('.', 'N')+'\n'
+                continue
+            if i == 3 and line[0] == '+':
+                plus = '+\n'
+                i = 4
+                continue
+            if i == 4:
+                i = 1
+                if doMax:
+                    scores = line
+                    if doStdOut:
+                        print(ID.strip())
+                        print(sequence.strip())
+                        print(plus.strip())
+                        print(line.strip())
+                    else:
+                        outfile.write(ID)
+                        outfile.write(sequence)
+                        outfile.write(plus)
+                        outfile.write(line)
+                else:
+                    if len(line.strip()) < trim:
+                        continue
+                    scores = line[0:trim]+'\n'
+                    if doStdOut:
+                        print(ID.strip())
+                        print(sequence.strip())
+                        print(plus.strip())
+                        print(scores.strip())
+                    else:
+                        outfile.write(ID)
+                        outfile.write(sequence)
+                        outfile.write(plus)
+                        outfile.write(scores)
+                continue
+
+    if doStdOut:
+        pass
+    else:
+        outfile.close()
+
+    if shorter > 0:
+        print(shorter, 'sequences shorter than desired length')
+
+
+run()