Skip to content
Snippets Groups Projects
Commit 9f47e352 authored by Laura Cook's avatar Laura Cook
Browse files

first commit

parent 306bbbae
Branches
No related tags found
No related merge requests found
Ideas from Anshul Kundaje
Rather than BLAST, compute the gapped k-mer similarity between regions across the species. Do the following
1. Find orthologous genes between the two species (once I have dunnart genome):
- JustOrthologs - https://academic.oup.com/bioinformatics/article/35/4/546/5063405
This seems to be a new fast method, metrics are comparable/slightly better than OrthoMCL OMA and OrthoFinder, doesn't require all-versus-all BLAST comparisons, which are time-consuming and memory intensive which are is used by other methods. Also creates just a single step command for generating orthologs instead of multiple steps that take awhile to run like in other programs. Compares dinucleotide percentages to determine the level of sequence identity between two CDS regions. Specific parameter for distantly related species. Tested in paper on species that are 312 million years divergent and works well.
```
ARGUMENT OPTIONS:
-q Query Fasta File A fasta file that has been divided into CDS regions and sorted based on the number of CDS regions.
-s Subject Fasta File A fasta file that has been divided into CDS regions and sorted based on the number of CDS regions.
-o Output File A path to an output file that will be created from this program.
-t Number of Cores The number of Cores available
-d For More Distantly Related Species A flag that implements the -d algorithm, which gives better accuracy for more distantly related species.
-c Combine Both Algorithms Combines both the normal and -d algorithms for the best overall accuracy.
-r Correlation value An optional value for changing the required correlation value between orthologs.
```
Fasta file of CDS regions that have been sorted based on the number of CDS regions for both the dunnart and the mouse
- mouse = `wget ftp://ftp.ensembl.org/pub/release-100/fasta/mus_musculus/cds/Mus_musculus.GRCm38.cds.all.fa.gz`
- dunnart =
# dunnart as query and mouse are query - see how different they are
```
python justOrthologs.py -q orthologTest/mm10_cds.fa -s orthologTest/dunnart_cds.fa -o dunnart_mm10_orthologs -c -d
```
2. Now for gene x in species A mapped to gene y in species B, identify approximate syntenic regulatory domains (e.g. if it were human vs mouse you could use +/-100 Kb around the genes)
- Don’t really get this step, so for two genes that are orthologs take -/+ 100kb either side of the gene for both species and I guess save that region with the orthologous gene name, species and coordinate
For dunnart - look at distance to the nearest gene for all peaks see how many fit into the 100kb range either side.
3. Now ur goal is map elements between these two domains
4. Map elements between these two domains (how should I do this?)
5. For all pairs of elements across the species, compute their gapped k-mer similarity. You can use the gkmSVM kernel for this.
gkmSVM kernel - http://www.beerlab.org/gkmsvm/
tutorial: http://www.beerlab.org/gkmsvm/gkmsvm-tutorial.htm
so the pair is the 100kb upstream in the mouse and dunnart or the downstream in the mouse and dunnart.
R package
```
install.packages('gkmSVM')
```
Sample input files: ctcfpos.bed nr10mers.fa ref.fa alt.fa
The first step is to build the kernel matrix (the pairwise similarity scores for all the sequences in the positive and negative sets).
Positive set:
Negative set:
6. Do this for pairs of elements for all matched domains across species. Lets call these syntenic pairs
7. As a control compute the gkm similarity for pairs of elements in non matched domains. This gives you the null distribution.
- pairs of elements in non-matched domains?
8. Compare the observed similarities of syntenic pairs against the null distribution to obtain statistical significance
9. Do multiple testing correction
10. Pairs that pass, are your cross-species matched enhancers
##################################
# #
# Last modified 2017/11/08 #
# #
# Georgi Marinov #
# #
##################################
import sys
import os
READS_LOG_INTERVAL = 5000000
MILLION = 1000000
def run():
if len(sys.argv) < 2:
print(
'usage: python %s <inputfilename> <bpToKeep | max> [-trim5 bp] '
'[-flowcellID flowcell] [-addEnd 1 | 2] [-replace string '
'newstring | blank] [-renameIDs prefix] [-stdout]' % sys.argv[0])
print(
'\tthe -trim5 option will trim additional bp from the 5 end, '
'i.e. if you want the middle 36bp of 38bp reads, use 36 as bp '
'to keep and 1 as the trim5 argument')
print(
'\tUse - to specify standard input, the script will print'
'(to standard output by default')
print(
'\tThe script can read compressed files as long as they '
'have the correct suffix - .bz2 or .gz')
sys.exit(1)
inputfilename = sys.argv[1]
doMax = False
if sys.argv[2] == 'max':
doMax = True
trim = 'max'
else:
trim = int(sys.argv[2])
outputfilename = inputfilename.split(
'/')[-1].split('.fastq')[0] + '.' + str(trim)+'mers.fastq'
doFlowcellID = False
doStdOut = True
if '-flowcellID' in sys.argv:
doFlowcellID = True
flowcellID = sys.argv[sys.argv.index('-flowcellID')+1]
if doStdOut:
pass
else:
print('will include flowcell ID', flowcellID, 'in reads headers')
doRenameIDs = False
if '-renameIDs' in sys.argv:
doRenameIDs = True
RID = '@' + sys.argv[sys.argv.index('-renameIDs') + 1]
dotrim5 = False
if '-trim5' in sys.argv:
dotrim5 = True
trim5 = int(sys.argv[sys.argv.index('-trim5')+1])
if doStdOut:
pass
else:
print('will trim ', trim5, 'bp from the 5-end')
outputfilename = inputfilename.split(
'.fastq')[0] + '.' + str(trim)+'bp-5prim-trim.fastq'
doAddEnd = False
if '-addEnd' in sys.argv:
doAddEnd = True
END = sys.argv[sys.argv.index('-addEnd')+1]
if doStdOut:
pass
else:
print('will add', '/'+END, 'to read IDs')
doReplace = False
if '-replace' in sys.argv:
doReplace = True
oldstring = sys.argv[sys.argv.index('-replace')+1]
newstring = sys.argv[sys.argv.index('-replace')+2]
if newstring == 'blank':
newstring = ''
if doStdOut:
pass
else:
print('will replace', oldstring, 'with', newstring, 'in read IDs')
if doStdOut:
pass
else:
outfile = open(outputfilename, 'w')
doStdIn = False
if inputfilename != '-':
if inputfilename.endswith('.bz2'):
cmd = 'bzip2 -cd ' + inputfilename
elif inputfilename.endswith('.gz'):
cmd = 'gunzip -c ' + inputfilename
else:
cmd = 'cat ' + inputfilename
p = os.popen(cmd, "r")
else:
doStdIn = True
line = 'line'
i = 0
shorter = 0
if dotrim5:
i = 1
j = 0
while line != '':
if doStdIn:
line = sys.stdin.readline()
else:
line = p.readline()
if line == '':
break
if i == 1 and line[0] == '@':
if doFlowcellID and flowcellID not in line:
ID = '@'+flowcellID+'_'+line.replace(' ', '_')[1:-1]+'\n'
else:
ID = line.replace(' ', '_')
if doReplace:
ID = ID.replace(oldstring, newstring)
if doRenameIDs:
ID = RID + str(j)
if doAddEnd:
ID = ID.strip()+'/'+END+'\n'
i = 2
continue
if i == 2:
i = 3
sequence = line[trim5:len(line)].strip()
continue
if i == 3 and line[0] == '+':
plus = '+\n'
i = 4
continue
if i == 4:
scores = line
i = 1
scores = line[trim5:len(line)].strip()
scores = scores[0:trim]
j = j+1
if j % READS_LOG_INTERVAL == 0:
if doStdOut:
pass
else:
print(str(j/MILLION) + 'M reads processed')
if doMax:
sequence = sequence.replace('.', 'N')
else:
sequence = sequence[0:trim].replace('.', 'N')+'\n'
if doStdOut:
print(ID.strip())
print(sequence.strip())
print(plus.strip())
print(scores)
else:
outfile.write(ID.strip()+'\n')
outfile.write(sequence.strip()+'\n')
outfile.write(plus.strip()+'\n')
outfile.write(scores + '\n')
continue
else:
i = 1
j = 0
while line != '':
if doStdIn:
line = sys.stdin.readline()
else:
line = p.readline()
if line == '':
break
if i == 1 and line[0] == '@':
if doFlowcellID and flowcellID not in line:
ID = '@'+flowcellID+'_'+line.replace(' ', '_')[1:-1]+'\n'
else:
ID = line.replace(' ', '_')
if doReplace:
ID = ID.replace(oldstring, newstring)
if doRenameIDs:
ID = RID + str(j)
if doAddEnd:
ID = ID.strip()+'/'+END+'\n'
i = 2
continue
if i == 2:
i = 3
j = j+1
if j % READS_LOG_INTERVAL == 0:
if doStdOut:
pass
else:
print(str(j/MILLION) + 'M reads processed')
if doMax:
sequence = line
else:
if len(line.strip()) < trim:
shorter += 1
sequence = line.strip().replace('.', 'N')+'\n'
else:
sequence = line[0:trim].replace('.', 'N')+'\n'
continue
if i == 3 and line[0] == '+':
plus = '+\n'
i = 4
continue
if i == 4:
i = 1
if doMax:
scores = line
if doStdOut:
print(ID.strip())
print(sequence.strip())
print(plus.strip())
print(line.strip())
else:
outfile.write(ID)
outfile.write(sequence)
outfile.write(plus)
outfile.write(line)
else:
if len(line.strip()) < trim:
continue
scores = line[0:trim]+'\n'
if doStdOut:
print(ID.strip())
print(sequence.strip())
print(plus.strip())
print(scores.strip())
else:
outfile.write(ID)
outfile.write(sequence)
outfile.write(plus)
outfile.write(scores)
continue
if doStdOut:
pass
else:
outfile.close()
if shorter > 0:
print(shorter, 'sequences shorter than desired length')
run()
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment