...
 
Commits (2)
......@@ -7,6 +7,7 @@ Delft3D/*
!Delft3D/run_all_examples.sh
!Delft3D/run_all_examples_sp.sh
!Delft3D/sed_in_file.tcl
eazy-photoz/inputs
FreeSurfer/buckner_data
FSL/intro
FSL/fmri
......@@ -20,6 +21,7 @@ Trimmomatic/.backup
*.fastq
*.fastq.gz
*.fasta
*.fasta.gz
*.faa
*.tar
*.tar.gz
......@@ -40,3 +42,4 @@ Trimmomatic/.backup
*.JPEG
*.odb
*.csv
*.data
#!/bin/bash
# To give your job a name, replace "MyJob" with an appropriate name
#SBATCH --job-name=Barrnap-test.slurm
# Run on single CPU
#SBATCH --ntasks=1
# set your minimum acceptable walltime=days-hours:minutes:seconds
#SBATCH -t 0:15:00
# Specify your email address to be notified of progress.
# SBATCH --mail-user=youreamiladdress@unimelb.edu
# SBATCH --mail-type=ALL
# Load the environment variables
module purge
module load foss/2019b
module load barrnap/0.9
# Search the draft genome for rRNA genes.
barrnap -o bin2_rrna.fa bin.2.fa
# Check the file bin2_rrna.fa for results
Example derived from the Swedish University of Agricultural Sciences.
https://www.hadriengourle.com/tutorials/
Barrnap (BAsic Rapid Ribosomal RNA Predictor) predicts the location of ribosomal RNA genes in genomes.
A draft genome is provided. A search is conducted on the genome for rRNA genes. Since these genes are usually quite conserved across
species/genera, it could give us a broad idea of the organism.
The bin, bin.2.fa, is also used for the diamond example.
[barrnap] This is barrnap 0.9
[barrnap] Written by Torsten Seemann
[barrnap] Obtained from https://github.com/tseemann/barrnap
[barrnap] Detected operating system: linux
[barrnap] Adding /usr/local/easybuild-2019/easybuild/software/mpi/gcc/8.3.0/openmpi/3.1.4/barrnap/0.9/bin/../binaries/linux to end of PATH
[barrnap] Checking for dependencies:
[barrnap] Found nhmmer - /usr/local/easybuild-2019/easybuild/software/mpi/gcc/8.3.0/openmpi/3.1.4/hmmer/3.2.1/bin/nhmmer
[barrnap] Found bedtools - /usr/local/easybuild-2019/easybuild/software/mpi/gcc/8.3.0/openmpi/3.1.4/bedtools/2.27.1/bin/bedtools
[barrnap] Will use 1 threads
[barrnap] Setting evalue cutoff to 1e-06
[barrnap] Will tag genes < 0.8 of expected length.
[barrnap] Will reject genes < 0.25 of expected length.
[barrnap] Using database: /usr/local/easybuild-2019/easybuild/software/mpi/gcc/8.3.0/openmpi/3.1.4/barrnap/0.9/bin/../db/bac.hmm
[barrnap] Scanning bin.2.fa for bac rRNA genes... please wait
[barrnap] Command: nhmmer --cpu 1 -E 1e-06 --w_length 3878 -o /dev/null --tblout /dev/stdout '/usr/local/easybuild-2019/easybuild/software/mpi/gcc/8.3.0/openmpi/3.1.4/barrnap/0.9/bin/../db/bac.hmm' 'bin.2.fa'
[barrnap] Rejecting short 68 nt predicted 16S_rRNA. Adjust via --reject option.
[barrnap] Rejecting short 205 nt predicted 23S_rRNA. Adjust via --reject option.
[barrnap] Found: 5S_rRNA k141_4428 L=110/119 240173..240282 - 5S ribosomal RNA
[barrnap] Found 1 ribosomal RNA features.
[barrnap] Sorting features and outputting GFF3...
[barrnap] Writing hit sequences to: bin2_rrna.fa
[barrnap] Running: bedtools getfasta -s -name+ -fo 'bin2_rrna.fa' -fi 'bin.2.fa' -bed '/tmp/AZo86tmMHj'
##gff-version 3
k141_4428 barrnap:0.9 rRNA 240173 240282 4.8e-15 - . Name=5S_rRNA;product=5S ribosomal RNA
index file bin.2.fa.fai not found, generating...
[barrnap] Done.
Derived from the Swedish Univeristy of Agricultural Sciences
https://www.hadriengourle.com/tutorials/
This file contains the sequence of the pO157 plasmid from the Sakai outbreak strain of E. coli O157.
Available from: curl -O -J -L https://osf.io/rnzbe/download
......
......@@ -2,6 +2,8 @@ NOTA BENE: TUTORIAL IS INCOMPLETE. RUNS BUT WITH ERRORS.
Sone content derived from the Swedish Univeristy of Agricultural Sciences
https://www.hadriengourle.com/tutorials/
Busco (Benchmark Universal Single Copy Orthologs) can be used to to find marker genes in a assembly. Marker genes are conserved
across a range of species and finding intact conserved genes in the assembly would be a good indication of its quality.
......
#!/bin/bash
# To give your job a name, replace "MyJob" with an appropriate name
#SBATCH --job-name=Diamond-test.slurm
# Run with four threads
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
# set your minimum acceptable walltime=days-hours:minutes:seconds
#SBATCH -t 0:15:00
# Specify your email address to be notified of progress.
# SBATCH --mail-user=youreamiladdress@unimelb.edu
# SBATCH --mail-type=ALL
# Load the environment variables
module purge
module load diamond/0.9.30
# Run diamond
diamond makedb --in uniprot_sprot.fasta.gz --db uniprot_sprot -p 4
diamond blastx -p 4 -q bin.2.fa -f 6 -d uniprot_sprot.dmnd -o bin2_diamond.txt
Diamond is a sequence aligner for protein and translated DNA searches and functions as a drop-in replacement for the NCBI BLAST
software tools. It is suitable for protein-protein search as well as DNA-protein search on short reads and longer sequences
including contigs and assemblies, providing a speedup of BLAST ranging up to x20,000.
Use diamond against the swissprot database for quickly assigning taxonomy to our contigs.
It is very possible the swissprot database is too small for finding meaningful hits for undersequenced / poorly known organisms.
#!/bin/bash
# To give your job a name, replace "MyJob" with an appropriate name
#SBATCH --job-name=Fastp-test.slurm
# Run this will multiple cores
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
# set your minimum acceptable walltime=days-hours:minutes:seconds
#SBATCH -t 0:15:00
# Specify your email address to be notified of progress.
# SBATCH --mail-user=youreamiladdress@unimelb.edu
# SBATCH --mail-type=ALL
# Load the environment variables
module purge
module load fastqc/0.11.9-java-11.0.2
module load fastp/0.20.0
module load bowtie2/2.3.5.1
# Check read quality
cd dolphin/results/
fastqc Dol1_*.fastq.gz
# Removing the adapters and trim by quality with Fastp
fastp -i Dol1_S19_L001_R1_001.fastq.gz -o Dol1_trimmed_R1.fastq \
-I Dol1_S19_L001_R2_001.fastq.gz -O Dol1_trimmed_R2.fastq \
--detect_adapter_for_pe --length_required 30 \
--cut_front --cut_tail --cut_mean_quality 10
# View the html report produced.
# Use precomputed the index files. Extract the bowtie indexes of the dolphin genome into the results directory
tar -xzvf host_genome.tar.gz
# Map sequencing reads on the dolphin genome. How many reads mapped on the dolphin genome?
bowtie2 -x host_genome/Tursiops_truncatus \
-1 Dol1_trimmed_R1.fastq -2 Dol1_trimmed_R2.fastq \
-S dol_map.sam --un-conc Dol_reads_unmapped.fastq --threads 4
Derived from content by the Swedish University of Agricultural Sciences.
https://www.hadriengourle.com/tutorials/
This example investigates metagenomics data and retrieve draft genome from an assembled metagenome.
It uses a dataset published in 2017 in a study in dolphins, where fecal samples where prepared for viral metagenomics study. The
dolphin had a self-limiting gastroenteritis of suspected viral origin.
Use FastQC to check the quality of the data, as well as fastp for trimming the bad quality part of the reads.
Bowtie2 is used for removing the host sequences by mapping/aligning on the dolphin genome.
Started analysis of Dol1_S19_L001_R1_001.fastq.gz
Approx 5% complete for Dol1_S19_L001_R1_001.fastq.gz
Approx 10% complete for Dol1_S19_L001_R1_001.fastq.gz
Approx 15% complete for Dol1_S19_L001_R1_001.fastq.gz
Approx 20% complete for Dol1_S19_L001_R1_001.fastq.gz
Approx 25% complete for Dol1_S19_L001_R1_001.fastq.gz
Approx 30% complete for Dol1_S19_L001_R1_001.fastq.gz
Approx 35% complete for Dol1_S19_L001_R1_001.fastq.gz
Approx 40% complete for Dol1_S19_L001_R1_001.fastq.gz
Approx 45% complete for Dol1_S19_L001_R1_001.fastq.gz
Approx 50% complete for Dol1_S19_L001_R1_001.fastq.gz
Approx 55% complete for Dol1_S19_L001_R1_001.fastq.gz
Approx 60% complete for Dol1_S19_L001_R1_001.fastq.gz
Approx 65% complete for Dol1_S19_L001_R1_001.fastq.gz
Approx 70% complete for Dol1_S19_L001_R1_001.fastq.gz
Approx 75% complete for Dol1_S19_L001_R1_001.fastq.gz
Approx 80% complete for Dol1_S19_L001_R1_001.fastq.gz
Approx 85% complete for Dol1_S19_L001_R1_001.fastq.gz
Approx 90% complete for Dol1_S19_L001_R1_001.fastq.gz
Approx 95% complete for Dol1_S19_L001_R1_001.fastq.gz
Analysis complete for Dol1_S19_L001_R1_001.fastq.gz
Started analysis of Dol1_S19_L001_R2_001.fastq.gz
Approx 5% complete for Dol1_S19_L001_R2_001.fastq.gz
Approx 10% complete for Dol1_S19_L001_R2_001.fastq.gz
Approx 15% complete for Dol1_S19_L001_R2_001.fastq.gz
Approx 20% complete for Dol1_S19_L001_R2_001.fastq.gz
Approx 25% complete for Dol1_S19_L001_R2_001.fastq.gz
Approx 30% complete for Dol1_S19_L001_R2_001.fastq.gz
Approx 35% complete for Dol1_S19_L001_R2_001.fastq.gz
Approx 40% complete for Dol1_S19_L001_R2_001.fastq.gz
Approx 45% complete for Dol1_S19_L001_R2_001.fastq.gz
Approx 50% complete for Dol1_S19_L001_R2_001.fastq.gz
Approx 55% complete for Dol1_S19_L001_R2_001.fastq.gz
Approx 60% complete for Dol1_S19_L001_R2_001.fastq.gz
Approx 65% complete for Dol1_S19_L001_R2_001.fastq.gz
Approx 70% complete for Dol1_S19_L001_R2_001.fastq.gz
Approx 75% complete for Dol1_S19_L001_R2_001.fastq.gz
Approx 80% complete for Dol1_S19_L001_R2_001.fastq.gz
Approx 85% complete for Dol1_S19_L001_R2_001.fastq.gz
Approx 90% complete for Dol1_S19_L001_R2_001.fastq.gz
Approx 95% complete for Dol1_S19_L001_R2_001.fastq.gz
Analysis complete for Dol1_S19_L001_R2_001.fastq.gz
Detecting adapter sequence for read1...
No adapter detected for read1
Detecting adapter sequence for read2...
No adapter detected for read2
Read1 before filtering:
total reads: 257617
total bases: 68087857
Q20 bases: 64868031(95.2711%)
Q30 bases: 59552985(87.4649%)
Read2 before filtering:
total reads: 257617
total bases: 68349702
Q20 bases: 62509996(91.4561%)
Q30 bases: 55246319(80.8289%)
Read1 after filtering:
total reads: 256037
total bases: 67566252
Q20 bases: 64538483(95.5188%)
Q30 bases: 59320846(87.7966%)
Read2 aftering filtering:
total reads: 256037
total bases: 67448539
Q20 bases: 62176728(92.184%)
Q30 bases: 55096219(81.6863%)
Filtering result:
reads passed filter: 512074
reads failed due to low quality: 3006
reads failed due to too many N: 0
reads failed due to too short: 154
reads with adapter trimmed: 27072
bases trimmed due to adapters: 472337
Duplication rate: 8.51956%
Insert size peak (evaluated by paired-end reads): 240
JSON report: fastp.json
HTML report: fastp.html
fastp -i Dol1_S19_L001_R1_001.fastq.gz -o Dol1_trimmed_R1.fastq -I Dol1_S19_L001_R2_001.fastq.gz -O Dol1_trimmed_R2.fastq --detect_adapter_for_pe --length_required 30 --cut_front --cut_tail --cut_mean_quality 10
fastp v0.20.0, time used: 17 seconds
host_genome/
host_genome/Tursiops_truncatus.rev.1.bt2
host_genome/Tursiops_truncatus.1.bt2
host_genome/Tursiops_truncatus.rev.2.bt2
host_genome/Tursiops_truncatus.3.bt2
host_genome/Tursiops_truncatus.2.bt2
host_genome/Tursiops_truncatus.4.bt2
256037 reads; of these:
256037 (100.00%) were paired; of these:
255804 (99.91%) aligned concordantly 0 times
151 (0.06%) aligned concordantly exactly 1 time
82 (0.03%) aligned concordantly >1 times
----
255804 pairs aligned concordantly 0 times; of these:
2 (0.00%) aligned discordantly 1 time
----
255802 pairs aligned 0 times concordantly or discordantly; of these:
511604 mates make up the pairs; of these:
511598 (100.00%) aligned 0 times
5 (0.00%) aligned exactly 1 time
1 (0.00%) aligned >1 times
0.09% overall alignment rate
#!/bin/bash
# To give your job a name, replace "MyJob" with an appropriate name
#SBATCH --job-name=KRAKEN-test.slurm
# Run multicore
#SBATCH --ntasks-per-node=1
# Set your minimum acceptable walltime=days-hours:minutes:seconds
#SBATCH -t 0:15:00
# Specify your email address to be notified of progress.
# SBATCH --mail-user=youreamiladdress@unimelb.edu
# SBATCH --mail-type=ALL
# Load the environment variables
module purge
module load fastqc/0.11.9-java-11.0.2
module load kraken/1.1-perl-5.30.0
# Use FastQC to check the quality of our data
cd wms/data/sub_100000
fastqc *.fastq
# Assign the database
KRAKEN_DB=/data/gpfs/datasets/KRAKEN/minikraken_20171013_4GB
# Running the KRAKEN loop
# Please change the directory path to suit your location
for i in *_1.fastq
do
prefix=$(basename $i _1.fastq)
# print which sample is being processed
echo $prefix
kraken --db $KRAKEN_DB \
${prefix}_1.fastq ${prefix}_2.fastq > ~/KRAKEN/wms/results/${prefix}.tab
kraken-report --db $KRAKEN_DB \
~/KRAKEN/wms/results/${prefix}.tab > ~/KRAKEN/wms/results/${prefix}_tax.txt
done
Derived from a tutorial from the Swedish Univeristy of Agricultural Sciences.
https://www.hadriengourle.com/tutorials/
Kraken is a system for assigning taxonomic labels to short DNA sequences (i.e. reads) Kraken aims to achieve high sensitivity and
high speed by utilizing exact alignments of k-mers and a novel classification algorithm.
Kraken uses a new approach with exact k-mer matching to assign taxonomy to short reads. It is extremely fast compared to
traditional approaches (i.e. BLAST).
In this example samples are compared from the Pig Gut Microbiome to samples from the Human Gut Microbiome.
Whole Metagenome Sequencing, "shotgun metagenome sequencing", is used, a relatively new and powerful sequencing approach that
provides insight into community biodiversity and function. On the contrary of Metabarcoding, where only a specific region of the
bacterial community (the 16s rRNA) is sequenced, WMS aims at sequencing all the genomic material present in the environment.
The dataset (subset_wms.tar.gz) uses a sample of 3 pigs and 3 humans.
FastQC is used to check the quality of our data.
Kraken includes a script called kraken-report to transform this file into a "tree" view with the percentage of reads assigned to
each taxa. Once the job is run take a look at the _tax.txt files in the results directory.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
sample organism
ERR1135369 pig
ERR1135370 pig
ERR1135375 pig
ERR321573 human
ERR321574 human
ERR321578 human
......@@ -2,6 +2,8 @@
Derived from content from the Swedish University of Agricultural Sciences
https://www.hadriengourle.com/tutorials/
## Data collection
M. genitalium was sequenced using the MiSeq platform (2 * 150bp). The reads were deposited in the ENA Short Read Archive under the
......
#!/bin/bash
# To give your job a name, replace "MyJob" with an appropriate name
#SBATCH --job-name=Metagenome-test.slurm
# Run multicore
#SBATCH --ntasks-per-node=4
# Set your minimum acceptable walltime=days-hours:minutes:seconds
#SBATCH -t 0:45:00
# Specify your email address to be notified of progress.
# SBATCH --mail-user=youreamiladdress@unimelb.edu
# SBATCH --mail-type=ALL
# Load the environment variables
module purge
module load foss/2019b
module load fastqc/0.11.9-java-11.0.2
module load sickle/1.33
module load megahit/1.1.4-python-3.7.4
module load bowtie2/2.3.5.1
module load samtools/1.9
module load bcftools/1.9
module load metabat/2.12.1-python-2.7.16
module load checkm/1.1.2-python-3.7.4
# Use FastQC to check the quality of our data
cd results/
ln -s ../data/tara_reads_* .
fastqc tara_reads_*.fastq.gz
# Trim the reads using sickle
sickle pe -f tara_reads_R1.fastq.gz -r tara_reads_R2.fastq.gz -t sanger \
-o tara_trimmed_R1.fastq -p tara_trimmed_R2.fastq -s /dev/null
# Assemble with MEGAHIT
megahit -1 tara_trimmed_R1.fastq -2 tara_trimmed_R2.fastq -o tara_assembly
# Map the reads back against the assembly to get coverage information
ln -s tara_assembly/final.contigs.fa .
bowtie2-build final.contigs.fa final.contigs
bowtie2 -x final.contigs -1 tara_reads_R1.fastq.gz -2 tara_reads_R2.fastq.gz | \
samtools view -bS -o tara_to_sort.bam
samtools sort tara_to_sort.bam -o tara.bam
samtools index tara.bam
# Get the bins from metabat
runMetaBat.sh -m 1500 final.contigs.fa tara.bam
mv final.contigs.fa.metabat-bins1500 metabat
# Check the quality of the bins
checkm lineage_wf -x fa metabat checkm/
## Below has been deprecated ##
# Plot the completeness
# checkm bin_qa_plot -x fa checkm metabat plots
# Take a look at plots/bin_qa_plot.png
Material derived from the Swedish University of Agricultural Sciences.
https://www.hadriengourle.com/tutorials/
In this example several applications are used. The exmple involves how to inspect and assemble metagenomic data and retrieve draft
genomes from assembled metagenomes.
The dataset consistes of 20 bacteria selected from the Tara Ocean study that recovered 957 distinct Metagenome-assembled-genomes (or
MAGs) that were previsouly unknown.
http://ocean-microbiome.embl.de/companion.html
This diff is collapsed.
Content derived from the Swedish University of Agricultural Sciences
https://www.hadriengourle.com/tutorials/
Pilon is a software tool which can be used to automatically improve draft assemblies. It attempts to make improvements to the input
genome, including:
......
Derived from content by the Swedish University of Agricultural Sciences
https://www.hadriengourle.com/tutorials/
Once the Prokka job has finished, examine each of its output files.
The GFF and GBK files contain all of the information about the features annotated (in different formats.)
......
Content derived from the Swedish University of Agricultural Sciences
https://www.hadriengourle.com/tutorials/
QUAST is a software evaluating the quality of genome assemblies by computing various metrics.
The file `m_genitalium.fasta` is a copy of `m_genitalium.fasta` which is produced by the de novo assembly in the MEGAHIT example.
......
# This folder contains files for a sample, single-processor R job, which reads in the file tutorial.R, which itself calls the datasets w1.dat and trees91.csv. It is for leaf biomass of a pine tree in a high CO2 environment. With the `vanilla` option in the R command, it will create a PDF of the output. The job can be submitted with:
# R README file. Last update LL20210127
# Basic Job Submission
This folder contains files for a sample, single-processor R job, which reads in the file tutorial.R, which itself calls the
datasets w1.dat and trees91.csv. It is for leaf biomass of a pine tree in a high CO2 environment. With the `vanilla` option in the
R command, it will create a PDF of the output. The job can be submitted with:
sbatch Rsample.slurm
# There is also a R-parallel script. Note this script does not work (yet). The library Rmpi has a few issues with OpenMI at the moment which we're working on. It does provide a good example of an error output for MPI :)
Try modifying Rsample.slurm for these simple R scripts
mandy.r is a short R code calculating Mandelbrot set through the first 20 iterations of equation z = z2 + c plotted for different complex
constants c. This example demonstrates:
* use of community-developed external libraries (called packages), in this case caTools package
* handling of complex numbers
* multidimensional arrays of numbers used as basic data type, see variables C, Z and X.
# More Sample R Scripts
Some more sample R scripts from Prof. Andrew Turpin https://people.eng.unimelb.edu.au/aturpin/R/
log.r includes the following functions
* read.table() is used to load data from a text file.
* plot() is used to create axes and labels.
* lines() is used to plot single curves.
* t.test() is used to test if the means of the two data sets differ.
* legend() is used to manually create a legend.
* print() is used to output the graph.
useawk.r R commands are used to filter the data, and print an average every tenth point instead of printing every point.
Note some of the features on this graph.
* Two plots are added to the same set of axis.
* R is used to process the data prior to plotting.
twoxes.r is useful for plotting two curves that have the same domain but different ranges.
* plot() is used to draw plots of the graph with desired mark type and line style.
use type ='n' to suppress the drawing of axes, then
use axis to draw each axis with desired style.
* since the two y-axes are of different ranges, 2 separate plots are used to draw plots for the 1st and 2nd y-axis.
par(new=TRUE) is used so that the two plots end up in a single graph.
* each axis needs major and minor ticks, axis() is called twice to draw minor and major ticks.
* use 'pos' in y-axis (x-axis) to specify the location of y-axis (x-axis) relative to x-axis (y-axis).
* use legend() to draw legend manually.
* mtext() is used for drawing the label of the 2nd y-axis.
strings.r uses the text function to place text on a graph. The most basic form of the function is
* text(x, y, "text")
* This places the word "text" at the point (x, y).
* x, y and text can all be vectors. If vectors of different lengths are used, the shorter vectors are recycled.
ebars.r includes
* Shifting of points to left and right to separate out the points shown at each integer level.
* Use of a function (with variable number of arguments)
* Use of the arrows() function to draw error bars
bargraph.r is a simple bar graph that makes use of the transpose function t()
# Tips 'n' tricks
## "join"ing two files
f1 <- read.table("data1",col.names=c("word","freq"))
f2 <- read.table("data2",col.names=c("word","length"))
d <- merge(f1,f2,by="word")
d now has three columns word, freq, length
# Parallel R
There is an example script for running R with MPI, RMpi.
# Extensions
sbatch r-parallel.slurm
less xvalidate.out
There is also a LIBRARY file which describes how to access R extensions.
# There is also a LIBRARY file which describes how to access R extensions.
......@@ -5,7 +5,7 @@
#SBATCH --time=1
module purge
/usr/local/module/spartan_old.sh
source /usr/local/module/spartan_old.sh
module load R/3.5.0-spartan_gcc-6.2.0
module load Rmpi
......
Host: spartan-rcg001 Rank(ID): 1 of Size: 5 on comm 1
[1] "Done"
Host: spartan-rcg001 Rank(ID): 2 of Size: 5 on comm 1
[1] "Done"
Host: spartan-rcg001 Rank(ID): 3 of Size: 5 on comm 1
[1] "Done"
Host: spartan-rcg002 Rank(ID): 4 of Size: 5 on comm 1
[1] "Done"
#postscript("log.ps")
#pdf("bargraph.pdf")
data = read.table("bargraph.data")
# Note: need to take transpose of data [ t(data) ] because of source
# table format
barplot(t(data),
beside=T,
legend.text=c("Method A", "Method B"),
ylab="Elapsed time (millisec)",
xlab="Data set")
#dev.off() # This is only needed if you use pdf/postscript in interactive mode
#
# Example of ploting R graphs with error bars
# Modelled precisely on Justin Zobel's example in jgraph
#
# Authour: andrew.turpin@rmit.edu.au
# Date: Fri Apr 20 16:38:28 EST 2007
#
# uncomment this for pdf output to file
#pdf("ebars.pdf", width=7, height=5, family="Times", paper="a4")
#postscript("ebars.ps", width=7, height=5, family="Times", paper="a4")
# A little function I wrote to draw a point at (x,y)
# and a vertical error bar from (x, y-len) to (x, y+len)
# assumes plotting axis already set up
# Note the use of ... to pass on unspecified arguments
plotPandB <- function(x, y, len, sfrac=0.01, ...) {
points(x, y, ...)
arrows(x, y, x, y+len, length=par("fin")[1] * sfrac, angle=90)
arrows(x, y, x, pmax(0, y-len), length=par("fin")[1] * sfrac, angle=90)
}
# Read in the 4 data sets into four list elements
# Only read the first 12 rows as in jz's original eg
d <- list(1:4)
for (i in 1:4)
d[[i]] <- read.table(
paste("ebars-d",i,".data",sep=""),
nrows = 12
)
#set up the axis
plot(0,0,
xlim=c(0, 12),
ylim=c(0, 0.045),
type="n",
xlab = "Level",
ylab = "Av. probability at level"
)
# draw points (x offsets as per jz's original jgraph)
# square x circle triangle
markType <- c(22, 120, 19, 24)
backGs <- c("black", "white", "gray", "white")
cols <- c("black", "gray" , "gray", "gray")
offsets <- c(-0.24, -0.08, +0.08, +0.24)
for(i in 1:4)
plotPandB(1:length(d[[i]][,1]) + offsets[i], d[[i]][,1], d[[i]][,2],
pch = markType[i],
bg = backGs[i],
col = cols[i],
sfrac = 0.005
)
# plonk on a legend
legend(8.5, 0.025,
c("Major", "Upper", "Lower", "Minor"),
pch=markType,
col = cols,
pt.bg = backGs
)
#dev.off() # This is only needed if you use pdf/postscript in interactive mode
#postscript("log.ps")
#pdf("log.pdf")
# Load second data table.
data_table_2 <- read.table("log-d2.data", col.names=c("DocLen2","NumDoc2"))
plot(data_table_2,
type="l",
xlab="Document length",
ylab="Number of documents",
log="y",
lwd=3)
# Load first data table.
data_table_1 <- read.table("log-d1.data", col.names=c("DocLen1","NumDoc1"))
lines(data_table_1, lwd="1")
legend(50,50,legend=c("Data set 1", "Data set 2"), lwd=c(1,3))
# a t.test just for fun!
t <- t.test(data_table_2$DocLen2, data_table_2$NumDoc2, paired=TRUE)
print(paste("t-test pvalue = ",t$p.value))
#dev.off() # This is only needed if you use pdf/postscript in interactive mode
library(caTools) # external package providing write.gif function
jet.colors <- colorRampPalette(c("red", "blue", "#007FFF", "cyan", "#7FFF7F",
"yellow", "#FF7F00", "red", "#7F0000"))
dx <- 1500 # define width
dy <- 1400 # define height
C <- complex(real = rep(seq(-2.2, 1.0, length.out = dx), each = dy),
imag = rep(seq(-1.2, 1.2, length.out = dy), dx))
C <- matrix(C, dy, dx) # reshape as square matrix of complex numbers
Z <- 0 # initialize Z to zero
X <- array(0, c(dy, dx, 20)) # initialize output 3D array
for (k in 1:20) { # loop with 20 iterations
Z <- Z^2 + C # the central difference equation
X[, , k] <- exp(-abs(Z)) # capture results
}
write.gif(X, "Mandelbrot.gif", col = jet.colors, delay = 100)
#postscript(family="Times", 'multi.ps', width=40)
#pdf(family="Times", 'multi.pdf', width=40)
# draw points (offsets is for color)
offsets <- c(0.0, 0.3, 0.3, 0.6, 0.6)
linethick <- c(0.5, 0.5, 0.5, 1.5, 1.5)
leg.txt <- c(
"actual p(novel)",
"p'(novel), mehtod X",
"p'(novel), method X+stdev",
"p'(novel), method X-fac",
"p'(novel), method X-fac+stdev ")
#read source file
data_table <- read.table("multi.data",
col.names=c("vx1","vy1","vy2","vy3","vy4","vy5","vy6","vy7"))
#cbind() forms matrices by binding together matrices horizontally,
#or column-wise, and rbind() vertically, or row-wise
line1node <- cbind(data_table$vx1, data_table$vy1)
#draw line1, xaxt="n" means not draw x axis now;
# lty=1 means solid line, lwd means line width
plot(line1node,
xaxt="n", yaxt="n",
xlab="Probability of novel symbol",
ylab="Number of symbol occurrences",
log = "x",
main="Figure multiple lines and custom ticks",
ylim=c(0.4,1),
type = "l",
col=gray(offsets[1]),
lty=1, lwd=linethick[1])
# 1 means draw ticks on the bottom. At determines place making a tick
axis(1, at=c(1,2,3,4,5,6,7,8,9,10,20,30,40,50), font.axis=6 )
# 2 means draw ticks on the left.
axis(2, at=c(0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0) , font.axis=6 )
#draw line 2. lty=3 means dotted
line2node <- cbind(data_table$vx1, data_table$vy2)
lines(line2node, type="l", col=gray(offsets[2]), lty=3, lwd=linethick[2]);
# remove those rows if vy3 values above 1 and assign those
# satisfied rows to line3smallnode
line3smallnode <- subset( data_table, !(vy3 > 1) )
# draw line3. lty=2 means dashed
line3node <- cbind(line3smallnode$vx1, line3smallnode$vy3)
lines(line3node, type="l", col=gray(offsets[3]), lty=2, lwd=linethick[3]);
# draw line4. lty=2 means dashed
line4node <- cbind(data_table$vx1, data_table$vy5)
lines(line4node, type="l", col=gray(offsets[4]), lty=1, lwd=linethick[4]);
# draw line5. (lty=2 == lty="dashed")
lines(data_table$vx1, data_table$vy6, type="l",
col=gray(offsets[5]), lty="dashed", lwd=linethick[5]);
# 10,0.7 is the position of legend. leg.txt is each line title.
# bty="n" remove the box around legend
# cex=0.9 means legend is 0.9 time of current font size
legend(10,0.7, leg.txt, lty = c(1,3,2,1,2),
col = gray(offsets),lwd=linethick, bty="n", cex = 0.9)
#dev.off() # This is only needed if you use pdf/postscript in interactive mode
......@@ -4,7 +4,7 @@
#SBATCH --time=00:10:00
module purge
/usr/local/module/spartan_old.sh
source /usr/local/module/spartan_old.sh
module load R/3.5.0-spartan_gcc-6.2.0
module load Rmpi
......
#
# Example of placing text on R graphs
# Modelled on Justin Zobel's example in jgraph
#
# Authour: Michael C. Harris <miharris@cs.rmit.edu.au>
# Date: Tue May 1 13:56:01 EST 2007
postscript("strings.ps", family="Times")
#pdf("strings.pdf", family="serif")
# Read data into vectors
space <- sort(c(44.7, 97.8, 158.1, 173.7, 31.7, 1, 300.0))
time <- sort(c(458.4, 71.8, 18.9, 1.45, 895.6, 7564.5, 0.95), TRUE)
# set the box type "L"
par(bty="l")
# Plot the data
plot(space, time, log="y",
pch=20,
xlab="Space overhead (%)",
ylab="Time (ms)")
# Draw labels in font 3 (italics) to the left of the points.
# The first point is too close to the Y axis, so
# draw the text below the point by passing a vector to pos.
text(space, time, LETTERS[1:7], pos=c(1,rep(2,6)), font=3)
# Add line segments
# Magic numbers are the edge of the plot
# Vertical segments
segments(space, time, space, 10000, lty=2)
# Horizontal segments
segments(space, time, 320, time, lty=2)
dev.off() # This is only needed if you use pdf/postscript in interactive mode
# A graph with 2 y axes.
#postscript("twoaxes.ps")
#pdf("twoaxes.pdf")
# Load data table.
data_table <- read.table("twoaxes.data", col.names=c("X","Y1","Y2"))
# prepare params of the three axes, including labels and ticks.
xMin <- 25
xMax <- 50
xMajorTick = 5
xMinorTick = 1
y1Min <- 0
y1Max <- 1800
y1MajorTick = 500
y1MinorTick = 100
y2Min <-0
y2Max <-100
y2MajorTick = 20
y2MinorTick = 10
# axes' major ticks and labels
xMajorLab = seq(xMin, xMax, xMajorTick);
y1MajorLab = seq(y1Min, y1Max, y1MajorTick);
y2MajorLab = seq(y2Min, y2Max, y2MajorTick);
# minor ticks and no lables; num ticks == num intervals +1
xMinorLab<- rep("",((xMax - xMin)/xMinorTick) +1)
y1MinorLab<- rep("",((y1Max - y1Min)/y1MinorTick) +1)
y2MinorLab<- rep("",((y2Max - y2Min)/y2MinorTick) +1)
# axes title strings
xText <- "List Length"
y1Text <- "Total size (megabytes)"
y2Text <- "Space wastage (%)"
# mark types
y1Mark= 21 #circle
y2Mark= 15 #filled box
# legend strings
y1Legend<- "Total size"
y2Legend<- "Space wastage"
# increase margin to make sure axis titles are displayed.
par(mar = c(4, 4, 2, 4))
plot(data_table$X, data_table$Y1, type="o", axes=FALSE,
xlim = c(xMin,xMax) , ylim =c(y1Min, y1Max),
xlab=xText, ylab=y1Text, pch=y1Mark)
# cex = 1.5
# X axis
axis(1, at=seq(xMin,xMax,xMinorTick), labels=xMinorLab,
tick=T, tck=-0.01, pos=y1Min)
axis(1, at=xMajorLab,labels=xMajorLab, tick=T, tck=-0.03, pos=y1Min)
#cex.axis = 1.2
# Y1 axis , las -- label direction.
axis(2, at=seq(y1Min,y1Max,y1MinorTick), labels=y1MinorLab,
tick=T, tck=-0.01, pos=xMin)
axis(2, at=y1MajorLab, labels=y1MajorLab, tick=T, tck=-0.03, pos=xMin,
las=2)
# make sure next plot is on same graph
par(new = T)
plot(data_table$X, data_table$Y2, type="o", axes=FALSE,
xlim = c(xMin,xMax) , ylim =c(y2Min, y2Max),
xlab="", ylab="", pch=y2Mark)
mtext(y2Text, 4, 2)
# Y2 axis
axis(4, at=seq(y2Min,y2Max,y2MinorTick), labels=y2MinorLab,
tick=T, tck=-0.01, pos=xMax)
axis(4, at= y2MajorLab, labels=y2MajorLab, tick=T, tck=-0.03, pos=xMax, las=2)
# Legend, bty="n" -- no frame,
legend(38, 20, c(y1Legend, y2Legend), bty="n",
lty=c(1,1), pch= c(y1Mark, y2Mark))
#dev.off() # This is only needed if you use pdf/postscript in interactive mode
# specify output file and width
#postscript(file="useawk.ps", width=40)
#pdf (file="useawk.pdf",width=40)
# read in two files
mtx1 <- read.table("useawk-d1.data")
mtx2 <- read.table("useawk-d2.data")
#init the arrays to hold the mean values
#we will need 200 slots at most
meanArr1<-rep(NA, 200)
meanArr2<-rep(NA, 200)
# the x label values
x<-rep(NA, 200)
# leave the first spot for 0
x[1] <- 0
for(i in 1:200){
meanArr1[i] <- mean(mtx1[((i-1)*10):(i*10),5])
# i+1 because we left the first slot for 0
x[i+1] = mtx1[i,1]
}
for(i in 1:200){
meanArr2[i] <- mean(mtx2[((i-1)*10):(i*10),5])
}
plot(meanArr1,col="gray", #line colour
xlim=c(0,160),
ylim=c(0,10000),
t="l", # we want a line graph
lwd=1.5, # how thick the line should be
ylab="Number of new terms",
xlab="Number of documents",
xaxt="n", yaxt="n")
# plt the x axis array using x
axis(1, 1:201, x)
# plt the y axis
axis(2)
# make sure the two graphs appear on the same axis
par(new=TRUE)
plot(meanArr2,col="black",
xlim=c(0,160),
ylim=c(0,10000), t="l",
xlab="", # we left these empty because the previous graph has
ylab="", # done them
lwd=1.5,
xaxt="n",
yaxt="n")
axis(1, at=NULL, labels=NA)
axis(2)
# draw the legend
legend("topright", c("Without stemming","With stemming"),
col=c("black", "gray"), lty=c(1,1), lwd="2")
#dev.off() # This is only needed if you use pdf/postscript in interactive mode
#!/bin/bash
# To give your job a name, replace "MyJob" with an appropriate name
#SBATCH --job-name=RBio-test.slurm
# Run multicore
#SBATCH --ntasks-per-node=4
# Set your minimum acceptable walltime=days-hours:minutes:seconds
#SBATCH -t 5:00:00
# Specify your email address to be notified of progress.
# SBATCH --mail-user=youreamiladdress@unimelb.edu
# SBATCH --mail-type=ALL
# Load the environment variables
module purge
module load foss/2019b
module load r-bundle-bioconductor/3.10-r-3.6.2
module load web_proxy
# Running the R script
R --vanilla < Metabarcoding.R
# Load packages
library(dada2)
library(ggplot2)
library(phyloseq)
library(phangorn)
library(DECIPHER)
packageVersion('dada2')
# Check the dataset
path <- 'MiSeq_SOP'
list.files(path)
# Filter and Trim
raw_forward <- sort(list.files(path, pattern="_R1_001.fastq",
full.names=TRUE))
raw_reverse <- sort(list.files(path, pattern="_R2_001.fastq",
full.names=TRUE))
# we also need the sample names
sample_names <- sapply(strsplit(basename(raw_forward), "_"),
`[`, # extracts the first element of a subset
1)
# Initial plot read quality. Forward reads are good quality while the reverse are way worse.
plotQualityProfile(raw_forward[1:2])
plotQualityProfile(raw_reverse[1:2])
# Dada2 requires us to define the name of our output files
# place filtered files in filtered/ subdirectory
filtered_path <- file.path(path, "filtered")
filtered_forward <- file.path(filtered_path,
paste0(sample_names, "_R1_trimmed.fastq.gz"))
filtered_reverse <- file.path(filtered_path,
paste0(sample_names, "_R2_trimmed.fastq.gz"))
# Use standard filtering parameters: maxN=0 (DADA22 requires no Ns), truncQ=2, rm.phix=TRUE and maxEE=2. The maxEE parameter sets the maximum
# number of “expected errors” allowed in a read, which according to the USEARCH authors is a better filter than simply averaging quality
# scores.
out <- filterAndTrim(raw_forward, filtered_forward, raw_reverse,
filtered_reverse, truncLen=c(240,160), maxN=0,
maxEE=c(2,2), truncQ=2, rm.phix=TRUE, compress=TRUE,
multithread=TRUE)
head(out)
# Learn and plot the Error Rates
# The DADA2 algorithm depends on a parametric error model and every amplicon dataset has a slightly different error rate. The learnErrors of
# Dada2 learns the error model from the data and will help DADA2 to fits its method to your data.
errors_forward <- learnErrors(filtered_forward, multithread=TRUE)
errors_reverse <- learnErrors(filtered_reverse, multithread=TRUE)
plotErrors(errors_forward, nominalQ=TRUE) +
theme_minimal()
# Dereplication
# From the Dada2 documentation: Dereplication combines all identical sequencing reads into into “unique sequences” with a corresponding
# “abundance”: the number of reads with that unique sequence. Dereplication substantially reduces computation time by eliminating redundant
# comparisons.
derep_forward <- derepFastq(filtered_forward, verbose=TRUE)
derep_reverse <- derepFastq(filtered_reverse, verbose=TRUE)
# name the derep-class objects by the sample names
names(derep_forward) <- sample_names
names(derep_reverse) <- sample_names
# Sample inference
# We are now ready to apply the core sequence-variant inference algorithm to the dereplicated data.
dada_forward <- dada(derep_forward, err=errors_forward, multithread=TRUE)
dada_reverse <- dada(derep_reverse, err=errors_reverse, multithread=TRUE)
# inspect the dada-class object
dada_forward[[1]]
# Merge Paired-end Reads
# Now that the reads are trimmed, dereplicated and error-corrected we can merge them together
merged_reads <- mergePairs(dada_forward, derep_forward, dada_reverse,
derep_reverse, verbose=TRUE)
# inspect the merger data.frame from the first sample
head(merged_reads[[1]])
# Construct Sequence Table
# Construct a sequence table of our mouse samples, a higher-resolution version of the OTU table produced by traditional methods.
seq_table <- makeSequenceTable(merged_reads)
dim(seq_table)
# inspect distribution of sequence lengths
table(nchar(getSequences(seq_table)))
# Remove Chimeras
# The dada method used earlier removes substitutions and indel errors but chimeras remain.
seq_table_nochim <- removeBimeraDenovo(seq_table, method='consensus',
multithread=TRUE, verbose=TRUE)
dim(seq_table_nochim)
# which percentage of our reads did we keep?
sum(seq_table_nochim) / sum(seq_table)
# Check the number of reads that made it through each step in the pipeline
get_n <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dada_forward, get_n), sapply(merged_reads, get_n),
rowSums(seq_table), rowSums(seq_table_nochim))
colnames(track) <- c('input', 'filtered', 'denoised', 'merged', 'tabled',
'nonchim')
rownames(track) <- sample_names
head(track)
# Assign Taxonomy
# Aassign taxonomy to the sequences using the SILVA database
taxa <- assignTaxonomy(seq_table_nochim,
'MiSeq_SOP/silva_nr_v128_train_set.fa.gz',
multithread=TRUE)
taxa <- addSpecies(taxa, 'MiSeq_SOP/silva_species_assignment_v128.fa.gz')
# for inspecting the classification
taxa_print <- taxa # removing sequence rownames for display only
rownames(taxa_print) <- NULL
head(taxa_print)
# Phylogenetic Tree
# DADA2 is reference-free so we have to build the tree ourselves
# Align our sequences
sequences <- getSequences(seq_table)
names(sequences) <- sequences # this propagates to the tip labels of the tree
alignment <- AlignSeqs(DNAStringSet(sequences), anchor=NA)
# Build a neighbour-joining tree then fit a maximum likelihood tree using the neighbour-joining tree as a starting point.
phang_align <- phyDat(as(alignment, 'matrix'), type='DNA')
dm <- dist.ml(phang_align)
treeNJ <- NJ(dm) # note, tip order != sequence order
fit = pml(treeNJ, data=phang_align)
## negative edges length changed to 0!
fitGTR <- update(fit, k=4, inv=0.2)
fitGTR <- optim.pml(fitGTR, model='GTR', optInv=TRUE, optGamma=TRUE,
rearrangement = 'stochastic',
control = pml.control(trace = 0))
detach('package:phangorn', unload=TRUE)
# Phyloseq. First load the metadata
sample_data <- read.table(
'https://hadrieng.github.io/tutorials/data/16S_metadata.txt',
header=TRUE, row.names="sample_name")
# Construct a phyloseq object from our output and newly created metadata
physeq <- phyloseq(otu_table(seq_table_nochim, taxa_are_rows=FALSE),
sample_data(sample_data),
tax_table(taxa),
phy_tree(fitGTR$tree))
# remove mock sample
physeq <- prune_samples(sample_names(physeq) != 'Mock', physeq)
physeq
# Look at the alpha diversity of our samples
plot_richness(physeq, x='day', measures=c('Shannon', 'Fisher'), color='when') +
theme_minimal()
# Ordination methods (beta diversity)
# Perform an MDS with euclidean distance (mathematically equivalent to a PCA)
ord <- ordinate(physeq, 'MDS', 'euclidean')
plot_ordination(physeq, ord, type='samples', color='when',
title='PCA of the samples from the MiSeq SOP') +
theme_minimal()
# With the Bray-Curtis distance
ord <- ordinate(physeq, 'NMDS', 'bray')
plot_ordination(physeq, ord, type='samples', color='when',
title='PCA of the samples from the MiSeq SOP') +
theme_minimal()
# Distribution of the most abundant families
top20 <- names(sort(taxa_sums(physeq), decreasing=TRUE))[1:20]
physeq_top20 <- transform_sample_counts(physeq, function(OTU) OTU/sum(OTU))
physeq_top20 <- prune_taxa(top20, physeq_top20)
plot_bar(physeq_top20, x='day', fill='Family') +
facet_wrap(~when, scales='free_x') +
theme_minimal()
# Place them in a tree
bacteroidetes <- subset_taxa(physeq, Phylum %in% c('Bacteroidetes'))
plot_tree(bacteroidetes, ladderize='left', size='abundance',
color='when', label.tips='Family')
group dpw
F3D0 0
F3D1 1
F3D141 141
F3D142 142
F3D143 143
F3D144 144
F3D145 145
F3D146 146
F3D147 147
F3D148 148
F3D149 149
F3D150 150
F3D2 2
F3D3 3
F3D5 5
F3D6 6
F3D7 7
F3D8 8
F3D9 9
group time
F3D0 Early
F3D1 Early
F3D141 Late
F3D142 Late
F3D143 Late
F3D144 Late
F3D145 Late
F3D146 Late
F3D147 Late
F3D148 Late
F3D149 Late
F3D150 Late
F3D2 Early
F3D3 Early
F3D5 Early
F3D6 Early
F3D7 Early
F3D8 Early
F3D9 Early
pcr.seqs(fasta=silva.bacteria.fasta, start=11894, end=25319, keepdots=F)
system(mv silva.bacteria.pcr.fasta silva.v4.fasta)
#change the name of the file from stability.files to whatever suits your study
make.contigs(file=stability.files, processors=8)
screen.seqs(fasta=current, group=current, maxambig=0, maxlength=275)
unique.seqs()
count.seqs(name=current, group=current)
align.seqs(fasta=current, reference=silva.v4.fasta)
screen.seqs(fasta=current, count=current, start=1968, end=11550, maxhomop=8)
filter.seqs(fasta=current, vertical=T, trump=.)
unique.seqs(fasta=current, count=current)
pre.cluster(fasta=current, count=current, diffs=2)
chimera.uchime(fasta=current, count=current, dereplicate=t)
remove.seqs(fasta=current, accnos=current)
classify.seqs(fasta=current, count=current, reference=trainset9_032012.pds.fasta, taxonomy=trainset9_032012.pds.tax, cutoff=80)
remove.lineage(fasta=current, count=current, taxonomy=current, taxon=Chloroplast-Mitochondria-unknown-Archaea-Eukaryota)
remove.groups(count=current, fasta=current, taxonomy=current, groups=Mock)
cluster.split(fasta=current, count=current, taxonomy=current, splitmethod=classify, taxlevel=4, cutoff=0.15)
make.shared(list=current, count=current, label=0.03)
classify.otu(list=current, count=current, taxonomy=current, label=0.03)
phylotype(taxonomy=current)
make.shared(list=current, count=current, label=1)
classify.otu(list=current, count=current, taxonomy=current, label=1)
F3D0 F3D0_S188_L001_R1_001.fastq F3D0_S188_L001_R2_001.fastq
F3D141 F3D141_S207_L001_R1_001.fastq F3D141_S207_L001_R2_001.fastq
F3D142 F3D142_S208_L001_R1_001.fastq F3D142_S208_L001_R2_001.fastq
F3D143 F3D143_S209_L001_R1_001.fastq F3D143_S209_L001_R2_001.fastq
F3D144 F3D144_S210_L001_R1_001.fastq F3D144_S210_L001_R2_001.fastq
F3D145 F3D145_S211_L001_R1_001.fastq F3D145_S211_L001_R2_001.fastq
F3D146 F3D146_S212_L001_R1_001.fastq F3D146_S212_L001_R2_001.fastq
F3D147 F3D147_S213_L001_R1_001.fastq F3D147_S213_L001_R2_001.fastq
F3D148 F3D148_S214_L001_R1_001.fastq F3D148_S214_L001_R2_001.fastq
F3D149 F3D149_S215_L001_R1_001.fastq F3D149_S215_L001_R2_001.fastq
F3D150 F3D150_S216_L001_R1_001.fastq F3D150_S216_L001_R2_001.fastq
F3D1 F3D1_S189_L001_R1_001.fastq F3D1_S189_L001_R2_001.fastq
F3D2 F3D2_S190_L001_R1_001.fastq F3D2_S190_L001_R2_001.fastq
F3D3 F3D3_S191_L001_R1_001.fastq F3D3_S191_L001_R2_001.fastq
F3D5 F3D5_S193_L001_R1_001.fastq F3D5_S193_L001_R2_001.fastq
F3D6 F3D6_S194_L001_R1_001.fastq F3D6_S194_L001_R2_001.fastq
F3D7 F3D7_S195_L001_R1_001.fastq F3D7_S195_L001_R2_001.fastq
F3D8 F3D8_S196_L001_R1_001.fastq F3D8_S196_L001_R2_001.fastq
F3D9 F3D9_S197_L001_R1_001.fastq F3D9_S197_L001_R2_001.fastq
Mock Mock_S280_L001_R1_001.fastq Mock_S280_L001_R2_001.fastq
Derived from an example from the Swedish University of Agricultural Sciences.
https://www.hadriengourle.com/tutorials/
The example job is metabarcoding using a DADA2 pipeline. It uses the data of MiSeq SOP but analyses the data using DADA2.
MiSqeq was acquired with the following commands:
wget http://www.mothur.org/w/images/d/d6/MiSeqSOPData.zip
unzip MiSeqSOPData.zip
rm -r __MACOSX/
cd MiSeq_SOP
wget https://zenodo.org/record/824551/files/silva_nr_v128_train_set.fa.gz
wget https://zenodo.org/record/824551/files/silva_species_assignment_v128.fa.gz
DADA2 analyses amplicon data which uses exact variants instead of OTUs. The advantages of the DADA2 method is described in the following paper:
http://dx.doi.org/10.1038/ismej.2017.119
The job submission script uses R Bioconductor which invokes a version of R. It then calls an R script in vanilla mode (producing a PDF for
the plots).
File added
This diff is collapsed.
#!/bin/bash
# To give your job a name, replace "MyJob" with an appropriate name
#SBATCH --job-name=Roary-test.slurm
# Allocate four CPUs
#SBATCH --ntasks-per-node=4
# Set your minimum acceptable walltime=days-hours:minutes:seconds
# Yes, this is a fair-sized job
#SBATCH -t 10:00:00
# Specify your email address to be notified of progress.
# SBATCH --mail-user=youreamiladdress@unimelb.edu
# SBATCH --mail-type=ALL
# Load the environment variables
module purge
module load foss/2019b
module load fastqc/0.11.9-java-11.0.2
module load megahit/1.1.4-python-3.7.4
module load prokka/1.14.5
module load enabrowsertool/1.5.4-python-3.7.4
module load roary/3.12.0-perl-5.30.0
module load web_proxy
# Get some random strains from the dataset
cut -d',' -f 1 journal.pcbi.1006258.s010.csv | tail -n +2 | shuf | head -10 > strains.txt
# Fastqc
cat strains.txt | parallel enaGroupGet -f fastq {}
mkdir reads
mv ERS*/*/*.fastq.gz reads/
# rm -r ERS*
# Assemble with Megahit
mkdir assemblies
for r1 in reads/*_1.fastq.gz
do
prefix=$(basename $r1 _1.fastq.gz)
r2=reads/${prefix}_2.fastq.gz
megahit -1 $r1 -2 $r2 -o ${prefix} --out-prefix ${prefix}
mv ${prefix}/${prefix}.contigs.fa assemblies/
rm -r ${prefix}
done
# Prokka to annotate
mkdir annotation
for assembly in assemblies/*.fa
do
prefix=$(basename $assembly .contigs.fa)
prokka --usegenus --genus Escherichia --species coli --strain ${prefix} \
--outdir ${prefix} --prefix ${prefix} ${assembly}
mv ${prefix}/${prefix}.gff annotation/
# rm -r ${prefix}
done
# Run roary for pan-genpmic analysis
roary -f roary -e -n -v annotation/*.gff
# Prepare for visualising plots
cd roary
FastTree -nt -gtr core_gene_alignment.aln > my_tree.newick
This example partially from the Swedish University of Agricultural Science and the Genome annotation and Pangenome Analysis from the CBIB inSantiago, Chile. It illustrates how to determine a pan-genome from a collection of isolate genomes.
https://www.hadriengourle.com/tutorials/
It starts with a dataset of Escherichia coli from the article "Prediction of antibiotic resistance in Escherichia coli from large-scale
pan-genome data", available at https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006258
The CSV file has been obtained from https://doi.org/10.1371/journal.pcbi.1006258.s010
Random strains are selected from the file and read through enaGroupGet and fastq for quality and placed in their own directory (`reads`).
The strains are assembled with MEGAHIT and then annotated with Prokka.
All the .gff files are placed in the same folder and roary is run against them, getting the coding sequences, converting them into protein,
creating pre-clusters, and checking for paralogs. Every isolate is taken and ordered by presence or absence of orthologs. The summary is in
the file summary_statistics.txt along with a gene_presence_absence.csv includes the gene name and gene annotation, and whether a gene is
present in a genome or not. Finally, a tree file is created.
#!/usr/bin/env python
# Copyright (C) <2015> EMBL-European Bioinformatics Institute
# This program is free software: you can redistribute it and/or
# modify it under the terms of the GNU General Public License as
# published by the Free Software Foundation, either version 3 of
# the License, or (at your option) any later version.
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
# Neither the institution name nor the name roary_plots
# can be used to endorse or promote products derived from
# this software without prior written permission.
# For written permission, please contact <marco@ebi.ac.uk>.
# Products derived from this software may not be called roary_plots
# nor may roary_plots appear in their names without prior written
# permission of the developers. You should have received a copy
# of the GNU General Public License along with this program.
# If not, see <http://www.gnu.org/licenses/>.
__author__ = "Marco Galardini"
__version__ = '0.1.0'
def get_options():
import argparse
# create the top-level parser
description = "Create plots from roary outputs"
parser = argparse.ArgumentParser(description = description,
prog = 'roary_plots.py')
parser.add_argument('tree', action='store',
help='Newick Tree file', default='accessory_binary_genes.fa.newick')
parser.add_argument('spreadsheet', action='store',
help='Roary gene presence/absence spreadsheet', default='gene_presence_absence.csv')
parser.add_argument('--labels', action='store_true',
default=False,
help='Add node labels to the tree (up to 10 chars)')
parser.add_argument('--format',
choices=('png',
'tiff',
'pdf',
'svg'),
default='png',
help='Output format [Default: png]')
parser.add_argument('-N', '--skipped-columns', action='store',
type=int,
default=14,
help='First N columns of Roary\'s output to exclude [Default: 14]')
parser.add_argument('--version', action='version',
version='%(prog)s '+__version__)
return parser.parse_args()
if __name__ == "__main__":
options = get_options()
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
import os
import pandas as pd
import numpy as np
from Bio import Phylo
t = Phylo.read(options.tree, 'newick')
# Max distance to create better plots
mdist = max([t.distance(t.root, x) for x in t.get_terminals()])
# Load roary
roary = pd.read_csv(options.spreadsheet, low_memory=False)
# Set index (group name)
roary.set_index('Gene', inplace=True)
# Drop the other info columns
roary.drop(list(roary.columns[:options.skipped_columns-1]), axis=1, inplace=True)
# Transform it in a presence/absence matrix (1/0)
roary.replace('.{2,100}', 1, regex=True, inplace=True)
roary.replace(np.nan, 0, regex=True, inplace=True)
# Sort the matrix by the sum of strains presence
idx = roary.sum(axis=1).sort_values(ascending=False).index
roary_sorted = roary.loc[idx]
# Pangenome frequency plot
plt.figure(figsize=(7, 5))
plt.hist(roary.sum(axis=1), roary.shape[1],
histtype="stepfilled", alpha=.7)
plt.xlabel('No. of genomes')
plt.ylabel('No. of genes')
sns.despine(left=True,
bottom=True)
plt.savefig('pangenome_frequency.%s'%options.format, dpi=300)
plt.clf()
# Sort the matrix according to tip labels in the tree
roary_sorted = roary_sorted[[x.name for x in t.get_terminals()]]
# Plot presence/absence matrix against the tree
with sns.axes_style('whitegrid'):
fig = plt.figure(figsize=(17, 10))
ax1=plt.subplot2grid((1,40), (0, 10), colspan=30)
a=ax1.matshow(roary_sorted.T, cmap=plt.cm.Blues,
vmin=0, vmax=1,
aspect='auto',
interpolation='none',
)
ax1.set_yticks([])
ax1.set_xticks([])
ax1.axis('off')
ax = fig.add_subplot(1,2,1)
# matplotlib v1/2 workaround
try:
ax=plt.subplot2grid((1,40), (0, 0), colspan=10, facecolor='white')
except AttributeError:
ax=plt.subplot2grid((1,40), (0, 0), colspan=10, axisbg='white')
fig.subplots_adjust(wspace=0, hspace=0)
ax1.set_title('Roary matrix\n(%d gene clusters)'%roary.shape[0])
if options.labels:
fsize = 12 - 0.1*roary.shape[1]
if fsize < 7:
fsize = 7
with plt.rc_context({'font.size': fsize}):
Phylo.draw(t, axes=ax,
show_confidence=False,
label_func=lambda x: str(x)[:10],
xticks=([],), yticks=([],),
ylabel=('',), xlabel=('',),
xlim=(-mdist*0.1,mdist+mdist*0.45-mdist*roary.shape[1]*0.001),
axis=('off',),
title=('Tree\n(%d strains)'%roary.shape[1],),
do_show=False,
)
else:
Phylo.draw(t, axes=ax,
show_confidence=False,
label_func=lambda x: None,
xticks=([],), yticks=([],),
ylabel=('',), xlabel=('',),
xlim=(-mdist*0.1,mdist+mdist*0.1),
axis=('off',),
title=('Tree\n(%d strains)'%roary.shape[1],),
do_show=False,
)
plt.savefig('pangenome_matrix.%s'%options.format, dpi=300)
plt.clf()
# Plot the pangenome pie chart
plt.figure(figsize=(10, 10))
core = roary[(roary.sum(axis=1) >= roary.shape[1]*0.99) & (roary.sum(axis=1) <= roary.shape[1] )].shape[0]
softcore = roary[(roary.sum(axis=1) >= roary.shape[1]*0.95) & (roary.sum(axis=1) < roary.shape[1]*0.99)].shape[0]
shell = roary[(roary.sum(axis=1) >= roary.shape[1]*0.15) & (roary.sum(axis=1) < roary.shape[1]*0.95)].shape[0]
cloud = roary[roary.sum(axis=1) < roary.shape[1]*0.15].shape[0]
total = roary.shape[0]
def my_autopct(pct):
val=int(round(pct*total/100.0))
return '{v:d}'.format(v=val)
a=plt.pie([core, softcore, shell, cloud],
labels=['core\n(%d <= strains <= %d)'%(roary.shape[1]*.99,roary.shape[1]),
'soft-core\n(%d <= strains < %d)'%(roary.shape[1]*.95,roary.shape[1]*.99),
'shell\n(%d <= strains < %d)'%(roary.shape[1]*.15,roary.shape[1]*.95),
'cloud\n(strains < %d)'%(roary.shape[1]*.15)],
explode=[0.1, 0.05, 0.02, 0], radius=0.9,
colors=[(0, 0, 1, float(x)/total) for x in (core, softcore, shell, cloud)],
autopct=my_autopct)
plt.savefig('pangenome_pie.%s'%options.format, dpi=300)
plt.clf()
#!/bin/bash
# To give your job a name, replace "MyJob" with an appropriate name
#SBATCH --job-name=Salmon-test.slurm
# Run this with single core
#SBATCH --ntasks=1
# set your minimum acceptable walltime=days-hours:minutes:seconds
#SBATCH -t 0:45:00
# Specify your email address to be notified of progress.
# SBATCH --mail-user=youreamiladdress@unimelb.edu
# SBATCH --mail-type=ALL
# Load the environment variables
module purge
module load foss/2019b
module load salmon/1.0.0
# Indexing transcriptome
tar xvf toy_rna.tar.gz
cd toy_rna
salmon index -t chr22_transcripts.fa -i chr22_index
# Quantify reads
for i in *_R1.fastq.gz
do
prefix=$(basename $i _R1.fastq.gz)
salmon quant -i chr22_index --libType A \
-1 ${prefix}_R1.fastq.gz -2 ${prefix}_R2.fastq.gz -o quant/${prefix};
done
Salmon is a fast program to produce a highly-accurate, transcript-level quantification estimates from RNA-seq data.
https://www.hadriengourle.com/tutorials/
This example derived from the Swedish University of Agricultural Sciences
For this tutorial we will use the test data from this paper:
http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004393
The test data consists of two commercially available RNA samples: Universal Human Reference (UHR) and Human Brain Reference (HBR).
The UHR is total RNA isolated from a diverse set of 10 cancer cell lines.
Salmon goes through each sample and invokes salmon using fairly basic options:
The -i argument tells salmon where to find the index
--libType A tells salmon that it should automatically determine the library type of the sequencing reads (e.g. stranded vs.
unstranded etc.)
The -1 and -2 arguments tell salmon where to find the left and right reads for this sample (notice, salmon will accept gzipped FASTQ
files directly).
the -o argument specifies the directory where salmon’s quantification results sould be written.
#!/bin/bash
# To give your job a name, replace "MyJob" with an appropriate name
#SBATCH --job-name=eazy-photoz-test.slurm
# Run on single CPU
#SBATCH --ntasks=1
# set your minimum acceptable walltime=days-hours:minutes:seconds
#SBATCH -t 0:15:00
# Specify your email address to be notified of progress.
# SBATCH --mail-user=youreamiladdress@unimelb.edu
# SBATCH --mail-type=ALL
# Load the environment variables
module purge
module load eazy-photoz/20201002
cd inputs
mkdir OUTPUT
eazy # generates param file
eazy -p zphot.param.default
The standard implementation of this software has hard-coded symlinks. These have been removed for use here.
Instead the script and inputs here should be used for the example HDF-N catalog Fernandez-Soto et al. 1999:
https://ui.adsabs.harvard.edu/abs/1999ApJ...513...34F/abstract
Copy this entire directory into your project (or home for testing), change into the directory and run the example script.