Commit 2e6fbc31 authored by root's avatar root
Browse files

Update for 6/1/2021

parent 3ce0c5c1
ABAQUS/Door.odb
AdvLinux/NAMD
Beast/Dengue4.env.trees
BLAST/dbs
BLAST/rat-ests
Cufflinks/sample.bam
Delft3D/*
!Delft3D/Delft3D-test.slurm
!Delft3D/run_all_examples.sh
......@@ -14,12 +12,10 @@ FSL/intro
FSL/fmri
Gaussian/g16
Gaussian/tests
Genomics
HPCshells/NAMD
NAMD/apoa1
NAMD/NAMD_BENCHMARKS_SPARTAN
NAMD/stmv
Python/minitwitter.csv
Trimmomatic/.backup
*.fastq
*.fastq.gz
......@@ -29,6 +25,7 @@ Trimmomatic/.backup
*.tar.gz
*.sam
*.sam.gz
*.bam
*.simg
*.gz
*.fa
......@@ -41,3 +38,5 @@ Trimmomatic/.backup
*.JPG
*.PNG
*.JPEG
*.odb
*.csv
=====
About
=====
The content here makes use of the Data Carpentry datasets and courses for Genomics, but modified for the Spartan HPC system, and includes the example files from the Intro directory.
The Intro directory files include sample scripts for basic job submission, the gattaca.txt file, quota and job checking scripts.
The DC curriculum uses data from a long term evolution experiment published in 2016: Tempo and mode of genome evolution in a 50,000-generation experiment (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4988878/) by Tenaillon O, Barrick JE, Ribeck N, Deatherage DE, Blanchard JL, Dasgupta A, Wu GC, Wielgoss S, Cruveiller S, Médigue C, Schneider D, and Lenski RE. (doi: 10.1038/nature18959). All sequencing data sets are available in the NCBI BioProject database under accession number PRJNA294072 (https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA294072).
backup.tar.gz: contains original fastq files, reference genome, and subsampled fastq files.
shell_data.tar.gz: contains the files used as input to the Introduction to the Command Line for Genomics lesson.
sub.tar.gz: contains subsampled fastq files that are used as input to the Data Wrangling and Processing for Genomics lesson.
solutions: contains the output files of the Shell Genomics and Wrangling Genomics lessons, including fastqc output, sam, bam, bcf, and vcf files.
vcf_clean_script.R: converts vcf output in .solutions/wrangling_solutions/variant_calling_auto to single tidy data frame.
===========
Sample Data
===========
We are going to use a long-term sequencing dataset from a population of Escherichia coli.
The data we are going to use is part of a long-term evolution experiment led by Richard Lenski.
The experiment was designed to assess adaptation in E. coli. A population was propagated for more than 40,000 generations in a glucose-limited minimal medium (in most conditions glucose is the best carbon source for E. coli, providing faster growth than other sugars).
Blount et al. 2008: Historical contingency and the evolution of a key innovation in an experimental population of Escherichia coli.
http://www.pnas.org/content/105/23/7899
We will be working with three sample events from the Ara-3 strain of this experiment, one from 5,000 generations, one from 15,000 generations, and one from 50,000 generations. The population changed substantially during the course of the experiment, and we will be exploring how (the evolution of a Cit+ mutant and hypermutability) with our variant calling workflow.
=======================
Bioinformatics Workflow
=======================
When working with high-throughput sequencing data, the raw reads you get off of the sequencer will need to pass through a number of different tools in order to generate your final desired output. The execution of this set of tools in a specified order is commonly referred to as a workflow or a pipeline.
The example workflow is the following:
1. Sequence Reads with FastQ
2. Quality Control with Trimmomatic
3. Alignment to Genome with SAM/BAM
4. Alignment cleanup with BAM
5. Variant calling with VCF
========
Tarballs
========
Whilst we could download the data from the European Nucleotide Archive, we have already done so with in the file backup.tar.gz. Check the table of contents with `tar tf`
[lev@spartan-login1 ~]$ cd Genomics/
[lev@spartan-login1 Genomics]$ tar tf backup.tar.gz
..
Extract the tarball, and copy the untimmed files to the working directory.
The files are in the gunzip format; extract one for viewing.
[lev@spartan-login1 Genomics]$ tar xvf backup.tar.gz
..
[lev@spartan-login1 Genomics]$ cp .backup/untrimmed_fastq/*fastq.gz .
[lev@spartan-login1 Genomics]$ gunzip SRR2584863_1.fastq.gz
===============
Assess The Read
===============
The fastq format includes the following:
Line Description
1 Always begins with '@' and then information about the read
2 The actual DNA sequence
3 Always begins with a '+' and sometimes the same info in line 1
4 Has a string of characters which represent the quality scores; must have same number of characters as line 2
[lev@spartan-login1 Genomics]$ head SRR2584863_1.fastq
Line 4 shows the quality for each nucleotide in the read. Quality is interpreted as the probability of an incorrect base call (e.g. 1 in 10) or, equivalently, the base call accuracy (e.g. 90%). To make it possible to line up each individual nucleotide with its quality score, the numerical score is converted into a code where each individual character represents the numerical quality score for an individual nucleotide.
In real life, one won't be assessing the quality of reads by visually inspecting your FASTQ files. Rather, one uses a software program to assess read quality and filter out poor quality reads, such as fastqc.
A sample script has been written, FASTQ.slurm, to run this read. It *could* be done locally, but if one is running a read on hundreds of files, having a job submission script is much easier!
[lev@spartan-login1 Genomics]$ sbatch fastqc.slurm
For each input FASTQ file, FastQC has created a .zip file and a .html file.
The .html files should be copied to a local system for viewing. e.g.,
llafayette@unimelb.edu.au@9770l-133895-l:~/Desktop$ scp spartan:Genomics/*.html .
llafayette@unimelb.edu.au@9770l-133895-l:~/Desktop$ firefox *.html
A short for loop can be used to unzip the files, and the summary files can be concatenated and viewed
[lev@spartan-login1 Genomics]$ for zipfile in ./*.zip; do unzip $zipfile ; done
[lev@spartan-login1 Genomics]$ cat SRR*/summary.txt > fastqc_summaries.txt
[lev@spartan-login1 Genomics]$ less fastqc_summaries.txt
==============
Cleaning Reads
==============
A program called Trimmomatic can be used to filter poor quality reads and trim poor quality bases from samples. Trimmomatic has a variety of options to trim reads. One must specify whether paired end (PE) or single end (SE) reads are used.
Trimmomatic can be run on the paired-end samples. From using FastQC it was indicated that Nextera adapters were present in the samples. The adapter sequences has to be copied into our current directory.
[lev@spartan-login1 Genomics]$ cp .backup/untrimmed_fastq/NexteraPE-PE.fa .
Trimmomatic has multi-threaded options as well, which is used in the example job submission script. There is also a loop version.
Ensure you zip up the gunzipped file from the fastqc read.
[lev@spartan-login1 Genomics]$ gzip SRR2584863_1.fastq
[lev@spartan-login1 Genomics]$ sbatch trim.slurm
Wait until this job has finished!
[lev@spartan-login1 Genomics]$ sbatch trimloop.slurm
Two challenges
a) Convert trimloop into an array
b) Launch trimloop.slurm as a dependency on trim.slurm
===============================
Alignment to a Reference Genome
===============================
The following steps are all included in the single script, align.slurm.
Now the data of a high-quality one can perform variant calling to see how the population changed over time, through comparison to the original population, E. coli strain REL606. Therefore, one needs align each of our samples to the E. coli REL606 reference genome, and see what differences exist in the reads versus the genome.
The steps involved are to download the reference genome for E. coli REL606, then use the Burrows Wheeler Aligner (BWA),for mapping low-divergent sequences against a large reference genome.
Download a set of trimmed FASTQ files to work with. These are small subsets of the real trimmed data, which will enable one to run the variant calling workflow quite quickly.
Directories for the results that will be generated as part of this workflow.
The first step is to index the reference genome for use by BWA, which allows the alignmer to quickly find potential alignment sites for query sequences in a genome, which saves time during alignment.
The alignment process consists of choosing an appropriate reference genome to map the reads against and then deciding on an aligner. Here, the BWA-MEM algorithm, which is the latest and is generally recommended for high-quality queries as it is faster and more accurate.
Review the following options, and iterate over the sample files.
http://bio-bwa.sourceforge.net/bwa.shtml
The SAM file, is a tab-delimited text file that contains information for each individual read and its alignment to the genome. The compressed binary version of SAM is called a BAM file, which can be used to reduce size and allow for indexing, which enables efficient random access of the data contained in the file.
The file begins with a header, which is optional. The header is used to describe source of data, reference sequence, method of alignment, etc., this will change depending on the aligner being used. Following the header is the alignment section. Each line that follows corresponds to alignment information for a single read. Each alignment line has 11 mandatory fields for essential mapping information and a variable number of other fields for aligner specific information.
The SAM file can be convered to BAM format using the samtools program with the view command and tell this command that the input is in SAM format (-S) and to output BAM format (-b):
Next sort the BAM file using the sort command from samtools. -o tells the command where to write the output.
SAM/BAM files can be sorted in multiple ways, e.g. by location of alignment on the chromosome, by read name, etc. It is important to be aware that different alignment tools will output differently sorted SAM/BAM, and different downstream tools require differently sorted alignment files as input.
===============
Variant Calling
===============
A variant call is a conclusion that there is a nucleotide difference vs. some reference at a given position in an individual genome or transcriptome, often referred to as a Single Nucleotide Polymorphism (SNP). The call is usually accompanied by an estimate of variant frequency and some measure of confidence. Similar to other steps in this workflow, there are number of tools available for variant calling.
The script, variant.slurm, covers the steps involved.
First, calculate the read coverage of positions in the genome.
Do the first pass on variant calling by counting read coverage with bcftools. We will use the command mpileup. The flag -O b tells samtools to generate a bcf format output file, -o specifies where to write the output file, and -f flags the path to the reference genome. This will generate a file with coverage information for every base.
Next, detect the single nucleotide polymorphisms (SNPs). Identify SNPs using bcftools `call`. In this casem specify ploidy with the flag --ploidy, which is one for the haploid E. coli. -m allows for multiallelic and rare-variant calling, -v tells the program to output variant sites only (not every site in the genome), and -o specifies where to write the output file.
After that, filter and report the SNP variants in variant calling format (VCF).
Following this, one can view the VCF Format. This includes the header (which describes the format), the time and date the file was created, the version of bcftools that was used, the command line parameters used, etc, followed by information on each of the variations observed.
less -S results/vcf/SRR2584866_final_variants.vcf
The first few columns represent the information we have about a predicted variation.
column info
CHROM contig location where the variation occurs
POS position within the contig where the variation occurs
ID a . until we add annotation information
REF reference genotype (forward strand)
ALT sample genotype (forward strand)
QUAL Phred-scaled probablity that the observed variant exists at this site (higher is better)
FILTER a . if no quality filters have been applied, PASS if a filter is passed, or the name of the filters this variant failed
In an ideal world, the information in the QUAL column would be all we needed to filter out bad variant calls. In reality one willl need to filter on multiple other metrics.
The last two columns contain the genotypes and can be tricky to decode.
column info
FORMAT lists in order the metrics presented in the final column
results lists the values associated with those metrics in order
For the sample file, the metrics presented are GT:PL:GQ.
metric definition
GT the genotype of this sample which for a diploid genome is encoded with a 0 for the REF allele, 1 for the first ALT allele, 2 for the second and so on. So 0/0 means homozygous reference, 0/1 is heterozygous, and 1/1 is homozygous for the alternate allele. For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc.
PL the likelihoods of the given genotypes
GQ the Phred-scaled confidence for the genotype
AD, DP the depth per allele by sample and coverage
The Broad Institute provides an overview of the VCF format.
https://www.broadinstitute.org/gatk/guide/article?id=1268
#!/bin/bash
#SBATCH --time=2:00:00
#SBATCH --ntasks=1
mkdir -p data/ref_genome
curl -L -o data/ref_genome/ecoli_rel606.fasta.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/017/985/GCA_000017985.1_ASM1798v1/GCA_000017985.1_ASM1798v1_genomic.fna.gz
sleep 30
gunzip data/ref_genome/ecoli_rel606.fasta.gz
curl -L -o sub.tar.gz https://downloader.figshare.com/files/14418248
sleep 60
tar xvf sub.tar.gz
mv sub/ data/trimmed_fastq_small
mkdir -p results/sam results/bam results/bcf results/vcf
module load BWA/0.7.17-intel-2017.u2
bwa index data/ref_genome/ecoli_rel606.fasta
bwa mem data/ref_genome/ecoli_rel606.fasta data/trimmed_fastq_small/SRR2584866_1.trim.sub.fastq data/trimmed_fastq_small/SRR2584866_2.trim.sub.fastq > results/sam/SRR2584866.aligned.sam
samtools view -S -b results/sam/SRR2584866.aligned.sam > results/bam/SRR2584866.aligned.bam
samtools sort -o results/bam/SRR2584866.aligned.sorted.bam results/bam/SRR2584866.aligned.bam
#!/bin/bash
who -u >> who.txt
#!/bin/bash
#SBATCH --job-name="fastqc_sample"
#SBATCH --ntasks=1
# Set minimum acceptable walltime=hours:minutes:seconds
#SBATCH --time=0:15:00
#SBATCH --output=fastqc-%j.out
#SBATCH --error=fastqc-%j.err
# Load the environment variables for R
module load fastqc/0.11.5
# The command to actually run the job
fastqc *.fastq*
# When creating filenames any character can be used except to the `/` because that is a directory delimiter.
# However, just because (almost) any character can be used, doesn't mean any character should be used. Spaces in filenames, for example, are a bad practise.
touch "This is a long file name"
for item in $(ls *); do echo ${item}; done
# Filenames with wildcards are not a particularly good idea either.
touch * # What are you thinking?!
rm * # Really?! You want to remove all files in your directory?
rm '*' # Safer, but shouldn't have been created in the first place.
# Best to keep to plain, old fashioned, alphanumerics. CamelCase is helpful.
DDWEIPDGQI TVGQRIGSGS FGTVYKGKWH GDVAVKMLNV TAPTPQQLQA
FKNEVGVLRK TRHVNILLFM GYSTKPQLAI VTQWCEGSSL YHHLHIIETK
FEMIKLIDIA RQTAQGMDYL HAKSIIHRDL KSNNIFLHED LTVKIGDFGL
ATVKSRWSGS HQFEQLSGSI LWMAPEVIRM QDKNPYSFQS DVYAFGIVLY
ELMTGQLPYS NINNRDQIIF MVGRGYLSPD LSKVRSNCPK AMKRLMAECL
KKKRDERPLF PQILASIELL ARSLPK
For those who are more familiar with PBS there is a handy script which can do a lot of the conversion from PBS to Slurm.
https://github.com/bjpop/pbs2slurm
The following is a summary of common PBS commands and their Slurm equivalent, in part taken from `https://genomedk.fogbugz.com/?W6`
User commands PBS/Torque Slurm
Job submission qsub <job script> sbatch <job script>
Job deletion qdel <job_id> scancel <job_id>
Job deletion qdel ALL scancel -u <user>
List jobs qstat [-u user] squeue [-u user] [-l for long format]
Job status qstat -f <job_id> jobinfo <job_id>
Job hold qhold <job_id> scontrol hold <job_id>
Job release qrls <job_id> ​ scontrol release <job_id>
Environment PBS/Torque Slurm
Job ID $PBS_JOBID $SLURM_JOBID
Node list $PBS_NODEFILE $SLURM_JOB_NODELIST (new format)
Submit dir $PBS_O_WORKDIR $SLURM_SUBMIT_DIR
Job Specification PBS/Torque Slurm
Script directive #PBS #SBATCH
Queue -q <queue> -p <partition>
Node count -l nodes=<number> -N <min[-max]>
Cores(cpu) per node -l ppn=<number> -c <number>
Memory size -l mem=16384 --mem=16g OR --mem-per-cpu=2g
Wall time -l walltime=<hh:mm:ss> -t <days-hh:mm:ss>
Standard output file -o <file_name> -o <file_name>
Standard error file -e <file_name> -e <file_name>
Output directory -o <directory> -o "directory/slurm-%j.out"
Event notification -m abe --mail-type=[BEGIN, END, FAIL, REQUEUE, or ALL]
Email address -M <address> --mail-user=<address>
Job name -N <name> --job-name=<name>
Job dependency -W depend= --depend=C:<jobid>
Account to charge -W group_list=<account> --account=<account>
BioSample_s InsertSize_l LibraryLayout_s Library_Name_s LoadDate_s MBases_l MBytes_l ReleaseDate_s Run_s SRA_Sample_s Sample_Name_s Assay_Type_s AssemblyName_s BioProject_s Center_Name_s Consent_s Organism_s Platform_s SRA_Study_s g1k_analysis_group_s g1k_pop_code_s source_s strain_s
SAMN00205533 0 SINGLE CZB152 29-May-14 149 100 25-Mar-11 SRR097977 SRS167141 CZB152 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN00205558 0 SINGLE CZB154 29-May-14 627 444 25-Mar-11 SRR098026 SRS167166 CZB154 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN00205559 0 SINGLE CZB199 29-May-14 157 118 25-Mar-11 SRR098027 SRS167167 CZB199 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN00205560 0 SINGLE REL1166A 29-May-14 627 440 25-Mar-11 SRR098028 SRS167168 REL1166A WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN00205561 0 SINGLE REL10979 29-May-14 140 94 25-Mar-11 SRR098029 SRS167169 REL10979 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN00205562 0 SINGLE REL10988 29-May-14 145 110 25-Mar-11 SRR098030 SRS167170 REL10988 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN00205563 0 SINGLE ZDB16 29-May-14 606 424 25-Mar-11 SRR098031 SRS167171 ZDB16 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN00205564 0 SINGLE <not provided> 29-May-14 311 257 25-Mar-11 SRR098032 SRS167172 ZDB30 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN00205564 2834 PAIRED ZDB30 29-May-14 1695 679 25-Mar-11 SRR098033 SRS167172 ZDB30 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN00205565 0 SINGLE ZDB83 29-May-14 260 162 25-Mar-11 SRR098034 SRS167173 ZDB83 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN00205566 0 SINGLE ZDB87 29-May-14 255 161 25-Mar-11 SRR098035 SRS167174 ZDB87 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN00205567 0 SINGLE ZDB96 29-May-14 126 90 25-Mar-11 SRR098036 SRS167175 ZDB96 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN00205568 0 SINGLE ZDB99 29-May-14 98 68 25-Mar-11 SRR098037 SRS167176 ZDB99 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN00205569 0 SINGLE ZDB107 29-May-14 241 142 25-Mar-11 SRR098038 SRS167177 ZDB107 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN00205570 0 SINGLE ZDB111 29-May-14 281 193 25-Mar-11 SRR098039 SRS167178 ZDB111 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN00205571 0 SINGLE ZDB143 29-May-14 653 466 25-Mar-11 SRR098040 SRS167179 ZDB143 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN00205572 0 SINGLE ZDB158 29-May-14 546 388 25-Mar-11 SRR098041 SRS167180 ZDB158 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN00205573 0 SINGLE ZDB172-SE 29-May-14 59 48 25-Mar-11 SRR098042 SRS167181 ZDB172 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN00205573 2729 PAIRED ZDB172-PE 29-May-14 1620 635 25-Mar-11 SRR098043 SRS167181 ZDB172 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN00205574 0 SINGLE ZDB199 29-May-14 646 454 25-Mar-11 SRR098044 SRS167182 ZDB199 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN00205586 0 SINGLE ZDB200 29-May-14 551 390 25-Mar-11 SRR098279 SRS167194 ZDB200 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN00205587 0 SINGLE ZDB357 29-May-14 571 407 25-Mar-11 SRR098280 SRS167195 ZDB357 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN00205588 0 SINGLE ZDB409 29-May-14 733 518 25-Mar-11 SRR098281 SRS167196 ZDB409 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN00205589 0 SINGLE ZDB429 29-May-14 443 309 25-Mar-11 SRR098282 SRS167197 ZDB429 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN00205590 0 SINGLE ZDB446 29-May-14 719 513 25-Mar-11 SRR098283 SRS167198 ZDB446 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN00205591 0 SINGLE ZDB458 29-May-14 633 447 25-Mar-11 SRR098284 SRS167199 ZDB458 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN00205592 0 SINGLE ZDB464 29-May-14 140 97 25-Mar-11 SRR098285 SRS167200 ZDB464 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN00205593 0 SINGLE ZDB467 29-May-14 714 322 25-Mar-11 SRR098286 SRS167201 ZDB467 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN00205594 0 SINGLE ZDB477 29-May-14 691 487 25-Mar-11 SRR098287 SRS167202 ZDB477 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN00205595 0 SINGLE ZDB483 29-May-14 829 593 25-Mar-11 SRR098288 SRS167203 ZDB483 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN00205596 0 SINGLE ZDB564 29-May-14 265 171 25-Mar-11 SRR098289 SRS167204 ZDB564 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN01095545 0 SINGLE ZDB285 25-Jul-12 150 106 26-Jul-12 SRR527252 SRS351858 ZDB285 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN01095546 0 SINGLE ZDB290 25-Jul-12 151 112 26-Jul-12 SRR527253 SRS351860 ZDB290 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN01095547 0 SINGLE ZDB165 25-Jul-12 155 106 26-Jul-12 SRR527254 SRS351861 ZDB165 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN01095548 0 SINGLE ZDB283 25-Jul-12 153 113 26-Jul-12 SRR527255 SRS351862 ZDB283 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN01095549 0 SINGLE ZDB294 25-Jul-12 158 112 26-Jul-12 SRR527256 SRS351863 ZDB294 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
SAMN01095550 0 SINGLE ZDB281 25-Jul-12 157 115 26-Jul-12 SRR527257 SRS351864 ZDB281 WGS <not provided> PRJNA188723 MSU public Escherichia coli B str. REL606 ILLUMINA SRP004752 <not provided> <not provided> <not provided> REL606
\ No newline at end of file
The following as some sinfo examples that you might find useful on Spartan.
`sinfo -s`
Provides summary information the system's partitions, from the partition name, whether the partition is available, walltime limits, node information (allocated, idle, out, total), and the nodelist.
`sinfo -p $partition`
Provides information about the particular partition specified. Breaks sinfo up for that partition into node states (drain, drng, mix, alloc, idle) and the nodes in that state. `Drain` means that the node is marked for maintenance, and whilst existing jobs will run it will not accept new jobs.
`sinfo -a`
Similar to `sinfo -p` but for all partitions.
`sinfo -n $nodes -p $partition`
Print information only for specified nodes in specified partition; can use comma-separated values or range expression e.g., `sinfo -n spartan-bm[001-010]`.
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodelist=spartan-bm005
# Alternative to exclude specific nodes.
# SBATCH --exclude=spartan-bm005
echo $(hostname ) $SLURM_JOB_NAME running $SLURM_JOBID >> hostname.txt
The Slurm command `squeue` provides information about jobs in the squeue. The following are some basic and useful commands.
squeue -a
This displays information in all jobs and job steps in all partitions.
squeue -A $account
This displays information for jobs according to account (i.e., group). Accepts a comma separated list.
squeue -j $jobids
Displays information for jobs according to job id. Accepts a comma separated list.
squeue -p $partition
Displays information for jobs according to partition. e.g., squeue -p gpgpu
squeue -u $users
Displays information for jobs according to users. Accepts a comma separated list.
#!/bin/bash
#SBATCH --job-name="trimm_sample"
# A multithreaded application
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --time=0:15:00
module load Trimmomatic/0.36-Java-1.8.0_152
java -jar $EBROOTTRIMMOMATIC/trimmomatic-0.36.jar PE -threads 4 SRR2589044_1.fastq.gz SRR2589044_2.fastq.gz \
SRR2589044_1.trim.fastq.gz SRR2589044_1un.trim.fastq.gz \
SRR2589044_2.trim.fastq.gz SRR2589044_2un.trim.fastq.gz \
SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15
#!/bin/bash
#SBATCH --job-name="trimmloop_sample"
# A multithreaded application
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --time=0:15:00
module load Trimmomatic/0.36-Java-1.8.0_152
for infile in ./*_1.fastq.gz
do
base=$(basename ${infile} _1.fastq.gz)
java -jar $EBROOTTRIMMOMATIC/trimmomatic-0.36.jar PE -threads 4 ${infile} ${base}_2.fastq.gz \
${base}_1.trim.fastq.gz ${base}_1un.trim.fastq.gz \
${base}_2.trim.fastq.gz ${base}_2un.trim.fastq.gz \
SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15
done
#!/bin/bash
#SBATCH --time=1:00:00
#SBATCH --ntasks=1
module load BCFtools/1.6-intel-2017.u2
bcftools mpileup -O b -o results/bcf/SRR2584866_raw.bcf \
-f data/ref_genome/ecoli_rel606.fasta results/bam/SRR2584866.aligned.sorted.bam
bcftools call --ploidy 1 -m -v -o results/bcf/SRR2584866_variants.vcf results/bcf/SRR2584866_raw.bcf
module load VCFtools/0.1.15-intel-2017.u2-Perl-5.24.1
vcfutils.pl varFilter results/bcf/SRR2584866_variants.vcf > results/vcf/SRR2584866_final_variants.vcf
# Script to import, tidy, and combine multiple vcf files into a tidy data frame
# and save to csv
# Naupaka Zimmerman
# nzimmerman@usfca.edu
# February 27, 2019
# load required packages
library("vcfR")
library("plyr")
# set the path to vcf files (output from the previous pipeline)
path_to_vcf_dir <- "~/.solutions/wrangling-solutions/variant_calling_auto/results/vcf/"
# list all files in the vcf directory
vcf_file_names <- list.files(path_to_vcf_dir)
# extract sample IDs from VCF file names assuming names like
# 'SRR2584863_final_variants.vcf' where the characters before the
# first '_' are the sample ID
sample_ids <- gsub(pattern = "_.*",
replacement = "",
x = vcf_file_names)
# read in all vcf files in the directory
vcf_objects <- sapply(paste0(path_to_vcf_dir,
vcf_file_names),
read.vcfR)
# tidy all vcf files, combining data where possible
tidied_vcf_objects <- lapply(vcf_objects,
vcfR2tidy,
single_frame = TRUE,
info_types = TRUE,
format_types = TRUE)
# shorten names of list items to be just sample IDs instead of full paths
names(tidied_vcf_objects) <- sample_ids
# extract out only the first element of each list item, since the second
# element is metadata can can't easily be combined with the rest into a single
# data frame
just_vcf_data <- lapply(tidied_vcf_objects, `[[`, 1)
# combine the three list elements into a single data frame using plyr, and
# add an id column called 'sample_id', a factor vector that encodes the id of
# each sample, extracted from the shortened list item names, assigned above
combined_vcf_data <- ldply(just_vcf_data,
.id = "sample_id")
# write out the csv of this single combined data frame
write.csv(combined_vcf_data,
file = "~/r_data/combined_tidy_vcf.csv",
row.names = FALSE)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment