The steps for generating taxon-annotated ‘blob’ plots are as follows:
Full instructions for running Blobology and more advanced usage can be found at:
https://github.com/sujaikumar/assemblage
Also see the paper:
Sujai Kumar, Martin Jones, Georgios Koutsovoulos, Michael Clarke and Mark Blaxter Blobology: exploring raw genome data for contaminants, symbionts, and parasites using taxon-annotated GC-coverage plots doi: 10.3389/fgene.2013.00237 http://www.frontiersin.org/Journal/10.3389/fgene.2013.00237/abstract
Although this tutorial uses the assemblage github repository, the repository for this paper is at https://github.com/blaxterlab/blobology The scripts at blaxterlab/blobology can be easier to use if you already have bam or coverage files.
https://github.com/sujaikumar/assemblage also has other scripts/workflows relevant to genome assembly, annotation, finding conserved non-coding elements, etc.
Also please see Sujai Kumar’s recent presentation at Beatles and Bioinformatics:
http://www.youtube.com/watch?v=tSul_qDwvN4&t=3h10m50s
BLAST+ for taxonomic assignment
sudo apt-get install ncbi-blast+
Install R and required packages (requires R >=2.14).
wget http://athyra.ged.msu.edu/~mcrusoe/debs/oneiric/r-base-core-dbg_3.0.2-1_amd64.deb
wget http://athyra.ged.msu.edu/~mcrusoe/debs/oneiric/r-base-core_3.0.2-1_amd64.deb
wget http://athyra.ged.msu.edu/~mcrusoe/debs/oneiric/r-base-dev_3.0.2-1_all.deb
wget http://athyra.ged.msu.edu/~mcrusoe/debs/oneiric/r-base-html_3.0.2-1_all.deb
wget http://athyra.ged.msu.edu/~mcrusoe/debs/oneiric/r-base_3.0.2-1_all.deb
wget http://athyra.ged.msu.edu/~mcrusoe/debs/oneiric/r-doc-html_3.0.2-1_all.deb
wget http://athyra.ged.msu.edu/~mcrusoe/debs/oneiric/r-doc-info_3.0.2-1_all.deb
wget http://athyra.ged.msu.edu/~mcrusoe/debs/oneiric/r-doc-pdf_3.0.2-1_all.deb
wget http://athyra.ged.msu.edu/~mcrusoe/debs/oneiric/r-mathlib_3.0.2-1_amd64.deb
wget http://athyra.ged.msu.edu/~mcrusoe/debs/oneiric/r-recommended_3.0.2-1_all.deb
dpkg -i r-base* r-base-core_* r-recommended* r-mathlib*
R
install.packages(c("codetools", "MASS", "ggplot2"))
After ths point, the following dependencies have already been installed on AMI public snapshot (in us-east): snap-78cf1764 so if you mount this snapshot as a volume you can skip the following steps, but you do need to add the programs to your $PATH:
cd /mynewmount
source env.sh
We will use seqtk for subsampling contigs (can also be used for reads)
git clone https://github.com/lh3/seqtk.git
cd seqtk; make; cd ..
BWA for read mapping
sudo apt-get install bwa
Sujai Kumar’s Assemblage for the scripts we need:
git clone https://github.com/sujaikumar/assemblage
The NCBI nt files, plus taxonomy information
wget ftp://ftp.ncbi.nih.gov/blast/db/nt.??.tar.gz
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
Assume you have a contig file from your assembly called ‘contigs.fasta’
First, subsample 10,000 contigs from the file:
seqtk sample contigs.fasta 10000 > contigs_10000.fasta
BLAST the reads
blastn -task megablast -query contigs_10000.fasta -db blast/nt -max_target_seqs 1 -outfmt 6 > contigs_10000.megablast.nt
Produce a taxonomy file
blast_taxonomy_report.pl \
-b contigs_10000.megablast.nt \
-nodes tax/nodes.dmp \
-names tax/names.dmp \
-gi_taxid_file tax/gi_taxid_nucl.dmp.gz \
-t genus=1 -t order=1 -t family=1 -t superfamily=1 -t kingdom=1 \
>contigs_10000.megablast.nt.taxon
Create a database of your contigs for BWA
bwa index contigs.fasta
Align the reads you used to make the assembly to the contigs database (for paired-end reads)
bwa mem contigs.fasta pair1.fasta.gz pair2.fasta.gz > samfile
If you have interleaved reads, or single-end, just omit the second FASTQ file
bwa mem contigs.fasta reads.fasta.gz > samfile
Create a file with coverage and GC information, this will be named according to your samfile name with .lencovgc.txt as the suffix.
Ensure you change the name of the files below before running the command.
sam_len_cov_gc_insert.pl -s samfile -f contigs.fasta
cat contigs_10000.megablast.nt.taxon samfile.lencovgc.txt |
perl -anF"\t" -e '
chomp;
/^(\S+).*\torder\t([^\t]+)/ and $o{$1} = $2 and next;
if ($F[2] =~ /^\d+$/ ) { print "$_\t" . (exists $o{$F[1]} ? $o{$F[1]} : "Not annotated") . "\n" }
' >> lencovgc.taxon
If you want a different taxonomic level, e.g. species, download this script:
wget http://static.xbase.ac.uk/files/results/nick/make_blobology_file.py
python make_blobology_file.py contigs_10000.megablast.nt.taxon samfile.lencovgc.txt order > lencovgc.taxon
Just replace ‘order’ with another taxonomic level when you run make_blobology_file.py, e.g.: species, genus, family, superfamily, kingdom
Create the blobology plot
Rscript --vanilla ../bin/assemblage/blobology.R lencovgc.taxon 0.005 1 2
The final file is called lencovgc.taxon.png - you will need to download this to view it (e.g. with sftp, scp run on your local machine).
scp -i identity.pem root@server:/path/to/lencovgc.taxon.png .
Development and posting of this material, and the associated workshop, were supported by Grant Number R25HG006243 from the National Human Genome Research Institute and an NSF OCI supplement to NSF DBI-0939454.
This file can be edited directly through the Web. Anyone can update and fix errors in this document with few clicks -- no downloads needed.
For an introduction to the documentation format please see the reST primer.