This is a sitemap over all available pages ordered by namespaces.
For our intro to genome assembly, we're going to be using a public E. coli data set (substrain MG1655). There are two available data sets, one from a 200bp insert library and another from a 500bp insert library. Assembling these data sets with Velvet takes only a few minutes on an Atmosphere VM, although it requires more than the 4GB of memory available on tiny instances. Make sure to request an instance with at least 8GB of memory for this exercise.
iget
command). Note that these are interleaved data, so instead of having a *-1.fq
and *-2.fq
file for each data set, both reads are contained in a single file./iplant/home/standage/CGS/SRR001665-int.fq.gz
/iplant/home/standage/CGS/SRR001666-int.fq.gz
velveth
and velvetg
commands to assemble the genome. The Velvet assembler is pre-installed on your VMs and on Mason (module load velvet), and instructions for running Velvet are provided below. Please refer to the Velvet manual for a complete listing of settings and options. One in particular you may want to adjust is the k-mer length, which you can change using the hash_length
parameter, set to 31 in the example below.asmbleval.pl
script below to generate some summary statistics for the assembly. Take note of these statistics, and document if/how they change when you run velvet with different parameters.velveth Ecoli 31 -fastq -shortPaired SRR001665-int.fq.gz -shortPaired2 SRR001666-int.fq.gz velvetg Ecoli -unused_reads yes -ins_length 200 -ins_length2 500 -min_contig_lgth 100 -cov_cutoff auto
For our example, the downloaded data sets were meant as test sets to evaluate the quality of an assembly generated by NGS (Illumina) sequencing. For this purpose, we need to compare the assemblies you generated with the “reference genome”, available at NCBI (Escherichia coli str. K-12 substr. MG1655).
A suitable program for sequence-level comparisons is MUMmer. Please download and install the program, then map the scaffolds from your assembly onto the reference genome. Discuss how you can use such alignment results to assess the quality (coverage and error rate) of a particular assembly, relative to the reference genome.
#!/usr/bin/env perl use strict; $/ = ">"; <STDIN>; # Discard "junk", if any, at beginning of the file. my @sequencelengths; my $gccontent = 0; my $atcontent = 0; my $combinedlength = 0; while(my $record = <STDIN>) { chomp($record); my ($defline, @seqlines) = split(/\n/, $record); my $sequence = join("", @seqlines); my $gc = $sequence =~ tr/GCSgcs//; my $at = $sequence =~ tr/ATWatw//; $gccontent += $gc; $atcontent += $at; $combinedlength += length($sequence); push(@sequencelengths, length($sequence)); } my $n50 = 0; my $n90 = 0; my $templength = $combinedlength; my $combinedlength50perc = $combinedlength * 0.5; my $combinedlength10perc = $combinedlength * 0.1; @sequencelengths = sort {$b<=>$a} @sequencelengths; foreach my $len(@sequencelengths) { if($n50 == 0 and $templength - $len < $combinedlength50perc) { $n50 = $len; } if($templength - $len < $combinedlength10perc) { $n90 = $len; last; } $templength -= $len; } report("%-25s %d bp\n", "Combined length:", $combinedlength); report("%-25s %.2lf%%\n", "Percent GC content:", $gccontent / ($gccontent+$atcontent) * 100); report("%-25s %d\n", "Number of sequences:", scalar(@sequencelengths)); report("%-25s %.2lf bp\n", "Mean sequence length:", $combinedlength / scalar(@sequencelengths)); report("%-25s %d bp\n", "Sequence n50:", $n50); report("%-25s %d bp\n", "Sequence n90:", $n90); report("%-25s %d bp\n", "Longest sequence:", $sequencelengths[0]); sub report { my($format, @args) = @_; my $string = sprintf($format, @args); $string =~ s{(\: +)}{ my $per = "." x length($1); qq{ $per }}e; print($string); }
You can run the script on the command line like this. Keep in mind your numbers will look very different.
$ perl asmbleval.pl < genome-sequence.fa Combined length ........... 208026220 bp Percent GC content ........ 29.66% Number of sequences ....... 1483 Mean sequence length ...... 140273.92 bp Sequence n50 .............. 1625592 bp Sequence n90 .............. 274443 bp Longest sequence .......... 7126315 bp $
The PDF manual contained in the CGAL code distribution describes the following workflow for computing an assembly likelihood.
bowtie2convert algns.sam X align contigs.fa Y Z cgal contigs.fa
algns.sam
in this example) and extracts the unmapped reads. Replace X
with the maximum read length. -a
and –no-mixed
flags.Y
with the number of reads to align and Z
with the number of available processors to speed up alignments.I have installed CGAL on a new VM image called CGS-IU-v1.2. However, Atmosphere seems to be having issues with this image, so if you want to use CGAL in the mean time you can download and install it yourself from the CGAL website.
MUMmer is also on the new CGS-IU-v1.2 image that should hopefully be available soon. Until then, it's not too difficult to download and install.
If you have two assemblies that you want to align and plot against each other, try the following commands.
# Align contigs/scaffolds from your two assemblies nucmer --mum asmbl-1.fa asmbl-2.fa # Nucmer creates a file called `out.delta`. Use mummerplot to visualize. # See what happens when you leave out the -f flag or the -l flag or both. mummerplot -f -l -t png out.delta # This creates `out.png`, which you can view with any image viewer (if you're connected on VNC). sudo apt-get install eog eog out.png