Exercise: prokaryotic genome assembly

For our intro to genome assembly, we're going to be using a public E. coli data set (substrain MG1655). There are two available data sets, one from a 200bp insert library and another from a 500bp insert library. Assembling these data sets with Velvet takes only a few minutes on an Atmosphere VM, although it requires more than the 4GB of memory available on tiny instances. Make sure to request an instance with at least 8GB of memory for this exercise.

  1. Log in to your virtual machine and download the E. coli data sets from the iPlant data store (using the iget command). Note that these are interleaved data, so instead of having a *-1.fq and *-2.fq file for each data set, both reads are contained in a single file.
    • /iplant/home/standage/CGS/SRR001665-int.fq.gz
    • /iplant/home/standage/CGS/SRR001666-int.fq.gz
  2. Assess the quality of the data and determine whether any quality control is needed.
  3. Use the velveth and velvetg commands to assemble the genome. The Velvet assembler is pre-installed on your VMs and on Mason (module load velvet), and instructions for running Velvet are provided below. Please refer to the Velvet manual for a complete listing of settings and options. One in particular you may want to adjust is the k-mer length, which you can change using the hash_length parameter, set to 31 in the example below.
  4. Use the asmbleval.pl script below to generate some summary statistics for the assembly. Take note of these statistics, and document if/how they change when you run velvet with different parameters.
  5. Try re-running the assembly again with a different k-mer length, and compare the results from this run with your previous runs. How did this affect the assembly?

Running Velvet

velveth Ecoli 31 -fastq -shortPaired SRR001665-int.fq.gz -shortPaired2 SRR001666-int.fq.gz
velvetg Ecoli -unused_reads yes -ins_length 200 -ins_length2 500 -min_contig_lgth 100 -cov_cutoff auto

Assessment

For our example, the downloaded data sets were meant as test sets to evaluate the quality of an assembly generated by NGS (Illumina) sequencing. For this purpose, we need to compare the assemblies you generated with the “reference genome”, available at NCBI (Escherichia coli str. K-12 substr. MG1655).

A suitable program for sequence-level comparisons is MUMmer. Please download and install the program, then map the scaffolds from your assembly onto the reference genome. Discuss how you can use such alignment results to assess the quality (coverage and error rate) of a particular assembly, relative to the reference genome.

Assembly evaluation script

asmbleval.pl
#!/usr/bin/env perl
use strict;
 
$/ = ">";
<STDIN>; # Discard "junk", if any, at beginning of the file.
 
my @sequencelengths;
my $gccontent = 0;
my $atcontent = 0;
my $combinedlength = 0;
while(my $record = <STDIN>)
{
  chomp($record);
  my ($defline, @seqlines) = split(/\n/, $record);
  my $sequence = join("", @seqlines);
 
  my $gc = $sequence =~ tr/GCSgcs//;
  my $at = $sequence =~ tr/ATWatw//;
  $gccontent += $gc;
  $atcontent += $at;
 
  $combinedlength += length($sequence);
  push(@sequencelengths, length($sequence));
}
 
my $n50 = 0;
my $n90 = 0;
my $templength = $combinedlength;
my $combinedlength50perc = $combinedlength * 0.5;
my $combinedlength10perc = $combinedlength * 0.1;
 
@sequencelengths = sort {$b<=>$a} @sequencelengths;
foreach my $len(@sequencelengths)
{
  if($n50 == 0 and $templength - $len < $combinedlength50perc)
  {
    $n50 = $len;
  }
  if($templength - $len < $combinedlength10perc)
  {
    $n90 = $len;
    last;
  }
  $templength -= $len;
}
 
report("%-25s %d bp\n", "Combined length:", $combinedlength);
report("%-25s %.2lf%%\n", "Percent GC content:", $gccontent / ($gccontent+$atcontent) * 100);
report("%-25s %d\n", "Number of sequences:", scalar(@sequencelengths));
report("%-25s %.2lf bp\n", "Mean sequence length:", $combinedlength / scalar(@sequencelengths));
report("%-25s %d bp\n", "Sequence n50:", $n50);
report("%-25s %d bp\n", "Sequence n90:", $n90);
report("%-25s %d bp\n", "Longest sequence:", $sequencelengths[0]);
 
sub report
{
  my($format, @args) = @_;
  my $string = sprintf($format, @args);
  $string =~ s{(\: +)}{ my $per = "." x length($1); qq{ $per }}e;
  print($string);
}

You can run the script on the command line like this. Keep in mind your numbers will look very different.

$ perl asmbleval.pl < genome-sequence.fa 
Combined length ........... 208026220 bp
Percent GC content ........ 29.66%
Number of sequences ....... 1483
Mean sequence length ...... 140273.92 bp
Sequence n50 .............. 1625592 bp
Sequence n90 .............. 274443 bp
Longest sequence .......... 7126315 bp
$

Computing assembly likelihood with CGAL

The PDF manual contained in the CGAL code distribution describes the following workflow for computing an assembly likelihood.

bowtie2convert algns.sam X
align contigs.fa Y Z
cgal contigs.fa
  • The first command takes a SAM file generated by bowtie2 (algns.sam in this example) and extracts the unmapped reads. Replace X with the maximum read length. :!: Note the following regarding bowtie2 alignments. :!:
    • Make sure to run bowtie2 with the -a and –no-mixed flags.
    • If you have multiple SAM/BAM files from multiple libraries, you need to use samtools to merge and sort the data into a single file.
  • The align program will use a modified Smith-Waterman approach to try to align reads that could not be mapped with bowtie2. This step is more time intensive, so only a subset of the reads are aligned. Replace Y with the number of reads to align and Z with the number of available processors to speed up alignments.
  • The cgal program does the actual likelihood calculation.

I have installed CGAL on a new VM image called CGS-IU-v1.2. However, Atmosphere seems to be having issues with this image, so if you want to use CGAL in the mean time you can download and install it yourself from the CGAL website.

Using MUMmer to align and plot 2 assemblies

MUMmer is also on the new CGS-IU-v1.2 image that should hopefully be available soon. Until then, it's not too difficult to download and install.

If you have two assemblies that you want to align and plot against each other, try the following commands.

# Align contigs/scaffolds from your two assemblies
nucmer --mum asmbl-1.fa asmbl-2.fa
 
# Nucmer creates a file called `out.delta`. Use mummerplot to visualize.
# See what happens when you leave out the -f flag or the -l flag or both.
mummerplot -f -l -t png out.delta
 
# This creates `out.png`, which you can view with any image viewer (if you're connected on VNC).
sudo apt-get install eog
eog out.png
cgss15/genome-assembly/exercise-prok.txt · Last modified: 2015/02/26 14:01 by standage
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki