Assignment: SNP calling

The first exercise for this unit introduced you to some common tools and basic concepts for read mapping. This assignment will look at one of the most common applications of NGS read mapping: detecting genomic variants. The assignment description is provided below.

  • Select a data set that you would like to study. Links to an example data set are provided below, but if your lab has done some sequencing (and if a reference genome is available) then we would encourage you to use those data.
  • Align the reads to the genome. The exercise introduced you to BWA and Bowtie2, but there are plenty of others. We want you to branch out with these assignments: you're welcome to use BWA or Bowtie2 for this step, but if you do, make sure to try a new aligner as well.
    • Novoalign (which is installed on the VMs) is a bit slower than BWA and Bowtie2, but is generally accepted by the community as the most accurate aligner.
    • For dozens of additional suggestions, try this page (from the reading assignment).
  • Use samtools mpileup to analyze the alignments and identify genome variants. Hints are provided below.
  • Examine your VCF (variant call format) file and choose several variants to evaluate manually. Note their position in the genome, and then view the alignments in samtools tview and visit that position (in tview, to go to position 15000 of sequence chr1 you type 'g' and then 'chr1:15000'). Do the reads aligned at that position provide good support for the variant call?
  • Explore the issue in a bit more depth. Here are some suggestions.
    • We haven't discussed any of the parameter settings for these mapping programs, but each of them has a manual that is free and easy to find online. It's always a good idea to look at the settings to determine if the defaults are appropriate for your particular analysis.
    • If you only ran a single aligner, try doing error correction on the reads first, then re-run the alignment and variant calling procedures, and then compare results.
    • If you ran multiple aligners, explore how well the variant calls agree between the two sets of results.
    • Try a different variant calling pipeline such as GATK.

Be sure to take good notes in your wiki, and follow the guidelines discussed in the course orientation material.

Appendix: example data set

If you're looking for inspiration, consider the following data from Richard Lenski's long-term evolution experiments in E. coli.

# Illumina reads, single end data
# Genome sequence
gunzip REL606.fa.gz

Source: ANGUS 2014.

Appendix: variant calling with samtools

There are several popular software packages and workflows for detecting single nucleotide polymorphisms (SNPs) and other genomic variants. You're welcome to try any of these, but here we provide instructions for variant calling with samtools.

First, use the mpileup command to convert the data from BAM to BCF format and calculate the likelihood of the data.

samtools mpileup -uD -f genome.fa reads.sorted.bam > var-raw.bcf

Next, use the bcftools view command to assess the likelihoods and make the variant calls.

bcftools view -bvcg var-raw.bcf > var.bcf

Finally, convert the variant calls from the binary BCF format to the plain text VCF format.

bcftools view var.bcf > var.vcf

Now we can use tools like less, head, and cat to evaluate the variant calls. And of course, we can put this all together into single pipeline command to eliminate the intermediate data files.

samtools mpileup -uD -f genome.fa reads.sorted.bam \
    | bcftools view -bvcg - \
    | bcftools view - \
    > var.vcf
cgss15/mapping/assignment-snps.txt · Last modified: 2015/01/28 13:54 by standage
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki