Exercise: mapping basics

0. Setup

Let's go ahead and create a working environment like we did for the first exercise.

mkdir mapping
cd mapping
touch 0README

The mapping programs we will be using for this exercise are already pre-installed on the CGS-IU-v1.1 Atmosphere image. However, the exercise will also rely on the samtools library, so go ahead and install that with the following command (it will prompt you for your password).

sudo apt-get install samtools

Also, if you have not added the following command to your .bashrc file, go ahead and execute it now.

source /usr/local/src/CGS-IU/0README

1. Data access

Download the following data set from the iPlant Data Store. This data is from a plant genome of particular interest to me (kudos to anyone who can figure out the species). The Fasta file is a randomly selected 150kb region of this plant genome, and the Fastq files are simulated Illumina reads.

Use the following commands to download the data.

# Download the genomic region
iget -V /iplant/home/standage/CGS/tcac-150kb.fa
# First set of simulated reads
iget -V /iplant/home/standage/CGS/tcac-sampleA-1.fq
iget -V /iplant/home/standage/CGS/tcac-sampleA-2.fq
# Another set of simulated reads
iget -V /iplant/home/standage/CGS/tcac-sampleB-1.fq
iget -V /iplant/home/standage/CGS/tcac-sampleB-2.fq
# A third set of simulated reads
iget -V /iplant/home/standage/CGS/tcac-sampleC-1.fq
iget -V /iplant/home/standage/CGS/tcac-sampleC-2.fq

3. Map reads with BWA


Before we can use BWA to align any reads to the genome, we must first index the genome. This creates a few extra index files that allow BWA to search the genome very quickly and efficiently. The command to index the genome is quite simple.

bwa index tcac-150kb.fa

This should only take a second or two for this small demo data set, although you should expect it to take a few minutes for a full-sized eukaryotic genome.


Once the genome is indexed, the command to align the reads is also pretty simple. BWA has several alignment modes, but we're going to use the maximal exact match (MEM) mode.

bwa mem tcac-150kb.fa tcac-sampleA-1.fq tcac-sampleA-2.fq > tcac-sampleA-unsorted.sam

This command stores the alignments to the file tcac-sampleA-unsorted.sam in SAM format.


What's in this SAM file? You can use the less program to browse this file or the head program to look at the first 10 lines. However, it's pretty clear that the SAM format was designed more for computers than for humans: it's a bit complicated to read. So how to we examine the file?

Mapping efficiency

There are a couple of easy sanity checks we can do to make sure things didn't go horribly wrong with the alignment. One is to look at the mapping efficiency, or the percentage of reads that mapped successfully to the genome. Some alignment programs will report this number, but BWA does not.

First, let's look at the number of reads in our data set.

wc -l tcac-sampleA-1.fq

This command counts the number of lines in the file: it should report 300000. Since each record takes up 4 lines in the Fastq file, that means we have 300000 / 4 = 75000 reads in this file (and 750000 read pairs in the data set).

Now we can use the samtools view command to count the number of reads that mapped to the genome.

# The '-c' flag indicates to count the alignments instead of printing them out.
# The '-F 4' setting tells samtools not to report unmapped reads.
samtools view -c -F 4 tcac-sampleA-unsorted.sam

This command should report 150000, which means that all 75000 read pairs aligned to the genome. Not surprising since this is simulated data with a low error rate, but it's good to know that our alignment program seems to be working correctly.

CIGAR strings

We can also look at the type of alignments in the file. The 6th column of a SAM file is a CIGAR string that provides a compact description of an alignment (see the SAM specification for more details). So we can use the command cut to extract the 6th column from each line in the file.

# There are many lines in the SAM file, so let's just look at the first 10
cut -f 6 tcac-sampleA-unsorted.sam | head -n 10

For this example, you should see 10 lines each containing 100M, signifying a perfect match (100M stands for 100 nucleotides match). So now let's look at the rest of the CIGAR values and use UNIX commands to sort and count them.

# Here's the pipeline.
#  - cut: extract the 6th column
#  - sort: the values
#  - uniq: report each value once and count how many times it occurred
#  - sort: sort the values again, this time using a reverse numerical sort
#  - head: print the first 10 lines
cut -f 6 tcac-sampleA-unsorted.sam | sort | uniq -c | sort -rn | head -n 10

The output of this command will show the 10 most common CIGAR strings, and the number of time each one occurred. The 100M CIGAR string is the biggest number by far, suggesting that the vast majority of our reads aligned perfectly to the genome! This gives us confidence that at least we didn't make any huge mistakes in the alignments.


The samtools package has a program called tview that allows use to browse and visually examine the alignments. This should be a bit easier to digest than the raw SAM output. However, to open the file in tview we must first convert it to BAM format, sort it, and then index it.

# SAM -> BAM conversion
samtools view -bS tcac-sampleA-unsorted.sam > tcac-sampleA-unsorted.bam
# Sort the alignments
samtools sort tcac-sampleA-unsorted.bam tcac-sampleA
# Index the alignments for quick lookup
samtools index tcac-sampleA.bam

Now we are ready to examine the alignments with tview.

samtools tview tcac-sampleA.bam tcac-150kb.fa

In tview you can scroll up, down, left, and right using the arrow keys, or you can type the ? character for a list of additional commands.

Put it all together

When using UNIX to process data, it's common to connect multiple commands together into a single pipeline rather than to run each command separately and store lots of intermediate files. BWA and samtools were written with the UNIX philosophy and can be run as part of a pipeline, so we could actually have combined some of those commands into a single pipeline like so.

bwa mem tcac-150kb.fa tcac-sampleA-1.fq tcac-sampleA-2.fq | samtools view -bS - | samtools sort - tcac-sampleA

In UNIX, the bar (|) character takes the output of the program on the left and sends it as input to the program on the right. Using the dash character as a filename is a common UNIX convention to indicate that the program should read input from the “standard input” (pipeline) rather than from a file on the hard disk.

The pipeline above sends the BWA output directly to the samtools SAM → BAM conversion, whose output in turn is sent to the samtools sort process. The benefit of doing it this way (rather than running the 3 commands separately as shown above) is that we didn't have to create intermediate files. For a complete data set, these temporary intermediate files can get very large, so unless you run into issues and need to troubleshoot your analysis workflow, it's resourceful to use UNIX pipelines.

Working with BAM

BAM is a binary format, which means it can't be processed by normal UNIX commands or viewed with a normal text viewer. If we want to (for example) look at the CIGAR strings in a BAM file, we would first have to convert it back to SAM format. Luckily we can easily integrate this into a pipeline too!

samtools view tcac-sampleA.bam | cut -f 6 | sort | uniq -c | sort -rn | head

4. Map reads with Bowtie2

Alignment with Bowtie2 is a very similar process to alignment with BWA. I will forgo the description and just provide the commands you need to do the alignments.

Note: in the terminal you can use the backslash (\) character to spread a single command over multiple lines. I will often do this to make my commands more readable.

# Index the genome for Bowtie2
bowtie2-build tcac-150kb.fa tcac-150kb
# The alignment pipeline
#  - alignment with bowtie2
#  - conversion to BAM with samtools view
#  - sorting BAM with samtools sort
bowtie2 -x tcac-150kb -1 tcac-sampleA-1.fq -2 tcac-sampleA-2.fq \
    | samtools view -bS - \
    | samtools sort - tcac-sampleA-bt2
# Look at most common CIGAR strings
samtools view tcac-sampleA-bt2.bam | cut -f 6 | sort | uniq -c | sort -rn | head
# Index and examine with tview
samtools index tcac-sampleA-bt2.bam
samtools tview tcac-sampleA-bt2.bam tcac-150kb.fa
cgss15/mapping/exercise-basics.txt · Last modified: 2015/01/28 12:14 by standage
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki