Exercise: NGS data and quality control

0. Setup

Before beginning this exercize, please make sure you are comfortable with the following.

  • Launching a virtual machine (VM) with iPlant Atmosphere (documentation)
  • Connecting to an Atmosphere VM with VNC Viewer (documentation)
  • Setting up iPlant Data Store access on your machine with the iinit command (documentation)
  • Entering basic UNIX commands on the command line

For this exercise, let's create a directory that will serve as your working environment. It is a good habit to do this for all exercises and assignments to keep your file system organized.

mkdir qc
cd qc
touch 0README

Here are a few hints in case this is your first exposure to UNIX.

  • The command cd takes you to your home directory, typically something like /home/username or /Users/username. You may already be in your home directory, but this command just makes sure.
  • The command mkdir qc makes a directory in your home directory named qc. We will place all of the data files, notes, and so on for the exercise in this directory.
  • The command cd takes you to the new directory you just created.
  • The command touch 0README creates a new empty file named 0README. It is common practice to keep notes in some kind of 0README file, for the same reason we keep a lab notebook. You may want to keep raw notes in your 0README files, and then go back later and clean up the notes on the wiki for sharing with the class. There's no one “right” way to do this, so come up with a system that works for you.

Running on lab iMacs

It looks like the iPlant Atmosphere maintenance is taking longer than originally expected. The exercise should be simple enough to do on the iMacs in the lab, but we'll have to install a couple of programs.

1. Data access

Log in to your virtual machine and download the following data set from the iPlant Data Store (using the iget command). This data is bacterial genomic DNA and was obtained from NCBI's Short Read Archive. In the interest of Getting Things Done In A Reasonable Timeframe, a small subset of the entire data was selected for this exercise. However, every aspect of this exercise should be realistic, except that when analyzing a complete data set you should expect things to take a bit more time.

The data files are available at the following paths.

  • /iplant/home/standage/CGS/smel-subset-1.fq.gz
  • /iplant/home/standage/CGS/smel-subset-2.fq.gz

Go ahead and download the files by entering the following commands in your terminal.

iget -V /iplant/home/standage/CGS/smel-subset-1.fq.gz
iget -V /iplant/home/standage/CGS/smel-subset-2.fq.gz

2. Examine the data with FastQC

What is in these mysterious data files we just downloaded? One way to check is to actually look at the files (with a program like less), and in fact that's not a bad idea. However, since most .fastq files contain millions of sequence reads, there are too many data points to look at manually. Fortunately we have programs that will analyze the data for us!

To examine the data, launch the FastQC program by entering the following command in your terminal.


This will bring up a new window, which is the interface of the FastQC program. To open the data files, locate the menu at the top of the window and click File and then Open. This will bring up a file system dialog, which you can use to locate and select the data files you just downloaded. After locating and selecting the files, click Ok. FastQC will only take a few seconds to analyze this data set.

3. Interpret reports

FastQC generates about a dozen or so reports for each data set, looking at a variety of sequence characteristics. A green icon indicates that a particular analysis module passed without any concerns. A yellow icon indicates that the analysis module has issued a warning, which means there may be minor issues with the data set. A red icon indicates that the analysis module has issued an error, typically indicative of serious issues with the data set.

Click on the icon for each analysis module and look at the report. A more detailed explanation for each module can be found by clicking Help, and then Contents, and then Analysis Modules.

4. Groom the sequence data

The FastQC reports usually give us a pretty good idea what the potential problems with a data set might be. Using this information, we can come up with a grooming procedure to clean up the data.

For this particular data set, there seems to be a large number of reads with low quality scores, as well as some over-represented k-mers. We'll use the Trimmomatic program to trim (or in some cases completely discard) low-quality sequences and remove adapter sequences, which may in part be responsible for the suspicious k-mer profiles.

First, create a file with all of the Illumina paired-end adapter sequences.

cat /usr/local/src/NGS-DIR/TRIMMOMATIC/Trimmomatic-0.32/adapters/*-PE.fa > adapters/all-PE.fa

Then run the Trimmomatic procedure.

java -jar /usr/local/src/NGS-DIR/TRIMMOMATIC/Trimmomatic-0.32/trimmomatic-0.32.jar PE \
     -threads 1 -phred33 \
     smel-subset-1.fq.gz smel-subset-2.fq.gz \
     smel-subset-1-trim.fq smel-subset-1-trim-unpaired.fq \
     smel-subset-2-trim.fq smel-subset-2-trim-unpaired.fq \
     ILLUMINACLIP:all-PE.fa:2:40:15 \
     LEADING:28 \
     TRAILING:28 \

If successful, this command will create 4 files. The files smel-subset-1-trim.fq and smel-subset-2-trim.fq contain read pairs that have been cleaned up. The files smel-subset-1-trim-unpaired.fq and smel-subset-2-trim-unpaired.fq contain reads for which the pair was discarded (most of the time we can ignore these files).

5. Examine the cleaned-up data with FastQC

If our grooming procedure was successful, we should expect that any issues reported by the initial FastQC analysis should now be solved. We should verify this assumption, though, by opening the cleaned-up data in FastQC and checking the analysis module reports. If the issues have been resolved, we should be ready to analyze the data. If FastQC still reports issues, we may need to refine our grooming procedure and run it again.

cgss15/ngs/exercise.txt · Last modified: 2015/01/21 10:19 by standage
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki