NGS: quality control

Adapter/barcode contamination

Each sequencing technology uses different experimental approaches as part of the library preparations and sequencing protocols. These protocols often involve attaching small DNA sequences to the genomic (or transcriptomic) fragments you want to sequence: adapters for affixing the DNA to a sequencing plate, unique barcode sequences enabling simultaneous sequencing of multiple samples, and so on. The sequencing machinery cannot distinguish between nucleotides from your sample and technically-introduced nucleotides. Therefore, it is an important preliminary quality control step to remove any such nucleotides prior to downstream analysis. Often this will be done by the sequencing center, but it's never a bad idea to make sure and to check their work.

Removing potential sequencing errors

Occasionally NGS instruments make mistakes when reporting the nucleotide sequence for a particular DNA fragment. For the Illumina platform, the overall error rate is about %2: that is, for every 100 nucleotides reported by the instrument, on average 2 will be incorrect. Each sequencing error makes an NGS analysis more difficult: for genome assembly, each error increases the size of the assembly graph and decreases the likelihood that the correct sequence will be assembled; for read mapping, each error decreases the likelihood that the read will map to the correct location in the genome. While a single isolated sequencing error is usually inconsequential, removing or correcting sequencing errors across an entire data set can improve the quality of results for downstream analysis.

One reason NGS instruments report quality scores for each nucleotide is that these scores can often help identify sequencing errors. One common quality control step is quality trimming, which involves looking at each read and removing nucleotides from each end if their quality scores fall below some threshold. An alternative approach is to scan the entire read and discard it altogether if it contains too many low quality scores.

Correcting sequencing errors

A different approach for quality control involves decomposing the input data into k-mers (subsequences of length k) and analyzing the set of all k-mers present in the data (the k-mer spectrum). For example, if your input data is AACGTT and k is 3, then the k-mer spectrum is the following set of 3-mers: AAC, ACG, CGT, and GTT.

Given the amount of data generated by NGS platforms, the expectation is that each genomic (or transcriptomic) nucleotide will on average be represented multiple times in the short read data. Therefore, we expect that each k-mer will be observed a minimum number of times in the k-mer spectrum of the input data. The k-mers that occur very infrequently are likely to include sequencing errors, and analysis of similar k-mers can help identify and correct the erroneous base.

Alternative approaches for error correction use suffix trees (or suffix arrays) and multiple sequence alignments. See the review by Yang, Chockalingam, and Aluru (2012) for a thorough treatment of the subject.

Removal of duplicate reads

Duplicate reads may trip up assembly and other NGS analyses because such duplications do not represent independent data points of the original sequencing experiment (i.e., reads randomly sampled from the genome) but rather technical amplification of such data points in the PCR step. Thus it may be necessary to remove the duplicate reads before subsequent analyses as part of the quality control step in the workflow. See FastUniq ..., PLOS One 2012, 7(12): e52249 for discussion and a removal tool.

Digital normalization

When sequencing genomic DNA, we assume the reads provide roughly uniform average coverage of the chromosome sequences we are trying to reconstruct. However, when sequencing microbial communities or transcriptomes, some molecules are present in extremely high abundance, while others are relatively low in abundance. Often the sample must be sequenced at high depth in order to provide sufficient coverage of low-abundance molecules. This has two direct effects: first, high-abundance molecules will be inordinately overrepresented; and second, the size of the data set will of necessity be quite large. As the size of the data set increases, so does the time and computational power needed to assemble the data.

The idea behind digital normalization is to reduce the amount of data in a data set without reducing the amount of information in the data set. For example, imagine a data set that contains 115,000 reads sampled from a highly expressed transcript. It's likely that the majority of these reads contain redundant information. The idea of normalization is to select a small number of these reads that would enable us to reconstruct the complete transcript sequence while discarding the rest of the reads.

Quality control tools

Quality analysis

Trimming, clipping, and filtering

Error correction

Digital normalization

cgss15/ngs/qc.txt · Last modified: 2015/03/02 10:20 by vbrendel
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki