NGS: data formats

The Fasta format

The Fasta format is easily the most ubiquitous and recognizable format for encoding biological sequences. It's simple, easy for humans to read, and easy for computers to parse.

  • The first line of each sequence begins with a chevron (> symbol), followed by whatever textual description is desired.
  • All subsequent lines in the entry contain the DNA (or RNA or protein) sequence.
  • The next chevron indicates the beginning of a new entry.
>seq1
CTCTGCCTATACACACACATTGGACGAAACATCTACTATGGATCCTACCTATACTCAGAAACCTGAAACA
CAGGTATTATACTACTACTAATCACCATAGCCACCGCCTTCATAGGATATGTCCTTCCATGAGGACAAAT
ATCATTCTGAGGGGCAACCGTAATTACTAACCTCTTCTCAGCAATTCCCTACATCGGCACAAACCTAGTA
GAATGAATTTGAGGAGGCTTTTCGGTAGATAAAGCAACCTTAAACCGATTCTTCGCCTTCCATTTCATCC
TTCCATTTACTATAGTTGCACTAGCAGGAGTGCACCTAACCTTTCTTCACGAAACAGGCTCAAACAACCC
ACTAGGTCTCACTTCAGACTCAGATAAAATTCCCTTCCACCCGTACTATACTATCAAAGACTTCCTAGGA
CTACTTATCCTAATTTTACTCCTTCTACTCTTAGCCCTACTATCTCCA
>seq2 optional description here
TCTGAACTCTAACAATAGATTTACTAACACTTACATGAATTGGCAGTCAACCAGTAGAATACCCCTACAC
CATTATTGGCCAAATAGCCTCAATTCTATACTTCTCCATTATTCTAGCTTTCCTACCAATTGCAGGANTA
ATCGAAAACTACC

These days you don't typically see NGS data encoded in Fasta format. However, the Fasta format is still the most common format for storing genome assemblies, protein sequences, and gene sequences, all of which will come up in this class.

The Fastq format(s)

Pre-NGS (Sanger) sequencers typically produced paired .fasta and .qual files, one containing the sequence for each read and the other containing a sequence of quality scores. These quality scores encode how confident the instrument is in the accuracy of each nucleotide for each read.

With the advent of NGS sequencing, it has become much more common to encode sequence data and quality scores in a single file using the Fastq format. The Fastq format is similar to the Fasta format, with a few extensions: each entry contains 4 lines and corresponds to a single read.

  • The first line begins with an @ symbol and contains an identifier for the read. Sequencing machines will typically use this identifier to store information about the flowcell, tile, instrument, etc, from which the read originated, but any string of (ASCII) characters is valid.
  • The second line contains the nucleotide sequence of the read.
  • The third line begins with the + character and is followed by the read identifier. Usually, this line is identical to the first line in the entry (except for the first character) or it is blank.
  • The fourth line contains quality scores for each nucleotide in the read.
@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC

As sequencing platforms have changed over time, several incompatible variants of the Fastq format have emerged, each with a different way of interpreting the quality values in the 4th line of each entry (see Cock et al (2010) for a thorough treatment of this issue). Some software packages can guess which encoding scheme is used for a particular Fastq file, but it's not always possible to guess correctly and it's always a good idea to get this information before analyzing the data. If you ordered the data from a sequencing center, make sure they provide this information to you. If you are downloading someone else's data from a public database, this information is usually provided in the metadata.

See this page about converting between different Fastq variants.

Paired-end vs single-end data

Some NGS protocols sequence from one end of each DNA fragment: this is called single-end sequencing. Other protocols will sequence both ends of each fragment, producing two paired reads: this is called paired-end sequencing. For some types of analysis paired-end data provide valuable information that can improve the result. When you are working with NGS data, you need to check whether your reads are single-end or paired-end. Some software only supports one data type or the other, and software that supports both requires you to specify the data type.

Interleaved vs paired files

To add to the confusion further, there are two different conventions for pairing reads together. In an interleaved Fastq file, each read pair will be adjacent entries in the file: that is, the first and second entries correspond to the same DNA fragment, the third and fourth entries correspond to another DNA fragment, and so on. In paired Fastq files, read pairs are stored in separate files: that is, the first entry in file A is paired with the first entry in file B. It is always important to check whether a particular software package expects paired-end data in interleaved Fastq files or paired Fastq files.

See this page about converted between interleaved and paired Fastq files.

cgss15/ngs/formats.txt · Last modified: 2015/01/12 14:06 by standage
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki