Assignment: transcriptome assembly

Raw data files for all 12 Polistes dominula libraries are available in the following directory in the iPlant data store.

ils /iplant/home/standage/Polistes_dominula/sequence/transcriptome

The goal for this assignment is to download the data, check data for quality issues, perform proper quality control, assemble transcript sequences, and assess the assembly. Assessment of transcriptome assemblies will be discussed further in class, but much of what we learned by assessing genome assemblies will apply here.

The transcriptome assembly produced by our research group for this data is accessible at /iplant/home/standage/CGS/pdom-tsa-r1.2.fa.gz for comparison.

Practical considerations

In an ideal world it would be great if you all could do an assembly with all 12 libraries. However, because we have constraints on our time and resources you are free to reasonably reduce the size of your data inputs. I would suggest doing this in two ways.

  • Select 6 libraries instead of 12. Perhaps you can choose 3 arbitrary queen samples and 3 arbitrary worker samples, or all 6 queens or all 6 workers. Either way, this should reduce the amount of runtime required to assemble the transcripts, and while leaving out libraries may make it impossible to recover a small number of transcripts, most should be recovered even from a subset of libraries.
  • For the libraries you choose to include, reduce the coverage of your data by digital normalization. In principle, this should reduce the amount of data you are providing as input without reducing the amount of information contained in that data. In other words, you are eliminating redundancy.

Digital normalization

On the CGS-IU-v1.2 iPlant VMs, there should be a script installed called If you want to use this script elsewhere, you must install the khmer package. Note that to run digital normalization on paired-end data, you must first interleave the paired files.

# Interleave the paired data into a single file.
paste <(paste - - - - < pdom-rnaseq-q1-1.fq) <(paste - - - - < pdom-rnaseq-q1-2.fq) | tr '\t' '\n' > pdom-rnaseq-q1-int.fq
# Run the digital normalization script -k 20 -C 20 -N 4 -x 4e9 --paired --out pdom-diginorm-q1-int.fq pdom-rnaseq-q1-int.fq
# Decouple (split) the interleaved output into paired files
paste - - - - - - - - < pdom-diginorm-q1-int.fq \
    | tee >(cut -f 1-4 | tr '\t' '\n' > pdom-diginorm-q1-1.fq) \
    | cut -f 5-8 | tr '\t' '\n' > pdom-diginorm-q1-2.fq

The scripts and are also available (via khmer) on your VMs, but in my experience these scripts are much slower than using shell commands as this example demonstrates.


In class we've mentioned read de-duplication several times, but have not discussed specific software or recipes. For a while I used the FastUniq tool, but for several reasons I ended up implementing my own de-duplication program sequniq. To install this program on your VM, simply run the following commands.

git clone
cd sequniq
sudo python install

Then, assuming your reads are interleaved, you can de-duplicate the reads with the following command.

./ < /path/to/data/pdom-q1-int.fq > /path/do/data/pdom-q1-int-dedup.fq
cgss15/transcript-assembly/assignment.txt · Last modified: 2015/04/20 09:04 by standage
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki