This is a sitemap over all available pages ordered by namespaces.
Raw data files for all 12 Polistes dominula libraries are available in the following directory in the iPlant data store.
ils /iplant/home/standage/Polistes_dominula/sequence/transcriptome
The goal for this assignment is to download the data, check data for quality issues, perform proper quality control, assemble transcript sequences, and assess the assembly. Assessment of transcriptome assemblies will be discussed further in class, but much of what we learned by assessing genome assemblies will apply here.
The transcriptome assembly produced by our research group for this data is accessible at /iplant/home/standage/CGS/pdom-tsa-r1.2.fa.gz
for comparison.
In an ideal world it would be great if you all could do an assembly with all 12 libraries. However, because we have constraints on our time and resources you are free to reasonably reduce the size of your data inputs. I would suggest doing this in two ways.
On the CGS-IU-v1.2
iPlant VMs, there should be a script installed called normalize-by-median.py
.
If you want to use this script elsewhere, you must install the khmer package.
Note that to run digital normalization on paired-end data, you must first interleave the paired files.
# Interleave the paired data into a single file. paste <(paste - - - - < pdom-rnaseq-q1-1.fq) <(paste - - - - < pdom-rnaseq-q1-2.fq) | tr '\t' '\n' > pdom-rnaseq-q1-int.fq # Run the digital normalization script normalize-by-median.py -k 20 -C 20 -N 4 -x 4e9 --paired --out pdom-diginorm-q1-int.fq pdom-rnaseq-q1-int.fq # Decouple (split) the interleaved output into paired files paste - - - - - - - - < pdom-diginorm-q1-int.fq \ | tee >(cut -f 1-4 | tr '\t' '\n' > pdom-diginorm-q1-1.fq) \ | cut -f 5-8 | tr '\t' '\n' > pdom-diginorm-q1-2.fq
The scripts interleave-reads.py
and split-paired-reads.py
are also available (via khmer) on your VMs, but in my experience these scripts are much slower than using shell commands as this example demonstrates.
In class we've mentioned read de-duplication several times, but have not discussed specific software or recipes. For a while I used the FastUniq tool, but for several reasons I ended up implementing my own de-duplication program sequniq. To install this program on your VM, simply run the following commands.
git clone https://github.com/standage/sequniq.git cd sequniq sudo python setup.py install
Then, assuming your reads are interleaved, you can de-duplicate the reads with the following command.
./dedup.py < /path/to/data/pdom-q1-int.fq > /path/do/data/pdom-q1-int-dedup.fq