This is a sitemap over all available pages ordered by namespaces.
RNA-seq provides a powerful way to simultaneously reconstruct RNA transcripts and estimate their abundances in a sample. This can be done without a reference genome, which we will learn about later in the class. However, when a reference genome is available, it is common to map the RNA-seq reads to the genome as a first step in identifying transcripts and profiling their expression.
Multi-condition contrasts are very common in biology, and especially in RNA-seq analysis it is common to compare gene expression across different tissues, developmental stages, environmental pressures, or other experimental conditions. Genes that are differentially expressed across conditions are candidates for additional experiments to determine their cellular function and investigate their potential causative role.
The data for this exercise was taken from the jumping ant Harpegnathos saltator. It includes a 150kb region selected at random from the genome, as well as 6 samples of simulated RNA-seq data. For the sake of this exercise, we will assume that samples A, B, and C were taken from queens, while samples X, Y, and Z were taken from workers. Don't worry though, you don't need to know anything about insect biology for this exercise. We could just as easily labeled these 'condition A' and 'condition B' if we had wanted to.
Step 0 is not shown here as it was for previous exercises, but make sure to start with a clean working directory for this exercise.
As usual, the data can be accessed from the iPlant data store.
# Download the genome iget -V /iplant/home/standage/CGS/hsal-150kb.fa # Download queen samples iget -V /iplant/home/standage/CGS/hsal-sampleA-1.fq.gz iget -V /iplant/home/standage/CGS/hsal-sampleA-2.fq.gz iget -V /iplant/home/standage/CGS/hsal-sampleB-1.fq.gz iget -V /iplant/home/standage/CGS/hsal-sampleB-2.fq.gz iget -V /iplant/home/standage/CGS/hsal-sampleC-1.fq.gz iget -V /iplant/home/standage/CGS/hsal-sampleC-2.fq.gz # Download worker samples iget -V /iplant/home/standage/CGS/hsal-sampleX-1.fq.gz iget -V /iplant/home/standage/CGS/hsal-sampleX-2.fq.gz iget -V /iplant/home/standage/CGS/hsal-sampleY-1.fq.gz iget -V /iplant/home/standage/CGS/hsal-sampleY-2.fq.gz iget -V /iplant/home/standage/CGS/hsal-sampleZ-1.fq.gz iget -V /iplant/home/standage/CGS/hsal-sampleZ-2.fq.gz
The Tophat program uses Bowtie2 for alignment, so we first need to index the genome for mapping with Bowtie2.
bowtie2-build hsal-150kb.fa hsal-150kb
For the RNA-seq data, I simulated 85bp paired-end reads using an insert size of about 500bp, with a standard deviation of about 50bp. Insert size refers to the distance from one end of the cDNA fragment to the other or, in other words, its length (for more on terminology, this blog post gives a very clear description). You should usually be able to get this type of information from your sequencing center (if you're sequencing your own samples) or from the NCBI SRA if you're downloading publicly available data.
PE reads R1---------> <---------R2 fragment (~ are adapters, barcodes, etc) ~~~========================================~~~ insert (actual sequence of interest) ======================================== inner mate ====================
Tophat wants us to provide not the insert size, but the “inner distance” between the read pair. We can calculate this by subtracting 2x the read length from the insert size. For this example, we get the following.
inner distance = insert size - 2 x read length = 500bp - 2*85bp = 500bp - 170bp = 330bp
Once we calculate this value we can run Tophat for each sample as follows.
# Sample A tophat --output-dir hsal-sampleA-th \ --mate-inner-dist 330 \ --mate-std-dev 50 \ --num-threads 1 \ hsal-150kb \ hsal-sampleA-1.fq.gz \ hsal-sampleA-2.fq.gz # Sample B tophat --output-dir hsal-sampleB-th \ --mate-inner-dist 330 \ --mate-std-dev 50 \ --num-threads 1 \ hsal-150kb \ hsal-sampleB-1.fq.gz \ hsal-sampleB-2.fq.gz # Repeat for samples C, X, Y, Z
Once the reads are mapped, we can use Cufflinks to reconstruct transcript sequences. Again we do this on a sample-by-sample basis.
# Sample A cufflinks --output-dir hsal-sampleA-cuff \ --num-threads 1 \ hsal-sampleA-th/accepted_hits.bam # Sample B cufflinks --output-dir hsal-sampleB-cuff \ --num-threads 1 \ hsal-sampleB-th/accepted_hits.bam # Repeat for samples C, X, Y, Z
The Cufflinks software comes with some additional programs for differential expression analysis.
The cuffmerge
command will take multiple assemblies (i.e. from different samples) and combine them into a single consensus assembly.
Then the cuffdiff
command will examine the mapped reads and perform differential expression and differential splicing tests.
# Create a consensus assembly ls *-cuff/transcripts.gtf > assemblies.txt cuffmerge -o cons-assembly \ --num-threads 1 \ --ref-sequence hsal-150kb.fa \ assemblies.txt # Perform differential analysis cuffdiff --output-dir hsal-diff \ --labels Queen,Worker \ --num-threads 1 \ cons-assembly/merged.gtf \ hsal-sampleA-th/accepted_hits.bam,hsal-sampleB-th/accepted_hits.bam,hsal-sampleC-th/accepted_hits.bam \ hsal-sampleX-th/accepted_hits.bam,hsal-sampleY-th/accepted_hits.bam,hsal-sampleZ-th/accepted_hits.bam
The output directory hsal-diff
will contain quite a variety of files.
Files with the extension .diff
correspond to the differential expression/splicing tests.
A description of these files is available on the Cufflinks website.