Exercise: differential expression analysis

RNA-seq provides a powerful way to simultaneously reconstruct RNA transcripts and estimate their abundances in a sample. This can be done without a reference genome, which we will learn about later in the class. However, when a reference genome is available, it is common to map the RNA-seq reads to the genome as a first step in identifying transcripts and profiling their expression.

Multi-condition contrasts are very common in biology, and especially in RNA-seq analysis it is common to compare gene expression across different tissues, developmental stages, environmental pressures, or other experimental conditions. Genes that are differentially expressed across conditions are candidates for additional experiments to determine their cellular function and investigate their potential causative role.

The data for this exercise was taken from the jumping ant Harpegnathos saltator. It includes a 150kb region selected at random from the genome, as well as 6 samples of simulated RNA-seq data. For the sake of this exercise, we will assume that samples A, B, and C were taken from queens, while samples X, Y, and Z were taken from workers. Don't worry though, you don't need to know anything about insect biology for this exercise. We could just as easily labeled these 'condition A' and 'condition B' if we had wanted to.

Step 1: Data access

Step 0 is not shown here as it was for previous exercises, but make sure to start with a clean working directory for this exercise.

As usual, the data can be accessed from the iPlant data store.

# Download the genome
iget -V /iplant/home/standage/CGS/hsal-150kb.fa
# Download queen samples
iget -V /iplant/home/standage/CGS/hsal-sampleA-1.fq.gz
iget -V /iplant/home/standage/CGS/hsal-sampleA-2.fq.gz
iget -V /iplant/home/standage/CGS/hsal-sampleB-1.fq.gz
iget -V /iplant/home/standage/CGS/hsal-sampleB-2.fq.gz
iget -V /iplant/home/standage/CGS/hsal-sampleC-1.fq.gz
iget -V /iplant/home/standage/CGS/hsal-sampleC-2.fq.gz
# Download worker samples
iget -V /iplant/home/standage/CGS/hsal-sampleX-1.fq.gz
iget -V /iplant/home/standage/CGS/hsal-sampleX-2.fq.gz
iget -V /iplant/home/standage/CGS/hsal-sampleY-1.fq.gz
iget -V /iplant/home/standage/CGS/hsal-sampleY-2.fq.gz
iget -V /iplant/home/standage/CGS/hsal-sampleZ-1.fq.gz
iget -V /iplant/home/standage/CGS/hsal-sampleZ-2.fq.gz

Step 2: Align RNA-seq reads with Tophat

The Tophat program uses Bowtie2 for alignment, so we first need to index the genome for mapping with Bowtie2.

bowtie2-build hsal-150kb.fa hsal-150kb

For the RNA-seq data, I simulated 85bp paired-end reads using an insert size of about 500bp, with a standard deviation of about 50bp. Insert size refers to the distance from one end of the cDNA fragment to the other or, in other words, its length (for more on terminology, this blog post gives a very clear description). You should usually be able to get this type of information from your sequencing center (if you're sequencing your own samples) or from the NCBI SRA if you're downloading publicly available data.

PE reads                                      R1--------->                    <---------R2
fragment (~ are adapters, barcodes, etc)     ~~~========================================~~~
insert (actual sequence of interest)            ========================================
inner mate                                                ====================

Tophat wants us to provide not the insert size, but the “inner distance” between the read pair. We can calculate this by subtracting 2x the read length from the insert size. For this example, we get the following.

inner distance = insert size - 2 x read length
               = 500bp - 2*85bp
               = 500bp - 170bp
               = 330bp

Once we calculate this value we can run Tophat for each sample as follows.

# Sample A
tophat --output-dir hsal-sampleA-th \
       --mate-inner-dist 330 \
       --mate-std-dev 50 \
       --num-threads 1 \
       hsal-150kb \
       hsal-sampleA-1.fq.gz \
# Sample B
tophat --output-dir hsal-sampleB-th \
       --mate-inner-dist 330 \
       --mate-std-dev 50 \
       --num-threads 1 \
       hsal-150kb \
       hsal-sampleB-1.fq.gz \
# Repeat for samples C, X, Y, Z

Once the reads are mapped, we can use Cufflinks to reconstruct transcript sequences. Again we do this on a sample-by-sample basis.

# Sample A
cufflinks --output-dir hsal-sampleA-cuff \
          --num-threads 1 \
# Sample B
cufflinks --output-dir hsal-sampleB-cuff \
          --num-threads 1 \
# Repeat for samples C, X, Y, Z

Step 4: Differential expression analysis

The Cufflinks software comes with some additional programs for differential expression analysis. The cuffmerge command will take multiple assemblies (i.e. from different samples) and combine them into a single consensus assembly. Then the cuffdiff command will examine the mapped reads and perform differential expression and differential splicing tests.

# Create a consensus assembly
ls *-cuff/transcripts.gtf > assemblies.txt
cuffmerge -o cons-assembly \
          --num-threads 1 \
          --ref-sequence hsal-150kb.fa \
# Perform differential analysis
cuffdiff --output-dir hsal-diff \
         --labels Queen,Worker \
         --num-threads 1 \
         cons-assembly/merged.gtf \
         hsal-sampleA-th/accepted_hits.bam,hsal-sampleB-th/accepted_hits.bam,hsal-sampleC-th/accepted_hits.bam \

The output directory hsal-diff will contain quite a variety of files. Files with the extension .diff correspond to the differential expression/splicing tests. A description of these files is available on the Cufflinks website.

cgss15/mapping/exercise-diff.txt · Last modified: 2015/02/04 10:40 by standage
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki