Assignment: RNA-seq and differential expression analysis

The second exercise for this unit introduced you to a very popular toolset for mapping and analyzing RNA-seq data. For this assignment you will apply what you have learned to analyze a real dataset. The assignment description is provided below.

  • Take a look at the example data set provided below. We will be using the data from this paper and trying to reproduce the analysis. If you have your own research data or a different paper describing an RNA-seq experiment you'd like to replicate, feel free to use that instead.
  • Do a brief quality check and determine whether the data has any serious issues that need to be addressed before proceeding.
  • Align the RNA-seq reads for each sample to the genome. The A. mellifera genome can be downloaded from NCBI, but for convenience I have posted it on the iPlant data store and you can download it with iget at the path /iplant/home/standage/CGS/Amel.gdna.fa.
  • Perform a differential expression (DE) analysis. You can use the Cufflinks/Cuffmerge/Cuffdiff pipeline covered in the exercise, or you can try other tools for measuring abundance (RSEM, eXpress) and DE analysis (edgeR, EBSeq, DESeq).
  • Assess the results. Does the number of differentially expressed genes identified by your analysis match what is reported in the paper? Pick a handful of genes designated as differentially expressed and visualize them with a tool like IGV. Do you think those individual results is reasonable?

Appendix: example data set

A recent paper in PNAS describes an RNAi knockdown experiment in honeybee. The authors did a targeted knockdown of the Dmnt3 methylation enzyme and profiled gene expression in treated individuals versus control individuals (no knockdown). The data is available from the NCBI SRA under the accession number SRP024289. The data includes 12 samples: 6 biological replicates for each condition (control and treatment).

For convenience, I'm providing a script that will automate the data download and .sra → .fastq conversion process, as well as give the data files more descriptive filenames.

download.sh
#!/usr/bin/env bash
set -eo pipefail
 
SRA=ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR893
 
# Download all 12 libraries (serially)
 for i in {21..44}
do
  curl -O $SRA/SRR8930${i}/SRR8930${i}.sra
done
 
# Convert all 12 libraries to Fastq format (in parallel)
for i in {21..44}
do
  fastq-dump --split-files SRR8930${i}.sra
done
rm *.sra  # Get rid of SRA files to reduce disk usage
 
# Rename controls
# 112c=c1, 114c=c2, 115c=c3, 131c=c4, 136c=c5, 138c=c6
for i in {21..32}
do
  j=$(($i - 20))
  k=$(($i % 2))
  l=$((1 - $k))
  rep=$((1 + ($j / 2) - $l))
  sample=c$rep
  cat SRR8930${i}_1.fastq >> ${sample}.1.fq
  cat SRR8930${i}_2.fastq >> ${sample}.2.fq
  rm SRR8930${i}_?.fastq # Get rid of original Fastq file to reduce disk usage
done
 
# Rename treatments (dmnt3 knockdowns)
# 187t=t1, 196t=t2, 198t=t3, 205t=t4, 212t=t5, 214t=t6
for i in {33..44}
do
  j=$(($i - 32))
  k=$(($i % 2))
  l=$((1 - $k))
  rep=$((1 + ($j / 2) - $l))
  sample=t$rep
  cat SRR8930${i}_1.fastq >> ${sample}.1.fq
  cat SRR8930${i}_2.fastq >> ${sample}.2.fq
  rm SRR8930${i}_?.fastq # Get rid of original Fastq file to reduce disk usage
done
 
gzip *.fq # Compress the final fastq files

Appendix: Cyberinfrastructure

The amount of disk space and run time you need for this assignment is substantially more than for previous assignments. If you run Tophat on one of the “tiny”-sized Atmosphere VMs, it will take a long time to process these 12 samples, and there's probably not enough disk space to hold the data anyway. Here are a few suggestions related to infrastructure.

  • The iPlant Discovery Environment has a module called TopHat2-PE for workflows. If you upload your fastq files to the iPlant data store (with the iput command), you should be able to run the analysis from the Discovery Environment instead of on the command line.
  • Another alternative is to run these jobs on Mason. This may be a good option if you have prior experience, but if not, then it will take some time and effort to get familiar with the job submission process. We will cover this in class soon.
  • Even if you run Tophat on the Discovery Environment, you'll still need to do some data management and analysis on a VM. You'll should probably use a “large1” instance size, which provides 8 CPUs and more memory. The iPlant folks also created a large 500GB EBS volume (cloud “flash drive”) for everyone in the class, which will be essential to completing this assignment. If you try to run things in your normal home directory you WILL RUN OUT OF DISK SPACE! Since we haven't covered EBS volumes in class yet, please read the relevant iPlant documentation.
  • The following commands are useful to monitor your disk usage. The command du -sh will show how much space the current directory is taking up, and df -h will show the status of all disks on the system. If you see your EBS approaching 100% capacity, make sure to free up some space!
  • Since the “large1” instances have 8 processors, you can speed up any programs that have a –num-threads option (or something similar) by setting –num-threads 8.
  • Dr. Brendel and I are working on a new image that has VNC working correctly. In the mean time, you may want to learn how to use the tmux program. This allows your terminal to keep running even if you disconnect from the machine and close your laptop. Unless you're keen on watching your terminal for hours and hours, it is probably a good idea to try tmux. You should be able to learn the basic concepts for how to use tmux with a simple Google search.
cgss15/mapping/assignment-diff.txt · Last modified: 2015/02/06 11:52 by standage
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki