Exercise: eukaryotic genome assembly

It's difficult to craft a scaled down exercise for eukaryotic genome assembly. Any data set that would complete in 5-10 minutes probably isn't a good example of what you're likely to deal with in “the real world”, but we want to follow the pattern we've used so far in the class by starting off with a manageable example before moving on to full-scale analysis and interpretation.

For the eukaryotic genome assembly assignment, we'll be working with a data set from the paper wasp Polistes dominula. The complete data set has 5 Illumina libraries, but for a preliminary exercise we will only use the 500 bp insert library and the 3kb insert library. I have applied digital normalization to these libraries to reduce their coverage while retaining their information content.

Despite my best efforts to encourage you to branch out, it seems like most everyone used Velvet for the prokaryotic genome assembly assignment. This exercise will introduce you to another popular assembler from BGI called SOAPdenovo. The assembler is pre-installed on your iPlant VMs, although there is not enough memory on those machines to do anything but the smallest of assemblies. SOAPdenovo is also available on Mason, however, and since we'll need Mason's computing power for the main assignment we may as well use it for the exercise as well.

  1. Just like the iPlant VMs have limited storage space, you only have a few GB worth of storage in your Mason home directory. However, Mason does provide a scratch directory for each user that has enormous storage capacity. The idea is to sep up a working directory for your analyses on the scratch disk, do your computations there, and then transfer the results to long term storage elsewhere (iPlant data store, local machine, etc).
  2. Mason does not have the iRODS data store commands (iinit, iget, etc) installed by default. Install iRODS in your home directory (using the instructions below) before you begin the assignment.
  3. Download the following data files from the iPlant data store: 2 libraries with paired fastq files.
    • /iplant/home/standage/CGS/pdom-500bp-1.fq
    • /iplant/home/standage/CGS/pdom-500bp-2.fq
    • /iplant/home/standage/CGS/pdom-3kb-1.fq
    • /iplant/home/standage/CGS/pdom-3kb-2.fq
  4. Skip quality control for now and move on to assembly. Use the SOAPdenovo assembler to reconstruct the genome sequence.
  5. Look at the terminal output from Mason (stdout and stderr). Are there any error messages of which you should be aware? Did everything seem to complete correctly? What can you say about the assembly from the asmbleval.pl script?

Installing iRODS

Download this file to your Mason home directory and then follow the following instructions. The setup script will ask you several questions about what you want to build. Answer no to all the questions except Save configuration and Start iRODS build.

cd ~
tar xzf irods3.1.tgz
cd iRODS
./irodssetup
export PATH=~/iRODS/clients/icommands/bin:$PATH
# You'll probably want to add this last command to your ''.bashrc'' file.

Assembly with SOAPdenovo

Using the example config from the SOAPdenovo website, I created a config file for these data sets.

soap.cnf
[LIB]
#maximal read length
max_rd_len=100
#average insert size
avg_ins=500
#if sequence needs to be reversed
reverse_seq=0
#in which part(s) the reads are used
asm_flags=3
#in which order the reads are used while scaffolding
rank=1
#fastq files
q1=/path/to/your/workdir/pdom-500bp-1.fq
q2=/path/to/your/workdir/pdom-500bp-2.fq
[LIB]
max_rd_len=35
avg_ins=3000
reverse_seq=0
asm_flags=3
rank=2
q1=/path/to/your/workdir/pdom-3kb-1.fq
q2=/path/to/your/workdir/pdom-3kb-2.fq

With your config file prepped, you could run SOAPdenovo as follows.

module load soapdenovo2
time SOAPdenovo-63mer all -s soap.cnf -K 27 -o output-dir

However, remember that we cannot run jobs on the interactive node. We have to create a launch script and submit it to the queue for execution. Here is an example launch script.

run-soapdenovo.sh
#!/bin/bash
 
#PBS -N AssemblyExercise1
#PBS -l nodes=1:ppn=32,walltime=4:00:00,vmem=500gb
#PBS -k oe
#PBS -q shared
#PBS -m bea
#PBS -M youremail@indiana.edu
 
module load soapdenovo2
WORKDIR=/path/to/your/workdir/
 
SOAPdenovo-63mer all -s $WORKDIR/soap.cnf -K 27 -p 32 -o $WORKDIR/output-dir

With your config file and launch script in place, you can submit your job like so.

qsub run-soapdenovo.sh

Testing jobs interactively

In class we discussed using Mason's interactive queue to troubleshoot and test your jobs. This is often a good idea for any task you need to do, not just this exercise. Once you have your launch script, data files, and any other configuration files in place, you can use qsub to request an interactive session. You probably want to run the command with a tmux terminal–otherwise you will have to keep your terminal open until the session begins. If you use the command below, Mason will send you an email when your session begins, so you can close your terminal and log out if you need to.

Usually you want to request an interactive session with much small resource requests so that you will not have to wait as long in the queue. Instead of requesting all 32 processors for a node, request 4 or 8 or 16. Also, you probably won't need more than 30-60 minutes to troubleshoot, so requesting a shorter walltime will also reduce your waiting time in the queue.

qsub -I -q shared -l nodes=1:ppn=8,vmem=64gb,walltime=1:00:00 -M youremail@indiana.edu -m abe

Once the session begins, you can go to your working directory and run your launch script. If the SOAPdenovo command in your launch script uses 32 threads but you only requested 8 processors for your interactive session, make sure to change your thread count to 8 before running the script.

bash run-soapdenovo.sh

If there is a problem with your launch script or your config file, the command will probably fail right away. You can then use the error message to try to fix the problem and try again. Once the command runs for 5-10 minutes without an error, you can be confident that everything is probably OK. You can then use ctrl-c to cancel the job, type “exit” to close your interactive session, and then submit your job to the queue with qsub.

cgss15/genome-assembly/exercise-euk.txt · Last modified: 2015/03/09 14:24 by standage
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki