Probing DNA methylation with BS-seq

Bisulfite sequencing - the basics

Bismark - our software of choice

Task 1:

  1. Please install the Bismark software on your VMs.
  2. Create a small “genome” (maybe 500 bases in a FASTA file).
  3. Create a few “reads” matching the genome (single and paired-end, from both forward and reverse strands).
  4. Simulate bisulfite-conversion: change most but not all of the C's in the reads to T; the unchanged C's should be identified as indicators of C-methylation by the Bismark software.
  5. Run the Bismark software according to the documentation (“prepare” your genome, map the reads, extract the methylation calls).

Task 2 involves running real data on Mason:

  1. Log into Mason and create a task directory under /N/dc2/scratch/<username>/
  2. copy the template qsub* and Makefile* files from /N/dc2/scratch/sarasank/
  3. copy the (Apis mellifera) genome directory from /N/dc2/scratch/sarasank/

You will have to edit the qsub file to make it your own. Take a look at a Makefile you want to run. The qsub script will merely invoke the workflow represented by the Makefile. If all goes well, the entire workflow (with all dependencies) should get launched and run to completion.

Decoding the Makefile

You will be familiar with the make command from software installations. Most source code is distributed with make instructions, that do what the name suggests: generate an executable from the source code, incorporating both downloaded source files and pre-installed system-wide libraries. What makes make so convenient is that the many dependencies typically involved in creating the executable are nicely taken care of. Code is compiled in orderly fashion. If a prerequisite is missing, the user will be told. Once taken care of, compilation can be resumed. Updates in some files that affect a subset of other files can be detected by make and incorporated into making a suitably updated executable, without unnecessary work on unaffected files.

The use of make in the following is analogous, although the end product is not an executable, but output from a workflow processing BS-seq data.

Let's look at one of our sample makefiles section by section. The header section displayed below gives running instructions and customization. For the running, nothing could be simpler: all we need to do is type “make -f <makefile>”, where “<makefile>” is our edited makefile. The additional argument “-j 4” is one of the reasons make is so powerful. The argument instructs make to use up to 4 processors in parallel. When different sub-tasks are found to be independent of each other, those tasks will then be processed simultaneously, speeding up the whole workflow from the user perspective.

In the following, several variables are set, including paths to program executables, program options, data sources, and user defined labels. Here, for example, the BS-seq data source is specified by an entry in the NCBI Sequence Read Archive. The makefile will use the “wget” command to retrieve the data. Note that additional input files are assumed to be on the system when you are running the workflow: first, the relevant genome sequence file in FASTA format(as ./genome/Amel.gdna.fa), and second the relevant genome annotation file in GFF3 format (as ./genome/$(GFF3)).

The last few lines in this section are the first generic lines, intended to be shared by all workflow invocations. That's an essential concept in the notion of workflows: we typically want a few input selections and a few knobs to fine-tune parameters; but basically, we do a lot of design work up front, and then should be able to apply the workflow to loads of new data. The three lines starting with “all:”, “sample_stats:”, and “.PHONY:” tell make what to make: by default, “all” and always “.PHONY”; note that “sample_stats” appears on the righthand side of “all”, and thus makes the write-up a little cleaner (whatever is to the right of “sample_stats” will be made).

#Makefile_WF1-4
#
#Makefile for workflow steps 1-4, from downloading SRA data to the Creport.
#Labeling of steps follows Fig. 1 in the manuscript.

#This is the paired-read version.  For single-read analysis, use Makefile_WF1-4s.
#Version: February 2, 2015.

#Usage: make -j 4 -f Makefile_WF1-4
#       make -f Makefile_WF1-4 cleanup

#Please be careful while editing the Makefile as it is 'tab-sensitive'.
#Typical customization should only involve appropriate editing of the variables in the next section.

##########################Variable Settings#####################################
###
GENOME  = Amel.gdna
SAMPLE  = SRR039815
SYNONYM = Amel-queen
SPECIES = Amel
SOURCE  = "ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX019/SRX019114/$(SAMPLE)"

BISMARK     = /N/dc2/scratch/sarasank/NGS-DIR/bismark_v0.13.0mod
BOWTIE2     = /N/dc2/scratch/sarasank/NGS-DIR/bowtie2-2.2.3
FASTQC      = /N/dc2/scratch/sarasank/NGS-DIR/FASTQC/FastQC
SAMTOOLS    = /N/dc2/scratch/sarasank/NGS-DIR/SAMTOOLS/samtools
SCRIPTS     = /N/dc2/scratch/sarasank/NGS-DIR
SRATOOLKIT  = /N/dc2/scratch/sarasank/NGS-DIR/SRATOOLKIT/sratoolkit.2.3.5-2-centos_linux64/bin
TRIM_GALORE = /N/dc2/scratch/sarasank/NGS-DIR/TRIM_GALORE/trim_galore_zip

BOWTIE2_OPTIONS      = -p 4 --score_min L,0,-0.6
FASTQC_OPTIONS       = --threads 2
SAMTOOLS_SORT_OPTION = -@ 4 -m 5G
TRIM_GALORE_OPTIONS  = 
BME_OPTIONS          = --multicore 10
GFF3		     = ref_Amel_4.5_top_level.gff3
### ! Note: The --phred33/64 option to trim_galore is set below according to the output of fastqc
###
################################################################################
### ! Typically there would be no need for further editing below this line ! ###


all:	genome fastqc1 fastqc2 $(SYNONYM).sam $(SYNONYM).M-bias.eval $(SYNONYM).Creport genome_size sample_stats genome/$(SPECIES).%.gff3
sample_stats:	rawread_nb rawread_lgth rawsample_size trmread_nb trmread_lgth trmsample_size genome_coverage
.PHONY:	./genome/$(GENOME) Bisulfite_Genome



The next sections specify what actually should be done for each “target”, and what dependencies are involved. For example, “genome” depends on “./genome/$(GENOME)” and “Bisulfite_Genome”. Read on, and you see what each of those dependencies involve: for the first, we just check that the genome FASTA file is present; for the second, we run the “bismark_genome_preparation” step that you are familiar with from the first exercise. Next, “wget” is used to download the specified read data, and an SRA Toolkit program is used to unpack the “.sra” file into the “.fastq” files for left and right reads.

###1.3 bismark genome preparation.
###    (requires the genome of interest, $(GENOME).fa, to have been placed into the folder ./genome)
#
genome:	./genome/$(GENOME) Bisulfite_Genome

./genome/$(GENOME):	
ifeq ("$(wildcard ./genome/$(GENOME).fa)","")
	$(error No genome file ./genome/$(GENOME).fa found. Please provide.)
endif

Bisulfite_Genome:	
ifeq ("$(wildcard ./genome/Bisulfite_Genome/*_conversion/*.fa)","")
	$(BISMARK)/bismark_genome_preparation --path_to_bowtie $(BOWTIE2) --bowtie2 ./genome
endif


###1.1 Obtain the SRA reads from NCBI:
#
$(SAMPLE).sra:	
	wget $(SOURCE)/$(SAMPLE).sra

%_1.fastq %_2.fastq:	%.sra
	$(SRATOOLKIT)/fastq-dump --split-files $*.sra


Quality checks:


###2.2 Run FastQC to check the quality of the reads:
#
fastqc1:	FastQCdir FastQC/$(SAMPLE)_1_fastqc/fastqc_data.txt FastQC/$(SAMPLE)_2_fastqc/fastqc_data.txt

FastQCdir:	
ifeq ("$(wildcard ./FastQC)","")
	mkdir FastQC
endif

FastQC/$(SAMPLE)_1_fastqc/fastqc_data.txt:	$(SAMPLE)_1.fastq
	$(FASTQC)/fastqc  $(FASTQC_OPTIONS)  --outdir=FastQC  $(SAMPLE)_1.fastq
	cd FastQC && unzip -o  $(SAMPLE)_1_fastqc.zip

FastQC/$(SAMPLE)_2_fastqc/fastqc_data.txt:	$(SAMPLE)_2.fastq
	$(FASTQC)/fastqc  $(FASTQC_OPTIONS)  --outdir=FastQC  $(SAMPLE)_2.fastq
	cd FastQC && unzip -o  $(SAMPLE)_2_fastqc.zip


###1.2 Trim the reads to remove low quality bases and adapater sequences:
#
%_1_val_1.fq %_2_val_2.fq:	%_1.fastq %_2.fastq FastQC/$(SAMPLE)_1_fastqc/fastqc_data.txt
	$(eval ENCODING = $(shell awk 'NR==6' FastQC/$(SAMPLE)_1_fastqc/fastqc_data.txt | awk -F" " '{printf("%s",$$5)}'))
	if [ $(ENCODING) = "1.9" ]; then \
		$(TRIM_GALORE)/trim_galore  $(TRIM_GALORE_OPTIONS) --phred33  --paired $*_1.fastq $*_2.fastq; \
	else \
		$(TRIM_GALORE)/trim_galore  $(TRIM_GALORE_OPTIONS) --phred64  --paired $*_1.fastq $*_2.fastq; \
	fi 


###2.4 Run FastQC again to check the quality of the trimmed reads:
#
fastqc2:	FastQCdir FastQC/$(SAMPLE)_1_val_1.fq_fastqc/fastqc_data.txt FastQC/$(SAMPLE)_2_val_2.fq_fastqc/fastqc_data.txt

FastQC/$(SAMPLE)_1_val_1.fq_fastqc/fastqc_data.txt:	$(SAMPLE)_1_val_1.fq
	$(FASTQC)/fastqc  $(FASTQC_OPTIONS)  --outdir=FastQC  $(SAMPLE)_1_val_1.fq
	cd FastQC && unzip -o  $(SAMPLE)_1_val_1.fq_fastqc.zip

FastQC/$(SAMPLE)_2_val_2.fq_fastqc/fastqc_data.txt:	$(SAMPLE)_2_val_2.fq
	$(FASTQC)/fastqc  $(FASTQC_OPTIONS)  --outdir=FastQC  $(SAMPLE)_2_val_2.fq
	cd FastQC && unzip -o  $(SAMPLE)_2_val_2.fq_fastqc.zip


Mapping the trimmed reads to the genome:


###2.4 Map the reads to the genome (using bowtie2 via bismark):
#
$(SAMPLE)_1_val_1.fq_bismark_bt2_pe.sam:	$(SAMPLE)_1_val_1.fq $(SAMPLE)_2_val_2.fq
	$(BISMARK)/bismark  --path_to_bowtie $(BOWTIE2) --bowtie2 $(BOWTIE2_OPTIONS)  ./genome -1 $(SAMPLE)_1_val_1.fq -2 $(SAMPLE)_2_val_2.fq


Scrubbing the data further to remove PCR artifacts (resulting in duplicated reads) and biased positions that don't reflect expected C-to-T conversion rates:


###2.5-6 Remove potential PCR artifacts using the bismark deduplication script:
#
$(SYNONYM).sam:	$(SAMPLE)_1_val_1.fq_bismark_bt2_pe.sam
	$(BISMARK)/deduplicate_bismark -p $(SAMPLE)_1_val_1.fq_bismark_bt2_pe.sam
	mv $(SAMPLE)_1_val_1.fq_bismark_bt2_pe.deduplicated.sam $(SYNONYM).sam


###2.7-8 Determine potential methylation-biased positions in the reads:
#
$(SYNONYM).M-bias.txt:	$(SYNONYM).sam
	$(BISMARK)/bismark_methylation_extractor $(BME_OPTIONS) --mbias_only -p $(SYNONYM).sam
	mv $(SYNONYM)_splitting_report.txt $(SYNONYM)_mbias_only_splitting_report.txt

$(SYNONYM).M-bias.eval:	$(SYNONYM).M-bias.txt
	python $(SCRIPTS)/eval_prmbias.py $(SYNONYM).M-bias.txt > $(SYNONYM).M-bias.eval


Ignore the biased end positions when using bismark_methylation_extractor to generate methylation calls (Creport file):


###3.9 Generate the Creport file:
#
$(SYNONYM).bismark.cov:	$(SYNONYM).M-bias.eval $(SYNONYM).sam
	$(eval R1FP = $(shell egrep "^--ignore" $(SYNONYM).M-bias.eval | awk -F" " '{print $$2}'))
	$(eval R1TP = $(shell egrep "^--ignore" $(SYNONYM).M-bias.eval | awk -F" " '{print $$4}'))
	$(eval R2FP = $(shell egrep "^--ignore" $(SYNONYM).M-bias.eval | awk -F" " '{print $$6}'))
	$(eval R2TP = $(shell egrep "^--ignore" $(SYNONYM).M-bias.eval | awk -F" " '{print $$8}'))
	$(BISMARK)/bismark_methylation_extractor $(BME_OPTIONS) --mbias_off -p --ignore $(R1FP) --ignore_3prime $(R1TP) --ignore_r2 $(R2FP) --ignore_3prime_r2 $(R2TP)  --bedGraph --CX --buffer_size 10G --scaffolds $(SYNONYM).sam

$(SYNONYM).cov:	$(SYNONYM).bismark.cov
	sort -V $(SYNONYM).bismark.cov > $(SYNONYM).cov

creport:	./genome/$(GENOME).fa $(SYNONYM).cov
	$(BISMARK)/coverage2cytosine -o creport --genome_folder ./genome --CX  $(SYNONYM).cov

$(SYNONYM).Creport:	creport
	sort -V creport > $(SYNONYM).Creport


Generate some sample statistics (note that some parts of this are independent of the last few workflow steps, and thus will be executed in parallel):



#Genome size
#
$(GENOME).stats:	./genome/$(GENOME).fa
	$(SCRIPTS)/asmbleval.pl < ./genome/$(GENOME).fa > $(GENOME).stats

genome_size:	$(GENOME).stats
	$(eval genome_size = $(shell awk 'NR==1{print $$4}' $(GENOME).stats))
	@echo "Genome size: $(genome_size) bp"


#Number of raw reads:
# (to obtain the number of reads, we add the number of lines in both read files divided by 4)
#
rawread_nb:	$(SAMPLE)_1.fastq $(SAMPLE)_2.fastq		
	$(eval RAWREAD_NB1 = $(shell wc -l $(SAMPLE)_1.fastq|cut -d' ' -f1))
	$(eval RAWREAD_NB2 = $(shell wc -l $(SAMPLE)_2.fastq|cut -d' ' -f1 ))
	$(eval RAWREAD_NB  = $(shell awk "BEGIN {print $(RAWREAD_NB1)/4+$(RAWREAD_NB2)/4}"))
	@echo "Number of raw reads: $(RAWREAD_NB) (sum of left and right reads)" > $(SAMPLE).stats

#Raw read length:
#
rawread_lgth:	FastQC/$(SAMPLE)_1_fastqc/fastqc_data.txt FastQC/$(SAMPLE)_2_fastqc/fastqc_data.txt	
	$(eval RAWREAD_LGTH1 = $(shell awk 'NR==9' FastQC/$(SAMPLE)_1_fastqc/fastqc_data.txt | awk -F" " '{print $$3}'))
	$(eval RAWREAD_LGTH2 = $(shell awk 'NR==9' FastQC/$(SAMPLE)_2_fastqc/fastqc_data.txt | awk -F" " '{print $$3}'))
	@echo "Raw read length (per FastQC report): $(RAWREAD_LGTH1) (left reads), $(RAWREAD_LGTH2) (right reads)" >> $(SAMPLE).stats

#Raw sample size:
#
rawsample_size:	rawread_nb
	$(eval RAWREAD_SZE1 = $(shell awk 'BEGIN{sum=0;}{if(NR%4==2){sum+=length($$0);}}END{print sum;}' $(SAMPLE)_1.fastq))
	$(eval RAWREAD_SZE2 = $(shell awk 'BEGIN{sum=0;}{if(NR%4==2){sum+=length($$0);}}END{print sum;}' $(SAMPLE)_2.fastq))
	$(eval RAWSAMPLE_SZE =$(shell awk "BEGIN {print $(RAWREAD_SZE1) + $(RAWREAD_SZE2)}")) 
	@echo "Raw read sample size: $(RAWSAMPLE_SZE) bp" >> $(SAMPLE).stats


#Number of trimmed reads:
#
trmread_nb:	$(SAMPLE)_1_val_1.fq $(SAMPLE)_2_val_2.fq		
	$(eval TRMREAD_NB1 = $(shell wc -l $(SAMPLE)_1_val_1.fq|cut -d' ' -f1))
	$(eval TRMREAD_NB2 = $(shell wc -l $(SAMPLE)_2_val_2.fq|cut -d' ' -f1 ))
	$(eval TRMREAD_NB  = $(shell awk "BEGIN {print $(TRMREAD_NB1)/4+$(TRMREAD_NB2)/4}"))
	@echo "Number of trimmed reads: $(TRMREAD_NB) (sum of left and right reads)" >> $(SAMPLE).stats

#Trimmed read length:
#
trmread_lgth:	FastQC/$(SAMPLE)_1_val_1.fq_fastqc/fastqc_data.txt FastQC/$(SAMPLE)_2_val_2.fq_fastqc/fastqc_data.txt	
	$(eval TRMREAD_LGTH1 = $(shell awk 'NR==9' FastQC/$(SAMPLE)_1_val_1.fq_fastqc/fastqc_data.txt | awk -F" " '{print $$3}'))
	$(eval TRMREAD_LGTH2 = $(shell awk 'NR==9' FastQC/$(SAMPLE)_2_val_2.fq_fastqc/fastqc_data.txt | awk -F" " '{print $$3}'))
	@echo "Trimmed read length range (per FastQC report): $(TRMREAD_LGTH1) (left reads), $(TRMREAD_LGTH2) (right reads)" >> $(SAMPLE).stats
	$(eval TRMREAD_ML1 = $(shell cat $(SAMPLE)_1_val_1.fq | awk '{if(NR%4==2) print length($$1)}' > /tmp/$(SAMPLE)_readlength1.txt; sort -n /tmp/$(SAMPLE)_readlength1.txt| awk ' { a[i++]=$$1; } END { print a[int(i/2)]; }'))
	$(eval TRMREAD_ML2 = $(shell cat $(SAMPLE)_2_val_2.fq | awk '{if(NR%4==2) print length($$1)}' > /tmp/$(SAMPLE)_readlength2.txt; sort -n /tmp/$(SAMPLE)_readlength2.txt| awk ' { a[i++]=$$1; } END { print a[int(i/2)]; }'))
	@echo "Median length of trimmed reads: $(TRMREAD_ML1) (left reads), $(TRMREAD_ML2) (right reads)" >> $(SAMPLE).stats

#Trimmed sample size:
#
trmsample_size:	trmread_nb
	$(eval TRMREAD_SZE1 = $(shell awk 'BEGIN{sum=0;}{if(NR%4==2){sum+=length($$0);}}END{print sum;}' $(SAMPLE)_1_val_1.fq))
	$(eval TRMREAD_SZE2 = $(shell awk 'BEGIN{sum=0;}{if(NR%4==2){sum+=length($$0);}}END{print sum;}' $(SAMPLE)_2_val_2.fq))
	$(eval TRMSAMPLE_SZE =$(shell awk "BEGIN {print $(TRMREAD_SZE1) + $(TRMREAD_SZE2)}")) 
	@echo "Trimmed sample size: $(TRMSAMPLE_SZE) bp" >> $(SAMPLE).stats


#Genome coverage:
# (using samtools to first convert the SAM file into a sorted BAM file, then "samtools depth" to obtain the total aligned base
#  count, and finally dividing that value by the genome size to obtain the coverage)
genome_coverage:	$(SYNONYM).bam $(GENOME).stats
	$(eval TOTAL_ALGNDBASE_CNT = $(shell $(SAMTOOLS)/samtools depth $(SYNONYM).bam | awk '{sum+=$$3}END{print sum}'))
	$(eval genome_size = $(shell awk 'NR==1{print $$4}' $(GENOME).stats))
	$(eval COVERAGE = $(shell awk "BEGIN {printf \"%.2f\", $(TOTAL_ALGNDBASE_CNT)/$(genome_size) }"))
	@echo "Genome coverage: $(COVERAGE) (= $(TOTAL_ALGNDBASE_CNT) / $(genome_size)  $(CNT))" >> $(SAMPLE).stats

$(SYNONYM).bam:	$(SYNONYM).sam
	$(SAMTOOLS)/samtools view -bS $(SYNONYM).sam | samtools sort  $(SAMTOOLS_SORT_OPTION)  - $(SYNONYM)

genome/$(SPECIES).%.gff3:	genome/$(GFF3) genome/$(GENOME).fa
	$(SCRIPTS)/parse_gff3_input.sh genome/$(GENOME).fa genome/$(GFF3) $(SPECIES) genome/GFF3DIR


Finally, some cleanup of files we may not want to store (invoke with "make -f <Makefile> cleanup"):



###4.10 Clean up the output directory:
#
cleanup:
	-\rm $(SAMPLE)*.fastq
	-\rm $(SAMPLE)_1_val_1.fq_bismark_bt2_pe.sam
	-\rm C*_$(SYNONYM).txt
	-\rm $(SYNONYM).bam
	-\rm $(SYNONYM).bismark.cov
	-\rm creport
cgss15/bsseq/start.txt · Last modified: 2015/02/16 11:02 by vbrendel
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki