Algorithms to find genes/ products: ORF (start/ stop in phase),length, compositions, GC-content, context (extraneous), homology, codon usage.

taking into account these factors, you can identify 60-70 percent of all genes in this way.

For eukaryotes, introns complicate issues, but spliced alignment helps.

Assignment: use the webpages in the schedule to predict gene models. 50 to 100MB sequence input, and compare the gene predictions across the different platforms and then against the published structures.


Once again I am exploring the p53 gene, but this time we are looking to see how different programs predict gene structure. These programs apply algorithms to nucleotide sequences in order to predict possible products from the genetic material. This process is made difficult by the introns, structures found in most eukaryotic organisms and a select group of microorganisms (Archaea). Introns are lengths of sequences that are pre-translationally cleaved out of RNA transcripts during processing.

For this procedure, the 19193 bp sequence for p53 was downloaded from Genbank as a FASTA file.

From the NCBI page, the genomic content is as listed:

The gene is located on chromosome 17 and is published as having 12 exons forming the coding region.

When the sequence is put into the GENSCAN webpage, the following output is given:

This was generated using the “vertebrate” preset, and the sub-optimal exon cutoff set at 1.00.

It is interesting that GENSCAN only calls 10 exons (Intr) as opposed to the 12 from the NCBI page.

Now, the sequence was uploaded to the Augustus page and the following output was returned:

This was generated using the “Homo sapiens” setting, the “Both” for strand selection, and “few” selected for number of alternative transcripts.

Including the “initial” coding region, Augustus returned 11 exons. In addition, where Augustus and GENSCAN are calling the exons is not consistent, and there is a difference of 44 bps between where they are called. This number is constant for all the called exons, so perhaps one of the programs is utilizing the data from the file slightly differently (a formatting issue) and this could also be why the one program is missing an exon. Additionally, exon prediction can be quite variable in different programs based on alternative splice site choices.

Comment: Added a tidbit about program choices leading to predicted exon variance.


b2gof15/students/bradbows/class_sessions/2015.10.13.txt · Last modified: 2015/10/19 23:30 by bradbows
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki