Final Project


Until recently, much of the information used in different fields of biology, such as biochemistry and molecular biology, were stored in paper form either in books or research papers in peer-reviewed literature. For example, an atlas of protein sequences was published annually to contain all the then-known amino acid sequences (Atlas of Protein Sequence and Structure; Margaret Dayhoff ed.). This meant that information was hard to find, comparisons were difficult and there was always a significant time-lag between discovery and publication of knowledge. Also, it is possible that laboratory procedures were difficult to interpret by people outside a given lab because of the use of shorthand, codes and other nuances that made it difficult for an outsider to fairly accurately understand the text; not to mention the fact that when accidents happen, book-keeping usually suffers for it. Fires, floods, and other disasters made paper recording almost as dangerous as it was cumbersome. Thankfully, the development of computers has all but eradicated these problems. Now, new discoveries, techniques and methods can be stored in databases almost as soon as they arise. Large amounts of data can easily be accessed, compared and analyzed. As bigger, faster computers are developed with more memory, larger and larger amounts of data can be analyzed almost real time, more and more previously independent databases, programs or tools are linked or interwoven, and multiple analyses becomes as easy as the click of a number of buttons.

Bioinformatics is the application of computer science to solve biological problems. In the beginning, this field was restricted to people with special training in the field of computer science, structural chemistry or advanced mathematics, but now, it has become part of the toolkit for many modern life scientists.

In this project, I will be exploring how different bioinformatics tools can be used to answer biological questions using the human keratin 16 as a reference protein (and other proteins depending on the tool used). Keratin is the key structural component of the outer layer of the human skin, the hair and nails. It can also be found as structural components of hooves, teeth and horns. Now all of these different keratinous structures certainly don’t look or even feel the same. Keratin can be flexible, hard or soft. The texture of the skin is different from the texture of hair, or even hooves of animals. It is therefore fascinating that they are all derived from this one protein. This is why I chose this protein as a reference model.

Human Keratin Sequence, Alignment and Phylogeny

Databases are useful tools because they allow for easy retrieval of stored information. Many databases exist where information regarding nucleotide or protein sequences or even structures can be obtained. Unique identities, such as accession numbers or codes are assigned to each entry so that the databases are well organized for ease of data retrieval. Since DNA and proteins are informational molecules, their ultimate shape and hence biological properties arise largely from their primary structures. Originally, individual databases were developed for proteins and nucleic acids in different parts of the world. Nowadays there are close links between all sequence databases, which make it possible to navigate between them. Some examples of databases include the NCBI and ExPASy. Using NCBI, I obtained the partial amino acid sequence of keratin 16 (NCBI accession number AAB30058):


It is a 244 amino acid protein, and more information about it can be found here. After obtaining the amino acid sequence of human keratin, I decided to search for organisms with similar amino acid sequences using the NCBI BLAST (Basic Local Alignment Search Tool). The result was a long list of similar sequences. I randomly chose sequences from different organisms, because I thought it would be interesting to see just how similar human keratin was to the sequences of other organisms. Here you will find the sequences I chose and the organisms they came from.

For the next set of tasks, the importance of a workflow cannot be over emphasized. A workflow is a series of steps carried out (in a given order) to execute a given task. Workflows ensure a continuous flow of logic from one step of the task completion process to another. In bioinformatics, there are a number of workflow management systems. The one used here is the BCBB Mobyle Pipeline System . This workflow management system contains various programs used to analyze data; from multiple alignments to building a phylogenic tree and so on. Usually, in a workflow management system, the output of one process is fed as input into the next process and this allows for ease of analysis of data using different tools at a time.

Next, multiple sequence alignment was performed. Sequence alignment is a way of arranging sequences of DNA, RNA or protein to show similar regions between them. The BLAST tool is a type of sequence alignment tool, since it comes up with related sequences based on a percentage identity to the original query sequence, but more advanced sequence alignment tools like MUSCLE and CLUSTAL W, capable of multiple alignments in one step, assign gaps to areas of dissimilarity (seen as dash lines), revealing their pattern of divergence and parameters can be altered to suit the purpose of the investigator. The regions of similarity can be further investigated to give information about the structural, functional or evolutionary similarities between the different organisms these sequences were derived from. Using the BCBB Mobyle Pipeline System platform, I performed multiple sequence alignment of all the above sequences using the MUSCLE program. The muscle output can be seen here.

This output was fed into the next program, PRODIST. This is a tool that uses information from the sequence alignments in order to generate distance matrix data. PRODIST output file was fed into the next program in the workflow, NEIGHBOR. This is another distance generating tool, which provides a tree output file that is fed into a tree generating program such as DRAWTREE or DRAWGRAM which produces a final phylogenetic tree. Here is the PRODIST output and here is the NEIGHBOR output.

Below are the phylogenetic trees produced by DRAWTREE and DRAWGRAM


From the trees, although non-rooted in each case, it can be seen that the keratin protein is conserved across a wide range of species, and that human and monkey keratin are very similar. The human keratin and those of other organisms in the tree seem to exhibit much divergence in similarity from that of the fly. Below is a picture of the workflow used to generate the phylogenetic tree:

From these first set of tasks, a relationship has been established between these organisms with regards to similarity in keratin protein. Therefore, it can be seen that with a few bioinformatics tools, the question of relatedness of organisms can be answered fairly easily.

SNPs, Variance and the 1000 Genome Project

SNPs (pronounced snips) is short for single nucleotide polymorphisms. They are loci in genes that may be mutated in different individuals resulting in alteration in primary structure at that position. As the name implies, SNPs cause a change in a single nucleotide—A, T, C or G—in the genome, where A may be substituted for T or C for G and so on. SNPs may be synonymous, in which case the change in nucleotide does not change the amino acid coded for, and as such, does not lead to a significant change in protein structure and function (Thanks to the degenerate nature of the genetic code). However, SNPs may also be non-synonymous. In this case a SNP will cause a change in the amino acid encoded, altering the final structure and/or function of the protein. SNPs are the most common type of variation among populations and are often of interest for recognizing a variety of traits such as appearance, disease susceptibility or response to drugs. The NCBI dbSNP and dbVar databases are useful for identifying polymorphisms and variations in a given loci. The 1000 genome project is by far the most detailed catalogue of human genetic variation. I used these tools to search for polymorphisms and variations in human keratin 16. All three tools identified numerous variations in this keratin. The NCBI SNP Database showed a total of 328 SNPs in humans, of which 11 had pathogenic significance by causing missense mutations. Diseases such as pachyonychia congenita, non-epidermolytic palmoplantar keratoderma and unilateral palmoplantar verrucous nevus have been associated with genetic mutations in the KRT 16 gene, found on chromosome 17 in humans. All SNPs observed were not restricted to any particular nucleotide substitution but spanned all four major bases. The pathogenic SNPs were all of germline origin. For more informantion on SNPs and variations in keratin 16, visit the 1000 genomes page by clicking here and the NCBI dbSNP page by clicking here.

This figure shows the dbSNP result for keratin.

This shows the 1000 genomes result for keratin 16

Protein Data Bank (PDB)

Three dimensional structures of biomacromolecules are now principally archived by the Research Collaboratory for Structural Biology (RCSB), protein data bank (PDB), although there are other archives where structures are stored. Structures receive a PDB ID as they are entered into the archive. Apart from crystal or NMR (Nuclear Magnetic Resonance) images, the PDB frequently contains variants such as site-directed mutants, proteins crystallized with different ligands or apoproteins (proteins crystallized without ligands).

The PDB website home page contains a wide range of general information including news, new features, latest structures and learning tools I entered the keyword “keratin 16” into the search bar and refined the result to “Homo sapiens only” in order to obtain the structure of human keratin 16. Unfortunately, this was unavailable and the crystal structures I obtained were for keratin 14 instead, with PDB ID 3TNU. More information on this keratin from the PDB can be obtained here: I was able to view the crystal structure of this keratin in different formats shown below:






Gene Prediction

Usually, after a genome has been sequenced, gene prediction or gene finding is the next step in gene annotation, which itself involves including pertinent information to the raw DNA sequences. Gene prediction is the process whereby regions of genomic DNA which encode genes are identified. These genes may be protein coding genes, RNA genes or even functional elements like regulatory regions. Gene prediction is believed to be an easier process in prokaryotes than in eukaryotes because the presence of both introns and exons in eukaryotic genome makes it harder to predict which are coding and non-coding regions, and also because sequence coding for a protein occurs as one contiguous open reading frame (ORF), which is typically many hundreds or thousands of base pairs long. One method of approaching gene prediction involves investigating codon usage in a reference genome, and comparing that to the query sequence. Because the genetic code is characteristically redundant but unambiguous (i.e. more than one codon may code for a particular amino acid), organisms may have codon bias, utilizing one or more of the redundant codons more than the others. Knowing the codons for which several groups of organisms have bias for may be helpful in gene prediction of a given sequence. Another method may be checking the sequence for regions similar to splice site consensus sequences. These consensus sequences include nearly invariant dinucleotides at each end of the intron, GT at the 5' end of the intron, and AG at the 3' end of the intron. The methods of gene prediction briefly described above are considered ab initio gene prediction methods; because extrinsic evidence is generally required to conclusively establish that a putative gene is functional. Markov’s model is another tool by which coding genes can be predicted by conditional probability. This model adopts the notion that the probability of future states depends on the present state. Thus this model allows for dependency to be taken into account and for probability scores to be attached to open reading frames. More on Markov’s model and its application in bioinformatics can be found here and here.

GENSCAN, GenMark, Augustus and GeneSeqer are useful bioinformatics tools for gene prediction. They have been programmed with algorithms that make it possible for them to predict coding genes from a given sequence. The human keratin 16 protein was mapped to a larger part of the genome (less than 100kb), using the NCBI database and this larger sequence was fed into the gene prediction tools. GENSCAN output showed 7 predicted peptides while Augustus output showed 5. Below are the results of both programs.

GENSCAN: Click here for output

AUGUSTUS:Click here for output

Gene Annotation

Gene annotation is the process of identifying the locations of genes and all of the coding regions in a genome and determining what those genes do. An annotation (irrespective of the context) is a note added by way of explanation or commentary. Once a genome is sequenced, it needs to be annotated to make sense of it. These are useful resources for gaining more insight into genome annotation Resource 1 and Resource 2.

Community genome annotation portals exist on the World Wide Web and allow for individuals to annotate or revise an existing annotation of the genome of certain organisms. yrGATE and GEVEAL are community annotation tools that have the annotation curated and corrected other portals such as FlyBase and PlantGDB.

I had a feel of the yrGATE gene annotation tool using Oriza sativa genome obtained from PlantGDB.

After clicking on “Annotate New Locus” from the annotation class option on the left options bar, a list of exon coordinates that I could choose from showed up. I randomly chose from the list and as I chose, the structures appeared (in green color) in “Your Structure”

As the structure appeared, appropriate boxes were filled on the options bar (see the mRNA box circled in red):

I clicked on ORF FINDER (circled in black). A new window opened that looked like this:

After clicking the “select ORF for annotation” button on the upper left corner, the sequence was inputted into the protein sequence box on the YRGATE page. I clicked on the blastp button at the bottom of the box.

The BLAST result showed sequences similar to the protein sequence of my annotation. There was an alignment that was 100% identical to the protein sequence of my annotation. I use the information from this identical protein to fill in the description box of my annotation in yrGATE.

Therefore, with these annotation tools, more information regarding a particular coding sequence can be obtained.

Comparative Genomics

This is a process in which genomes of different organisms, or different species and/or strains of organisms are compared. By comparing finished reference sequences of organisms, researchers can identify regions of similarity and difference. This information can help scientists better understand the structure and function of the genes. Comparative genomics also provides a powerful tool for studying evolutionary changes among organisms, helping to identify genes that are conserved among species, as well as genes that give each organism its unique characteristics. The genomic sequences compared may include the DNA sequence, genes, gene order, regulatory sequences, and other genomic structural landmarks. The major principle of comparative genomics is that common features of two organisms will often be encoded within the DNA that is evolutionary conserved between them.

Tools available for such investigation include CoGe, which is short for Comparative Genomics. Using this tool, I decided to compare the genomes of Rhodopseudomonas palustris and Caulobacter crescentus. To do so, I searched for and included both genomes into the CoGe BLAST. Using this tool, a query sequence from one of the organisms is required as a marker. I used the replication protein sequence of R. palustris as my query sequence.

Selecting one high scoring segment pair (HSP) for each organism from the HSP table, and clicked send to Gevo.

Gevo result showed conserved regions colored in yellow.

Clicking on each of those regions resulted in more information such as the location of the region (chromosome and position), DNA content, etc. With this tool it can also be observed if other gene loci in both organisms have similarities.

Gene Expression Analysis and Motif Finding

Gene expression analysis is a process whereby the differential expression of genes across different experimental conditions is analyzed. It is a powerful method for investigating the transcriptional behavior of biological systems, for classifying cell states in disease, and for many other purposes. NCBI GEO and eQTL are gene expression analysis tools.

Using the study on Microarray Analysis of Space-flown Murine Thymus Tissue, Groups were defined as either space flown or control using the GEO analysis tool.

The result can be found hereand is visualized below:

Clicking on any of the genes will reveal a graph showing expression values of that gene within the two defined groups.

A box plot showing the value distribution of both groups can also be seen.

A motif is a nucleotide or amino acid sequence that is widespread and is conjectured to have biological significance. They are important for understanding gene function, human disease, drug design, etc. They are helpful in finding transcriptional regulatory elements, transcription factor binding sites, and so on. Tools used to find motif sequences include XX MOTIF and MOTIF.

Using the MOTIF tool, selecting the pfam database, and entering the keratin 16 sequence as query sequence, 6 motifs were reported.


Bioinformatics will continue to remain relevant in answering biological questions. Integration of data, ease of access and analysis through tools and work flow management systems, and faster and better data storage and review options are just some of the benefits of this field. As technology continues to grow, it is believed that the significance of this field will also increase to the point where all biological scientists will not be able to escape its use.

VB Comment: Good work.

b2gof14/students/monyezir/final_project.txt · Last modified: 2014/12/18 11:03 by vbrendel
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki