Kasun's Wiki Page for Bioinformatics2go

This is Kasun's Wiki Page for Bioinformatics2go Class

Recommended reading materials: Bioinformatics Data Skills by Vince Buffalo

1. Introduction to WikiBCB and NCBI Resources

How to maintain a WikiBCB account?

Main steps:

  1. Log in with your user name and password
  2. Go to your WikiBCB page (if you already have one) or “Create a New Page” for you
  3. Important facts;
    • Type the topic between 6 x = signs
    • Anything appear between 2*—2* will be appeared in Bold
    • Create hyperlinks to link other web pages

NCBI (National Center for Biotechnology Information)

NCBI (NCBI) is one of the most useful web platforms to provide access to biomedical and genomic information. There are an enormous amount of different algorithms and tools developed at NCBI (Analysis Tools) usually used by scientific community to advance and elaborate their scientific findings.

One analysis tool I found interesting was VAST, which is the short for Vector Aligment Search Tool. This particular algorithm was developed at NCBI and widely used to identify similar protein three dimensional structures, thereby enabling identification of distant homologs that are unable to recognized by sequence comparison.

2. Molecular Phylogeny

Proteins are the functional units of life. The fundamental belief in all Molecular Biologists is that all proteins are related to one way or the other. Molecular phylogeny is a field grown on this belief. Now we are learning more about Molecular Phylogeny!

If a protein sequence is newly identified the next task is to find its identity. In order to assign its identity a database search can be done using the protein sequence. NCBI BLAST (Basic Local Alignment Search Tool) search tool is one of the popular approaches. BLAST search results are organized according to different aspects such as E value, % identity, etc. Therefore, once you perform a BLAST search you will get a list of results and may contain a collection of different protein sequences having sequence similarities to the query sequence. By aligning these first few sequences (Clustal W) and then generating a phylogenetic tree from the results, phylogenetic relationships can also be uncovered.

Let's see how I used this approach to the following protein sequence: (Kasun's Answer for the Assignment as well)

Sequence of a newly sequenced protein is given below.

MNPRQGYSLSGYYTHPFQGYEHRQLRYQQPGPGSSPSSFLLKQIEFLKGQLPEAPVIGKQTPSLPPSLPG LRPRFPVLLASSTRGRQVDIRGVPRGVHLGSQGLQRGFQHPSPRGRSLPQRGVDCLSSHFQELSIYQDQE QRILKF

A BLAST search gave a list of different proteins and from the first hit where the E value is 0.0 and query coverage is 100%, identified the protein as ADAR1 (Adenosine Deaminase Acting on RNA). A series of other closely related sequences are chosen and are shown below.

>Seq1
MNPRQGYSLSGYYTHPFQGYEHRQLRYQQPGPGSSPSSFLLKQIEFLKGQLPEAPVIGKQTPSLPPSLPG
LRPRFPVLLASSTRGRQVDIRGVPRGVHLGSQGLQRGFQHPSPRGRSLPQRGVDCLSSHFQELSIYQDQE
QRILKF

>Seq2
MNPRQGYSLSGYYTHPFQGYEHRQLRYQQPGPGSSPSSFLLKQIEFLKGQLPEAPVIGKQTPSLPPSLPG
LRPRFPVLLASSTRGRQVDIRGVPRGVHLGSQGLQRGFQHPSPRGRSLPQRGVDCLSSHFQELSIYQDQE
QRILKFLEELGEGKATTAHDLSGKLGTPKKEINRVLYSLAKKGKLQKEAGTPPLWKIAVSTQAWNQHSGV
VRPDGHSQGAPNSDPSLEPEDRNSTSVSEDLLEPFIAVSAQAWNQHSGVVRPDSHSQGSPNSDPGLEPED
SNSTSALEDPLEFLDMAEIKEKICDYLFNVSDSSALNLAKNIGLTKARDINAVLIDMERQGDVYRQGTTP
PIWHLTDKKRERMQIKRNTNSVPETAPAAIPETRRNAEFLTCNIPTSNASNNMVTTEKVENGQEPVIKLE
NRQEARPEPARLKPPVHYNGPSKAGYVDFENGQWATDDIPDDLNSIRAAPGEFRAIMEMPSFYSHGLPRC
SPYKKLTECQLKNPISGLLEYAQFASQTCEFNMIEQSGPPHEPRFKFQVVINGREFPPAEAGSKKVAKQD
AAMKAMTILLEEAKAKDSGKSEESSHYSTEKESEKTAESQTPTPSATSFFSGKSPVTTLLECMHKLGNSC
EFRLLSKEGPAHEPKFQYCVAVGAQTFPSVSAPSKKVAKQMAAEEAMKALHGEATNSMASDNQPEGMISE
SLDNLESMMPNKVRKIGELVRYLNTNPVGGLLEYARSHGFAAEFKLVDQSGPPHEPKFVYQAKVGGRWFP
AVCAHSKKQGKQEAADAALRVLIGENEKAERMGFTEVTPVTGASLRRTMLLLSRSPEAQPKTLPLTGSTF
HDQIAMLSHRCFNTLTNSFQPSLLGRKILAAIIMKKDSEDMGVVVSLGTGNRCVKGDSLSLKGETVNDCH
AEIISRRGFIRFLYSELMKYNSQTAKDSIFEPAKGGEKLQKEHFSSAHCYLHFLCPVFCF

>Seq3
MNPRQGYSLSGYYTHPFQGYEHRQLRYQQPGPGSSPSSFLLKQIEFLKGQLPEAPVIGKQTPSLPPSLPG
LRPRFPVLLASSTRGRQVDIRGVPRGVHLGSQGLQRGFQHPSPRGRSLPQRGVDCLSSHFQELSIYQDQE
QRILKFLEELGEGKATTAHDLSGKLGTPKKEINRVLYSLAKKGKLQKEAGTPPLWKIAVSTQAWNQHSGV
VRPDGHSQGAPNSDPSLEPEDRNSTSVSEDLLEPFIAVSAQAWNQHSGVVRPDSHSQGSPNSDPGLEPED
SNSTSALEDPLEFLDMAEIKEKICDYLFNVSDSSALNLAKNIGLTKARDINAVLIDMERQGDVYRQGTTP
PIWHLTDKKRERMQIKRNTNSVPETAPAAIPETKRNAEFLTCNIPTSNASNNMVTTEKVENGQEPVIKLE
NRQEARPEPARLKPPVHYNGPSKAGYVDFENGQWATDDIPDDLNSIRAAPGEFRAIMEMPSFYSHGLPRC
SPYKKLTECQLKNPISGLLEYAQFASQTCEFNMIEQSGPPHEPRFKFQVVINGREFPPAEAGSKKVAKQD
AAMKAMTILLEEAKAKDSGKSEESSHYSTEKESEKTAESQTPTPSATSFFSGKSPVTTLLECMHKLGNSC
EFRLLSKEGPAHEPKFQYCVAVGAQTFPSVSAPSKKVAKQMAAEEAMKALHGEATNSMASDNQPEGMISE
SLDNLESMMPNKVRKIGELVRYLNTNPVGGLLEYARSHGFAAEFKLVDQSGPPHEPKFVYQAKVGGRWFP
AVCAHSKKQGKQEAADAALRVLIGENEKAERMGFTEVTPVTGASLRRTMLLLSRSPEAQPKTLPLTGSTF
HDQIAMLSHRCFNTLTNSFQPSLLGRKILAAIIMKKDSEDMGVVVSLGTGNRCVKGDSLSLKGETVNDCH
AEIISRRGFIRFLYSELMKYNSQTAKDSIFEPAKGGEKLQIKKTVSFHLYISTAPCGDGALFDKSCSDRA
MESTESRHYPVFENPKQGKLRTKVENGEGTIPVESSDIVPTWDGIRLGERLRTMSCSDKILRWNVLGLQG
ALLTHFLQPIYLKSVTLGYLFSQGHLTRAICCRVTRDGSAFEDGLRHPFIVNHPKVGRVSIYDSKRQSGK
TKETSVNWCLADGYDLEILDGTRGTVDGPRNELSRVSKKNIFLLFKKLCSFRYRRDLLRLSYGEAKKAAR
DYETAKNYFKKGLKDMGYGNWISKPQEEKNFYLCPV

>Seq4
MNPRQGYSLSGYYTHPFQGYEHRQLRYQQPGPGSSPSSFLLKQIEFLKGQLPEAPVIGKQTPSLPPSLPG
LRPRFPVLLASSTRGRQVDIRGVPRGVHLGSQGLQRGFQHPSPRGRSLPQRGVDCLSSHFQELSIYQDQE
QRILKFLEELGEGKATTAHDLSGKLGTPKKEINRVLYSLAKKGKLQKEAGTPPLWKIAVSTQAWNQHSGV
VRPDGHSQGAPNSDPSLEPEDRNSTSVSEDLLEPFIAVSAQAWNQHSGVVRPDSHSQGSPNSDPGLEPED
SNSTSALEDPLEFLDMAEIKEKICDYLFNVSDSSALNLAKNIGLTKARDINAVLIDMERQGDVYRQGTTP
PIWHLTDKKRERMQIKRNTNSVPETAPAAIPETRRNAEFLTCNIPTSNASNNMVTTEKVENGQEPVIKLE
NRQEARPEPARLKPPVHYNGPSKAGYVDFENGQWATDDIPDDLNSIRAAPGEFRAIMEMPSFYSHGLPRC
SPYKKLTECQLKNPISGLLEYAQFASQTCEFNMIEQSGPPHEPRFKFQVVINGREFPPAEAGSKKVAKQD
AAMKAMTILLEEAKAKDSGKSEESSHYSTEKESEKTAESQTPTPSATSFFSGKSPVTTLLECMHKLGNSC
EFRLLSKEGPAHEPKFQYCVAVGAQTFPSVSAPSKKVAKQMAAEEAMKALHGEATNSMASDNQPEGMISE
SLDNLESMMPNKVRKIGELVRYLNTNPVGGLLEYARSHGFAAEFKLVDQSGPPHEPKFVYQAKVGGRWFP
AVCAHSKKQGKQEAADAALRVLIGENEKAERMGFTELPLTGSTFHDQIAMLSHRCFNTLTNSFQPSLLGR
KILAAIIMKKDSEDMGVVVSLGTGNRCVKGDSLSLKGETVNDCHAEIISRRGFIRFLYSELMKYNSQTAK
DSIFEPAKGGEKLQIKKTVSFHLYISTAPCGDGALFDKSCSDRAMESTESRHYPVFENPKQGKLRTKVEN
GEGTIPVESSDIVPTWDGIRLGERLRTMSCSDKILRWNVLGLQGALLTHFLQPIYLKSVTLGYLFSQGHL
TRAICCRVTRDGSAFEDGLRHPFIVNHPKVGRVSIYDSKRQSGKTKETSVNWCLADGYDLEILDGTRGTV
DGPRNELSRVSKKNIFLLFKKLCSFRYRRDLLRLSYGEAKKAARDYETAKNYFKKGLKDMGYGNWISKPQ
EEKNFYLCPV

>Seq5
MNPRQGYSLSGYYTHPFQGYEHRQLRYQQPGPGSSPSSFLLKQIEFLKGQLPEAPVIGKQTPSLPPSLPG
LRPRFPVLLASSTRGRQVDIRGVPRGVHLGSQGLQRGFQHPSPRGRSLPQRGVDCLSSHFQELSIYQDQE
QRILKFLEELGEGKATTAHDLSGKLGTPKKEINRVLYSLAKKGKLQKEAGTPPLWKIAVSTQAWNQHSGV
VRPDGHSQGAPNSDPSLEPEDRNSTSVSEDLLEPFIAVSAQAWNQHSGVVRPDSHSQGSPNSDPGLEPED
SNSTSALEDPLEFLDMAEIKEKICDYLFNVSDSSALNLAKNIGLTKARDINAVLIDMERQGDVYRQGTTP
PIWHLTDKKRERMQIKRNTNSVPETAPAAIPETKRNAEFLTCNIPTSNASNNMVTTEKVENGQEPVIKLE
NRQEARPEPARLKPPVHYNGPSKAGYVDFENGQWATDDIPDDLNSIRAAPGEFRAIMEMPSFYSHGLPRC
SPYKKLTECQLKNPISGLLEYAQFASQTCEFNMIEQSGPPHEPRFKFQVVINGREFPPAEAGSKKVAKQD
AAMKAMTILLEEAKAKDSGKSEESSHYSTEKESEKTAESQTPTPSATSFFSGKSPVTTLLECMHKLGNSC
EFRLLSKEGPAHEPKFQYCVAVGAQTFPSVSAPSKKVAKQMAAEEAMKALHGEATNSMASDNQPEGMISE
SLDNLESMMPNKVRKIGELVRYLNTNPVGGLLEYARSHGFAAEFKLVDQSGPPHEPKFVYQAKVGGRWFP
AVCAHSKKQGKQEAADAALRVLIGENEKAERMGFTELPLTGSTFHDQIAMLSHRCFNTLTNSFQPSLLGR
KILAAIIMKKDSEDMGVVVSLGTGNRCVKGDSLSLKGETVNDCHAEIISRRGFIRFLYSELMKYNSQTAK
DSIFEPAKGGEKLQIKKTVSFHLYISTAPCGDGALFDKSCSDRAMESTESRHYPVFENPKQGKLRTKVEN
GEGTIPVESSDIVPTWDGIRLGERLRTMSCSDKILRWNVLGLQGALLTHFLQPIYLKSVTLGYLFSQGHL
TRAICCRVTRDGSAFEDGLRHPFIVNHPKVGRVSIYDSKRQSGKTKETSVNWCLADGYDLEILDGTRGTV
DGPRNELSRVSKKNIFLLFKKLCSFRYRRDLLRLSYGEAKKAARDYETAKNYFKKGLKDMGYGNWISKPQ
EEKNFYLCPV

>Se6
MNPRQGYSLSGYYTHPFQGYEHRQLRYQQPGPGSSPSSFLLKQIEFLKGQLPEAPVIGKQTPSLPPSLPG
LRPRFPVLLASSTRGRQVDIRGVPRGVHLGSQGLQRGFQHPSPRGRSLPQRGVDCLSSHFQELSIYQDQE
QRILKFLEELGEGKATTAHDLSGKLGTPKKEINRVLYSLAKKGKLQKEAGTPPLWKIAVSTQAWNQHSGV
VRPDGHSQGAPNSDPSLEPEDRNSTSVSEDLLEPFIAVSAQAWNQHSGVVRPDSHSQGSPNSDPGLEPED
SNSTSALEDPLEFLDMAEIKEKICDYLFNVSDSSALNLAKNIGLTKARDINAVLIDMERQGDVYRQGTTP
PIWHLTDKKRERMQIKRNTNSVPETAPAAIPETRRNAEFLTCNIPTSNASNNMVTTEKVENGQEPVIKLE
NRQEARPEPARLKPPVHYNGPSKAGYVDFENGQWATDDIPDDLNSIRAAPGEFRAIMEMPSFYSHGLPRC
SPYKKLTECQLKNPISGLLEYAQFASQTCEFNMIEQSGPPHEPRFKFQVVINGREFPPAEAGSKKVAKQD
AAMKAMTILLEEAKAKDSGKSEESSHYSTEKESEKTAESQTPTPSATSFFSGKSPVTTLLECMHKLGNSC
EFRLLSKEGPAHEPKFQYCVAVGAQTFPSVSAPSKKVAKQMAAEEAMKALHGEATNSMASDNQVRKIGEL
VRYLNTNPVGGLLEYARSHGFAAEFKLVDQSGPPHEPKFVYQAKVGGRWFPAVCAHSKKQGKQEAADAAL
RVLIGENEKAERMGFTELPLTGSTFHDQIAMLSHRCFNTLTNSFQPSLLGRKILAAIIMKKDSEDMGVVV
SLGTGNRCVKGDSLSLKGETVNDCHAEIISRRGFIRFLYSELMKYNSQTAKDSIFEPAKGGEKLQIKKTV
SFHLYISTAPCGDGALFDKSCSDRAMESTESRHYPVFENPKQGKLRTKVENGEGTIPVESSDIVPTWDGI
RLGERLRTMSCSDKILRWNVLGLQGALLTHFLQPIYLKSVTLGYLFSQGHLTRAICCRVTRDGSAFEDGL
RHPFIVNHPKVGRVSIYDSKRQSGKTKETSVNWCLADGYDLEILDGTRGTVDGPRNELSRVSKKNIFLLF
KKLCSFRYRRDLLRLSYGEAKKAARDYETAKNYFKKGLKDMGYGNWISKPQEEKNFYLCPV

>Seq7
MNPRQGYSLSGYYTHPFQGYEHRQLRYQQPGPGSSPSSFLLKQIEFLKGQLPEAPVIGKQTPSLPPSLPG
LRPRFPVLLASSTRGRQVDIRGVPRGVHLGSQGLQRGFQHPSPRGRSLPQRGVDCLSSHFQELSIYQDQE
QRILKFLEELGEGKATTAHDLSGKLGTPKKEINRVLYSLAKKGKLQKEAGTPPLWKIAVSTQAWNQHSGV
VRPDGHSQGAPNSDPSLEPEDRNSTSVSEDLLEPFIAVSAQAWNQHSGVVRPDSHSQGSPNSDPGLEPED
SNSTSALEDPLEFLDMAEIKEKICDYLFNVSDSSALNLAKNIGLTKARDINAVLIDMERQGDVYRQGTTP
PIWHLTDKKRERMQIKRNTNSVPETAPAAIPETKRNAEFLTCNIPTSNASNNMVTTEKVENGQEPVIKLE
NRQEARPEPARLKPPVHYNGPSKAGYVDFENGQWATDDIPDDLNSIRAAPGEFRAIMEMPSFYSHGLPRC
SPYKKLTECQLKNPISGLLEYAQFASQTCEFNMIEQSGPPHEPRFKFQVVINGREFPPAEAGSKKVAKQD
AAMKAMTILLEEAKAKDSGKSEESSHYSTEKESEKTAESQTPTPSATSFFSGKSPVTTLLECMHKLGNSC
EFRLLSKEGPAHEPKFQYCVAVGAQTFPSVSAPSKKVAKQMAAEEAMKALHGEATNSMASDNQVRKIGEL
VRYLNTNPVGGLLEYARSHGFAAEFKLVDQSGPPHEPKFVYQAKVGGRWFPAVCAHSKKQGKQEAADAAL
RVLIGENEKAERMGFTELPLTGSTFHDQIAMLSHRCFNTLTNSFQPSLLGRKILAAIIMKKDSEDMGVVV
SLGTGNRCVKGDSLSLKGETVNDCHAEIISRRGFIRFLYSELMKYNSQTAKDSIFEPAKGGEKLQIKKTV
SFHLYISTAPCGDGALFDKSCSDRAMESTESRHYPVFENPKQGKLRTKVENGEGTIPVESSDIVPTWDGI
RLGERLRTMSCSDKILRWNVLGLQGALLTHFLQPIYLKSVTLGYLFSQGHLTRAICCRVTRDGSAFEDGL
RHPFIVNHPKVGRVSIYDSKRQSGKTKETSVNWCLADGYDLEILDGTRGTVDGPRNELSRVSKKNIFLLF
KKLCSFRYRRDLLRLSYGEAKKAARDYETAKNYFKKGLKDMGYGNWISKPQEEKNFYLCPV

>Seq8
MNPRQGYSLSGYYTHPFQGYEHRQLRYQQPGPGSSPSSFLLKQIEFLKGQLPEAPVIGKQTPSLPPSLPG
LRPRFPVLLASSTRGRQVDIRGVPRGVHLGSQGLQRGFQHPSPRGRSLPQRGVDCLSSHFQELSIYQDQE
QRILKFLEELGEGKATTAHDLSGKLGTPKKEINRVLYSLAKKGKLQKEAGTPPLWKIAVSTQAWNQHSGV
VRPDGHSQGAPNSDPSLEPEDRNSTSVSEDLLEPFIAVSAQAWNQHSRVVRPDSHSQGSPNSDPGLEPED
SNSTSALEDPLEFLDMAEIKEKICDYLFNVSDSSALNLAKNIGLTKARDINAVLIDMERQGDVYRQGTTP
PIWHSTDKKRERMQIKRNTNSVPETAPAAIPETKRNAEFLTCNIPTSNASNNMVTTEKVENGQEPVIKLE
NRQEARPEPARLKPPVHYNGPSKAGYVDFENGQWATDDIPDDLNSIRAAPGEFRAIMEMPSFYSHGLPRC
SPYKKLTECQLKNPVSGLLEYAQFASQTCEFNMIEQSGPPHEPRFKFQVVINGREFPPAEAGSKKVAKQD
AAMKAMTILLEEAKAKDSGKSEESSHYSTEKESEKTAESQTPTPSATSFFSGKSPVTTLLECMHKLGNSC
EFRLLSKEGPAHEPKFQYCVAVGAQTFPSVSAPSKKVAKQMAAEEAMKALHGEATNSMASDNQPEGMISE
SLDNLESMMPDKVRKIGELVRYLNTNPVGGLLEYARSHGFAAEFKLVDQSGPPHEPKFVYQAKVGGRWFP
AVCAHSKKQGKQEAADAALRVLIGENEKAERMGFTEVTPVTGASLRRTMLLLSRSPEAQPKTLPLTGSTF
HDQIAMLSHRCFDTLTNSFQPSLLGRKILAAIIMKKDSEDMGVVVSLGTGNRCVKGDSLSLKGETVNDCH
AEIISRRGFIRFLYSELMKYNSQTAKDSIFEPAKGGEKLQIKKAVSFHLYISTAPCGDGALFDKSCSDRA
MESTESRHYPVFENPKQGKLRTKVENGEGTIPVESSDIVPTWDGIRLGERLRTMSCSDKILRWNVLGLQG
ALLTHFLQPIYLKSVTLGYLFSQGHLTRAICCRVTRDGSAFEDGLRHPFIVNHPKVGRVSIYDSKRQSGK
TKETSVNWCLADGYDLEILDGTRGTVDGPRNELSRVSKKNIFLLFKKLCSFRYRRDLLRLSYGEAKKAAR
DYETAKNYFKKGLKDMGYGNWISKPQEEKNFYLCPV

>Seq9
MNPRQGYSLSGYYTHPFQGYEHRQLRYQQPGPGSSPSSFLLKQIEFLKGQLPEAPVIGKQTPSLPPSLPG
LRPRFPVLLASSTRGRQVDIRGVPRGVHLGSQGLQRGFQHPSPRGRSLPQRGVDCLSSHFQELSIYQDQE
QRILKFLEELGEGKATTAHDLSGKLGTPKKEINRVLYSLAKKGKLQKEAGTPPLWKIAVSTQAWNQHSGV
VRPDGHSQGAPNSDPSLEPEDRNSTSVSEDLLEPFIAVSAQAWNQHSGVVRPDSHSQGSPNSDPGLEPED
SNSTSALEDPLEFLDMAEIKEKICDYLFNVSDSSALNLAKNIGLTKARDINAVLIDMERQGDVYRQGTTP
PIWHLTDKKRERMQIKRNTNSVPETAPAAIPETRRNAEFLTCNIPTSNASNNMVTTEKVENGQEPVIKLE
NRQEARPEPARLKPPVHYNGPSKAGYVDFENGQWATDDIPDDLNSIRAAPGEFRAIMEMPSFYSHGLPRC
SPYKKLTECQLKNPISGLLEYAQFASQTCEFNMIEQSGPPHEPRFKFQVVINGREFPPAEAGSKKVAKQD
AAMKAMTILLEEAKAKDSGKSEESSHYSTEKESEKTAESQTPTPSATSFFSGKSPVTTLLECMHKLGNSC
EFRLLSKEGPAHEPKFQYCVAVGAQTFPSVSAPSKKVAKQMAAEEAMKALHGEATNSMASDNQPEGMISE
SLDNLESMMPNKVRKIGELVRYLNTNPVGGLLEYARSHGFAAEFKLVDQSGPPHEPKFVYQAKVGGRWFP
AVCAHSKKQGKQEAADAALRVLIGENEKAERMGFTEVTPVTGASLRRTMLLLSRSPEAQPKTLPLTGSTF
HDQIAMLSHRCFNTLTNSFQPSLLGRKILAAIIMKKDSEDMGVVVSLGTGNRCVKGDSLSLKGETVNDCH
AEIISRRGFIRFLYSELMKYNSQTAKDSIFEPAKGGEKLQIKKTVSFHLYISTAPCGDGALFDKSCSDRA
MESTESRHYPVFENPKQGKLRTKVENGEGTIPVESSDIVPTWDGIRLGERLRTMSCSDKILRWNVLGLQG
ALLTHFLQPIYLKSVTLGYLFSQGHLTRAICCRVTRDGSAFEDGLRHPFIVNHPKVGRVSIYDSKRQSGK
TKETSVNWCLADGYDLEILDGTRGTVDGPRNELSRVSKKNIFLLFKKLCSFRYRRDLLRLSYGEAKKAAR
DYETAKNYFKKGLKDMGYGNWISKPQEEKNFYLCPV

>Seq10
MNPRQGYSLSGYYTHPFQGYEHRQLRYQQPGPGSSPSSFLLKQIEFLKGQLPEAPVIGKQTPSLPPSLPG
LRPRFPVLLASSTRGRQVDIRGVPRGVHLRSQGLQRGFQHPSPRGRSLPQRGVDCLSSHFQELSIYQDQE
QRILKFLEELGEGKATTAHDLSGKLGTPKKEINRVLYSLAKKGKLQKEAGTPPLWKIAVSTQAWNQHSGV
VRPDGHSQGAPNSDPSLEPEDRNSTSVSEDLLEPFIAVSAQAWNQHSGVVRPDSHSQGSPNSDPGLEPED
SNSTSALEDPLEFLDMAEIKEKICDYLFNVSDSSALNLAKNIGLTKARDINAVLIDMERQGDVYRQGTTP
PIWHLTDKKRERMQIKRNTNSVPETAPAAIPETKRNAEFLTCNIPTSNASNNMVTTEKVENGQEPVIKLE
NRQEARPEPARLKPPVHYNGPSKAGYVDFENGQWATDDIPDDLNSIRAAPGEFRAIMEMPSFYSHGLPRC
SPYKKLTECQLKNPISGLLEYAQFASQTCEFNMIEQSGPPHEPRFKFQVVINGREFPPAEAGSKKVAKQD
AAMKAMTILLEEAKAKDSGKSEESSHYSTEKESEKTAESQTPTPSATSFFSGKSPVTTLLECMHKLGNSC
EFRLLSKEGPAHEPKFQYCVAVGAQTFPSVSAPSKKVAKQMAAEEAMKALHGEATNSMASDNQPEGMISE
SLDNLESMMPNKVRKIGELVRYLNTNPVGGLLEYARSHGFAAEFKLVDQSGPPHEPKFVYQAKVGGRWFP
AVCAHSKKQGKQEAADAALRVLIGENEKAERMGFTEVTPVTGASLRRTMLLLSRSPEAQPKTLPLTGSTF
HDQIAMLSHRCFNTLTNSFQPSLLGRKILAAIIMKKDSEDMGVVVSLGTGNRCVKGDSLSLKGETVNDCH
AEIISRRGFIRFLYSELMKYNSQTAKDSIFEPAKGGEKLQIKKTVSFHLYISTAPCGDGALFDKSCSDRA
MESTESRHYPVFENPKQGKLRTKVENGEGTIPVESSDIVPTWDGIRLGERLRTMSCSDKILRWNVLGLQG
ALLTHFLQPIYLKSVTLGYLFSQGHLTRAICCRVTRDGSAFEDGLRHPFIVNHPKVGRVSIYDSKRQSGK
TKETSVNWCLADGYDLEILDGTRGTVDGPRNELSRVSKKNIFLLFKKLCSFRYRRDLLRLSYGEAKKAAR
DYETAKNYFKKGLKDMGYGNWISKPQEEKNFYLCPV

As mentioned above, a multiple sequence alignment was done using Clustal W. Then to observe the phylogenetic relationship of these sequences the following pipeline generated on BCBB.

The same tree was displayed by three different ways. Observed phylogenetic trees are follow:

plotfile_1_.pdf

plotfile.pdf

tree_1_.pdf

3. Genetic Variation

Our popular database NCBI hosts two main categories of variations: NCBI dbSNP and dbVar Variations are a cause of mutations. Mutations can be synonymous or non-synonymous. Synonymous mutations often don't cause any effect on its protein product. But non-synonymous mutations cause frameshifts, often cause the production of truncated or no protein at all. SNPs (Single Nucleotide Polymorphisms) can give rise to synonymous or non-synonymous mutations. Therefore GWAS (Genome-Wide Association Studies) are mainly focused on identifying these different SNPs that are having clinical relevance.

One such clinically important protein is WDR19 in humans. This gene encodes a member of the WD repeat family protein. The resulting protein involve in wide variety of cellular processes including signal transduction, cell cycle progression, apoptosis and gene regulation. This gene is conserved from C. elegans to Homo sapiens (humans). The gene is located in the forth chromosome. NCBI dbSNP reports this gene carries cSNPs that can lead to missense mutations changing amino acid Arg (R) to Gln (Q).

1000 genome browser in fact provides insight to population diversity of the interested genetic variation.

By using this valuable web platform one can identify the interested genetic variation among different populations.

4. Protein Structure

Human genome comprised of ~20000 genes. Many of these genes give rise to a functional protein. In proteomic aspects, identification of these proteins and their roles in different cellular processes are important. Last Tuesday we discussed about SNPs and importance in disease conditions and drug tolerance. This is really important in studying population genetic variance. Similarly if you know the sequence of a protein you can find its identity by doing a pBLAST and essentially the next step is identifying its structure. Protein structures are solved by many ways. One popular method is X-ray crystallography. Use of MS and NMR have made structure elucidation fast and made an opening for find structures of any protein of interest whereas X-ray crystallography can be utilized only for a subset of proteins. There are databases developed to store these resolved structures and one such protein structure database is PDB. The well-known protein sequence databases, for example NCBI GenPept, UniprotKB and ExPASy together with PDB provide a greater contribution for the development of the field of Proteomics.

As I showed previously SNPs in the apoptotic protein WDR19 is involved in many pathophysiological conditions (as evidenced by PubMed). Though the mRNA and Protein sequences are known for WDR19, PDB does not have records of its 3D structure. BioGrid 3.4 reports 12 main binding partners of WDR19 and their associations are represented by the protein-protein interaction database String as follows:

The UniProtKB provides an extensive description about WDR19. Even though the PDB does not have experimentally solved 3D structures for WDR19 protein UniProtKB provides evidences for existing 3D models for WDR19. One such data base providing opportunity to find existing models for a protein is PMP. This data base provides access to various models computed by comparative modeling methods provided by different partner sites such as Swiss-Model and ModBase. The ModBase tutorials help to learn how to model a protein structure and outlined in the following figure.

PMP provides six existing models derived from various templates and various partner sites. Other than that the Swiss-Model web platform can be used to model any protein structure. Following are three possible 3D structures for WDR19 derived from above data base and provides a detailed Homology Modeling Report as well. However all these models have very low sequence identities to the target sequence, particularly less than 15%.

Model 01 Model 02 Model 03

These three structures were generated based on the template 5a1u.1. Generated PDB files of above three models can be found by using following links: Model 01, Model 02, Model 03

According to the NCBI dbSNP, WDR19 carries SNPs in different coding sequences. In normal populations amino acid positions 1018, 1022, 1160, 1182 and 1178 carry an arginine (R) residue where as due to a synonymous nucleotide change the amino acid change to a glutamine (Q). How can this affect the protein structure? To investigate this question the protein sequences were remodeled using Swiss-Model (only 1018).

But, according to the UniProtKB, WDR19 has two major isoforms. Above models correspond to the 1342 amino acid long isoform 1. Models for 437 amino acid isoform 2 are as follows and derived using Swiss-Model similar to isoform 1. These models were constructed based on the templates 3shf.1 (for Model 01 and 02) and 4nsx.1 (for Model 03).

Once a 3D structure is resolved its important to identify the presence of any PTMs (Post Translational Modifications). A web resource available for the above task is PhosphoSitePlus. PhosphoSitePlus helps to identify variety of different PTMs on a protein such as phosphorylations, methylations, ubiquitinations, etc. According to it WDR19 has several phosphorylation sites and are diagramed in the following figure (Source: PhosphoSitePlus):

5. Gene Structure Prediction

The Central Dogma of Molecular Biology central dogma of molecular biology shows the genetic information flow from DNA to a protein. Mostly the process is unidirectional except in retroviruses where they can produce DNA from RNA via the process of reverse transcription. A DNA sequence that codes for a RNA or protein can be identified as a gene. Human genome is about 3.3 billion base pairs long. Therefore its really impossible to isolate each individual gene by using experimental methods. But with the development of the field Bioinformatics a new simple way of identifying genes came out without requiring experimental manipulations. This is known as Gene Prediction. Once identified by using gene prediction, those can be validated by using experimental methods to prove whether they are actual genes or not.

Gene prediction algorithms such as GENIE, GENSCAN, GeneMark and Augustus (a nice review for Augustus can be found here) identify different features of genes. Most of these programs are based on stochastic models such as Hidden Markov Models (HMMs) (except Augustus which is based on a generalized Hidden Markov Model (GHMM)). Generally prokaryotic genes are simple compared to eukaryotic genes. Eukaryotic genes are significantly large and carry very long introns within their genes, while prokaryotes can produce polycistronic RNAs.

What are the features that these gene finding algorithms are seeking? Usually these algorithms capable of identifying multiple features. For example, start codon (ATG mostly), stop codons (TAA, TGA and TAG), length of the potential CDS, base composition, codon usage, promoter elements (such as -10 and -35 elements in bacteria), poly-A signals and consensus sequences in the exon/intron boundaries are widely used as common features to identify genes in a piece of DNA.

Codon Usage: The Geneti Code is degenerate and therefore to encode 20 different amino acids there are 61 different codons present. Three of the others function as translational stop codons. Only methionine (Met) and tryptophan (Trp) have a single codon, while arginine (Arg) has the highest number of codons which is six. Different organisms have different codon preferences and this is addressed by the codon usage. The web tools such as GenScript and Codon Usage Database can be used to identify these different genetic codes made based on different organisms. One prominent example is, even though the standard initiation codon is ATG, now we know that the codons CTG and TTG can also initiate translation.

As I mentioned earlier, gene prediction algorithms such as GENSCAN, GeneMark and Augustus provide a window to identify genes within a given DNA sequence. I used a DNA sequence of 41957 bp long to identify different genes that can be encoded from that.

I used GENSCAN first to locate any genes within it. GENSCAN program helps to predict the locations and exon-intron structures of genes in genomic sequences from a variety eukaryotes (e.g. vertebrates, Arabidopsis, Maize) (for more information click here). You can submit your input as a FASTA file or can directly copy and paste to the space provided. First let's look at what we will get by using default settings in the GENSCAN interface:

When running the program after submitting the query sequence, the algorithm first analyses the sequence and identifies the sequence length and the GC content.

The program identifies 12 possible genes within the query DNA sequence. One of the predicted gene is shown below:

Key:

-Prom —> Promoter

-Init —> Initiator Exon

-Intr —> Internal Exon

-Term —> Terminator

-PlyA —> Poly A Signal

According to these results the predicted gene has 4 exons and the lengths of each exon is given. This program also provides the expected protein product as well. The predicted protein has 387 amino acids as indicated below:

Then the same sequence was used to do a BLAST search. The sequence aligned with Arabidopsis thaliana Chromosome 5. It also has a similar gene as indicated below:

The Protein identified as a protein-kinase like protein. According to the NCBI entry the protein is 410 amino acids long. To investigate the similarity between two protein products, a pBLAST was done (Note: “align two or more sequences”).

As shown in the dot-matrix the GENSCAN identified gene has a missing amino acid sequence compared to the NCBI entry. I ran GENSCAN in the default settings keeping “Vertebrates” as the source organisms. As I described previously, different organisms have different codon usages. We might get the same CDS if we rerun the program by using Arabidopsis as these source organism. What if we change it to Arabidopsis?

The expected protein product was observed. The pBLAST confirmed it and the dot-matrix is shown below.

Augustus web server is a recently developed faithful web platform which can be used as a tool to identify genes within a piece of DNA. The importance of the Augustus web platform is that, it considers alternative transcripts which can be produced as a result of alternative splicing. The same sequence was used to scan for genes using Augustus. There are a number of different source organisms (both prokaryotes and eukaryotes) available in this server and as I already know that the sequence is correspond to Arabidopsis thaliana by performing the BLAST search, I used it as the source organism and ran the program.

When the source organism is Arabidopsis Augustus identified 17 genes (GENSCAN identified only 12) within the query DNA sequence. Four of them and corresponding amino acid sequences are given below.

The fourth protein sequence of this Augustus output file was used to do a BLAST search with the protein sequence observed from the GENSCAN (when the source is Arabidopsis). BLAST search proved that still both protein sequences are the same. BLAST search results are as follows:

When the same sequence is used against a different source (here Homo sapiens) the predicted genes in the graphical representation is significantly different.

When the source organism is Homo sapiens Augustus identified only 10 genes within the query DNA sequence. Two of them and corresponding amino acid sequences are given below.

The first protein sequence of this Augustus output file was used to do a BLAST search with the protein sequence observed from the GENSCAN (when the source is Arabidopsis). BLAST search proved that still both protein sequences are the same. BLAST search results are as follows:

GeneMark is another widely used tool for gene prediction. This can be used with both eukaryotic and prokaryotic sequences. There are multiple options for gene prediction depending on the users requirement.

I used both GeneMark and GeneMark.hmm to predict genes in the above 41957 bp long DNA sequence. Instead of using eukaryotic sources I have used Escherichia coli, which is a bacterium (prokaryote), as the source organism. GeneMark identified only 8 genes in the DNA sequence (in both + and - strands) while GeneMark.hmm identified 31 genes (in both + and - strands). GeneMark.hmm also provides the opportunity to identify each individual gene sequence, protein sequence and graphical representation (as a PDF).

The protein sequence identified was used to search the PDB. 3409 hits were identified. Similarly a homology modeling was done by using Swiss-Model. Three possible structures are given below:

6. Genome browsers and community genome annotation

Other than NCBI, EMBL and DDBJ which are the main sources of DNA sequence information, now there are many other specific genome browsers developed to search genomic data of different organisms. In bioinformatics, a genome browser is typically defined as a graphical interface for display of information from a biological database for genomic data. There are many genomic databases available freely in the internet and most importantly there are dedicated genomic databases for your model organism of interest.

Ex: FlyBase, UCSC Human Genome Browser, TAIR, etc.

Genome annotation is a key step in developing genome browsers. Gene prediction using tools such as Augustus and GeneMark has been a powerful tool to identify possible genes within a piece of DNA, but the next important piece of work is the validation of these predicted genes as real. Once validated as a possible real gene they should be added to a genome browser. This process is known as genome annotation. According to PlantGDB, genome annotation is the process by which pertinent information about these raw DNA sequences is added to the genome databases. This involves describing different regions of the code (ex: start, stop codons, exons, introns, etc.) and identifying which regions can be called genes. one such genome annotation platform is xGDBym. This web platform allows researchers to annotate genome and view or analyze outputs using a Web-based graphical user interface (GUI).

Currently this web platform involve in annotating three genomes, an insect, a plant very close to Arabdopsis and banana (i.e., Polistes dominula, Capsella rubella and Musa respectively). For example Polistes dominula has 11819 gene models. But still only 10 of them are annotated by community (to date). There are many annotation tools available, for example yrGATE. According to PlantGDB, yrGATE is a web based gene-structure annotation tool for the identification and dissemination of eukaryotic genes. One key advantage of yrGATE is its simplicity. The annotation platform, xGDBym, uses two main annotation tools. First is the aforementioned yrGATE and the other is CpGAT. PlantGDB defines CpGAT as a comprehensive tool for annotating genomic regions up to 500 kilobases.

As mentioned above there are 11819 gene models available for Polistes dominula. These gene models can be filtered based on several parameters, i.e., coverage score and the integrity score. A model with a high coverage score (>90%) and a low integrity score (<75%) is a good candidate for annotation by using yrGATE.

When above filters are applied, the number of gene models reduced to a total of 2029. A model with a coverage of 97% and an integrity score of 34% was used for the annotation using yrGATE.

Different colors in the graphical representation correspond to different components.

Blue color represents statistical models derived from computational methods. But the the purple color which represents a well conserved protein comprised of all the predicted gene models in the graphic. Therefore may be these gene models may not be different genes, rather can be a single gene. The hypothesis is supported by the mRNA sequences as well. Therefore the conserved protein encoding gene itself can be used as a single model for annotation. Once the model is built next task is to identify possible open reading frames. yrGATE is directly linked to ORF Finder and thereby possible ORFs can be identified easily. Once an ORF is determined the predicted protein sequence has to be used for a protein BLAST to identify any closely related proteins.

According to the ORF Finder results, the predicted protein is 716 amino acids in length. The protein sequence is then used to do a BLAST search. Similar to ORF Finder, yrGATE is directly linked to NCBI pBLAST.

The predicted protein showed 79% similarity to a PREDICTED: uncharacterized protein LOC408669 isoform X2.

Before submission the annotation class has to be determined. Once all the necessary information including the name of the putative protein the annotation can be submitted for review. The submitted annotation will appear in green color in the community annotations section.

7. iPlant Collaborative and Comparative Genomics

One of the important platforms available for biologists in iPlant Collobarative. This enables biologists to do data-driven science by providing them with powerful computational infrastructure for handling huge datasets and complex analyses, iPlant fills a niche created by the computing epoch and a rapidly evolving world.

iPlant comprised of a set of platforms and tools and organized in such a way to support the research needs of today's life scientist. It provides world-class experiences in high performance computing, data storage, analysis tools and workflows, visualization and image analysis, computational expertise, and educational resources. iPlant collaborative architecture is as follows:

One such important platform is Atmosphere. It functions as a virtual computer to do various computer driven manipulations and provide a platform to run various programs separately.

It is widely popular among scientists and is one of the frequently used service of iPlant.

Another important tool is the Discovery Environment.

Comparative genomics is a field of biological research in which the genomic features such as DNA sequence, genes, gene order, regulatory sequences, and other genomic structural landmarks of different organisms are compared. One platform that can be used for comparative genomic research is CoGe (Accelerating Comparative Genomics.

This resource contains four major systems: a data engine (to store thousands of genomes), a suite of interconnected web-based tools, a wiki documentation system with hundreds of pages on comparative genomics, and a TinyURL resource for storing links to CoGe to regenerate data and analyses.

8. Gene Expression Analyses and Motif Finding in DNA and Protein Sequences

Functional genomics is a field of molecular biology that attempts to make use of the vast wealth of data produced by genomic and transcriptomic projects (such as genome sequencing projects and RNA-seq) to describe gene (and protein) functions and interactions. According to EMBL-EBI, the aim of functional genomics studies is to understand the complex relationship between genotype and phenotype on a global (genome-wide) scale. Different analyses are done to investigate a range of processes such as transcription, translation and epigenetic regulation, in an attempt to answer relevant biological questions such as: When and where are genes expressed? How the gene expression levels differ under various states? How are genes spatially and temporally regulated? How does gene expression change in various diseases or following a treatment? etc.

We know that the human genome comprised of only ~20000-25000 genes. But the human proteome estimated to be greater than 1000000.

Therefore, the genomic complexity doesn't correlate with the proteomic complexity. The proteomic complexity is generated by two major causes: transcript variations (i.e., alternative splicing, RNA editing, use of differential promoters, translational frameshifting, etc.) and protein modifications/proteolysis. Functional genomics is a one way of addressing this phenomenon.

Functional genomics, in contrast to genomics, focuses on dynamic aspects as opposed to static aspects. For instance, functional genomics focuses on gene transcription, translation, regulation of gene expression and protein–protein interactions, etc. instead of genomic information such as DNA sequence or structures. Functional genomics uses novel genome-wide approaches to address various biological questions (for example the function of DNA in the levels of genes, RNA and proteins) rather than using traditional approaches which basically involve gene-by-gene approach.

As mentioned above, functional genomic analyses are done at various stages of gene expression. Few of them can be listed as follows. In addition to the three main categories below there are many other categories that one can find.

1. DNA level

Ex: ENCODE, Genetic interaction mapping, next generation sequencing (NGS)etc.

2. RNA level (Transcriptome Analysis)

Ex: Microarray, SAGE (Serial Analysis of Gene Expression), etc.

3. Protein level

Ex: Protein-protein interactions by using Y2H (Yeast two hybrid) system

Rapid development in functional genomics with the development of various techniques as mentioned above necessitate the development of repositories for these functional genomic data. One such data repository is GEO (Gene Expression Omnibus). GEO is an international public repository that archives and freely distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomics data submitted by the research community. There are three major features of GEO.

1. Provide a versatile database to efficiently store functional genomic data

2. Simple submissions and well annotated data deposits from the research community

3. Provide user friendly mechanisms to query, locate, review or retrieve data

GEO made up of three sub databases: GEO DataSets, GEO Profiles and GEO2R. The GEO DataSets database stores original submitter-supplied records (Series, Samples and Platforms) as well as curated DataSets. Whereas the expression profiles of curated GEO DataSets are stored in GEO Profiles. An interactive web tool, GEO2R, allows users to compare two or more groups of Samples in a GEO Series in order to identify genes that are differentially expressed across experimental conditions (How to use GEO2R?).

How to use GEO2R?

As mentioned above GEO2R is an analysis platform and for analysis a sample data set must be identified. One can search GEO by using GEO Series accession numbers (Ex: GSE29252) or key words. If the Series is associated with multiple platforms, the program might ask to select the Platform of interest. The second step is defining groups. The Sample panel contain a link named “Define Groups” and one can define groups (Ex: Control, Test, etc.) by clicking on that. After defining groups samples must be assigned to each group. After assigning samples for each group, GEO2R can be run by clicking on “TOP 250” and this will perform the calculation with default settings. Results are presented as a table of genes ordered by significance. The top 250 genes are presented and may be viewed as profile graphs.

Here is an example: (I have included Screen shots of each step)

Or

Keyword Search Results:

1. DataSets Results

2. Profiles Results

First hit of the GEO DataSets:

9. Workflow Design and Execution Tools

BioExtract: not very good. Taverna: A java application

Galaxy: Record keeping! Important. Why? You do research and need work flow managemnt. If you need to re do something similar. You could use Galaxy.

Vimeo

b2gof15/students/jbuddika/start.txt · Last modified: 2015/10/29 14:12 by jbuddika
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki