Bioinformatics 2 Go Final

With sequencing prices at an all time low and technologies ever expanding, it is affordable and reasonably simple to preform high-throughput sequencing. The advancement of technologies have now made it easy to generate large amounts of data, for instance in the scope of millions of base pair reads per organism. Therefore, the problem is no longer obtaining data, but organizing it, mining through it, and making interpretations from the data. Bioinformatics is an emerging field that tackles these issues as well as generates theoretical models based on user generated parameters and algorithms. It is very unlikely that a scientist will fail to encounter bioinformatics sometime in their career, and thus it is important to learn how to navigate public databases and servers. In this Bioinformatics2Go course, we have encountered some of these tools and have recorded our experiences.

Minichromosome maintenance

I picked a subunit of minichromosome maintenance (MCM) helicase as my protein because it is essential for replication in Archaea and Eukaryotes and the lab I am in is studying it. The following amino acid FASTA sequence is of Trypanosoma grayi and can be found on the NCBI website under the accession number XP_009307667.1. It is 952 amino acids long.

MPPHRKRARDEGLPRHARESDEEPEEAEEELGEDLFGENYEQDYLHPDEESEVAEDDVADEDWIDDDTDV SVISDSGRAAVDALLDRRKEMEQRLREEQRQLQEGVFSDVDDEESLISADDDYYDGTNAGEDAGKNGAGG NDDNGDDNMNEGGADGVVGNSKDDGVYVRGELESMDFDWRQPQCDLVEWLAQELPRRVVKNRIYNFYLNY VEHGVSVYEQKVNLMTRENEQSFQLSYSHLSRAYDSVLALWLVDVPDIMIELLEDAANYFVFRLFPHYRK VHRCILVRICDLPLCDPIRDFRQIHMNVLVRVEGVVIRRSPVYPQMQAVRYDCVRCSYIIGPIYQRGDKE QRVSMCPSCHSKGPFRVNMTLTEYRNHQTIVLQESPGKVPPGRLPRSLEVVLTNDLIDRANPGEEVDVTG IYRNNFDPLLNSRQGFPVFTTVLHANNVVRRSAEVGSFRLPDDERARIMDLSKHPNLRRKILRSIAPSIH GRDDIKLGLLLAMLGGVPKDVGGDQSHRIRGDINVLLVGDPGCAKSQFLKFVEKTANRAVFTTGRGSTAV GLTASVHKDSVTGDFVLEGGALVIADRGSCLIDEFDKMSDQDRTSIHEAMEQQTISVARGGIVTTLSARC SILAAANPIGGRYDPSLSFDANVNLTTPILSRFDLLFVVRDEVNVELDEKLATFICHSHIRNHPRSQDEN KRAERELHDRLSKLRHALENAPTEEERRAIEMQLQALRHTVLDKSLQEDEDPSSDKPLPQQLLRKYILYA KAHCHPRVSNIDANTIARLYTELRQESKHGGLAITVRHMESVIRLSEAHARLHLRDFVRDEDVSAAISLF LRCFIQTQKYSLRSAMENRFRKYFDSDTEPLPLIQHHIKVAVQSIRTFERQMSGGVEPTKVRIDVSQLEH CTTSVSKESLNAFFESEEFRREYTLVRDPVTGVALQIEHSLV

More information on this protein can be found at the wikipedia page and in the Cold Spring Harbor Perspectives in Biology publication, The Minichromosome Maintenance Replicative Helicase.

Molecular Phylogeny

In order to determine evolutionary relationships between organisms, similar sequences need to be found. In this case, protein blast can be used to search the NCBI protein database for similar protein sequences across species. I selected the swissProt database to search my query sequence against. This database has a comprehensive and reliable collection of protein sequences from genome sequencing projects.

The following are similar amino acid sequences from rice, frog, yeast, archaea (sulfolobus), and cattle.

MCM_RICE

MDDSENNAPSTPGSPGFSTDRLPPNTTTSRGATDPSSYSDDDDDDVVGAEEAEVDPNVLPEDDGVVAAEE EEDGEDLFNDNYLDDYRRMDEQDQYESVGLDDSIEDERNLDEIMADRRAAEAELDARDVRTGAAPDRKLP RMLHDQDTDEDMSFRRPKRHRANFRPPREPRTPRSDDDGDGATPSSPGRSQRGMYSGGDVPMTDQTDDDP YEDEFDEEDEMNMYRVQGTLREWVTRDEVRRFIAKKFKEFLLTYVNPKNEQGEFEYVRLINEMVLANKCS LEIDYKQFIYIHPNIAIWLADAPQSVIEVMEEVAKNVVFDLHKNYRNIHQKIYVRITNLPVYDQIRNIRQ IHLNTMIRIGGVVTRRSGVFPQLQQVKYDCSKCGTVLGPFFQNSYTEVKVGSCPECQSKGPFTINVEQTI YRNYQKLTLQESPGIVPAGRLPRYKEVILLNDLIDCARPGEEIEVTGIYTNNFDLSLNTKNGFPVFATVV EANYVAKKQDLFSAYKLTDEDKAEIEKLAKDPRIGERIVKSIAPSIYGHEDIKTAIALAMFGGQEKNVKG KHRLRGDINVLLLGDPGTAKSQFLKYVEKTGHRAVYTTGKGASAVGLTAAVHKDPVTREWTLEGGALVLA DRGICLIDEFDKMNDQDRVSIHEAMEQQSISISKAGIVTSLQARCSVIAAANPIGGRYDSSKTFTQNVEL TDPIISRFDVLCVVKDIVDPFTDEMLARFVVDSHARSQPKGANLEDRVPTDVEDDPLAAARQADPDILSQ DMLKKYITYAKLNVFPKIHDADLDKISHVYAELRRESSHGQGVPIAVRHIESIIRMSEAHARMHLRSYVS QEDVDMAIRVLLDSFISTQKFGVQKALQKNFRKYMTYKKDYNELLLLLLRTLVKDVLHFEEIVSGPTTRL THIEVKVEDLKNKAQEYEIYDLRPFFSSAHFRDNNFVLDEGRGIIRHPLAA

MCM_FROG

MADSSESFNIATSPRTGSRRDALTSSPGRDLPPFEDESEGMFGDEVPREEEEDGEELIGDAMERDYRAIS ELDRYEAEGLDDEDDVEDLTASQRDAAEQAMRMRDREMGHELGRMRRGLLYDSDEEDEDRPARKRRMAER AAEGAPEEDEEMIESIENLEDMKGHTVREWVSMAATRLEIYHRFKNFLRTHVDEHGHNVFKEKISDMCKE NKESLVVNYEDLAAREHVLAYFLPEAPAEMLKIFDEAAKEVVLVMYPKYDRIAREIHVRISHLPLVEELR SLRQLHLNQLIRTSGVVTCCTGVLPQLSMVKYNCNKCNFILGPFFQSQNQEVKPGSCPECQSLGPFEINM EETVYQNYQRITIQESPGKVAAGRLPRSKDAILLADLVDSCKPGDEIELTGTYHNNYDGSLNTANGFPVF ATVILANHITKKDDKVAVGELTDEDVKAIVALSKDERIGERIFASIAPSIYGHEDIKRGLALALFGGEAK NPGGKHKVRGDINVLLCGDPGTAKSQFLKYVEKVASRAVFTTGQGASAVGLTAYVQRHPVTKEWTLEAGA LVLADRGVCLIDEFDKMNDQDRTSIHEAMEQQSISISKAGIVTSLQARCTIIAASNPIGGRYDPSLTFSE NVDLTEPIVSRFDILCVVRDTVDPVQDEMLARFVVGSHIKHHPSSKDIANGEEFALPNTFGVEPLPQEVL KKYIMYSKEKIHPKLNQMDQDKVAKMYSDLRKESMATGSIPITVRHIESMIRMAEAHARMHLRDYVVEDD VNMAIRVMLESFIDTQKFSVMRSMRKTFARYLAFRRDNNELLLFVLKQLVAEQTSYQRNRYGAQQDTIEV PEKDLVDKARQINIHNLSAFYDSDLFKMNRFTHDVKKKMIIQQF

MCM_YEAST

MSDNRRRRREEDDSDSENELPPSSPQQHFRGGMNPVSSPIGSPDMINPEGDDNEVDDVPDIDEVEEQMNE VDLMDDNMYEDYAADHNRDRYDPDQVDDREQQELSLSERRRIDAQLNERDRLLRNVAYIDDEDEEQEGAA QLDEMGLPVQRRRRRRQYEDLENSDDDLLSDMDIDPLREELTLESLSNVKANSYSEWITQPNVSRTIARE LKSFLLEYTDETGRSVYGARIRTLGEMNSESLEVNYRHLAESKAILALFLAKCPEEMLKIFDLVAMEATE LHYPDYARIHSEIHVRISDFPTIYSLRELRESNLSSLVRVTGVVTRRTGVFPQLKYVKFNCLKCGSILGP FFQDSNEEIRISFCTNCKSKGPFRVNGEKTVYRNYQRVTLQEAPGTVPPGRLPRHREVILLADLVDVSKP GEEVEVTGIYKNNYDGNLNAKNGFPVFATIIEANSIKRREGNTANEGEEGLDVFSWTEEEEREFRKISRD RGIIDKIISSMAPSIYGHRDIKTAVACSLFGGVPKNVNGKHSIRGDINVLLLGDPGTAKSQILKYVEKTA HRAVFATGQGASAVGLTASVRKDPITKEWTLEGGALVLADKGVCLIDEFDKMNDQDRTSIHEAMEQQSIS ISKAGIVTTLQARCSIIAAANPNGGRYNSTLPLAQNVSLTEPILSRFDILCVVRDLVDEEADERLATFVV DSHVRSHPENDEDREGEELKNNGESAIEQGEDEINEQLNARQRRLQRQRKKEEEISPIPQELLMKYIHYA RTKIYPKLHQMDMDKVSRVYADLRRESISTGSFPITVRHLESILRIAESFAKMRLSEFVSSYDLDRAIKV VVDSFVDAQKVSVRRQLRRSFAIYTLGH

MCM_SULFOLOBUS

MEIPSKQIDYRDVFIEFLTTFKGNNNQNKYIERINELVAYRKKSLIIEFSDVLSFNENLAYEIINNTKII LPILEGALYDHILQLDPTYQRDIEKVHVRIVGIPRVIELRKIRSTDIGKLITIDGILVKVTPVKERIYKA TYKHIHPDCMQEFEWPEDEEMPEVLEMPTICPKCGKPGQFRLIPEKTKLIDWQKAVIQERPEEVPSGQLP RQLEIILEDDLVDSARPGDRVKVTGILDIKQDSPVKRGSRAVFDIYMKVSSIEVSQKVLDEVIISEEDEK KIKDLAKDPWIRDRIISSIAPSIYGHWELKEALALALFGGVPKVLEDTRIRGDIHILIIGDPGTAKSQML QFISRVAPRAVYTTGKGSTAAGLTAAVVREKGTGEYYLEAGALVLADGGIAVIDEIDKMRDEDRVAIHEA MEQQTVSIAKAGIVAKLNARAAVIAAGNPKFGRYISERPVSDNINLPPTILSRFDLIFILKDQPGEQDRE LANYILDVHSGKSTKNIIDIDTLRKYIAYARKYVTPKITSEAKNLITDFFVEMRKKSSETPDSPILITPR QLEALIRISEAYAKMALKAEVTREDAERAINIMRLFLESVGVDMESGKIDIDTIMTGKPKSAREKMMKII EIIDSLAVSSECAKVKDILKEAQQVGIEKSNIEKLLTDMRKSGIIYEAKPECYKKV

MCM_CATTLE

MDLAAAAEPGAGSQHLEVRDEVAEKCQKLFLDFLEEFQNSDGEIKYLQLAEELIRPERNTLVVSFVDLEQ FNQQLSTTIQEEFYRVYPYLCRALKTFVKDRKEIPLAKDFYVAFQDLPTRHKIRELTSSRIGLLTRISGQ VVRTHPVHPELVSGTFLCLDCQTVIKDVEQQFKYTQPNICRNPVCANRRRFLLDTNKSRFVDFQKVRIQE TQAELPRGSIPRSLEVILRAEAVESAQAGDKCDFTGTLIVVPDVSKLSTPGARAETDSRVSGVDGYETEG VRGLRALGVRDLSYRLVFLACCVAPTNPRFGGKELRDEEQTAESIKNQMTVKEWEKVFEMSQDKNLYHNL CTSLFPTIHGNDEVKRGVLLMLFGGVPKTTGEGTSLRGDINVCIVGDPSTAKSQFLKHVEEFSPRAVYTS GKASIAAGLTAAVVRDEESHEFVIEAGALMLADNGVCCIDEFDKMDVRDQVAIHEAMEQQTISITKAGVK ATLNARTSILAAANPISGHYDRSKSLKQNINLSAPIMSRFDLFFILVDECNEVTDYAIARRIVDLHSRIE DSIDRVYSLDEIRRYLLFARQFKPKISKESEDFIVEQYKRLRQRDGSGVTKSSWRITVRQLESMIRLSEA MARMHCCDEVQPKHVKEAFRLLNKSIIRVETPDVNLDQEEDAQMEVDEGPDGINGHADSPAPASGINGHS EDMNQDSVPKASLRLGFSEYCRISNLIVLHLRKMEEEEDESALKRSELVNWYLKEIESEIDSEEELINKK RIIEKVIYRLTHYDHVLIELTQAGLKGSTEGSESYEEDPYLVVNPNYLLED

Once I obtained the similar sequences in FASTA format from the blast search, I created a pipeline using the Mobyle @ Pasteur site. A pipeline is a series of programs linked together so that the output from one program becomes the input of another. A screen shot of my created pipeline is shown below.

The program Muscle is a multiple aligner and the similar sequences I selected along with the Trypanosome sequence in FASTA format were the input. The output (shown below) was fed into the program protdist.

MCM_CATTLE


—MDLAAAAEPGAGSQHLEVR————————————–


———————DEVAEKCQKLFLDFLEEFQ—NSDGEIKYLQLAEELIR PERNTLVVSFVDLE-QFNQQLSTTIQEEFYRVYPYLCRALK–TFVKDRKEIPLAKDFYV AFQDLPTRHKIRELTSSRIGLLTRISGQVVRTHPVHPELVSGTFLCLDCQTVIKDVEQQF KYTQ——PNICRNPVCANRRRFLLDTNKSRFVDFQKVRIQETQAELPRGSIPRSLEV ILRAEAVESAQAGDKCDFTGTLIVVPDVSKLSTPGARAETDSRVSGVDGYETEGVRGLRA LGVRDLSYRLVFLACCVAPT——–NPRFGGKELRDEEQTAESIKNQMTVKEWEKVF EMSQDKNLYHNLCTSLFPTIHGNDEVKRGVLLMLFGGVPKTTGEG–TSLRGDINVCIVG DPSTAKSQFLKHVEEFSPRAVYTSGKASIAAGLTAAVVRDEESHEFVIEAGALMLADNGV CCIDEFDKMDVRDQVAIHEAMEQQTISITKAGVKATLNARTSILAAANPISGHYDRSKSL KQNINLSAPIMSRFDLFFILVDECNEVTDYAIARRIVDLH——————– ———SRIEDSIDR——————————VYSLDEIRRYLL FARQF-KPKISKESEDFIVEQYKRLRQRDGSGVTKSSWRITV–RQLESMIRLSEAMARM HCCDEVQPKHVKEAFRLLNKSIIRVETPDVNLDQEEDAQMEVDEGPDGINGHADSPAPAS GINGHSEDMNQDSVPKASLRLGFSEYCRISNLIVLHLRKMEEEEDESALKRSELVNWYLK EIESEIDSEEELINKKRIIEKVIYRLTHYDHVLIELTQAGLKGSTEGSESYEEDPYLVVN PNYLLED

MCM_SULFOLOBUS


———–MEIPSKQIDYR————————————–


—————————–DVFIEFLTTFK—GNNNQNKYIERINELVA YRKKSLIIEFSDVL-SFNENLAYEIINNTKIILPILEGALYDHILQLDPTYQRDIEKVHV RIVGIPRVIELRKIRSTDIGKLITIDGILVKVTPVKERIYKATYKHIHPD-CMQEFEWPE DEEMPEVLEMPTIC–PKCGKPGQFRLIPEKTKLIDWQKAVIQERPEEVPSGQLPRQLEI ILEDDLVDSARPGDRVKVTGIL————–DIKQDSPVK-RGSRAV——– ——-FDIYMKVSSIEVS——–QKVLDEVIISEEDE————–KKIK DLAKDPWIRDRIISSIAPSIYGHWELKEALALALFGGVPKVLED—TRIRGDIHILIIG DPGTAKSQMLQFISRVAPRAVYTTGKGSTAAGLTAAVVREKGTGEYYLEAGALVLADGGI AVIDEIDKMRDEDRVAIHEAMEQQTVSIAKAGIVAKLNARAAVIAAGNPKFGRYISERPV SDNINLPPTILSRFDLIFILKDQPG-EQDRELANYILDVH———–SGKSTK— ———N————————————–IIDIDTLRKYIA YARKYVTPKITSEAKNLITDFFVEMRKKSSET—PDSPILITPRQLEALIRISEAYAKM ALKAEVTREDAERAINIM-RLFL————————————- —-ESVGVDMESG———–KIDIDTIMTGKPKSAREK—-MMKIIEIIDSLAV SSECAKVKDILKEAQQVGIE–KSNIEKLLTDM—RKSGI——–IYEAKPECYKK V——

MCM_TRYPANOSOME

———————————MPPHRKRARDEGLPRHARESDEEP— ——————EEAEEELGEDLFGENYEQDYLHPDEESEVAEDDVADEDWIDD DTDVSVISDSGRAAVDALLDRRKEMEQRLREEQRQLQEGVFSDVDDEESLISADDDYYDG TNAGEDAGKNGAGGNDDNGDDN————–MNEGGADGVVGNSKDDGVYVRGEL ESM-DFDWRQPQCDLVEWLAQELPRRVVKNRIYNFYLNYV—-EHGVSVYEQKVNLMTR ENEQSFQLSYSHLSRAYDSVLALWLVDVPDIMIELLEDAANYFVFRLFPHYRKVHRCILV RICDLPLCDPIRDFRQIHMNVLVRVEGVVIRRSPVYPQMQAVRYDCVRCSYIIGPIYQRG DKEQR—–VSMC–PSCHSKGPFRVNMTLTEYRNHQTIVLQESPGKVPPGRLPRSLEV VLTNDLIDRANPGEEVDVTGIY————–RNNFDPLLNSRQGFPV——– ——-FTTVLHANNVVRR——–SAEVGSFRLPDDER————–ARIM DLSKHPNLRRKILRSIAPSIHGRDDIKLGLLLAMLGGVPKDVGGDQSHRIRGDINVLLVG DPGCAKSQFLKFVEKTANRAVFTTGRGSTAVGLTASVHKDSVTGDFVLEGGALVIADRGS CLIDEFDKMSDQDRTSIHEAMEQQTISVARGGIVTTLSARCSILAAANPIGGRYDPSLSF DANVNLTTPILSRFDLLFVVRDEVNVELDEKLATFICHSHIRNHPRSQDENKRAERELHD RLSKLRHALENAPTEEERRAIEMQLQA-LRHTVLDKSLQEDEDPSSDKPLPQQLLRKYIL YAKAHCHPRVSNIDANTIARLYTELRQESKHG—-GLAITV–RHMESVIRLSEAHARL HLRDFVRDEDVSAAISLFLRCFI————————————- —-QTQKYSLRSAMENRFRKYFDSDTEPLPLIQHHIKVAVQS—–IRTFERQMSGGV EPTKVRIDVSQLEHCTTSVS–KESLNAFFESEEFRREYTL——–VRDPVTGVALQ IEHSLV-

MCM_YEAST

MSDNRRRRREEDDSDSENELPPSSPQQHFRGGMNPVSSPIGSPDMINPEGDDNEVDDVPD ID—————-EVEEQMNEVDLMDDNMYEDYAADHNRDRYDPDQV——-D DREQQELSLSERRRIDAQLNER—D——–RLLRNVAYIDDEDE—–EQ—EG AAQLDEMGLPVQRRRRRRQYED—————LENSDDDLLSDMDIDPLREELTL ESLSNVKAN—-SYSEWITQPNVSRTIARELKSFLLEYT—DETGRSVYGARIRTLGE MNSESLEVNYRHLA-ESKAILALFLAKCPEEMLKIFDLVAMEATELHYPDYARIHSEIHV RISDFPTIYSLRELRESNLSSLVRVTGVVTRRTGVFPQLKYVKFNCLKCGSILGPFFQDS NEEIR—–ISFC–TNCKSKGPFRVNGEKTVYRNYQRVTLQEAPGTVPPGRLPRHREV ILLADLVDVSKPGEEVEVTGIY————–KNNYDGNLNAKNGFPV——– ——-FATIIEANSIKRREGNTANEGEEGLDVFSWTEEEE————–REFR KISRDRGIIDKIISSMAPSIYGHRDIKTAVACSLFGGVPKNVNGK–HSIRGDINVLLLG DPGTAKSQILKYVEKTAHRAVFATGQGASAVGLTASVRKDPITKEWTLEGGALVLADKGV CLIDEFDKMNDQDRTSIHEAMEQQSISISKAGIVTTLQARCSIIAAANPNGGRYNSTLPL AQNVSLTEPILSRFDILCVVRDLVDEEADERLATFVVDSHVRSHPENDEDREGEELK— ———NNGESAIEQGEDEINEQLNARQRRLQRQRKKEEEIS—PIPQELLMKYIH YARTKIYPKLHQMDMDKVSRVYADLRRESIST—GSFPITV–RHLESILRIAESFAKM RLSEFVSSYDLDRAIKVVVDSFV————————————- —-DAQKVSVRRQLRRSFA—————————————- ———————-IYTLGH——————————–


MCM_RICE

-MD———–DSENNAPSTPGSPGFSTDRLPPNTTTSRGATDPSSYSDDDDDDVVG AEEAEVDPNVLPEDDGVVAAEEEEDGEDLFNDNYLDDYRRMDEQDQYESVGL–DDSIED ERNLDEIMADRR-AAEAELDAR—DVRTGAAPDRKLPRMLHDQDTDE—–DMSFRRP KRHRANFRPPREPRTPRSDDDGDGATPSSPGRSQRGMYSGGDVPMTDQTDDDPYEDE–F DEEDEMNMYRVQGTLREWVTRDEVRRFIAKKFKEFLLTYVNPKNEQGEFEYVRLINEMVL ANKCSLEIDYKQFI-YIHPNIAIWLADAPQSVIEVMEEVAKNVVFDLHKNYRNIHQKIYV RITNLPVYDQIRNIRQIHLNTMIRIGGVVTRRSGVFPQLQQVKYDCSKCGTVLGPFFQNS YTEVK—–VGSC–PECQSKGPFTINVEQTIYRNYQKLTLQESPGIVPAGRLPRYKEV ILLNDLIDCARPGEEIEVTGIY————–TNNFDLSLNTKNGFPV——– ——-FATVVEANYVAKK——–QDLFSAYKLTDEDK————–AEIE KLAKDPRIGERIVKSIAPSIYGHEDIKTAIALAMFGGQEKNVKGK–HRLRGDINVLLLG DPGTAKSQFLKYVEKTGHRAVYTTGKGASAVGLTAAVHKDPVTREWTLEGGALVLADRGI CLIDEFDKMNDQDRVSIHEAMEQQSISISKAGIVTSLQARCSVIAAANPIGGRYDSSKTF TQNVELTDPIISRFDVLCVVKDIVDPFTDEMLARFVVDSHARSQP——KGANLE— ———DRVPTDVEDDPLAAARQA—————DPD—ILSQDMLKKYIT YAKLNVFPKIHDADLDKISHVYAELRRESSHG—QGVPIAV–RHIESIIRMSEAHARM HLRSYVSQEDVDMAIRVLLDSFI————————————- —-STQKFGVQKALQKNFRKYMTYKKDYNELLLLLLRTLVKD—-VLHFEEIVSGPTT RLTHIEVKVEDLKNKAQEYE–IYDLRPFFSSAHF-RDNNF——–VLDEGRGIIRH PLAA—

MCM_FROG

MAD———–SSESFNIATSPRTGSRRDAL—-TSSPGRDLPP–FEDESEGMFG DE—————VPREEEEDGEELIGDAMERDYRAISELDRYEAEGL–D—-D EDDVEDLTASQRDAAEQAMRMR—DREMGHELGRMRRGLLYDSDEED—–ED—– ——-RPARKRRMAERAAEG————————APEEDEEMIES–I ENLEDMKGH—-TVREWVSMAATRLEIYHRFKNFLRTHV—DEHGHNVFKEKISDMCK ENKESLVVNYEDLA-AREHVLAYFLPEAPAEMLKIFDEAAKEVVLVMYPKYDRIAREIHV RISHLPLVEELRSLRQLHLNQLIRTSGVVTCCTGVLPQLSMVKYNCNKCNFILGPFFQSQ NQEVK—–PGSC–PECQSLGPFEINMEETVYQNYQRITIQESPGKVAAGRLPRSKDA ILLADLVDSCKPGDEIELTGTY————–HNNYDGSLNTANGFPV——– ——-FATVILANHITKK——–DDKVAVGELTDEDV————–KAIV ALSKDERIGERIFASIAPSIYGHEDIKRGLALALFGGEAKNPGGK–HKVRGDINVLLCG DPGTAKSQFLKYVEKVASRAVFTTGQGASAVGLTAYVQRHPVTKEWTLEAGALVLADRGV CLIDEFDKMNDQDRTSIHEAMEQQSISISKAGIVTSLQARCTIIAASNPIGGRYDPSLTF SENVDLTEPIVSRFDILCVVRDTVDPVQDEMLARFVVGSHIKHHPSSKDIANGEEFA— ———–LPNTFGVE—————————–PLPQEVLKKYIM YSKEKIHPKLNQMDQDKVAKMYSDLRKESMAT—GSIPITV–RHIESMIRMAEAHARM HLRDYVVEDDVNMAIRVMLESFI————————————- —-DTQKFSVMRSMRKTFARYLAFRRDNNELLLFVLKQLVAE——QTSYQRNRYGA QQDTIEVPEKDLVDKARQIN–IHNLSAFYDSDLF-KMNRF——–THDVKKKMIIQ QF—–

Protdist determines the distances between each pair of sequences. The generated matrix is shown below.

The protdist output then serves as the input for the program neighbor. Neighbor generates an evolutionary tree. The neighbor output is shown below.

This data was then used as the input for the program drawtree. Drawtree creates an image file of the tree, of which is shown below.

I am quite surprised that cattle MCM is so closely related to that of sulfolobus…

Genetic Variation

NCBI's dbSNP database is a collection of polymorphisms across species. When searching MCM on this database, there were 328 hits. 33 of these hits were missense mutations.

I also searched MCM on the NCBI 1000 genomes and selected the link for MCM6 on chromosome 2. There were two clinical variants, one of which being pathogenic. This site shows variation between individuals through out 14 populations worldwide.

Protein Structure

When searching MCM on the PBD site I:

1) entered DDC in search bar and clicked search icon

2) scrolled down and clicked the first entry, Crystal structure of an active MCM hexamer. I chose this because there was no structure for Trypanasomes. This is a crystal structure of a chimera between the N terminal of Sulfolobus solfataricus and the AAA+ domain of Pyrococcus furiosus. In this chimera, MCM is a homohexamer.

3) clicked 3D view on right under protein image

4) played around with settings

The following image was generated with the custom view option style set to “ball and stick” and color set to “rainbow”

Gene Structure Prediction

Gene prediction tools allow the translated protein sequence of a gene to be generated based on the nucleotide sequence. In the case of Trypanosame MCM, I copied and pasted the nucleotide sequence from NCBI into the input box of the GENSCAN server to generate a predicted protein sequence (shown below).

This predicted protein sequence is 952 amino acids in length, just like the NCBI entry. Next, I ran a multiple alignment blastp with the Genscan sequence as the query and the NCBI sequence as the subject. The results are shown below.

The Blast alignment revealed a 100% sequence identity.

I the put the same NCBI nucleotide sequence into the Augustus site to generate a predicted protein sequence. The blastp results from this predicted protein sequence against the sequence from NCBI revealed the same results as previous and is shown below.

Genome browsers and community genome annotation

yrGATE is a tool for viewing and adding annotations to genomes. This is a nice resource because it makes this information available to the public and allows people to review changes to annotations so that errors can be corrected.

I decided to explore Polistes dominula genome using the xGDB site. I clicked on “most wanted annotations” and then clicked on “gene fusion”. This brought up GAEVAL Search Results. GAEVAL is A Tool for Gene Annotation Evaluation and therefore scores annotations based on their integrity. I selected the first annotation. It had a 78% coverage score. The GAEVAL summary is shown below.

I then clicked the “view in genome browser” button and from there selected yrGATE from the drop down “annotations” menu. I selected various exons from the Evidence Plot and examined the Integrity scores. When selecting two nonconsecutive exons, the intergrity score and exon sequence coverage is lower than that of consecutive exons.

To find open reading frames or ORFS, I clicked on ORF finder. This brought up a screen shown below.

Note that the ORF in magenta is the selected ORF. Once I picked an ORF, I clicked on “Select ORF for annotation”. This entered the protein sequence for the specific ORF into the yrGATE annotation tool. From there, I ran a blastp search in which no putative conserved domains were detected.

Comparative Genomics

CoGe is a tool to compare gene products through out genomes of different species. Once at the site, I selected “Organism view” and entered in Drosophila melanogaster and clicked “Launch genome viewer” to get to the screen shown below.

With this viewer, you can look at the sequence at any given location. Also, after your organism is selected, you can selected different banked genomes and datasets based on preference.

Also on the CoGe site, you can preform CoGeBlast. I selected Drosophila and Humans.

Before the Blast is run, there are various parameters you can change. Also, you must enter a query sequence. I entered MCM and did not get any hits.

Gene expression analyses

The NCBI Gene Expression Omnibus (GEO) site contains stored public gene expression experiments. According the their site GEO “is a public functional genomics data repository supporting MIAME-compliant data submissions. Array- and sequence-based data are accepted. Tools are provided to help users query and download experiments and curated gene expression profiles”.

I went to the GEO site and entered Minichromosome maintenance. There was one hit in the GEO DataSets Database, which I selected. I then scrolled down and clicked “Analyze with GEO2R”. Two different groups showed up, which I labeled “control” and “experiment”, as shown below.

I then scrolled down and clicked on “view distribution”. Which gave me the graphical data shown below.

Motif finding in DNA and protein sequences

The XXmotif site allows users to enter a nucleotide or amino acid sequence and search for motifs. I entered the nucleotide sequence for Trypanosome and received the following data:

There was a total of 30 motif hits. The size of the nucleotides represent the abundance of that particular nucleotide through out the total sequences banked.

VB Comment: Your report is a bit sparse on conclusions. What do the various program outputs tell you?

b2gof14/students/taynick/final_project.txt · Last modified: 2014/12/18 10:50 by vbrendel
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki