NCBI Sequence Alignment

While one of the most common uses for NCBI's BLAST tool is to align a sequence against their database of sequences it can also be used to compare a sequence to sequences outside that database. The NCBI Tutorial explains how to do this with two sequences that the user inputs. This allows users to compare sequences before they are added to the NCBI database.

MSMEG_3240

The protein MSMEG_3240, from Mycobacterium smegmatis, shares several domains with response regulators associated with nitrogen sensing including a receiving domain and a C-terminal DNA binding domain. This is shown by in the graphic display of the genome from NCBI's gene database featured below.

The nucleotide and amino acid sequences are shown below:

Nucleotide:
GTGATGACCGATGCGGCACCCACGGTGATGGTGGTCGACGACCACCCGATCTGGCGAGATGCCGTCGCCC
GCGACCTCGCCGACGACGGTTTCGACGTCGTCGCCACGGCCGACGGCGTCGCGTCGGCGTCCCGCCGTGC
CGCGGTGGTCCGCCCCGACGTGGTCCTGATGGACATGCGCCTCGGCGACGGTTCCGGGGCTCAGGCCACC
GCGGAGGTGCTCGCGGTCTCACCGCGGTCGCGCGTGCTGGTGCTGTCGGCCTCCGACGAACGCGACGACG
TGCTGCAGGCGGTCAAGGCAGGCGCCACGGGATATCTGGTGAAGAGCGCATCGAGAACCGAACTCGCCGA
CGCGGTCCGCGCCACCGCGGAGGGCCGCGCGGTCTTCACCCCCGGTCTGGCGGGACTGGTGCTGGGGGAG
TATCGGCGCATCGCGCAACAACCGGCACAGGAGGGGCCCGCGACGCCCACCCTCACCGAACGCGAGACCG
AGATCCTGCGGTATGTGGCGAAAGGCCTGACGGCCAAACAGATCGCCGCGCGTCTTTCGCTGAGTCACCG
CACCGTGGAGAACCACGTGCAGGCGACGTTCCGCAAGCTCCAGGTCGCCAACCGGGTCGAACTCGCCCGC
TACGCGATAGAACACGGGCTGGACGAGTAG

Amino Acid:
MMTDAAPTVMVVDDHPIWRDAVARDLADDGFDVVATADGVASASRRAAVVRPDVVLMDMRLGDGSGAQAT
AEVLAVSPRSRVLVLSASDERDDVLQAVKAGATGYLVKSASRTELADAVRATAEGRAVFTPGLAGLVLGE
YRRIAQQPAQEGPATPTLTERETEILRYVAKGLTAKQIAARLSLSHRTVENHVQATFRKLQVANRVELAR
YAIEHGLDE

The structure of this protein is not known; however, it has 35% sequence similarity to the protein DevR from Mycobacterium smegmatis, so it can be compared to that protein with Swiss Model. The orange sections of the image below are areas predicted to have little structural similarity to MSMEG_3240 and blue areas are predicted to have high similarity. While most of MSMEG_3240 has a high degree of similarity to DevR, the alpha helix and turns near the top of the image diverge significantly from the sequence.

When the sequence for MSMEG_3240 from NCBI's gene database is BLASTed against the NCBI database a number of proteins with sequence similarity are found, most of them from various Mycobactrium strains. When the first ten results from different species are aligned with the Mobyl @Pasteur clustalw-multialign tool and the subsequent subsequence alignment is input into the ClustalW2 Phylogeny program, the following phylogenetic tree is generated.

When the amino acid sequence is put into the genome.jp MOTIF finder the motifs shown below were found. They primarily are helix-turn-helix motifs, which bind DNA. These DNA binding motifs, combined with a putative response regulator, indicate that the protein may be the response regulator part of a two component regulatory system. MSMEG_3240 would likely activate the expression of certain genes when its unknown sensor detected some indicator of a hypoxic environment, as DevR does in Mycobacterium tuberculosis.

The sequence for MSMEG_3240 from NCBI's gene database was taken along with the nucleotides from 3kbp upstream to 7kbp downstream of the gene and this sequence was input in the Augustus and GeneMarkS. The sequence in Augustus used the S. aureus organism demonstartion as it is the only gram positive choice and M. smegmatis is most similar to gram positive bacteria, reported on both strands, and with few alternative transcripts. With GeneMarkS the settings used were Prokaryotic, LST, protein sequence output.

Below are the outputs from the Augustus and GeneMarkS respectively. Neither program found the MSMEG_3240 open reading frame; however, this is not particularly surprising. Both programs were using different genus of bacteria as their reference strains. In addition to this, Mycobacterium are unusual even among bacteria so it is unsurprising that the prediction of genes in them may be different. The final reason for the fact that it at not have been found is that it uses an GTG (valine) start site, which the program may not be as good at predicting although it did find several other open reading frames with GTG start codons.

# This output was generated with AUGUSTUS (version 3.1.0).
# AUGUSTUS is a gene prediction tool written by Mario Stanke (mario.stanke@uni-greifswald.de),
# Oliver Keller, Stefanie König and Lizzy Gerischer.
# Please cite: Mario Stanke, Mark Diekhans, Robert Baertsch, David Haussler (2008),
# Using native and syntenically mapped cDNA alignments to improve de novo gene finding
# Bioinformatics 24: 637-644, doi 10.1093/bioinformatics/btn013
# No extrinsic information on sequences given.
# Initialising the parameters using config directory /data/www/augustus/augustus/config/ ...
# s_aureus version. Using species specific transition matrix: /data/www/augustus/augustus/config/species/s_aureus/s_aureus_trans_shadow_bacterium.pbl
# Using species specific overlap length distribution: /data/www/augustus/augustus/config/species/s_aureus/s_aureus_ovlp_len.pbl
# admissible start codons and their probabilities: ATA(0), ATC(0), ATG(0.834), ATT(0), CTG(0), GTG(0.0746), TTG(0.091)
# Looks like /data/www/augustus/tmp/AUG-1753485020/input.fa is in fasta format.
# We have hints for 0 sequences and for 0 of the sequences in the input set.
#
# ----- prediction on sequence number 1 (length = 10660, name = unnamed-1) -----
#
# Constraints/Hints:
# (none)
# Predicted genes for sequence number 1 on both strands
# start gene g1
unnamed-1	AUGUSTUS	gene	312	1028	0.86	+	.	g1
unnamed-1	AUGUSTUS	transcript	312	1028	0.86	+	.	g1.t1
unnamed-1	AUGUSTUS	start_codon	312	314	.	+	0	transcript_id "g1.t1"; gene_id "g1";
unnamed-1	AUGUSTUS	single	312	1028	0.86	+	0	transcript_id "g1.t1"; gene_id "g1";
unnamed-1	AUGUSTUS	CDS	312	1028	0.86	+	0	transcript_id "g1.t1"; gene_id "g1";
unnamed-1	AUGUSTUS	stop_codon	1026	1028	.	+	0	transcript_id "g1.t1"; gene_id "g1";
# coding sequence = [gtgcatcaaccatctggagaccgtgagtgcggggcggctctacgtcgacggacaactcgtcggctaccgcgaacgcggc
# ggcaaactgcacgagatgaagccgtccgacgtggccaaacagcgtcgcgacgtcggaatggtgttccagcacttcaacttgttcccgcaccgcaccgc
# gctggccaacatcatcgaggcgcccatcaaggtcaagggcgtcaagaagaaggaggccatcgaccgggcccgcgatctgctcaaccaggtgggtctgg
# cggacaaggccgaggcctacccggcgcagctgtcgggtggtcagcaacagcgcgtggccatcgcgcgcgcgctcgcgatgaaccccaagctcatgctg
# ttcgacgagcccacctcggcgctggaccccgaactcgtcggcgatgtcctcggcgtgatgaagaagctcgcctccgagggcatgaccatggtggtggt
# cactcacgagatgggtttcgcgcgcgaggtcgccgacaagctggtcttcatggacggcggcgtcatcgtcgagagcggcgatccccgcgaggtcatgg
# caaacccgaaacacgaacggacaaaagccttcctgtccaaggtgatgtagcccgtcgggtagcgtcggacgggtggcgacaactcgttccgacgtctt
# catcagcaccgcggagctcatccagctgctcgcggcaggcggcccggtga]
# protein sequence = [MHQPSGDRECGAALRRRTTRRLPRTRRQTARDEAVRRGQTASRRRNGVPALQLVPAPHRAGQHHRGAHQGQGRQEEGG
# HRPGPRSAQPGGSGGQGRGLPGAAVGWSATARGHRARARDEPQAHAVRRAHLGAGPRTRRRCPRRDEEARLRGHDHGGGHSRDGFRARGRRQAGLHGR
# RRHRRERRSPRGHGKPETRTDKSLPVQGDVARRVASDGWRQLVPTSSSAPRSSSSCSRQAAR]
# end gene g1
###
# start gene g2
unnamed-1	AUGUSTUS	gene	2035	2433	0.81	+	.	g2
unnamed-1	AUGUSTUS	transcript	2035	2433	0.81	+	.	g2.t1
unnamed-1	AUGUSTUS	start_codon	2035	2037	.	+	0	transcript_id "g2.t1"; gene_id "g2";
unnamed-1	AUGUSTUS	single	2035	2433	0.81	+	0	transcript_id "g2.t1"; gene_id "g2";
unnamed-1	AUGUSTUS	CDS	2035	2433	0.81	+	0	transcript_id "g2.t1"; gene_id "g2";
unnamed-1	AUGUSTUS	stop_codon	2431	2433	.	+	0	transcript_id "g2.t1"; gene_id "g2";
# coding sequence = [atgcgcgacggtctatctgaccgggtcccgcaggcgcgggacgtgggtgatcgccgagctcgtcgtggtggtcgcgctg
# atgctgtcgacggagctggtggcgtccgaacagtggatcgccgacaaccagtcctggccgacgacgctgtgggcgaccaacgccaccatctcggtggc
# gttgcacttcggcccgatcgggggcatgtccgccgggctcgcggtgatggcgacggtcgcgctgctcaagggccatgtgagcgtcaacctcggccgca
# acgccaccatcgtgatcgagctcgcggtcggtctggctgtcgggatggccgcgcagaccgcgcggcgcgcgcacgccgaactggaacgcgccgtgcga
# ctctcggcggccctggaggaacgtga]
# protein sequence = [MRDGLSDRVPQARDVGDRRARRGGRADAVDGAGGVRTVDRRQPVLADDAVGDQRHHLGGVALRPDRGHVRRARGDGDG
# RAAQGPCERQPRPQRHHRDRARGRSGCRDGRADRAARARRTGTRRATLGGPGGT]
# end gene g2
###
# start gene g3
unnamed-1	AUGUSTUS	gene	3473	3670	0.7	-	.	g3
unnamed-1	AUGUSTUS	transcript	3473	3670	0.7	-	.	g3.t1
unnamed-1	AUGUSTUS	stop_codon	3473	3475	.	-	0	transcript_id "g3.t1"; gene_id "g3";
unnamed-1	AUGUSTUS	single	3473	3670	0.7	-	0	transcript_id "g3.t1"; gene_id "g3";
unnamed-1	AUGUSTUS	CDS	3473	3670	0.7	-	0	transcript_id "g3.t1"; gene_id "g3";
unnamed-1	AUGUSTUS	start_codon	3668	3670	.	-	0	transcript_id "g3.t1"; gene_id "g3";
# coding sequence = [atgagtaggactactcgtccagcccgtgttctatcgcgtagcgggcgagttcgacccggttggcgacctggagcttgcg
# gaacgtcgcctgcacgtggttctccacggtgcggtgactcagcgaaagacgcgcggcgatctgtttggccgtcaggcctttcgccacataccgcagga
# tctcggtctcgcgttcggtga]
# protein sequence = [MSRTTRPARVLSRSGRVRPGWRPGACGTSPARGSPRCGDSAKDARRSVWPSGLSPHTAGSRSRVR]
# end gene g3
###
# start gene g4
unnamed-1	AUGUSTUS	gene	3859	5580	0.86	+	.	g4
unnamed-1	AUGUSTUS	transcript	3859	5580	0.86	+	.	g4.t1
unnamed-1	AUGUSTUS	start_codon	3859	3861	.	+	0	transcript_id "g4.t1"; gene_id "g4";
unnamed-1	AUGUSTUS	single	3859	5580	0.86	+	0	transcript_id "g4.t1"; gene_id "g4";
unnamed-1	AUGUSTUS	CDS	3859	5580	0.86	+	0	transcript_id "g4.t1"; gene_id "g4";
unnamed-1	AUGUSTUS	stop_codon	5578	5580	.	+	0	transcript_id "g4.t1"; gene_id "g4";
# coding sequence = [gtggtccccgcaggcgcctcagcagccgccccggcagtggaccccgcaaccggttgcgccggtcgccccggtggccgtg
# cctgcccggcaaccggacaccccgcccaagccccgctcggaagggtggatcggcaaggtactggcgatggccggtgtcgcagtgacgctcgtcggcgt
# ggtgtcgctgctggtgctggccgcacaggccggcatcctgcgccccgaggtgcgagtggcggcaggggcggccctggccgtggcgctggtggccgtcg
# cgatgtggctcgaccgccgtcccggtggccgggtcggggcggtcgcactcgcggcgaccggtgtggccgcggcgtacatggacgtgatcgcggtgacg
# gccatctacgaatgggtcccggccccggtgggtttggcgctggccggggtcgtcggggcctgcggcctgatgctggcgcggtggtggggctccgagca
# actcgggttgctggtgttcgtgccgctgatcgcgttggcgccggtgatcaccgacggcgtgacgctgctgctgatcggtttcatgctggcgctgtcgg
# cggcgtcgcttccggtgcagttcggccgggactggttctggctgcacgccgcgcgcacggccgcggtgacgatcccgctgctcgtcgcactggtctcg
# gccgcgatcggcgggcgcgaggatctccggctggccctggtgtgtgcgctggcggcggtgcctgcgctcgtgggcggtgtgacggtgtcgcgcttcag
# caccaggccggtggcgaccaccgtggtctcggcgctgggaacggttccgctgctgtgtgtttcggccaccgcggaccgcgtgcctgccgtcctgctga
# tcgccgggctggcggccgcggcgctggccgtcgcggccatcggggaccggctgccgggcatcggggcgccggtgcgccgggtgtgggccgcaacctcg
# gcggcggccgctctgatcgccgtgctcgtggcgttcgacggcacggtggccgcgccggtgctgctggcgatgtcgatcgcgatcgcggtgggcgggca
# gcgcgatccggtcgggcgttgcgccgcaatcggtttcgcgttgatcggcgcgatgttctacctcgaccacgcggcgccggccatgctcgtcgaggcga
# caccgctcgacggcccgactgtcgcctcggtcgtgatcggcagtgtgatgctgatcggtgccgcggccgcgaacggctggacatggtcgcggacggtg
# tccgataccgaggttgtgcgcctggtgtgggtcgcggtatcagcggtgatcggatacgcggcgaccgcgctgaccgtcacggtcggggtggcgctggg
# cggggcggaggtgggctttttggccgggcatatggccgcaacgctcagttggatcgtggccgcggccctggcgttcggatacgccgcacgacgtccgg
# gcgcatcgcggtcggtgctgatcggcgggggactggtgctggtggccgcggcgacgggcaagctgttcctgttcgacctcggcacgctggacggcatg
# taccgcgtcgtgctgttcatcgtgggcgggctggtgctgctgggaatgggcgcgggttatgcacggtttctggcccagcagtccgacggccggtcgga
# tgcgcaaccgggaacggatcacgaggcccactcgacgtgacgtggcagggctcacaaagagcaaatttggaatga]
# protein sequence = [MVPAGASAAAPAVDPATGCAGRPGGRACPATGHPAQAPLGRVDRQGTGDGRCRSDARRRGVAAGAGRTGRHPAPRGAS
# GGRGGPGRGAGGRRDVARPPSRWPGRGGRTRGDRCGRGVHGRDRGDGHLRMGPGPGGFGAGRGRRGLRPDAGAVVGLRATRVAGVRAADRVGAGDHRR
# RDAAADRFHAGAVGGVASGAVRPGLVLAARRAHGRGDDPAARRTGLGRDRRARGSPAGPGVCAGGGACARGRCDGVALQHQAGGDHRGLGAGNGSAAV
# CFGHRGPRACRPADRRAGGRGAGRRGHRGPAAGHRGAGAPGVGRNLGGGRSDRRARGVRRHGGRAGAAGDVDRDRGGRAARSGRALRRNRFRVDRRDV
# LPRPRGAGHARRGDTARRPDCRLGRDRQCDADRCRGRERLDMVADGVRYRGCAPGVGRGISGDRIRGDRADRHGRGGAGRGGGGLFGRAYGRNAQLDR
# GRGPGVRIRRTTSGRIAVGADRRGTGAGGRGDGQAVPVRPRHAGRHVPRRAVHRGRAGAAGNGRGLCTVSGPAVRRPVGCATGNGSRGPLDVTWQGSQ
# RANLE]
# end gene g4
###
# start gene g5
unnamed-1	AUGUSTUS	gene	5630	6031	1	-	.	g5
unnamed-1	AUGUSTUS	transcript	5630	6031	1	-	.	g5.t1
unnamed-1	AUGUSTUS	stop_codon	5630	5632	.	-	0	transcript_id "g5.t1"; gene_id "g5";
unnamed-1	AUGUSTUS	single	5630	6031	1	-	0	transcript_id "g5.t1"; gene_id "g5";
unnamed-1	AUGUSTUS	CDS	5630	6031	1	-	0	transcript_id "g5.t1"; gene_id "g5";
unnamed-1	AUGUSTUS	start_codon	6029	6031	.	-	0	transcript_id "g5.t1"; gene_id "g5";
# coding sequence = [gtgcagcagatcggccgtgctgggatcctcggcgtccaccgcgtcgtggacgcgccggatggtgtcgaccgtggcgttg
# atacgggtggtgatgaggtcgacgacatcggctgtgctgcgctcgaacgcggggaattccggcagcgtcgtggtggccgccacggtgtcggaacggcc
# gtcgggaacggcgtccagcgcgcgcatccgttcggcgatcgtgtcgctgccctcgcgcgcgaagtcgaccaactcgtcgagctgcaggtgcaggtcac
# ggaagttgctgcccaccacgttccaatgggcctgtttgccctgcagggacagctcgatcaggtcgacgagaaccttctggaggttgccgccgaactcc
# ggtgtggcatggaaaccttggatatctga]
# protein sequence = [MQQIGRAGILGVHRVVDAPDGVDRGVDTGGDEVDDIGCAALERGEFRQRRGGRHGVGTAVGNGVQRAHPFGDRVAALA
# REVDQLVELQVQVTEVAAHHVPMGLFALQGQLDQVDENLLEVAAELRCGMETLDI]
# end gene g5
###
# start gene g6
unnamed-1	AUGUSTUS	gene	6252	6842	0.79	+	.	g6
unnamed-1	AUGUSTUS	transcript	6252	6842	0.79	+	.	g6.t1
unnamed-1	AUGUSTUS	start_codon	6252	6254	.	+	0	transcript_id "g6.t1"; gene_id "g6";
unnamed-1	AUGUSTUS	single	6252	6842	0.79	+	0	transcript_id "g6.t1"; gene_id "g6";
unnamed-1	AUGUSTUS	CDS	6252	6842	0.79	+	0	transcript_id "g6.t1"; gene_id "g6";
unnamed-1	AUGUSTUS	stop_codon	6840	6842	.	+	0	transcript_id "g6.t1"; gene_id "g6";
# coding sequence = [gtggcgcaaacggaggcacatgtcgtcgggcacctggatgcgcccttcgacatcggtgctctccatacgcgaggcgaca
# ttgaccgcgtcaccccacacgtcgtagaagaaccggcgcgcaccgaccaccccggccaccaccggtccggcggccaggccgatccgcagtggtacgcg
# cctgccttcgggatcggtgagatcggcgacggccgcggccatgtcgagcgcgagtgccgcgagcgcctcggcgtggtcggtgcggggctcggggatgc
# cgccgaccaccatgtacgaatcgccgctggtcttgaccttctccaggcagtgctgctcgacgagcgcatcgagatcggtgtagagcgtgtcgaggaac
# cgcaccagatcacacggcgcggtctcgctggcgcgcttggtgtagccggcgatgtcggcgaacaggatcgaggcgtcgtcgtatcggtcggcgatgat
# ggtgcgcgccgggtctttgagccgtgtcgcgatcgtggcgggaagaatgttcgcgagcaacttctccgagcgctggtactcggcctccatcgcgtcct
# cggcgcgcgcgatctcgcgtag]
# protein sequence = [MAQTEAHVVGHLDAPFDIGALHTRGDIDRVTPHVVEEPARTDHPGHHRSGGQADPQWYAPAFGIGEIGDGRGHVEREC
# RERLGVVGAGLGDAADHHVRIAAGLDLLQAVLLDERIEIGVERVEEPHQITRRGLAGALGVAGDVGEQDRGVVVSVGDDGARRVFEPCRDRGGKNVRE
# QLLRALVLGLHRVLGARDLA]
# end gene g6
###
# command line:
# /data/www/augustus/augustus/bin/augustus --species=s_aureus --strand=both --singlestrand=false --genemodel=partial --codingseq=on --sample=100 --keep_viterbi=true --alternatives-from-sampling=true --minexonintronprob=0.2 --minmeanexonintronprob=0.5 --maxtracks=2 /data/www/augustus/tmp/AUG-1753485020/input.fa --exonnames=on
GeneMark.hmm PROKARYOTIC (Version 3.26)
Date: Tue Oct 13 14:01:24 2015
Sequence file name: seq.fna
Model file name: GeneMark_hmm_heuristic.mod
RBS: false
Model information: Heuristic_model_for_genetic_code_11_and_GC_69

FASTA definition line: empty-fasta-def-line
Predicted genes
   Gene    Strand    LeftEnd    RightEnd       Gene     Class
    #                                         Length
    1        +          <2         148          147        1
    2        +         145         930          786        1
    3        +         953        1813          861        1
    4        +        1868        3004         1137        1
    5        +        3004        3660          657        1
    6        +        3704        5545         1842        1
    7        +        5609        6094          486        1
    8        -        6099        6911          813        1
    9        +        8173        8310          138        1
   10        +        8643        9269          627        1
   11        +        9547      >10659         1113        1




>gene_1|GeneMark.hmm|48_aa|+|2|148	>empty-fasta-def-line
TWYLVITSILMVGQYYLERYYSRGASRKLTTKQLEALAKAQTVGEAHP

>gene_2|GeneMark.hmm|261_aa|+|145|930	>empty-fasta-def-line
VTTAEATSGDYMVRAESVCKNFGALKVLRGVTLNVSKGQVLVLVGPSGSGKSTFLRCINH
LETVSAGRLYVDGQLVGYRERGGKLHEMKPSDVAKQRRDVGMVFQHFNLFPHRTALANII
EAPIKVKGVKKKEAIDRARDLLNQVGLADKAEAYPAQLSGGQQQRVAIARALAMNPKLML
FDEPTSALDPELVGDVLGVMKKLASEGMTMVVVTHEMGFAREVADKLVFMDGGVIVESGD
PREVMANPKHERTKAFLSKVM

>gene_3|GeneMark.hmm|286_aa|+|953|1813	>empty-fasta-def-line
VATTRSDVFISTAELIQLLAAGGPVTLLDVRWTLAEPNGEQAYLDGHLPGAVYVSLDDEL
ADHTVRGRGRHPLPSGRHLEAAARRWGVRDGVPTVVYDDWNRAGSARAWWCLTAAGISGV
RILDGGLGAWVAGGGGVETGPVTPEPGDVRVVHDDLYRGALPTLTADDVQSAAALIDARA
PERFRGEVEPVDPVAGHVPGAVNLPSTGLLNPDGTLRDEAQVRALLADRGVDDTGDTAVG
AYCGSGVTAALTVAGLAAAGVDAALFPGSWSEWVCDPGRPVARGEK

>gene_4|GeneMark.hmm|378_aa|+|1868|3004	>empty-fasta-def-line
VQREPDPVTPLWRAAQGFRLLSCLYALGFHIAITDDLRRPVLGWVLFAGLIVWSAACATV
YLTGSRRRGTWVIAELVVVVALMLSTELVASEQWIADNQSWPTTLWATNATISVALHFGP
IGGMSAGLAVMATVALLKGHVSVNLGRNATIVIELAVGLAVGMAAQTARRAHAELERAVR
LSAALEERERLSRRVHDGAIQVLALVARRGREIGGETAKLAELAGEQERALRRLVSAADT
DTMAGPLTDVGALLRTRASDRVSVSVPAEPVLLDHPVARELFAAAENALDNVAAHAGADA
RAFVLLEDLGEEVTVSIRDDGVGIPEGRLAEAERQGRMGVAKSIVGRMDWLGGTAVLTTG
PDSGTEWELTVPRTRKGQ

>gene_5|GeneMark.hmm|218_aa|+|3004|3660	>empty-fasta-def-line
MTDAAPTVMVVDDHPIWRDAVARDLADDGFDVVATADGVASASRRAAVVRPDVVLMDMRL
GDGSGAQATAEVLAVSPRSRVLVLSASDERDDVLQAVKAGATGYLVKSASRTELADAVRA
TAEGRAVFTPGLAGLVLGEYRRIAQQPAQEGPATPTLTERETEILRYVAKGLTAKQIAAR
LSLSHRTVENHVQATFRKLQVANRVELARYAIEHGLDE

>gene_6|GeneMark.hmm|613_aa|+|3704|5545	>empty-fasta-def-line
MTEPQRAVIARVTADLTAVSAYLNRMAGDLATLDRLVAQQSAAPRPEAVAPQWSPQAPQQ
PPRQWTPQPVAPVAPVAVPARQPDTPPKPRSEGWIGKVLAMAGVAVTLVGVVSLLVLAAQ
AGILRPEVRVAAGAALAVALVAVAMWLDRRPGGRVGAVALAATGVAAAYMDVIAVTAIYE
WVPAPVGLALAGVVGACGLMLARWWGSEQLGLLVFVPLIALAPVITDGVTLLLIGFMLAL
SAASLPVQFGRDWFWLHAARTAAVTIPLLVALVSAAIGGREDLRLALVCALAAVPALVGG
VTVSRFSTRPVATTVVSALGTVPLLCVSATADRVPAVLLIAGLAAAALAVAAIGDRLPGI
GAPVRRVWAATSAAAALIAVLVAFDGTVAAPVLLAMSIAIAVGGQRDPVGRCAAIGFALI
GAMFYLDHAAPAMLVEATPLDGPTVASVVIGSVMLIGAAAANGWTWSRTVSDTEVVRLVW
VAVSAVIGYAATALTVTVGVALGGAEVGFLAGHMAATLSWIVAAALAFGYAARRPGASRS
VLIGGGLVLVAAATGKLFLFDLGTLDGMYRVVLFIVGGLVLLGMGAGYARFLAQQSDGRS
DAQPGTDHEAHST

>gene_7|GeneMark.hmm|161_aa|+|5609|6094	>empty-fasta-def-line
MSARRTESDIQGFHATPEFGGNLQKVLVDLIELSLQGKQAHWNVVGSNFRDLHLQLDELV
DFAREGSDTIAERMRALDAVPDGRSDTVAATTTLPEFPAFERSTADVVDLITTRINATVD
TIRRVHDAVDAEDPSTADLLHGLIDGLEKQAWLIRSENRKV

>gene_8|GeneMark.hmm|270_aa|-|6099|6911	>empty-fasta-def-line
LSVGFVISTVSAAVMVVATVWSALREIARAEDAMEAEYQRSEKLLANILPATIATRLKDP
ARTIIADRYDDASILFADIAGYTKRASETAPCDLVRFLDTLYTDLDALVEQHCLEKVKTS
GDSYMVVGGIPEPRTDHAEALAALALDMAAAVADLTDPEGRRVPLRIGLAAGPVVAGVVG
ARRFFYDVWGDAVNVASRMESTDVEGRIQVPDDMCLRLRHAFVLEERGEVEVKGKGVMRT
WYLVGRRDGERAPLRTGDARSESVGNPAGG

>gene_9|GeneMark.hmm|45_aa|+|8173|8310	>empty-fasta-def-line
MDLLFAVLPGMAGLVLLTAAGGAIGVRHARAAQAVPAPQIARFMA

>gene_10|GeneMark.hmm|208_aa|+|8643|9269	>empty-fasta-def-line
MTGSTTDADRPRRVLIAEDEALIRLDLAEMLREEGYEVVGEAGDGQEAVEMAESLRPDLV
IMDVKMPRRDGIDAASEIASKRIAPIVILTAFSQRELVERARDAGAMAYLVKPFNINDLV
PAIEVAVSRFAELSALETEVATLSERLETRKLVERAKGLLQAKHKMTEPEAFKWIQRAAM
DRRTTMKRVAEVVLETLDDTKQAPAPEQ

>gene_11|GeneMark.hmm|371_aa|+|9547|10659	>empty-fasta-def-line
VVALAIAGCNQSTPEEEAAQTDLKIVEKVQIDENGAEVTGAGDVTPADPAGDGNAVCPPV
MIAMMGALNGPDAALGINIKNGVQMAIDKHNAANAQCQVQLKAFDTEGDPQKATGVAPQI
VDEPFIIGVVGPAFSGETKATGDVFNQAGLVATTASATNVQLSENGWRTFFRGLANDGVQ
GPSVANYMKNTLENKKVCVVDDSTDYGLGLAEAVRTTLGPVADASCNISVKKGDKDFSAA
VTQIKGAAPDSVFYSGYYSEAAPFVQQLKDGGVEATFISADGTKDPEFVKQAGESSKGAL
LSCPCGPATAEFAEEYTQKFGQEPGTYSTEGYDLGTILLKGIDSGAITRADLLNYVRNYE
GQGVARKYQWT

I think there is a good organization and flow to the information you have about MSMEG 3240. -C. Adam Brown

SNPs in the Human Genome

Looking at single nucleotide polymorphisms (SNPs) is an effective way to analyze variation in genes. Given a large enough number of sequences phenotypes can be correlated to specific nucleotide changes allowing a greater understanding of what functions a gene performs. As there is no database for SNPs in Mycobacteria I looked at SNPs at the locus rs1131769 on the Homo sapiens fifth chomosome gene TMEM173, a transmembrane protein. The location of the gene and the mutation within the gene from 1000 Genomes are shown below.

Predicting Introns and Exons

One important part of sequencing a genome is accurately predicting genes. With eukaryotes a problem arises that products are deferentially spliced, resulting in different sized transcripts that can confuse automated systems. In these cases it is often useful to have a human look at the sequence and make judgements on what the most highly conserved sequences are. PdomGDB provides a database of putative genes for Polistes dominula. This page gives a list of genes and information about them. Genes can be annotated by clicking on the genome view button then clicking annotate on the page which opens.

b2gof15/students/jostrnat/start.txt · Last modified: 2015/11/03 19:30 by jostrnat
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki