Using bioinformatics to analyze porcine derived small intestine submucosa extracellular matrix grafting material (SIS)

1.0 SIS and regenerative medicine

Small intestine submucosa

SIS is a grafting material derived from the inner layer of the small intestine.

Small intestine submucosa

The extracellular matrix

The extracellular matrix (ECM) was once thought of as primarily the “glue” that holds cells together. The ECM plays important roles in cell signaling, gene expression, differentiation and is highly regulated and dynamic. While the ECM consists of many proteins, including elastins, glycoproteins, glycosaminoglycans, and growth factors, the most abundant proteins within the ECM are collagens. Collagen makes up between 25-35% of the protein content within mammals and is comprised of 28 known types. In humans, more than 90% is type I, while type II, III, IV, and V make up the majority of the remainder. Irregularities in collagen formation can cause a variety of disorders including Ehlers-Danlos syndrome.

Collagen structure

Collagen typically exists as a trimer composed of two alpha 1 chains and one alpha 2 chain. These alpha peptide chains often exist as a repetitive motif of (glycine-x-y)*n, where either x is proline or y is hydroxyproline. Any other amino acid excluding glycine, proline, and hydroxyproline tends to fill the unused x or y. The chains are then hydroxylated at proline and lysine residues and glycosylated at glucose and galactose residues, which allows for cross-linking with other chains. This forms a triple helix called procollagen. When exported out of the cell the loose ends are then trimmed to form tropocollagen. Finally, strands of tropocollagen covalently binds lysine and hydroxylysines together forming collagen fibrils.

The extracellular matrix plays an essential role in the process of healing. The general process of healing involves four overlapping stages: hemostasis, inflammation, proliferation, and remodeling. The processes involved are highly regulated and involves many interacting proteins.

SIS and other tissue ECMs can be decellularized and used as grafting materials that speed the rate of healing, most notably when used to treat chronic wounds such as diabetic ulcers. These materials are thought to act by promoting cell migration and angiogenesis at the site of injury, soaking up overexpressed proteases, and release soluble growth factors as they become incorporated into the body. However, many of these materials are poorly characterized and many of the protein identities within ECM grafting materials are unknown.

Variance between acellular scaffoldings

ECM grafting materials generally allow wounds to heal more quickly, however the tissue source and treated area can have a profound impact on the efficacy of ECM grafting materials. Furthermore, the structure and pore size of the matrix can also impact efficacy. Shown below on top are ECM materials derived from different tissues within the same species, while the bottom shows ECM from the same tissue of different species. This illustrates the noticeable structural variance between both tissues and species.

A - porcine dermis B - porcine SIS C - human dermis D - bovine dermis

2.0 Initial analysis

ECM materials are difficult to study because of their insolubility. However, after proteolytic digestion, many of the proteins can be solubilized and identified using a bottom bottom up approach. For this experiment, three randomized samples of SIS were digested with trypsin and ran through mass spectrometry. The total ion current shows the signal produced by the samples as they elute from the HPLC.

Total Ion Current

Bradykinin (peaks shown below) is used as an internal standard because it does not contain internal Lysine or Arginine and will remain intact during tryptic digestion. The corresponding peaks are extracted from each sample and the intensity is used to normalize the peak areas.  Delays in retention time can also be found and accounted for using internal standards, although only a slight lag between samples was measured during this experiment.

Identification of Proteins

Proteins are identified using search SearchGUI. Traditionally Mascot is used for most research publications, however, SearchGUI uses a DAS interface that searches multiple databases and tends to outperform Mascot.

Using databases to annotate protein identity

NCBI and UniProt

These sites are the most commonly used databases for annotations. This is because they are single curated databases and are often used for publications.


Other individual databases can also be found online. These often curated and compiled annotations. One such example is PANTHER database. This database was used initially to compile protein identities that were then cross referenced with UniProt.

Distributed Annotation System (DAS) Enabled websites

DAS is a communication protocol that allows integration of multiple external sources. This is especially helpful in bioinformatics for compiling protein, gene, and genomic data. Rather depending on a single bank for annotation information, DAS-enabled sites allow users to pull from a pool of sources.


NCBI and many other resources often fail to contain ECM proteins. DASty can be especially helpful when working with poorly annotated targets by providing an overview of annotations across multiple databases in a nice, compact window. Below, one can see uniprot was the only database that contained annotations for our protein of interest.

Other bioinformatic DAS-enabled sites include:

  1. Genomic, gene, and protein annotations - Ensembl
  2. Multiple sequence alignment editing - Jalview
  3. Matching protein annotations to 3D stucture - SPICE
  4. Sequence and structure alignment - STRAP
  5. Drosophila specific genes, genomes, and phenotypes - flybase

Evolution of DAS based systems exhibit non-linear growth since first development

Data Sorting

Excel was used in combination with MzMine to extract raw peak area from mass spectrometry analysis to calculate relative quantitation. The peptide identities were then concatenated with SearchGUI peptide identification output. Pivot tables, filters, and simple ecel functions were used to sort and concatenate these large data sets 

Lets take a look at collagen in a new page. In total, there are 103 unique peptide identities that have been identified as belonging to the collagen family.

Using pivot tables, we can view the relative abundance of both the parent protein and peptide identified.

Now lets cross reference the peptide identities with an NCBI Blast search. First we will concatenate the parent protein values with a carrot in excel (“=concatenate(>,[value])”) so they can be imported within a single search.

Blast searches are used to identify sequences that share homology with the input sequence. They are available for searching against both nucleotides and proteins. For this experiment we will use the BLAST as is the case with our experiment. Care must be taken to input the correct format when using the blast search. Since our data exists in a vertical column, it can be cut and pasted directly if we first use excel to concatenate the parent protein accessions with “>” using the formula “=concatenate(>,[value])”. This is especially helpful for large data sets. (but is not necessary for other programs, e.g. STRING)

The results show a mapping of the homologous sequences along the protein's sequence. Each protein entry is opened separately using the dropdown box near the top. Below, a table states the expected values, representing the confidence for each hit

Interestingly, some of the accession numbers are not in the NCBI database used. This is because NCBI doesn't contain all of the accession numbers obtained from SearchGUI, a DAS based peptide fingerprinting software. The average confidence for these values was 39% in our search, primarily because only a single 9-amino acid peptide was found.

When this short, sequence is run against the database, a large amount of matches are identified.  Despite the use of a single, short peptide, the search matched a high proportion of collagen proteins. This is because of collagen’s conserved repeating three amino acid pattern. (G-X-Y)n

results can also be pre-filtered to only include specific databases.  When run against only pig proteins the search yields an expect value of 100% for 3 different porcine collagens. The same also held true for the other unlisted collagen.

From here on, we will analyze the proteins using STRING

Analysis with STRING

Brief overview

Since their development DOS-enabled sites have since grown in their functionality as new algorithms are developed, including databases that can quantitatively integrate data. One such site is currently my favorite DAS-enabled site, STRING. While STRING is primarily for protein-protein interactions, it also includes functional protein analysis, namely via interactions, structure, expression, genomic context, and sequence similarity. Currently, STRING boasts a database covering 9,643,763 proteins and 2,031 organisms. Moreover, STRING provides a platform for accessing further analysis through other DAS-enabled sites including SMART for architectural analysis across species and EggNOG for orthologs, phylogeny, and functional annotations. A complete guide can be found here.

While some of the approaches above were outlined using single database systems, STRING is more user friendly and contain data from multiple sources to provide more complete coverage. This especially helpful when working with poorly annotated protein sets, as is the case with Sus scrofa

STRING offers a friendly user interface for finding sequence similarity within and between any sequence within the DAS database, along with sequence alignment, predicted and experimental functional partners, fusion proteins, neighborhood, occurence (by phylogeny), coexpression, references within experiments, and can textmine and scan databases all through one page.

STRING can handle multiple target inputs at once and accepts most formats including FASTA, many database accession numbers, and even plain text.

Protein Interactions

Any potential matches are then displayed with a bitscore and identity confidence The proteins selected are correct, so we will continue to the top scoring partners

Here you can see the proteins with putative interactions and their connections displayed graphically:

Down at the bottom, the parameters can be adjusted to include more protein interactions, although too many may just result in confusion

Collagen is most frequently found as a trimer and often interacts with many proteins and signaling molecules, so it makes sense that our search table has many high scoring interactions.

When we increase the maximum interactions to 50, the graphical mapping can be too cumbersome to realistically assess without increasing stringency

Datasets can also be checked for above average interactions, either relative to the genome, or via custom parameter. Here, 226/266 unique ECM protein identities were able to be imported. These were found to have interacting proteins at both a required confidence level of 0.7 and 0.9

0.7 cutoff

0.9 cutoff

Graphically, clustering can be improved by using the “Kmeans” and “MCL” slide bars. These act upon different parameters to promote clustering when the nodes are allowed to “relax”. Non-interacting nodes can also be removed from the image

Additionally, p-values and sum of biological GO categories,pathways, and domains can be summed with the enrichment tab

Gene transcript, motifs, and structure with SMART and PDBsum

selecting a bubble shows a summary of structure and transcript motifs

From here, one can continue analysis of the protein within the SMART database by clicking on transcript or to PDBsum the 3D rendering

PDBsum also provides the paper first describing the PDB crystal structure

as well as articles citing the structure


Selecting a protein's bitscore will provide sequence alignment to the current protein being compared. The protein being compared can be swapped out by selecting its name.

Transcript, phylogenetic, and functional genomic analysis with F1S5H3 - Von Williebrand alpha 2 domain (VWA2)

VWA2 is a domain found within collagen alpha-3 IV proteins and


the sequence can be obtained by selecting on the bubble and selecting sequence


The transcript obtained through the SMART link:

The domains within the transcript include a signal peptide, 3 von Williebrand alpha motifs, 2 epidermal growth factor-like domains, and a low complexity region

The EGF confidence appears low, so we can check easily look into these motifs for more information

Regions within the transcript can be run through an NCBI or EMBL-EBI blast search by selecting two points along the scale   The top value (Sus scrofa, pig) has 100% confidence, but the annotation is listed as a low quality protein. Further investigation could be done to better assess its identity

  The phylogenetic match has a value of 1092 bits, which is a fairly confident value

We can further study phylogenetic similarity in SMART using the compositional architecture search Here we can select for species to see other proteins that share VWA2 domains

The selected proteins can then be formatted into a nuwick tree or downloaded

Co-expression can also be easily visualized for any data set.

note Be aware that sometimes large datasets may not allow for the desired phylogeny to be selected if the majority of strong correlations are made with a distant clade or domain.


The world of bioinformatics is constantly expanding. Each year, researchers have cheaper, more powerful tools at their disposal to advance their work. Personally, I've discovered several databases and software programs that will save hours of tedious work, while also providing improved results.

b2gof15/students/ampajako/final_project/start.txt · Last modified: 2015/11/04 20:21 by ampajako
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki