The PASIF System

Prediction of Alternatively
Spliced IsoForms

"...PASIF actively evolves transcript isoforms!"

Buddha pic

Overview:

The PASIF system implements a computational method developed for predicting multiple transcript isoforms from alternatively spliced loci in higher eukaryotic taxa, and for eukaryotic gene structure prediction in general. It incorporates a broad array of artificial intelligence techniques, including algorithms for intelligent search, constraint satisfaction, neural networks for classification, lazy instance-based parameter evaluation, etc. A detailed description of the algorithm is presented in Chapter 5 of [Sparks, 2007]. The software implementation, written in the C programming language, can execute on both serial computers and parallel systems that support MPI functionality.

Disclaimer:

Note that we do not advocate the use of PASIF for practical gene annotation tasks: the system evolved from a highly exploratory effort to develop a gene prediction system deviating from conventional practice, in an effort to escape stagnation in this domain's progress [Knapp and Chen, 2007]. This goal was not achieved. Various subsystems from the software implementation are quite useful, however, and are readily cannibalized by other tools.

Algorithmic Approach:

PASIF utilizes a novel strategy to address the (alternative) gene structure prediction problem: random restart hill-climbing search is used to predict multiple transcript products at genetic loci. Random restarts of the search process provide structural seeds for iterative improvement (with respect to an objective function, described below) by a greedy heuristic local search method. Since the greedy-choice property does not hold in the context of gene prediction, this approach allows PASIF to predict heterogeneous, intralocus gene structures exhibiting potential biological relevancy. The algorithm executes "seed-and-refine" cycles until either

-- Feature Vector for Gene Structures:

Candidate transcript isoforms are characterized along six dimensions.

  1. The mean average of posterior donor splice site probabilities is computed using a Bayesian approach described in [Sparks and Brendel, 2005] and references contained therein.

    Given appropriate training data (as produced by, e.g., the GenomeThreader package [Gremme, Brendel, Sparks, and Kurtz, 2005]), it is possible to parameterize such splice site models using a variant of the BSSM4GSQ package, tailored specifically for use with PASIF: BSSM4PASIF.tar.bz2.

  2. The mean average of posterior acceptor splice site probabilities is similarly calculated.

  3. Mean averages of the ρ-values of [Brendel and Kleffe, 1998]—which relate to how well a particular splice site might pair with some cognate splice site, as a function of all feasible local splicing configurations—is calculated for donor sites.

  4. The mean average of acceptor site ρ-values is also computed.

  5. We introduce a new measure, the ζ-value, that indicates how robust the inclusion of a particular intron is to perturbations of the splicing pattern exhibited by its host gene structure. The method first partitions the power set of introns in the structure into those in which the intron of interest is present, and those in which it is not. It then takes the difference of the maximally-scoring element from the former subset and that of the latter, where the "score" of any such element is given as the mean average of posterior probabilities exhibited by all donor and acceptor sites it contains. This difference is normalized by dividing by the square root of the raw count of introns in the original structure. For any particular spliced isoform, the sum of ζ-values from each intron it hosts (a theoretically unbounded quantity) is squashed into the unit interval using the logistic function, which renders the descriptive feature.

  6. Likelihoods of putative coding sequences, given coding-specific transition probability estimates, are computed using a new variant of Markov chain models we have designed, called dynamically modulating Markov models, which work as follows:

    Training data for a functional class of sequences, e.g., coding or intron, are partitioned into Z subsets. Transition probabilities are estimated from training data in each of the Z subsets using standard methods for developing Markov chains. This results in a cloud of transition probabilities with dimensionality of four (the cardinality of the nucleotide alphabet) raised to the (K+1)th power, where K is the Markov chain's order (each "dimension" corresponds to a nucleotide oligomer of length K+1).

    The breadth of each dimension (oligomer) is delimited using the minimum and maximum transition probabilities observed for the oligomer over the Z distinct estimates. Transition probabilities returned by the dynamically modulating Markov model for this oligomer can vary across this range, subject to the additional constraint that, for all four possible oligomers of length K+1 that share a K-long prefix, their adjusted transition probabilities must collectively represent a legitimate probability mass function. Linear programming techniques, e.g., the simplex algorithm, are used to implement this optimization step.

    It is possible to train these (among other) Markov models using the train_MM.x driver program included in IMMpractical.4PASIF.tar.bz2, a variant of the IMMpractical library [Sparks, Brendel, and Dorman, 2007] configured for use with PASIF.

-- Multivariate Objective Function:

We assume a linear (or near-linear) classification boundary exists between biologically relevant spliced isoforms and spurious gene structures, when represented using the descriptive feature vector outlined above. This is because all dimensions of this feature space are expected to concomitantly increase in value with the "goodness" of a candidate solution.

Weights for combining elements of this feature vector are learned using two common variants on the classical perceptron algorithm, including an unthresholded perceptron (linear unit) and a single sigmoid unit, both of which can return continuous-valued outputs amenable to sorting candidate solutions generated during the search process. Thus, these neurons both guide optimization during the search process, and also classify candidate solutions in a two-class system.

Software Downloads:

Bibliography:


Valid XHTML 1.0 Transitional Valid CSS!

This site served as XHTML here.
Copyright © 2006, 2007 Michael E. Sparks