Probalign

Last updated May 18, 2023

Probalign is a sequence alignment tool that calculates a maximum expected accuracy alignment using partition function posterior probabilities.^[1] Base pair probabilities are estimated using an estimate similar to Boltzmann distribution. The partition function is calculated using a dynamic programming approach.

Algorithm

The following describes the algorithm used by probalign to determine the base pair probabilities.^[2]

Alignment score

To score an alignment of two sequences two things are needed:

a similarity function $\sigma (x,y)$ (e.g. PAM, BLOSUM,...)
affine gap penalty: $g(k)=\alpha +\beta k$

The score $S(a)$ of an alignment a is defined as:

$S(a)=\sum _{x_{i}-y_{j}\in a}\sigma (x_{i},y_{j})+{\text{gap cost}}$

Now the boltzmann weighted score of an alignment a is:

$e^{\frac {S(a)}{T}}=e^{\frac {\sum _{x_{i}-y_{j}\in a}\sigma (x_{i},y_{j})+{\text{gap cost}}}{T}}=\left(\prod _{x_{i}-y_{i}\in a}e^{\frac {\sigma (x_{i},y_{j})}{T}}\right)\cdot e^{\frac {gapcost}{T}}$

Where $T$ is a scaling factor.

The probability of an alignment assuming boltzmann distribution is given by

$Pr[a|x,y]={\frac {e^{\frac {S(a)}{T}}}{Z}}$

Where $Z$ is the partition function, i.e. the sum of the boltzmann weights of all alignments.

Dynamic programming

Let $Z_{i,j}$ denote the partition function of the prefixes $x_{0},x_{1},...,x_{i}$ and $y_{0},y_{1},...,y_{j}$ . Three different cases are considered:

$Z_{i,j}^{M}:$ the partition function of all alignments of the two prefixes that end in a match.
$Z_{i,j}^{I}:$ the partition function of all alignments of the two prefixes that end in an insertion $(-,y_{j})$ .
$Z_{i,j}^{D}:$ the partition function of all alignments of the two prefixes that end in a deletion $(x_{i},-)$ .

Then we have: $Z_{i,j}=Z_{i,j}^{M}+Z_{i,j}^{D}+Z_{i,j}^{I}$

Initialization

The matrixes are initialized as follows:

$Z_{0,j}^{M}=Z_{i,0}^{M}=0$
$Z_{0,0}^{M}=1$
$Z_{0,j}^{D}=0$
$Z_{i,0}^{I}=0$

Recursion

The partition function for the alignments of two sequences $x$ and $y$ is given by $Z_{|x|,|y|}$ , which can be recursively computed:

$Z_{i,j}^{M}=Z_{i-1,j-1}\cdot e^{\frac {\sigma (x_{i},y_{j})}{T}}$
$Z_{i,j}^{D}=Z_{i-1,j}^{D}\cdot e^{\frac {\beta }{T}}+Z_{i-1,j}^{M}\cdot e^{\frac {g(1)}{T}}+Z_{i-1,j}^{I}\cdot e^{\frac {g(1)}{T}}$
$Z_{i,j}^{I}$ analogously

Base pair probability

Finally the probability that positions $x_{i}$ and $y_{j}$ form a base pair is given by:

$P(x_{i}-y_{j}|x,y)={\frac {Z_{i-1,j-1}\cdot e^{\frac {\sigma (x_{i},y_{j})}{T}}\cdot Z'_{i',j'}}{Z_{|x|,|y|}}}$

$Z',i',j'$ are the respective values for the recalculated $Z$ with inversed base pair strings.

Related Research Articles

In physics, the Maxwell–Boltzmann distribution, or Maxwell(ian) distribution, is a particular probability distribution named after James Clerk Maxwell and Ludwig Boltzmann.

In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

In probability theory, the central limit theorem (CLT) establishes that, in many situations, for identically distributed independent samples, the standardized sample mean tends towards the standard normal distribution even if the original variables themselves are not normally distributed.

<span class="mw-page-title-main">Log-normal distribution</span> Probability distribution

In probability theory, a log-normal (or lognormal) distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. Thus, if the random variable $X$ is log-normally distributed, then $Y = ln(X)$ has a normal distribution. Equivalently, if $Y$ has a normal distribution, then the exponential function of $Y$ , $X = exp(Y)$ , has a log-normal distribution. A random variable which is log-normally distributed takes only positive real values. It is a convenient and useful model for measurements in exact and engineering sciences, as well as medicine, economics and other topics (e.g., energies, concentrations, lengths, prices of financial instruments, and other metrics).

<span class="mw-page-title-main">Fokker–Planck equation</span> Partial differential equation

In statistical mechanics, the Fokker–Planck equation is a partial differential equation that describes the time evolution of the probability density function of the velocity of a particle under the influence of drag forces and random forces, as in Brownian motion. The equation can be generalized to other observables as well.

<span class="mw-page-title-main">Covariance matrix</span> Measure of covariance of components of a random vector

In probability theory and statistics, a covariance matrix is a square matrix giving the covariance between each pair of elements of a given random vector. Any covariance matrix is symmetric and positive semi-definite and its main diagonal contains variances.

In mathematics, Itô's lemma or Itô's formula is an identity used in Itô calculus to find the differential of a time-dependent function of a stochastic process. It serves as the stochastic calculus counterpart of the chain rule. It can be heuristically derived by forming the Taylor series expansion of the function up to its second derivatives and retaining terms up to first order in the time increment and second order in the Wiener process increment. The lemma is widely employed in mathematical finance, and its best known application is in the derivation of the Black–Scholes equation for option values.

In mathematics, a Gaussian function, often simply referred to as a Gaussian, is a function of the base form

In probability theory and statistics, the Rayleigh distribution is a continuous probability distribution for nonnegative-valued random variables. Up to rescaling, it coincides with the chi distribution with two degrees of freedom. The distribution is named after Lord Rayleigh.

The Ising model, named after the physicists Ernst Ising and Wilhelm Lenz, is a mathematical model of ferromagnetism in statistical mechanics. The model consists of discrete variables that represent magnetic dipole moments of atomic "spins" that can be in one of two states. The spins are arranged in a graph, usually a lattice, allowing each spin to interact with its neighbors. Neighboring spins that agree have a lower energy than those that disagree; the system tends to the lowest energy but heat disturbs this tendency, thus creating the possibility of different structural phases. The model allows the identification of phase transitions as a simplified model of reality. The two-dimensional square-lattice Ising model is one of the simplest statistical models to show a phase transition.

In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family", or the older term Koopman–Darmois family. The terms "distribution" and "family" are often used loosely: specifically, an exponential family is a set of distributions, where the specific distribution varies with the parameter; however, a parametric family of distributions is often referred to as "a distribution", and the set of all exponential families is sometimes loosely referred to as "the" exponential family. They are distinct because they possess a variety of desirable properties, most importantly the existence of a sufficient statistic.

In statistics, propagation of uncertainty is the effect of variables' uncertainties on the uncertainty of a function based on them. When the variables are the values of experimental measurements they have uncertainties due to measurement limitations which propagate due to the combination of variables in the function.

The Voigt profile is a probability distribution given by a convolution of a Cauchy-Lorentz distribution and a Gaussian distribution. It is often used in analyzing data from spectroscopy or diffraction.

In probability theory, the Rice distribution or Rician distribution is the probability distribution of the magnitude of a circularly-symmetric bivariate normal random variable, possibly with non-zero mean (noncentral). It was named after Stephen O. Rice (1907–1986).

In probability theory and statistics, the chi distribution is a continuous probability distribution. It is the distribution of the positive square root of the sum of squares of a set of independent random variables each following a standard normal distribution, or equivalently, the distribution of the Euclidean distance of the random variables from the origin. It is thus related to the chi-squared distribution by describing the distribution of the positive square roots of a variable obeying a chi-squared distribution.

The lattice Boltzmann methods (LBM), originated from the lattice gas automata (LGA) method (Hardy-Pomeau-Pazzis and Frisch-Hasslacher-Pomeau models), is a class of computational fluid dynamics (CFD) methods for fluid simulation. Instead of solving the Navier–Stokes equations directly, a fluid density on a lattice is simulated with streaming and collision (relaxation) processes. The method is versatile as the model fluid can straightforwardly be made to mimic common fluid behaviour like vapour/liquid coexistence, and so fluid systems such as liquid droplets can be simulated. Also, fluids in complex environments such as porous media can be straightforwardly simulated, whereas with complex boundaries other CFD methods can be hard to work with.

The softmax function, also known as softargmax or normalized exponential function, converts a vector of $K$ real numbers into a probability distribution of $K$ possible outcomes. It is a generalization of the logistic function to multiple dimensions, and used in multinomial logistic regression. The softmax function is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes, based on Luce's choice axiom.

In statistics, the multivariate t-distribution is a multivariate probability distribution. It is a generalization to random vectors of the Student's t-distribution, which is a distribution applicable to univariate random variables. While the case of a random matrix could be treated within this structure, the matrix t-distribution is distinct and makes particular use of the matrix structure.

<span class="mw-page-title-main">Generalized Pareto distribution</span> Family of probability distributions often used to model tails or extreme values

In statistics, the generalized Pareto distribution (GPD) is a family of continuous probability distributions. It is often used to model the tails of another distribution. It is specified by three parameters: location $, scale, and shape . Sometimes it is specified by only scale and shape and sometimes only by its shape parameter. Some references give the shape parameter as .$

In analytic number theory, a Dirichlet series, or Dirichlet generating function (DGF), of a sequence is a common way of understanding and summing arithmetic functions in a meaningful way. A little known, or at least often forgotten about, way of expressing formulas for arithmetic functions and their summatory functions is to perform an integral transform that inverts the operation of forming the DGF of a sequence. This inversion is analogous to performing an inverse Z-transform to the generating function of a sequence to express formulas for the series coefficients of a given ordinary generating function.

References

↑ U. Roshan and D. R. Livesay, Probalign: multiple sequence alignment using partition function posterior probabilities, Bioinformatics, 22(22):2715-21, 2006 (PDF)
↑ Lecture "Bioinformatics II" at University of Freiburg

External links

Probalign Webservice

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] U. Roshan and D. R. Livesay, Probalign: multiple sequence alignment using partition function posterior probabilities, Bioinformatics, 22(22):2715-21, 2006 (PDF)

[2] Lecture "Bioinformatics II" at University of Freiburg

[1]

[2]