Statistical potential

Last updated
Example of interatomic pseudopotential, between b-carbons of isoleucine and valine residues, generated by using MyPMFs. ICBVCB.svg
Example of interatomic pseudopotential, between β-carbons of isoleucine and valine residues, generated by using MyPMFs.

In protein structure prediction, statistical potentials or knowledge-based potentials are scoring functions derived from an analysis of known protein structures in the Protein Data Bank (PDB).

Contents

The original method to obtain such potentials is the quasi-chemical approximation, due to Miyazawa and Jernigan. [2] It was later followed by the potential of mean force (statistical PMF [Note 1] ), developed by Sippl. [3] Although the obtained scores are often considered as approximations of the free energy—thus referred to as pseudo-energies—this physical interpretation is incorrect. [4] [5] Nonetheless, they are applied with success in many cases, because they frequently correlate with actual Gibbs free energy differences. [6]

Overview

Possible features to which a pseudo-energy can be assigned include:

The classic application is, however, based on pairwise amino acid contacts or distances, thus producing statistical interatomic potentials. For pairwise amino acid contacts, a statistical potential is formulated as an interaction matrix that assigns a weight or energy value to each possible pair of standard amino acids. The energy of a particular structural model is then the combined energy of all pairwise contacts (defined as two amino acids within a certain distance of each other) in the structure. The energies are determined using statistics on amino acid contacts in a database of known protein structures (obtained from the PDB).

History

Initial development

Many textbooks present the statistical PMFs as proposed by Sippl [3] as a simple consequence of the Boltzmann distribution, as applied to pairwise distances between amino acids. This is incorrect, but a useful start to introduce the construction of the potential in practice. The Boltzmann distribution applied to a specific pair of amino acids, is given by:

where is the distance, is the Boltzmann constant, is the temperature and is the partition function, with

The quantity is the free energy assigned to the pairwise system. Simple rearrangement results in the inverse Boltzmann formula, which expresses the free energy as a function of :

To construct a PMF, one then introduces a so-called reference state with a corresponding distribution and partition function , and calculates the following free energy difference:

The reference state typically results from a hypothetical system in which the specific interactions between the amino acids are absent. The second term involving and can be ignored, as it is a constant.

In practice, is estimated from the database of known protein structures, while typically results from calculations or simulations. For example, could be the conditional probability of finding the atoms of a valine and a serine at a given distance from each other, giving rise to the free energy difference . The total free energy difference of a protein, , is then claimed to be the sum of all the pairwise free energies:

where the sum runs over all amino acid pairs (with ) and is their corresponding distance. In many studies does not depend on the amino acid sequence. [7]

Conceptual issues

Intuitively, it is clear that a low value for indicates that the set of distances in a structure is more likely in proteins than in the reference state. However, the physical meaning of these statistical PMFs has been widely disputed, since their introduction. [4] [5] [8] [9] The main issues are:

  1. The wrong interpretation of this "potential" as a true, physically valid potential of mean force;
  2. The nature of the so-called reference state and its optimal formulation;
  3. The validity of generalizations beyond pairwise distances.

Controversial analogy

In response to the issue regarding the physical validity, the first justification of statistical PMFs was attempted by Sippl. [10] It was based on an analogy with the statistical physics of liquids. For liquids, the potential of mean force is related to the radial distribution function , which is given by: [11]

where and are the respective probabilities of finding two particles at a distance from each other in the liquid and in the reference state. For liquids, the reference state is clearly defined; it corresponds to the ideal gas, consisting of non-interacting particles. The two-particle potential of mean force is related to by:

According to the reversible work theorem, the two-particle potential of mean force is the reversible work required to bring two particles in the liquid from infinite separation to a distance from each other. [11]

Sippl justified the use of statistical PMFs—a few years after he introduced them for use in protein structure prediction—by appealing to the analogy with the reversible work theorem for liquids. For liquids, can be experimentally measured using small angle X-ray scattering; for proteins, is obtained from the set of known protein structures, as explained in the previous section. However, as Ben-Naim wrote in a publication on the subject: [5]

[...] the quantities, referred to as "statistical potentials," "structure based potentials," or "pair potentials of mean force", as derived from the protein data bank (PDB), are neither "potentials" nor "potentials of mean force," in the ordinary sense as used in the literature on liquids and solutions.

Moreover, this analogy does not solve the issue of how to specify a suitable reference state for proteins.

Machine learning

In the mid-2000s, authors started to combine multiple statistical potentials, derived from different structural features, into composite scores. [12] For that purpose, they used machine learning techniques, such as support vector machines (SVMs). Probabilistic neural networks (PNNs) have also been applied for the training of a position-specific distance-dependent statistical potential. [13] In 2016, the DeepMind artificial intelligence research laboratory started to apply deep learning techniques to the development of a torsion- and distance-dependent statistical potential. [14] The resulting method, named AlphaFold, won the 13th Critical Assessment of Techniques for Protein Structure Prediction (CASP) by correctly predicting the most accurate structure for 25 out of 43 free modelling domains.

Explanation

Bayesian probability

Baker and co-workers [15] justified statistical PMFs from a Bayesian point of view and used these insights in the construction of the coarse grained ROSETTA energy function. According to Bayesian probability calculus, the conditional probability of a structure , given the amino acid sequence , can be written as:

is proportional to the product of the likelihood times the prior . By assuming that the likelihood can be approximated as a product of pairwise probabilities, and applying Bayes' theorem, the likelihood can be written as:

where the product runs over all amino acid pairs (with ), and is the distance between amino acids and . Obviously, the negative of the logarithm of the expression has the same functional form as the classic pairwise distance statistical PMFs, with the denominator playing the role of the reference state. This explanation has two shortcomings: it relies on the unfounded assumption the likelihood can be expressed as a product of pairwise probabilities, and it is purely qualitative.

Probability kinematics

Hamelryck and co-workers [6] later gave a quantitative explanation for the statistical potentials, according to which they approximate a form of probabilistic reasoning due to Richard Jeffrey and named probability kinematics. This variant of Bayesian thinking (sometimes called "Jeffrey conditioning") allows updating a prior distribution based on new information on the probabilities of the elements of a partition on the support of the prior. From this point of view, (i) it is not necessary to assume that the database of protein structures—used to build the potentials—follows a Boltzmann distribution, (ii) statistical potentials generalize readily beyond pairwise differences, and (iii) the reference ratio is determined by the prior distribution.

Reference ratio

The reference ratio method.
Q
(
X
)
{\displaystyle Q(X)}
is a probability distribution that describes the structure of proteins on a local length scale (right). Typically,
Q
(
X
)
{\displaystyle Q(X)}
is embodied in a fragment library, but other possibilities are an energy function or a graphical model. In order to obtain a complete description of protein structure, one also needs a probability distribution
P
(
Y
)
{\displaystyle P(Y)}
that describes nonlocal aspects, such as hydrogen bonding.
P
(
Y
)
{\displaystyle P(Y)}
is typically obtained from a set of solved protein structures from the PDB (left). In order to combine
Q
(
X
)
{\displaystyle Q(X)}
with
P
(
Y
)
{\displaystyle P(Y)}
in a meaningful way, one needs the reference ratio expression (bottom), which takes the signal in
Q
(
X
)
{\displaystyle Q(X)}
with respect to
Y
{\displaystyle Y}
into account. Ratio reference method.svg
The reference ratio method. is a probability distribution that describes the structure of proteins on a local length scale (right). Typically, is embodied in a fragment library, but other possibilities are an energy function or a graphical model. In order to obtain a complete description of protein structure, one also needs a probability distribution that describes nonlocal aspects, such as hydrogen bonding. is typically obtained from a set of solved protein structures from the PDB (left). In order to combine with in a meaningful way, one needs the reference ratio expression (bottom), which takes the signal in with respect to into account.

Expressions that resemble statistical PMFs naturally result from the application of probability theory to solve a fundamental problem that arises in protein structure prediction: how to improve an imperfect probability distribution over a first variable using a probability distribution over a second variable , with . [6] Typically, and are fine and coarse grained variables, respectively. For example, could concern the local structure of the protein, while could concern the pairwise distances between the amino acids. In that case, could for example be a vector of dihedral angles that specifies all atom positions (assuming ideal bond lengths and angles). In order to combine the two distributions, such that the local structure will be distributed according to , while the pairwise distances will be distributed according to , the following expression is needed:

where is the distribution over implied by . The ratio in the expression corresponds to the PMF. Typically, is brought in by sampling (typically from a fragment library), and not explicitly evaluated; the ratio, which in contrast is explicitly evaluated, corresponds to Sippl's PMF. This explanation is quantitive, and allows the generalization of statistical PMFs from pairwise distances to arbitrary coarse grained variables. It also provides a rigorous definition of the reference state, which is implied by . Conventional applications of pairwise distance statistical PMFs usually lack two necessary features to make them fully rigorous: the use of a proper probability distribution over pairwise distances in proteins, and the recognition that the reference state is rigorously defined by .

Applications

Statistical potentials are used as energy functions in the assessment of an ensemble of structural models produced by homology modeling or protein threading. Many differently parameterized statistical potentials have been shown to successfully identify the native state structure from an ensemble of decoy or non-native structures. [16] Statistical potentials are not only used for protein structure prediction, but also for modelling the protein folding pathway. [17] [18]

See also

Notes

  1. Not to be confused with actual PMF.

Related Research Articles

<span class="mw-page-title-main">Exponential distribution</span> Probability distribution

In probability theory and statistics, the exponential distribution or negative exponential distribution is the probability distribution of the distance between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate; the distance parameter could be any meaningful mono-dimensional measure of the process, such as time between production errors, or length along a roll of fabric in the weaving manufacturing process. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts.

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables, each of which clusters around a mean value.

In electrochemistry, the Nernst equation is a chemical thermodynamical relationship that permits the calculation of the reduction potential of a reaction from the standard electrode potential, absolute temperature, the number of electrons involved in the redox reaction, and activities of the chemical species undergoing reduction and oxidation respectively. It was named after Walther Nernst, a German physical chemist who formulated the equation.

<span class="mw-page-title-main">Chi-squared distribution</span> Probability distribution and special case of gamma distribution

In probability theory and statistics, the chi-squared distribution with degrees of freedom is the distribution of a sum of the squares of independent standard normal random variables.

<span class="mw-page-title-main">Gamma distribution</span> Probability distribution

In probability theory and statistics, the gamma distribution is a versatile two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the gamma distribution. There are two equivalent parameterizations in common use:

  1. With a shape parameter k and a scale parameter θ
  2. With a shape parameter and an inverse scale parameter , called a rate parameter.
<span class="mw-page-title-main">Helmholtz free energy</span> Thermodynamic potential

In thermodynamics, the Helmholtz free energy is a thermodynamic potential that measures the useful work obtainable from a closed thermodynamic system at a constant temperature (isothermal). The change in the Helmholtz energy during a process is equal to the maximum amount of work that the system can perform in a thermodynamic process in which temperature is held constant. At constant temperature, the Helmholtz free energy is minimized at equilibrium.

The Ising model, named after the physicists Ernst Ising and Wilhelm Lenz, is a mathematical model of ferromagnetism in statistical mechanics. The model consists of discrete variables that represent magnetic dipole moments of atomic "spins" that can be in one of two states. The spins are arranged in a graph, usually a lattice, allowing each spin to interact with its neighbors. Neighboring spins that agree have a lower energy than those that disagree; the system tends to the lowest energy but heat disturbs this tendency, thus creating the possibility of different structural phases. The model allows the identification of phase transitions as a simplified model of reality. The two-dimensional square-lattice Ising model is one of the simplest statistical models to show a phase transition.

<span class="mw-page-title-main">Partition function (statistical mechanics)</span> Function in thermodynamics and statistical physics

In physics, a partition function describes the statistical properties of a system in thermodynamic equilibrium. Partition functions are functions of the thermodynamic state variables, such as the temperature and volume. Most of the aggregate thermodynamic variables of the system, such as the total energy, free energy, entropy, and pressure, can be expressed in terms of the partition function or its derivatives. The partition function is dimensionless.

<span class="mw-page-title-main">Equipartition theorem</span> Theorem in classical statistical mechanics

In classical statistical mechanics, the equipartition theorem relates the temperature of a system to its average energies. The equipartition theorem is also known as the law of equipartition, equipartition of energy, or simply equipartition. The original idea of equipartition was that, in thermal equilibrium, energy is shared equally among all of its various forms; for example, the average kinetic energy per degree of freedom in translational motion of a molecule should equal that in rotational motion.

<span class="mw-page-title-main">Chemiosmosis</span> Electrochemical principle that enables cellular respiration

Chemiosmosis is the movement of ions across a semipermeable membrane bound structure, down their electrochemical gradient. An important example is the formation of adenosine triphosphate (ATP) by the movement of hydrogen ions (H+) across a membrane during cellular respiration or photosynthesis.

In mathematical statistics, the Kullback–Leibler (KL) divergence, denoted , is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q. Mathematically, it is defined as

An ideal chain is the simplest model in polymer chemistry to describe polymers, such as nucleic acids and proteins. It assumes that the monomers in a polymer are located at the steps of a hypothetical random walker that does not remember its previous steps. By neglecting interactions among monomers, this model assumes that two monomers can occupy the same location. Although it is simple, its generality gives insight about the physics of polymers.

<span class="mw-page-title-main">Fundamental thermodynamic relation</span>

In thermodynamics, the fundamental thermodynamic relation are four fundamental equations which demonstrate how four important thermodynamic quantities depend on variables that can be controlled and measured experimentally. Thus, they are essentially equations of state, and using the fundamental equations, experimental data can be used to determine sought-after quantities like G or H (enthalpy). The relation is generally expressed as a microscopic change in internal energy in terms of microscopic changes in entropy, and volume for a closed system in thermal equilibrium in the following way.

The Poisson–Boltzmann equation describes the distribution of the electric potential in solution in the direction normal to a charged surface. This distribution is important to determine how the electrostatic interactions will affect the molecules in solution. The Poisson–Boltzmann equation is derived via mean-field assumptions. From the Poisson–Boltzmann equation many other equations have been derived with a number of different assumptions.

<span class="mw-page-title-main">Gaussian network model</span>

The Gaussian network model (GNM) is a representation of a biological macromolecule as an elastic mass-and-spring network to study, understand, and characterize the mechanical aspects of its long-time large-scale dynamics. The model has a wide range of applications from small proteins such as enzymes composed of a single domain, to large macromolecular assemblies such as a ribosome or a viral capsid. Protein domain dynamics plays key roles in a multitude of molecular recognition and cell signalling processes. Protein domains, connected by intrinsically disordered flexible linker domains, induce long-range allostery via protein domain dynamics. The resultant dynamic modes cannot be generally predicted from static structures of either the entire protein or individual domains.

Statistical coupling analysis or SCA is a technique used in bioinformatics to measure covariation between pairs of amino acids in a protein multiple sequence alignment (MSA). More specifically, it quantifies how much the amino acid distribution at some position i changes upon a perturbation of the amino acid distribution at another position j. The resulting statistical coupling energy indicates the degree of evolutionary dependence between the residues, with higher coupling energy corresponding to increased dependence.

CS-BLAST (Context-Specific BLAST) is a tool that searches a protein sequence that extends BLAST, using context-specific mutation probabilities. More specifically, CS-BLAST derives context-specific amino-acid similarities on each query sequence from short windows on the query sequences. Using CS-BLAST doubles sensitivity and significantly improves alignment quality without a loss of speed in comparison to BLAST. CSI-BLAST is the context-specific analog of PSI-BLAST, which computes the mutation profile with substitution probabilities and mixes it with the query profile. CSI-BLAST is the context specific analog of PSI-BLAST. Both of these programs are available as web-server and are available for free download.

Graphical models have become powerful frameworks for protein structure prediction, protein–protein interaction, and free energy calculations for protein structures. Using a graphical model to represent the protein structure allows the solution of many problems including secondary structure prediction, protein-protein interactions, protein-drug interaction, and free energy calculations.

Information field theory (IFT) is a Bayesian statistical field theory relating to signal reconstruction, cosmography, and other related areas. IFT summarizes the information available on a physical field using Bayesian probabilities. It uses computational techniques developed for quantum field theory and statistical field theory to handle the infinite number of degrees of freedom of a field and to derive algorithms for the calculation of field expectation values. For example, the posterior expectation value of a field generated by a known Gaussian process and measured by a linear device with known Gaussian noise statistics is given by a generalized Wiener filter applied to the measured data. IFT extends such known filter formula to situations with nonlinear physics, nonlinear devices, non-Gaussian field or noise statistics, dependence of the noise statistics on the field values, and partly unknown parameters of measurement. For this it uses Feynman diagrams, renormalisation flow equations, and other methods from mathematical physics.

Direct coupling analysis or DCA is an umbrella term comprising several methods for analyzing sequence data in computational biology. The common idea of these methods is to use statistical modeling to quantify the strength of the direct relationship between two positions of a biological sequence, excluding effects from other positions. This contrasts usual measures of correlation, which can be large even if there is no direct relationship between the positions. Such a direct relationship can for example be the evolutionary pressure for two positions to maintain mutual compatibility in the biomolecular structure of the sequence, leading to molecular coevolution between the two positions.

References

  1. Postic, Guillaume; Hamelryck, Thomas; Chomilier, Jacques; Stratmann, Dirk (2018). "MyPMFs: a simple tool for creating statistical potentials to assess protein structural models". Biochimie. 151: 37–41. doi:10.1016/j.biochi.2018.05.013. ISSN   0300-9084. PMID   29857183. S2CID   46923560.
  2. Miyazawa S, Jernigan R (1985). "Estimation of effective interresidue contact energies from protein crystal structures: quasi-chemical approximation". Macromolecules. 18 (3): 534–552. Bibcode:1985MaMol..18..534M. CiteSeerX   10.1.1.206.715 . doi:10.1021/ma00145a039.
  3. 1 2 Sippl MJ (1990). "Calculation of conformational ensembles from potentials of mean force. An approach to the knowledge-based prediction of local structures in globular proteins". J Mol Biol. 213 (4): 859–883. doi:10.1016/s0022-2836(05)80269-4. PMID   2359125.
  4. 1 2 Thomas PD, Dill KA (1996). "Statistical potentials extracted from protein structures: how accurate are they?". J Mol Biol. 257 (2): 457–469. doi:10.1006/jmbi.1996.0175. PMID   8609636.
  5. 1 2 3 Ben-Naim A (1997). "Statistical potentials extracted from protein structures: Are these meaningful potentials?". J Chem Phys. 107 (9): 3698–3706. Bibcode:1997JChPh.107.3698B. doi:10.1063/1.474725.
  6. 1 2 3 Hamelryck T, Borg M, Paluszewski M, et al. (2010). Flower DR (ed.). "Potentials of mean force for protein structure prediction vindicated, formalized and generalized". PLOS ONE. 5 (11): e13714. arXiv: 1008.4006 . Bibcode:2010PLoSO...513714H. doi: 10.1371/journal.pone.0013714 . PMC   2978081 . PMID   21103041.
  7. Rooman M, Wodak S (1995). "Are database-derived potentials valid for scoring both forward and inverted protein folding?". Protein Eng. 8 (9): 849–858. doi:10.1093/protein/8.9.849. PMID   8746722.
  8. Koppensteiner WA, Sippl MJ (1998). "Knowledge-based potentials–back to the roots". Biochemistry Mosc. 63 (3): 247–252. PMID   9526121.
  9. Shortle D (2003). "Propensities, probabilities, and the Boltzmann hypothesis". Protein Sci. 12 (6): 1298–1302. doi:10.1110/ps.0306903. PMC   2323900 . PMID   12761401.
  10. Sippl MJ, Ortner M, Jaritz M, Lackner P, Flockner H (1996). "Helmholtz free energies of atom pair interactions in proteins". Fold Des. 1 (4): 289–98. doi:10.1016/s1359-0278(96)00042-9. PMID   9079391.
  11. 1 2 Chandler D (1987) Introduction to Modern Statistical Mechanics. New York: Oxford University Press, USA.
  12. Eramian, David; Shen, Min‐yi; Devos, Damien; Melo, Francisco; Sali, Andrej; Marti-Renom, Marc (2006). "A composite score for predicting errors in protein structure models". Protein Science. 15 (7): 1653–1666. doi:10.1110/ps.062095806. PMC   2242555 . PMID   16751606.
  13. Zhao, Feng; Xu, Jinbo (2012). "A Position-Specific Distance-Dependent Statistical Potential for Protein Structure and Functional Study". Structure. 20 (6): 1118–1126. doi:10.1016/j.str.2012.04.003. PMC   3372698 . PMID   22608968.
  14. Senior AW, Evans R, Jumper J, et al. (2020). "Improved protein structure prediction using potentials from deep learning" (PDF). Nature. 577 (7792): 706–710. Bibcode:2020Natur.577..706S. doi:10.1038/s41586-019-1923-7. PMID   31942072. S2CID   210221987.
  15. Simons KT, Kooperberg C, Huang E, Baker D (1997). "Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions". J Mol Biol. 268 (1): 209–225. CiteSeerX   10.1.1.579.5647 . doi:10.1006/jmbi.1997.0959. PMID   9149153.
  16. Lam SD, Das S, Sillitoe I, Orengo C (2017). "An overview of comparative modelling and resources dedicated to large-scale modelling of genome sequences". Acta Crystallogr D. 73 (8): 628–640. doi:10.1107/S2059798317008920. PMC   5571743 . PMID   28777078.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  17. Kmiecik S and Kolinski A (2007). "Characterization of protein-folding pathways by reduced-space modeling". Proc. Natl. Acad. Sci. U.S.A. 104 (30): 12330–12335. Bibcode:2007PNAS..10412330K. doi: 10.1073/pnas.0702265104 . PMC   1941469 . PMID   17636132.
  18. Adhikari AN, Freed KF, Sosnick TR (2012). "De novo prediction of protein folding pathways and structure using the principle of sequential stabilization". Proc. Natl. Acad. Sci. U.S.A. 109 (43): 17442–17447. Bibcode:2012PNAS..10917442A. doi: 10.1073/pnas.1209000109 . PMC   3491489 . PMID   23045636.