Dead-end elimination

Last updated October 10, 2024

The dead-end elimination algorithm (DEE) is a method for minimizing a function over a discrete set of independent variables. The basic idea is to identify "dead ends", i.e., combinations of variables that are not necessary to define a global minimum because there is always a way of replacing such combination by a better or equivalent one. Then we can refrain from searching such combinations further. Hence, dead-end elimination is a mirror image of dynamic programming, in which "good" combinations are identified and explored further.

Although the method itself is general, it has been developed and applied mainly to the problems of predicting and designing the structures of proteins (and in this wise was cited in the Scientific Background to the 2024 Nobel Prize in Chemistry).^[1] It closely related to the notion of dominance in optimization also known as substitutability in a Constraint Satisfaction Problem. The original description and proof of the dead-end elimination theorem can be found in .

Basic requirements

An effective DEE implementation requires four pieces of information:

Note that the criteria can easily be reversed to identify the maximum of a given function as well.

Applications to protein structure prediction

Dead-end elimination has been used effectively to predict the structure of side chains on a given protein backbone structure by minimizing an energy function $E$ . The dihedral angle search space of the side chains is restricted to a discrete set of rotamers for each amino acid position in the protein (which is, obviously, of fixed length). The original DEE description included criteria for the elimination of single rotamers and of rotamer pairs, although this can be expanded.

In the following discussion, let $N$ be the length of the protein and let $r_{k}$ represent the rotamer of the $\mathrm {k^{th}}$ side chain. Since atoms in proteins are assumed to interact only by two-body potentials, the energy may be written

E_{TOT}=\sum _{k}E_{k}(r_{k})+\sum _{k\neq l}E_{kl}(r_{k},r_{l})\,

Where $E_{k}(r_{k})$ represents the "self-energy" of a particular rotamer $r_{k}$ , and $E_{kl}(r_{k},r_{l})$ represents the "pair energy" of the rotamers $r_{k},r_{j}$ .

Also note that $E_{kk}(r_{k}^{A},r_{k}^{A})$ (that is, the pair energy between a rotamer and itself) is taken to be zero, and thus does not affect the summations. This notation simplifies the description of the pairs criterion below.

Singles elimination criterion

If a particular rotamer $r_{k}^{A}$ of sidechain $k$ cannot possibly give a better energy than another rotamer $r_{k}^{B}$ of the same sidechain, then rotamer A can be eliminated from further consideration, which reduces the search space. Mathematically, this condition is expressed by the inequality

E_{k}(r_{k}^{A})+\sum _{l=1}^{N}\min _{X}E_{kl}(r_{k}^{A},r_{l}^{X})>E_{k}(r_{k}^{B})+\sum _{l=1}^{N}\max _{X}E_{kl}(r_{k}^{B},r_{l}^{X})

where $\min _{X}E_{kl}(r_{k}^{A},r_{l}^{X})$ is the minimum (best) energy possible between rotamer $r_{k}^{A}$ of sidechain $k$ and any rotamer X of side chain $l$ . Similarly, $\max _{X}E_{kl}(r_{k}^{B},r_{l}^{X})$ is the maximum (worst) energy possible between rotamer $r_{k}^{B}$ of sidechain $k$ and any rotamer X of side chain $l$ .

Pairs elimination criterion

The pairs criterion is more difficult to describe and to implement, but it adds significant eliminating power. For brevity, we define the shorthand variable $U_{kl}^{AB}$ that is the intrinsic energy of a pair of rotamers $A$ and $B$ at positions $k$ and $l$ , respectively

U_{kl}^{AB}\ {\stackrel {\mathrm {def} }{=}}\ E_{k}(r_{k}^{A})+E_{l}(r_{l}^{B})+E_{kl}(r_{k}^{A},r_{l}^{B})

A given pair of rotamers $A$ and $B$ at positions $k$ and $l$ , respectively, cannot both be in the final solution (although one or the other may be) if there is another pair $C$ and $D$ that always gives a better energy. Expressed mathematically,

U_{kl}^{AB}+\sum _{i=1}^{N}\min _{X}\left(E_{ki}(r_{k}^{A},r_{i}^{X})+E_{lj}(r_{l}^{B},r_{j}^{X})\right)>U_{kl}^{CD}+\sum _{i=1}^{N}\max _{X}\left(E_{ki}(r_{k}^{C},r_{i}^{X})+E_{lj}(r_{l}^{D},r_{j}^{X})\right)

where $A\neq C$ , $B\neq D$ and $k\neq l$ .

Energy matrices

For large $N$ , the matrices of precomputed energies can become costly to store. Let $N$ be the number of amino acid positions, as above, and let $p$ be the number of rotamers at each position (this is usually, but not necessarily, constant over all positions). Each self-energy matrix for a given position requires $p$ entries, so the total number of self-energies to store is $Np$ . Each pair energy matrix between two positions $r_{k}$ and $r_{l}$ , for $p$ discrete rotamers at each position, requires a $p\times p$ matrix. This makes the total number of entries in an unreduced pair matrix $N^{2}p^{2}$ . This can be trimmed somewhat, at the cost of additional complexity in implementation, because pair energies are symmetrical and the pair energy between a rotamer and itself is zero.

Implementation and efficiency

The above two criteria are normally applied iteratively until convergence, defined as the point at which no more rotamers or pairs can be eliminated. Since this is normally a reduction in the sample space by many orders of magnitude, simple enumeration will suffice to determine the minimum within this pared-down set.

Given this model, it is clear that the DEE algorithm is guaranteed to find the optimal solution; that is, it is a global optimization process. The single-rotamer search scales quadratically in time with total number of rotamers. The pair search scales cubically and is the slowest part of the algorithm (aside from energy calculations). This is a dramatic improvement over the brute-force enumeration which scales as $O(p^{N})$ .

A large-scale benchmark of DEE compared with alternative methods of protein structure prediction and design finds that DEE reliably converges to the optimal solution for protein lengths for which it runs in a reasonable amount of time. It significantly outperforms the alternatives under consideration, which involved techniques derived from mean field theory, genetic algorithms, and the Monte Carlo method. However, the other algorithms are appreciably faster than DEE and thus can be applied to larger and more complex problems; their relative accuracy can be extrapolated from a comparison to the DEE solution within the scope of problems accessible to DEE.

Protein design

The preceding discussion implicitly assumed that the rotamers $r_{k}$ are all different orientations of the same amino acid side chain. That is, the sequence of the protein was assumed to be fixed. It is also possible to allow multiple side chains to "compete" over a position $k$ by including both types of side chains in the set of rotamers for that position. This allows a novel sequence to be designed onto a given protein backbone. A short zinc finger protein fold has been redesigned this way. However, this greatly increases the number of rotamers per position and still requires a fixed protein length.

Generalizations

More powerful and more general criteria have been introduced that improve both the efficiency and the eliminating power of the method for both prediction and design applications. One example is a refinement of the singles elimination criterion known as the Goldstein criterion , which arises from fairly straightforward algebraic manipulation before applying the minimization:

E_{k}(r_{k}^{A})-E_{k}(r_{k}^{B})+\sum _{l=1}^{N}\min _{X}\left(E_{kl}(r_{k}^{A},r_{l}^{X})-E_{kl}(r_{k}^{B},r_{l}^{X})\right)>0

Thus rotamer $r_{k}^{A}$ can be eliminated if any alternative rotamer from the set at $r_{k}$ contributes less to the total energy than $r_{k}^{A}$ . This is an improvement over the original criterion, which requires comparison of the best possible (that is, the smallest) energy contribution from $r_{k}^{A}$ with the worst possible contribution from an alternative rotamer.

An extended discussion of elaborate DEE criteria and a benchmark of their relative performance can be found in .

Related Research Articles

Dynamic programming is both a mathematical optimization method and an algorithmic paradigm. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics.

In data mining and statistics, hierarchical clustering is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two categories:

In mathematics, the Smith normal form is a normal form that can be defined for any matrix with entries in a principal ideal domain (PID). The Smith normal form of a matrix is diagonal, and can be obtained from the original matrix by multiplying on the left and right by invertible square matrices. In particular, the integers are a PID, so one can always calculate the Smith normal form of an integer matrix. The Smith normal form is very useful for working with finitely generated modules over a PID, and in particular for deducing the structure of a quotient of a free module. It is named after the Irish mathematician Henry John Stephen Smith.

Feature selection is the process of selecting a subset of relevant features for use in model construction. Stylometry and DNA microarray analysis are two cases where feature selection is used. It should be distinguished from feature extraction.

Protein design is the rational design of new protein molecules to design novel activity, behavior, or purpose, and to advance basic understanding of protein function. Proteins can be designed from scratch or by making calculated variants of a known protein structure and its sequence. Rational protein design approaches make protein-sequence predictions that will fold to specific structures. These predicted sequences can then be validated experimentally through methods such as peptide synthesis, site-directed mutagenesis, or artificial gene synthesis.

In information theory, the Rényi entropy is a quantity that generalizes various notions of entropy, including Hartley entropy, Shannon entropy, collision entropy, and min-entropy. The Rényi entropy is named after Alfréd Rényi, who looked for the most general way to quantify information while preserving additivity for independent events. In the context of fractal dimension estimation, the Rényi entropy forms the basis of the concept of generalized dimensions.

<span class="mw-page-title-main">Geometry processing</span>

Geometry processing is an area of research that uses concepts from applied mathematics, computer science and engineering to design efficient algorithms for the acquisition, reconstruction, analysis, manipulation, simulation and transmission of complex 3D models. As the name implies, many of the concepts, data structures, and algorithms are directly analogous to signal processing and image processing. For example, where image smoothing might convolve an intensity signal with a blur kernel formed using the Laplace operator, geometric smoothing might be achieved by convolving a surface geometry with a blur kernel formed using the Laplace-Beltrami operator.

In multi-objective optimization, the Pareto front is the set of all Pareto efficient solutions. The concept is widely used in engineering. It allows the designer to restrict attention to the set of efficient choices, and to make tradeoffs within this set, rather than considering the full range of every parameter.

The Rand index or Rand measure in statistics, and in particular in data clustering, is a measure of the similarity between two data clusterings. A form of the Rand index may be defined that is adjusted for the chance grouping of elements, this is the adjusted Rand index. The Rand index is the accuracy of determining if a link belongs within a cluster or not.

The self-consistent mean field (SCMF) method is an adaptation of mean field theory used in protein structure prediction to determine the optimal amino acid side chain packing given a fixed protein backbone.^{It is faster but less accurate than dead-end elimination and is generally used in situations where the protein of interest is too large for the problem to be tractable by DEE.}

Stress majorization is an optimization strategy used in multidimensional scaling (MDS) where, for a set of -dimensional data items, a configuration of $points in -dimensional space is sought that minimizes the so-called stress function . Usually is or, i.e. the matrix lists points in or dimensional Euclidean space so that the result may be visualised. The function is a cost or loss function that measures the squared differences between ideal distances and actual distances in r -dimensional space. It is defined as:$

Group method of data handling (GMDH) is a family of inductive algorithms for computer-based mathematical modeling of multi-parametric datasets that features fully automatic structural and parametric optimization of models.

SUBCLU is an algorithm for clustering high-dimensional data by Karin Kailing, Hans-Peter Kriegel and Peer Kröger. It is a subspace clustering algorithm that builds on the density-based clustering algorithm DBSCAN. SUBCLU can find clusters in axis-parallel subspaces, and uses a bottom-up, greedy strategy to remain efficient.

In statistics, the backfitting algorithm is a simple iterative procedure used to fit a generalized additive model. It was introduced in 1985 by Leo Breiman and Jerome Friedman along with generalized additive models. In most cases, the backfitting algorithm is equivalent to the Gauss–Seidel method, an algorithm used for solving a certain linear system of equations.

Graphical models have become powerful frameworks for protein structure prediction, protein–protein interaction, and free energy calculations for protein structures. Using a graphical model to represent the protein structure allows the solution of many problems including secondary structure prediction, protein-protein interactions, protein-drug interaction, and free energy calculations.

In computer science and data mining, MinHash is a technique for quickly estimating how similar two sets are. The scheme was invented by Andrei Broder, and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results. It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words.

In graph theory, betweenness centrality is a measure of centrality in a graph based on shortest paths. For every pair of vertices in a connected graph, there exists at least one shortest path between the vertices such that either the number of edges that the path passes through or the sum of the weights of the edges is minimized. The betweenness centrality for each vertex is the number of these shortest paths that pass through the vertex.

Multiple kernel learning refers to a set of machine learning methods that use a predefined set of kernels and learn an optimal linear or non-linear combination of kernels as part of the algorithm. Reasons to use multiple kernel learning include a) the ability to select for an optimal kernel and parameters from a larger set of kernels, reducing bias due to kernel selection while allowing for more automated machine learning methods, and b) combining data from different sources that have different notions of similarity and thus require different kernels. Instead of creating a new kernel, multiple kernel algorithms can be used to combine kernels already established for each individual data source.

Stochastic chains with memory of variable length are a family of stochastic chains of finite order in a finite alphabet, such as, for every time pass, only one finite suffix of the past, called context, is necessary to predict the next symbol. These models were introduced in the information theory literature by Jorma Rissanen in 1983, as a universal tool to data compression, but recently have been used to model data in different areas such as biology, linguistics and music.

In biochemistry, a backbone-dependent rotamer library provides the frequencies, mean dihedral angles, and standard deviations of the discrete conformations of the amino acid side chains in proteins as a function of the backbone dihedral angles φ and ψ of the Ramachandran map. By contrast, backbone-independent rotamer libraries express the frequencies and mean dihedral angles for all side chains in proteins, regardless of the backbone conformation of each residue type. Backbone-dependent rotamer libraries have been shown to have significant advantages over backbone-independent rotamer libraries, principally when used as an energy term, by speeding up search times of side-chain packing algorithms used in protein structure prediction and protein design.

References

^ Desmet J, de Maeyer M, Hazes B, Lasters I. (1992). The dead-end elimination theorem and its use in protein side-chain positioning. Nature, 356, 539-542. PMID 21488406.
^ Voigt CA, Gordon DB, Mayo SL. (2000). Trading accuracy for speed: A quantitative comparison of search algorithms in protein sequence design. J Mol Biol 299(3):789-803.
^ Dahiyat BI, Mayo SL. (1997). De novo protein design: fully automated sequence selection. Science 278(5335):82-7.
^ Goldstein RF. (1994). Efficient rotamer elimination applied to protein side-chains and related spin glasses. Biophys J 66(5):1335-40.
^ Pierce NA, Spriet JA, Desmet J, Mayo SL. (2000). Conformational splitting: a more powerful criterion for dead-end elimination. J Comput Chem 21: 999-1009.

↑ https://www.nobelprize.org/uploads/2024/10/advanced-chemistryprize2024.pdf

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] ttps://www.nobelprize.org/uploads/2024/10/advanced-chemistryprize2024.pdf

[1]