The dead-end elimination algorithm (DEE) is a method for minimizing a function over a discrete set of independent variables. The basic idea is to identify "dead ends", i.e., combinations of variables that are not necessary to define a global minimum because there is always a way of replacing such combination by a better or equivalent one. Then we can refrain from searching such combinations further. Hence, dead-end elimination is a mirror image of dynamic programming, in which "good" combinations are identified and explored further. Although the method itself is general, it has been developed and applied mainly to the problems of predicting and designing the structures of proteins. It closely related to the notion of dominance in optimization also known as substitutability in a Constraint Satisfaction Problem. The original description and proof of the dead-end elimination theorem can be found in .
Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics. In both contexts it refers to simplifying a complicated problem by breaking it down into simpler sub-problems in a recursive manner. While some decision problems cannot be taken apart this way, decisions that span several points in time do often break apart recursively. Likewise, in computer science, if a problem can be solved optimally by breaking it into sub-problems and then recursively finding the optimal solutions to the sub-problems, then it is said to have optimal substructure.
Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its folding and its secondary and tertiary structure from its primary structure. Structure prediction is fundamentally different from the inverse problem of protein design. Protein structure prediction is one of the most important goals pursued by bioinformatics and theoretical chemistry; it is highly important in medicine and biotechnology. Every two years, the performance of current methods is assessed in the CASP experiment. A continuous evaluation of protein structure prediction web servers is performed by the community project CAMEO3D.
Protein design is the rational design of new protein molecules to design novel activity, behavior, or purpose, and to advance basic understanding of protein function. Proteins can be designed from scratch or by making calculated variants of a known protein structure and its sequence. Rational protein design approaches make protein-sequence predictions that will fold to specific structures. These predicted sequences can then be validated experimentally through methods such as peptide synthesis, site-directed mutagenesis, or artificial gene synthesis.
An effective DEE implementation requires four pieces of information:
Note that the criteria can easily be reversed to identify the maximum of a given function as well.
Dead-end elimination has been used effectively to predict the structure of side chains on a given protein backbone structure by minimizing an energy function . The dihedral angle search space of the side chains is restricted to a discrete set of rotamers for each amino acid position in the protein (which is, obviously, of fixed length). The original DEE description included criteria for the elimination of single rotamers and of rotamer pairs, although this can be expanded.
A dihedral angle is the angle between two intersecting planes. In chemistry it is the angle between planes through two sets of three atoms, having two atoms in common. In solid geometry it is defined as the union of a line and two half-planes that have this line as a common edge. In higher dimension, a dihedral angle represents the angle between two hyperplanes.
Amino acids are organic compounds containing amine (-NH2) and carboxyl (-COOH) functional groups, along with a side chain (R group) specific to each amino acid. The key elements of an amino acid are carbon (C), hydrogen (H), oxygen (O), and nitrogen (N), although other elements are found in the side chains of certain amino acids. About 500 naturally occurring amino acids are known (though only 20 appear in the genetic code) and can be classified in many ways. They can be classified according to the core structural functional groups' locations as alpha- (α-), beta- (β-), gamma- (γ-) or delta- (δ-) amino acids; other categories relate to polarity, pH level, and side chain group type (aliphatic, acyclic, aromatic, containing hydroxyl or sulfur, etc.). In the form of proteins, amino acid residues form the second-largest component (water is the largest) of human muscles and other tissues. Beyond their role as residues in proteins, amino acids participate in a number of processes such as neurotransmitter transport and biosynthesis.
In the following discussion, let be the length of the protein and let represent the rotamer of the side chain. Since atoms in proteins are assumed to interact only by two-body potentials, the energy may be written
Potential generally refers to a currently unrealized ability. The term is used in a wide variety of fields, from physics to the social sciences to indicate things that are in a state where they are able to change in ways ranging from the simple release of energy by objects to the realization of abilities in people. Examples include:
Where represents the "self-energy" of a particular rotamer , and represents the "pair energy" of the rotamers .
Also note that (that is, the pair energy between a rotamer and itself) is taken to be zero, and thus does not affect the summations. This notation simplifies the description of the pairs criterion below.
If a particular rotamer of sidechain cannot possibly give a better energy than another rotamer of the same sidechain, then rotamer A can be eliminated from further consideration, which reduces the search space. Mathematically, this condition is expressed by the inequality
where is the minimum (best) energy possible between rotamer of sidechain and any rotamer X of side chain . Similarly, is the maximum (worst) energy possible between rotamer of sidechain and any rotamer X of side chain .
The pairs criterion is more difficult to describe and to implement, but it adds significant eliminating power. For brevity, we define the shorthand variable that is the intrinsic energy of a pair of rotamers and at positions and , respectively
A given pair of rotamers and at positions and , respectively, cannot both be in the final solution (although one or the other may be) if there is another pair and that always gives a better energy. Expressed mathematically,
where , and .
For large , the matrices of precomputed energies can become costly to store. Let be the number of amino acid positions, as above, and let be the number of rotamers at each position (this is usually, but not necessarily, constant over all positions). Each self-energy matrix for a given position requires entries, so the total number of self-energies to store is . Each pair energy matrix between two positions and , for discrete rotamers at each position, requires a matrix. This makes the total number of entries in an unreduced pair matrix . This can be trimmed somewhat, at the cost of additional complexity in implementation, because pair energies are symmetrical and the pair energy between a rotamer and itself is zero.
The above two criteria are normally applied iteratively until convergence, defined as the point at which no more rotamers or pairs can be eliminated. Since this is normally a reduction in the sample space by many orders of magnitude, simple enumeration will suffice to determine the minimum within this pared-down set.
Given this model, it is clear that the DEE algorithm is guaranteed to find the optimal solution; that is, it is a global optimization process. The single-rotamer search scales quadratically in time with total number of rotamers. The pair search scales cubically and is the slowest part of the algorithm (aside from energy calculations). This is a dramatic improvement over the brute-force enumeration which scales as .
Global optimization is a branch of applied mathematics and numerical analysis that attempts to find the global minima or maxima of a function or a set of functions on a given set. It is usually described as a minimization problem because the maximization of the real-valued function is obviously equivalent to the minimization of the function .
In mathematics, a function or sequence is said to exhibit quadratic growth when its values are proportional to the square of the function argument or sequence position. "Quadratic growth" often means more generally "quadratic growth in the limit", as the argument or sequence position goes to infinity – in big Theta notation, f(x) = Θ(x2). This can be defined both continuously or discretely.
A large-scale benchmark of DEE compared with alternative methods of protein structure prediction and design finds that DEE reliably converges to the optimal solution for protein lengths for which it runs in a reasonable amount of time . It significantly outperforms the alternatives under consideration, which involved techniques derived from mean field theory, genetic algorithms, and the Monte Carlo method. However, the other algorithms are appreciably faster than DEE and thus can be applied to larger and more complex problems; their relative accuracy can be extrapolated from a comparison to the DEE solution within the scope of problems accessible to DEE.
In computing, a benchmark is the act of running a computer program, a set of programs, or other operations, in order to assess the relative performance of an object, normally by running a number of standard tests and trials against it. The term benchmark is also commonly utilized for the purposes of elaborately designed benchmarking programs themselves.
In physics and probability theory, mean field theory studies the behavior of large and complex stochastic models by studying a simpler model. Such models consider a large number of small individual components that interact with each other. The effect of all the other individuals on any given individual is approximated by a single averaged effect, thus reducing a many-body problem to a one-body problem.
In computer science and operations research, a genetic algorithm (GA) is a metaheuristic inspired by the process of natural selection that belongs to the larger class of evolutionary algorithms (EA). Genetic algorithms are commonly used to generate high-quality solutions to optimization and search problems by relying on bio-inspired operators such as mutation, crossover and selection. John Holland introduced genetic algorithms in 1960 based on the concept of Darwin’s theory of evolution; afterwards, his student David E. Goldberg extended GA in 1989.
The preceding discussion implicitly assumed that the rotamers are all different orientations of the same amino acid side chain. That is, the sequence of the protein was assumed to be fixed. It is also possible to allow multiple side chains to "compete" over a position by including both types of side chains in the set of rotamers for that position. This allows a novel sequence to be designed onto a given protein backbone. A short zinc finger protein fold has been redesigned this way . However, this greatly increases the number of rotamers per position and still requires a fixed protein length.
A zinc finger is a small protein structural motif that is characterized by the coordination of one or more zinc ions (Zn2+) in order to stabilize the fold. Originally coined to describe the finger-like appearance of a hypothesized structure from Xenopus laevis transcription factor IIIA, the zinc finger name has now come to encompass a wide variety of differing protein structures. Xenopus laevis TFIIIA was originally demonstrated to contain zinc and require the metal for function in 1983, the first such reported zinc requirement for a gene regulatory protein. It often apears as a metal-binding domain in multi-domain proteins.
More powerful and more general criteria have been introduced that improve both the efficiency and the eliminating power of the method for both prediction and design applications. One example is a refinement of the singles elimination criterion known as the Goldstein criterion , which arises from fairly straightforward algebraic manipulation before applying the minimization:
Thus rotamer can be eliminated if any alternative rotamer from the set at contributes less to the total energy than . This is an improvement over the original criterion, which requires comparison of the best possible (that is, the smallest) energy contribution from with the worst possible contribution from an alternative rotamer.
An extended discussion of elaborate DEE criteria and a benchmark of their relative performance can be found in .
Pareto efficiency or Pareto optimality is a state of allocation of resources from which it is impossible to reallocate so as to make any one individual or preference criterion better off without making at least one individual or preference criterion worse off. The concept is named after Vilfredo Pareto (1848–1923), Italian engineer and economist, who used the concept in his studies of economic efficiency and income distribution.
In data mining and statistics, hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:
In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features for use in model construction. Feature selection techniques are used for four reasons:
Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables as well as unknown parameters and latent variables, with various sorts of relationships among the three types of random variables, as might be described by a graphical model. As is typical in Bayesian inference, the parameters and latent variables are grouped together as "unobserved variables". Variational Bayesian methods are primarily used for two purposes:
In information theory, the Rényi entropy generalizes the Hartley entropy, the Shannon entropy, the collision entropy and the min-entropy. Entropies quantify the diversity, uncertainty, or randomness of a system. The Rényi entropy is named after Alfréd Rényi. In the context of fractal dimension estimation, the Rényi entropy forms the basis of the concept of generalized dimensions.
In statistics, the Bayesian information criterion (BIC) or Schwarz information criterion is a criterion for model selection among a finite set of models; the model with the lowest BIC is preferred. It is based, in part, on the likelihood function and it is closely related to the Akaike information criterion (AIC).
The Lanczos algorithm is a direct algorithm devised by Cornelius Lanczos that is an adaptation of power methods to find the most useful eigenvalues and eigenvectors of an Hermitian matrix, where is often but not necessarily much smaller than . Although computationally efficient in principle, the method as initially formulated was not useful, due to its numerical instability. In 1970, Ojalvo and Newman showed how to make the method numerically stable and applied it to the solution of very large engineering structures subjected to dynamic loading. This was achieved using a method for purifying the vectors to any degree of accuracy, which when not performed, produced a series of vectors that were highly contaminated by those associated with the lowest natural frequencies. In their original work, these authors also suggested how to select a starting vector and suggested an empirically determined method for determining , the reduced number of vectors. Soon thereafter their work was followed by Paige, who also provided an error analysis. In 1988, Ojalvo produced a more detailed history of this algorithm and an efficient eigenvalue error test. Currently, the method is widely used in a variety of technical fields and has seen a number of variations.
In quantum mechanics, separable quantum states are states without quantum entanglement.
The objective of the Thomson problem is to determine the minimum electrostatic potential energy configuration of N electrons constrained to the surface of a unit sphere that repel each other with a force given by Coulomb's law. The physicist J. J. Thomson posed the problem in 1904 after proposing an atomic model, later called the plum pudding model, based on his knowledge of the existence of negatively charged electrons within neutrally-charged atoms.
The self-consistent mean field (SCMF) method is an adaptation of mean field theory used in protein structure prediction to determine the optimal amino acid side chain packing given a fixed protein backbone. It is faster but less accurate than dead-end elimination and is generally used in situations where the protein of interest is too large for the problem to be tractable by DEE.
Conformational entropy is the entropy associated with the number of conformations of a molecule. The concept is most commonly applied to biological macromolecules such as proteins and RNA, but also be used for polysaccharides and other molecules. To calculate the conformational entropy, the possible conformations of the molecule may first be discretized into a finite number of states, usually characterized by unique combinations of certain structural parameters, each of which has been assigned an energy. In proteins, backbone dihedral angles and side chain rotamers are commonly used as parameters, and in RNA the base pairing pattern may be used. These characteristics are used to define the degrees of freedom. The conformational entropy associated with a particular structure or state, such as an alpha-helix, a folded or an unfolded protein structure, is then dependent on the probability of the occupancy of that structure.
Fourier–Motzkin elimination, also known as the FME method, is a mathematical algorithm for eliminating variables from a system of linear inequalities. It can output real solutions.
SUBCLU is an algorithm for clustering high-dimensional data by Karin Kailing, Hans-Peter Kriegel and Peer Kröger. It is a subspace clustering algorithm that builds on the density-based clustering algorithm DBSCAN. SUBCLU can find clusters in axis-parallel subspaces, and uses a bottom-up, greedy strategy to remain efficient.
Graphical models have become powerful frameworks for protein structure prediction, protein–protein interaction, and free energy calculations for protein structures. Using a graphical model to represent the protein structure allows the solution of many problems including secondary structure prediction, protein-protein interactions, protein-drug interaction, and free energy calculations.
In data mining and machine learning, -flats algorithm is an iterative method which aims to partition observations into clusters where each cluster is close to a -flat, where is a given integer.
The Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) is a multi-criteria decision analysis method, which was originally developed by Ching-Lai Hwang and Yoon in 1981 with further developments by Yoon in 1987, and Hwang, Lai and Liu in 1993. TOPSIS is based on the concept that the chosen alternative should have the shortest geometric distance from the positive ideal solution (PIS) and the longest geometric distance from the negative ideal solution (NIS).
Multiple kernel learning refers to a set of machine learning methods that use a predefined set of kernels and learn an optimal linear or non-linear combination of kernels as part of the algorithm. Reasons to use multiple kernel learning include a) the ability to select for an optimal kernel and parameters from a larger set of kernels, reducing bias due to kernel selection while allowing for more automated machine learning methods, and b) combining data from different sources that have different notions of similarity and thus require different kernels. Instead of creating a new kernel, multiple kernel algorithms can be used to combine kernels already established for each individual data source.
In global optimization, a state transition algorithm (STA) is an iterative method that generates a sequence of continually improved approximations to provide a solution to an optimization problem. It was first proposed by Zhou, et al.