Adaptive sampling

Last updated

Adaptive sampling is a approach to sampling that uses heuristics to provide efficiency. The term adaptive sampling represents a general approach to the problem of sampling, rather than being a special method itself. Meaning it can be combined with suitable other approaches/methods.

Contents

In some real world problems, sampling is implicitly/explicitly needed and used to obtain practical solutions. The sampling process will need resources and efficient usage of these resources is usually crucial. This is why there are multiple sampling methods instead of the brute-force approach.

Let f(x) be a function that is to be sampled. For simplicity, let C(x,s) be the cost for sample x given the previous set of samples s (For simplicity, we can assume that C(x,s) is constant since sampling cost usually does not depend on the previous samples and the sampling input x to the function. In time-critical systems, where the cost for each sample is strongly related to computation time; usually there are other parameters to the function C like the current time...); and G(x, s) be the gain (anti-cost) from sampling the function at x, given the set of previous samples s. For example, it can be assumed that G(x, s)=0 if x has already been sampled. The sampling problem is then maximizing our cumulative gain minus cumulative cost. Which usually comes down to sampling the function n times until the next sample's estimated/deterministic cost C(x,s) is smaller than the gain G(x,s) of that sample.

Adaptive sampling then assumes that given necessary knowledge about the problem, there is a theoretically optimal sequence s of samples that will maximize the information (gain) induced by that sample; and it is possible to estimate s using heuristics. Adaptive sampling usually focuses on estimating the next optimal sample input x, given the previous set of samples. Thus, being adaptive to the current knowledge about the function.

Computational Molecular Biology

In computational molecular biology, adaptive sampling is used to efficiently simulate protein folding when coupled with molecular dynamics simulations.

Background

Proteins spend a large portion – nearly 96% in some cases [1] – of their folding time "waiting" in various thermodynamic free energy minima. Consequently, a straightforward simulation of this process would spend a great deal of computation to this state, with the transitions between the states – the aspects of protein folding of greater scientific interest – taking place only rarely. [2] Adaptive sampling exploits this property to simulate the protein's phase space in between these states. Using adaptive sampling, molecular simulations that previously would have taken decades can be performed in a matter of weeks. [3]

Theory

If a protein folds through the metastable states A -> B -> C, researchers can calculate the length of the transition time between A and C by simulating the A -> B transition and the B -> C transition. The protein may fold through alternative routes which may overlap in part with the A -> B -> C pathway. Decomposing the problem in this manner is efficient because each step can be simulated in parallel. [3]

Applications

Adaptive sampling is used by the Folding@home distributed computing project in combination with Markov state models. [2] [3]

Disadvantages

While adaptive sampling is useful for short simulations, longer trajectories may be more helpful for certain types of biochemical problems. [4] [5]

See also

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The process of analyzing and interpreting data can sometimes be referred to as computational biology, however this distinction between the two terms is often disputed. To some, the term computational biology refers to building and using models of biological systems.

<span class="mw-page-title-main">Computational chemistry</span> Branch of chemistry

Computational chemistry is a branch of chemistry that uses computer simulations to assist in solving chemical problems. It uses methods of theoretical chemistry incorporated into computer programs to calculate the structures and properties of molecules, groups of molecules, and solids. The importance of this subject stems from the fact that, with the exception of some relatively recent findings related to the hydrogen molecular ion, achieving an accurate quantum mechanical depiction of chemical systems analytically, or in a closed form, is not feasible. The complexity inherent in the many-body problem exacerbates the challenge of providing detailed descriptions of quantum mechanical systems. While computational results normally complement information obtained by chemical experiments, it can occasionally predict unobserved chemical phenomena.

<span class="mw-page-title-main">Protein folding</span> Change of a linear protein chain to a 3D structure

Protein folding is the physical process by which a protein, after synthesis by a ribosome as a linear chain of amino acids, changes from an unstable random coil into a more ordered three-dimensional structure. This structure permits the protein to become biologically functional.

<span class="mw-page-title-main">Monte Carlo method</span> Probabilistic problem-solving algorithm

Monte Carlo methods, or Monte Carlo experiments, are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. The underlying concept is to use randomness to solve problems that might be deterministic in principle. The name comes from the Monte Carlo Casino in Monaco, where the primary developer of the method, mathematician Stanisław Ulam, was inspired by his uncle's gambling habits.

<span class="mw-page-title-main">Molecular dynamics</span> Computer simulations to discover and understand chemical properties

Molecular dynamics (MD) is a computer simulation method for analyzing the physical movements of atoms and molecules. The atoms and molecules are allowed to interact for a fixed period of time, giving a view of the dynamic "evolution" of the system. In the most common version, the trajectories of atoms and molecules are determined by numerically solving Newton's equations of motion for a system of interacting particles, where forces between the particles and their potential energies are often calculated using interatomic potentials or molecular mechanical force fields. The method is applied mostly in chemical physics, materials science, and biophysics.

In statistics, Markov chain Monte Carlo (MCMC) is a class of algorithms used to draw samples from a probability distribution. Given a probability distribution, one can construct a Markov chain whose elements' distribution approximates it – that is, the Markov chain's equilibrium distribution matches the target distribution. The more steps that are included, the more closely the distribution of the sample matches the actual desired distribution.

<span class="mw-page-title-main">Folding@home</span> Distributed computing project simulating protein folding

Folding@home is a distributed computing project aimed to help scientists develop new therapeutics for a variety of diseases by the means of simulating protein dynamics. This includes the process of protein folding and the movements of proteins, and is reliant on simulations run on volunteers' personal computers. Folding@home is currently based at the University of Pennsylvania and led by Greg Bowman, a former student of Vijay Pande.

Global optimization is a branch of applied mathematics and numerical analysis that attempts to find the global minima or maxima of a function or a set of functions on a given set. It is usually described as a minimization problem because the maximization of the real-valued function is equivalent to the minimization of the function .

<span class="mw-page-title-main">Vijay S. Pande</span> American scientist

Vijay Satyanand Pande is a Trinidadian–American scientist and venture capitalist. Pande is best known for orchestrating the distributed computing protein-folding research project known as Folding@home. His research is focused on distributed computing and computer-modelling of microbiology, and on improving computer simulations regarding drug-binding, protein design, and synthetic biomimetic polymers. He is currently a general partner at venture capital firm Andreessen Horowitz.

Protein design is the rational design of new protein molecules to design novel activity, behavior, or purpose, and to advance basic understanding of protein function. Proteins can be designed from scratch or by making calculated variants of a known protein structure and its sequence. Rational protein design approaches make protein-sequence predictions that will fold to specific structures. These predicted sequences can then be validated experimentally through methods such as peptide synthesis, site-directed mutagenesis, or artificial gene synthesis.

<span class="mw-page-title-main">Docking (molecular)</span> Prediction method in molecular modeling

In the field of molecular modeling, docking is a method which predicts the preferred orientation of one molecule to a second when a ligand and a target are bound to each other to form a stable complex. Knowledge of the preferred orientation in turn may be used to predict the strength of association or binding affinity between two molecules using, for example, scoring functions.

<span class="mw-page-title-main">Intrinsically disordered proteins</span> Protein without a fixed 3D structure

In molecular biology, an intrinsically disordered protein (IDP) is a protein that lacks a fixed or ordered three-dimensional structure, typically in the absence of its macromolecular interaction partners, such as other proteins or RNA. IDPs range from fully unstructured to partially structured and include random coil, molten globule-like aggregates, or flexible linkers in large multi-domain proteins. They are sometimes considered as a separate class of proteins along with globular, fibrous and membrane proteins.

<span class="mw-page-title-main">Force field (chemistry)</span> Concept on molecular modeling

In the context of chemistry, molecular physics, physical chemistry, and molecular modelling, a force field is a computational model that is used to describe the forces between atoms within molecules or between molecules as well as in crystals. Force fields are a variety of interatomic potentials. More precisely, the force field refers to the functional form and parameter sets used to calculate the potential energy of a system on the atomistic level. Force fields are usually used in molecular dynamics or Monte Carlo simulations. The parameters for a chosen energy function may be derived from classical laboratory experiment data, calculations in quantum mechanics, or both. Force fields utilize the same concept as force fields in classical physics, with the main difference being that the force field parameters in chemistry describe the energy landscape on the atomistic level. From a force field, the acting forces on every particle are derived as a gradient of the potential energy with respect to the particle coordinates.

In computational biology, de novo protein structure prediction refers to an algorithmic process by which protein tertiary structure is predicted from its amino acid primary sequence. The problem itself has occupied leading scientists for decades while still remaining unsolved. According to Science, the problem remains one of the top 125 outstanding issues in modern science. At present, some of the most successful methods have a reasonable probability of predicting the folds of small, single-domain proteins within 1.5 angstroms over the entire structure.

Transition path sampling (TPS) is a rare-event sampling method used in computer simulations of rare events: physical or chemical transitions of a system from one stable state to another that occur too rarely to be observed on a computer timescale. Examples include protein folding, chemical reactions and nucleation. Standard simulation tools such as molecular dynamics can generate the dynamical trajectories of all the atoms in the system. However, because of the gap in accessible time-scales between simulation and reality, even present supercomputers might require years of simulations to show an event that occurs once per millisecond without some kind of acceleration.

For robot control, Stochastic roadmap simulation is inspired by probabilistic roadmap methods (PRM) developed for robot motion planning.

<span class="mw-page-title-main">Metadynamics</span> Scientific computer simulation method

Metadynamics is a computer simulation method in computational physics, chemistry and biology. It is used to estimate the free energy and other state functions of a system, where ergodicity is hindered by the form of the system's energy landscape. It was first suggested by Alessandro Laio and Michele Parrinello in 2002 and is usually applied within molecular dynamics simulations. MTD closely resembles a number of newer methods such as adaptively biased molecular dynamics, adaptive reaction coordinate forces and local elevation umbrella sampling. More recently, both the original and well-tempered metadynamics were derived in the context of importance sampling and shown to be a special case of the adaptive biasing potential setting. MTD is related to the Wang–Landau sampling.

<span class="mw-page-title-main">Conformational ensembles</span> Computational models of intrinsically-disordered proteins

In computational chemistry, conformational ensembles, also known as structural ensembles, are experimentally constrained computational models describing the structure of intrinsically unstructured proteins. Such proteins are flexible in nature, lacking a stable tertiary structure, and therefore cannot be described with a single structural representation. The techniques of ensemble calculation are relatively new on the field of structural biology, and are still facing certain limitations that need to be addressed before it will become comparable to classical structural description methods such as biological macromolecular crystallography.

Multi-state modeling of biomolecules refers to a series of techniques used to represent and compute the behaviour of biological molecules or complexes that can adopt a large number of possible functional states.

Molecular Operating Environment (MOE) is a drug discovery software platform that integrates visualization, modeling and simulations, as well as methodology development, in one package. MOE scientific applications are used by biologists, medicinal chemists and computational chemists in pharmaceutical, biotechnology and academic research. MOE runs on Windows, Linux, Unix, and macOS. Main application areas in MOE include structure-based design, fragment-based design, ligand-based design, pharmacophore discovery, medicinal chemistry applications, biologics applications, structural biology and bioinformatics, protein and antibody modeling, molecular modeling and simulations, virtual screening, cheminformatics & QSAR. The Scientific Vector Language (SVL) is the built-in command, scripting and application development language of MOE.

References

  1. Robert B Best (2012). "Atomistic molecular simulations of protein folding". Current Opinion in Structural Biology (review). 22 (1): 52–61. doi:10.1016/j.sbi.2011.12.001. PMID   22257762.
  2. 1 2 TJ Lane; Gregory Bowman; Robert McGibbon; Christian Schwantes; Vijay Pande; Bruce Borden (September 10, 2012). "Folding@home Simulation FAQ". Folding@home. Stanford University. Archived from the original on 2012-09-13. Retrieved September 10, 2012.
  3. 1 2 3 G. Bowman; V. Volez; V. S. Pande (2011). "Taming the complexity of protein folding". Current Opinion in Structural Biology. 21 (1): 4–11. doi:10.1016/j.sbi.2010.10.006. PMC   3042729 . PMID   21081274.
  4. David E. Shaw; Martin M. Deneroff; Ron O. Dror; Jeffrey S. Kuskin; Richard H. Larson; John K. Salmon; Cliff Young; Brannon Batson; Kevin J. Bowers; Jack C. Chao; Michael P. Eastwood; Joseph Gagliardo; J. P. Grossman; C. Richard Ho; Douglas J. Ierardi, Ist (2008). "Anton, A Special-Purpose Machine for Molecular Dynamics Simulation". Communications of the ACM. 51 (7): 91–97. doi: 10.1145/1364782.1364802 .
  5. Ron O. Dror; Robert M. Dirks; J.P. Grossman; Huafeng Xu; David E. Shaw (2012). "Biomolecular Simulation: A Computational Microscope for Molecular Biology". Annual Review of Biophysics . 41: 429–52. doi:10.1146/annurev-biophys-042910-155245. PMID   22577825.