De novo protein structure prediction

Last updated

In computational biology, de novo protein structure prediction refers to an algorithmic process by which protein tertiary structure is predicted from its amino acid primary sequence. The problem itself has occupied leading scientists for decades while still remaining unsolved. According to Science, the problem remains one of the top 125 outstanding issues in modern science. [1] At present, some of the most successful methods have a reasonable probability of predicting the folds of small, single-domain proteins within 1.5 angstroms over the entire structure. [2]

Contents

De novo methods tend to require vast computational resources, and have thus only been carried out for relatively small proteins. De novo protein structure modeling is distinguished from Template-based modeling (TBM) by the fact that no solved homologue to the protein of interest is used, making efforts to predict protein structure from amino acid sequence exceedingly difficult. Prediction of protein structure de novo for larger proteins will require better algorithms and larger computational resources such as those afforded by either powerful supercomputers (such as Blue Gene or MDGRAPE-3) or distributed computing projects (such as Folding@home, Rosetta@home, the Human Proteome Folding Project, or Nutritious Rice for the World). Although computational barriers are vast, the potential benefits of structural genomics (by predicted or experimental methods) to fields such as medicine and drug design make de novo structure prediction an active research field.

Background

Currently, the gap between known protein sequences and confirmed protein structures is immense. At the beginning of 2008, only about 1% of the sequences listed in the UniProtKB database corresponded to structures in the Protein Data Bank (PDB), leaving a gap between sequence and structure of approximately five million. [3] Experimental techniques for determining tertiary structure have faced serious bottlenecks in their ability to determine structures for particular proteins. For example, whereas X-ray crystallography has been successful in crystallizing approximately 80,000 cytosolic proteins, it has been far less successful in crystallizing membrane proteins – approximately 280. [4] In light of experimental limitations, devising efficient computer programs to close the gap between known sequence and structure is believed to be the only feasible option. [4]

De novo protein structure prediction methods attempt to predict tertiary structures from sequences based on general principles that govern protein folding energetics and/or statistical tendencies of conformational features that native structures acquire, without the use of explicit templates. Research into de novo structure prediction has been primarily focused into three areas: alternate lower-resolution representations of proteins, accurate energy functions, and efficient sampling methods.

A general paradigm for de novo prediction involves sampling conformation space, guided by scoring functions and other sequence-dependent biases such that a large set of candidate (“decoy") structures are generated. Native-like conformations are then selected from these decoys using scoring functions as well as conformer clustering. High-resolution refinement is sometimes used as a final step to fine-tune native-like structures. There are two major classes of scoring functions. Physics-based functions are based on mathematical models describing aspects of the known physics of molecular interaction. Knowledge-based functions are formed with statistical models capturing aspects of the properties of native protein conformations. [5]

Amino Acid Sequence Determines Protein Tertiary Structure

Several lines of evidence have been presented in favor of the notion that primary protein sequence contains all the information required for overall three-dimensional protein structure, making the idea of a de novo protein prediction possible. First, proteins with different functions usually have different amino acid sequences. Second, several different human diseases, such as Duchenne muscular dystrophy, can be linked to loss of protein function resulting from a change in just a single amino acid in the primary sequence. Third, proteins with similar functions across many different species often have similar amino acid sequences. Ubiquitin, for example, is a protein involved in regulating the degradation of other proteins; its amino acid sequence is nearly identical in species as far separated as Drosophila melanogaster and Homo sapiens. Fourth, by thought experiment, one can deduce that protein folding must not be a completely random process and that information necessary for folding must be encoded within the primary structure. For example, if we assume that each of 100 amino acid residues within a small polypeptide could take up 10 different conformations on average, giving 10^100 different conformations for the polypeptide. If one possible conformation was tested every 10^-13 second, then it would take about 10^77 years to sample all possible conformations. However, proteins are properly folded within the body on short timescales all the time, meaning that the process cannot be random and, thus, can potentially be modeled.

One of the strongest lines of evidence for the supposition that all the relevant information needed to encode protein tertiary structure is found in the primary sequence was demonstrated in the 1950s by Christian Anfinsen. In a classic experiment, he showed that ribonuclease A could be entirely denatured by being submerged in a solution of urea (to disrupt stabilizing hydrophobic bonds) in the presence of a reducing agent (to cleave stabilizing disulfide bonds). Upon removal of the protein from this environment, the denatured and functionless ribonuclease protein spontaneously recoiled and regained function, demonstrating that protein tertiary structure is encoded in the primary amino acid sequence. Had the protein reformed randomly, over one-hundred different combinations of four disulfide bonds could have formed. However, in the majority of cases proteins will require the presence of molecular chaperons within the cell for proper folding. The overall shape of a protein may be encoded in its amino acid structure, but its folding may depend on chaperons to assist in folding. [6]

Successful De Novo Modeling Requirements

De novo conformation predictors usually function by producing candidate conformations (decoys) and then choosing amongst them based on their thermodynamic stability and energy state. Most successful predictors will have the following three factors in common:

1) An accurate energy function that corresponds the most thermodynamically stable state to the native structure of a protein

2) An efficient search method capable of quickly identifying low-energy states through conformational search

3) The ability to select native-like models from a collection of decoy structures [3]

De novo programs will search three dimensional space and, in the process, produce candidate protein conformations. As a protein approaches its correctly folded, native state, entropy and free energy will decrease. Using this information, de novo predictors can discriminate amongst decoys. Specifically, de novo programs will select possible conformations with lower free energies – which are more likely to be correct than those structures with higher free energies. [2] [6] [7] As stated by David A. Baker in regards to how his de novo Rosetta predictor works, “during folding, each local segment of the chain flickers between a different subset of local conformations…folding to the native structure occurs when the conformations adopted by the local segments and their relative orientations allow…low energy features of native protein structures. In the Rosetta algorithm…the program then searches for the combination of these local conformations that has the lowest overall energy.” [8]

However, some de novo methods work by first enumerating through the entire conformational space using a simplified representation of a protein structure, and then select the ones that are most likely to be native-like. An example of this approach is one based on representing protein folds using tetrahedral lattices and building all atoms models on top of all possible conformations obtained using the tetrahedral representation. This approach was used successfully at CASP3 to predict a protein fold whose topology had not been observed before by Michael Levitt's team. [9]

By developing the QUARK program, Xu and Zhang showed that ab initio structure of some proteins can be successfully constructed through a knowledge-based force field . [10] [11]

Correctly folded protein conformations (native structures) have lower free energies than partially folded or primary structures. Computers search for these conformations because they indicate correct folding. Energy and entropy recreation diagram PNG.png
Correctly folded protein conformations (native structures) have lower free energies than partially folded or primary structures. Computers search for these conformations because they indicate correct folding.

Protein Predicting Strategies

If a protein of known tertiary structure shares at least 30% of its sequence with a potential homolog of undetermined structure, comparative methods that overlay the putative unknown structure with the known can be utilized to predict the likely structure of the unknown. However, below this threshold three other classes of strategy are used to determine possible structure from an initial model: ab initio protein prediction, fold recognition, and threading.

  1. Ab Initio Methods: In ab initio methods, an initial effort to elucidate secondary structures (alpha helix, beta sheet, beta turn, etc.) from primary structure is made by utilization of physicochemical parameters and neural net algorithms. From that point, algorithms predict tertiary folding. One drawback to this strategy is that it is not yet capable of incorporating the locations and orientation of amino acid side chains.
  2. Fold Prediction: In fold recognition strategies, a prediction of secondary structure is first made and then compared to either a library of known protein folds, such as CATH or SCOP, or what is known as a "periodic table" of possible secondary structure forms. A confidence score is then assigned to likely matches.
  3. Threading: In threading strategies, the fold recognition technique is expanded further. In this process, empirically based energy functions for the interaction of residue pairs are used to place the unknown protein onto a putative backbone as a best fit, accommodating gaps where appropriate. The best interactions are then accentuated in order to discriminate amongst potential decoys and to predict the most likely conformation.

The goal of both fold and threading strategies is to ascertain whether a fold in an unknown protein is similar to a domain in a known one deposited in a database, such as the protein databank (PDB). This is in contrast to de novo (ab initio) methods where structure is determined using a physics-base approach en lieu of comparing folds in the protein to structures in a data base. [12]

Limitations of De novo Prediction Methods

A major limitation of de novo protein prediction methods is the extraordinary amount of computer time required to successfully solve for the native conformation of a protein. Distributed methods, such as Rosetta@home, have attempted to ameliorate this by recruiting individuals who then volunteer idle home computer time in order to process data. Even these methods face challenges, however. For example, a distributed method was utilized by a team of researchers at the University of Washington and the Howard Hughes Medical Institute to predict the tertiary structure of the protein T0283 from its amino acid sequence. In a blind test comparing the accuracy of this distributed technique with the experimentally confirmed structure deposited within the Protein Databank (PDB), the predictor produced excellent agreement with the deposited structure. However, the time and number of computers required for this feat was enormous – almost two years and approximately 70,000 home computers, respectively. [13]

One method proposed to overcome such limitations involves the use of Markov models (see Markov chain Monte Carlo). One possibility is that such models could be constructed in order to assist with free energy computation and protein structure prediction, perhaps by refining computational simulations. [14] Another way of circumventing the computational power limitations is using coarse-grained modeling. Coarse-grained protein models allow for de novo structure prediction of small proteins, or large protein fragments, in a short computational time. [15]

An example of distributed computing (Rosetta) in predicting the 3D structure of a protein from its amino-acid sequence. The predicted structure (magenta) of a protein is overlaid with the experimentally determined crystal structure (blue) of that protein. The agreement between the two is very good. T0281-bakerprediction overlay.png
An example of distributed computing (Rosetta) in predicting the 3D structure of a protein from its amino-acid sequence. The predicted structure (magenta) of a protein is overlaid with the experimentally determined crystal structure (blue) of that protein. The agreement between the two is very good.

CASP

“Progress for all variants of computational protein structure prediction methods is assessed in the biannual, community wide Critical Assessment of Protein Structure Prediction (CASP) experiments. In the CASP experiments, research groups are invited to apply their prediction methods to amino acid sequences for which the native structure is not known but to be determined and to be published soon. Even though the number of amino acid sequences provided by the CASP experiments is small, these competitions provide a good measure to benchmark methods and progress in the field in an arguably unbiased manner.” [16]

Notes

See also

Related Research Articles

<span class="mw-page-title-main">Protein secondary structure</span> General three-dimensional form of local segments of proteins

Protein secondary structure is the local spatial conformation of the polypeptide backbone excluding the side chains. The two most common secondary structural elements are alpha helices and beta sheets, though beta turns and omega loops occur as well. Secondary structure elements typically spontaneously form as an intermediate before the protein folds into its three dimensional tertiary structure.

<span class="mw-page-title-main">Protein tertiary structure</span> Three dimensional shape of a protein

Protein tertiary structure is the three-dimensional shape of a protein. The tertiary structure will have a single polypeptide chain "backbone" with one or more protein secondary structures, the protein domains. Amino acid side chains and the backbone may interact and bond in a number of ways. The interactions and bonds of side chains within a particular protein determine its tertiary structure. The protein tertiary structure is defined by its atomic coordinates. These coordinates may refer either to a protein domain or to the entire tertiary structure. A number of these structures may bind to each other, forming a quaternary structure.

<span class="mw-page-title-main">Protein folding</span> Change of a linear protein chain to a 3D structure

Protein folding is the physical process by which a protein, after synthesis by a ribosome as a linear chain of amino acids, changes from an unstable random coil into a more ordered three-dimensional structure. This structure permits the protein to become biologically functional.

<span class="mw-page-title-main">Structural genomics</span>

Structural genomics seeks to describe the 3-dimensional structure of every protein encoded by a given genome. This genome-based approach allows for a high-throughput method of structure determination by a combination of experimental and modeling approaches. The principal difference between structural genomics and traditional structural prediction is that structural genomics attempts to determine the structure of every protein encoded by the genome, rather than focusing on one particular protein. With full-genome sequences available, structure prediction can be done more quickly through a combination of experimental and modeling approaches, especially because the availability of large number of sequenced genomes and previously solved protein structures allows scientists to model protein structure on the structures of previously solved homologs.

<span class="mw-page-title-main">Protein structure prediction</span> Type of biological prediction

Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its secondary and tertiary structure from primary structure. Structure prediction is different from the inverse problem of protein design. Protein structure prediction is one of the most important goals pursued by computational biology; it is important in medicine and biotechnology.

<span class="mw-page-title-main">CASP</span> Protein structure prediction challenge

Critical Assessment of Structure Prediction (CASP), sometimes called Critical Assessment of Protein Structure Prediction, is a community-wide, worldwide experiment for protein structure prediction taking place every two years since 1994. CASP provides research groups with an opportunity to objectively test their structure prediction methods and delivers an independent assessment of the state of the art in protein structure modeling to the research community and software users. Even though the primary goal of CASP is to help advance the methods of identifying protein three-dimensional structure from its amino acid sequence many view the experiment more as a “world championship” in this field of science. More than 100 research groups from all over the world participate in CASP on a regular basis and it is not uncommon for entire groups to suspend their other research for months while they focus on getting their servers ready for the experiment and on performing the detailed predictions.

<span class="mw-page-title-main">Structural alignment</span> Aligning molecular sequences using sequence and structural information

Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large RNA molecules. In contrast to simple structural superposition, where at least some equivalent residues of the two structures are known, structural alignment requires no a priori knowledge of equivalent positions. Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Structural alignment can therefore be used to imply evolutionary relationships between proteins that share very little common sequence. However, caution should be used in using the results as evidence for shared evolutionary ancestry because of the possible confounding effects of convergent evolution by which multiple unrelated amino acid sequences converge on a common tertiary structure.

<span class="mw-page-title-main">Structural bioinformatics</span> Bioinformatics subfield

Structural bioinformatics is the branch of bioinformatics that is related to the analysis and prediction of the three-dimensional structure of biological macromolecules such as proteins, RNA, and DNA. It deals with generalizations about macromolecular 3D structures such as comparisons of overall folds and local motifs, principles of molecular folding, evolution, binding interactions, and structure/function relationships, working both from experimentally solved structures and from computational models. The term structural has the same meaning as in structural biology, and structural bioinformatics can be seen as a part of computational structural biology. The main objective of structural bioinformatics is the creation of new methods of analysing and manipulating biological macromolecular data in order to solve problems in biology and generate new knowledge.

<span class="mw-page-title-main">Protein structure</span> Three-dimensional arrangement of atoms in an amino acid-chain molecule

Protein structure is the three-dimensional arrangement of atoms in an amino acid-chain molecule. Proteins are polymers – specifically polypeptides – formed from sequences of amino acids, which are the monomers of the polymer. A single amino acid monomer may also be called a residue, which indicates a repeating unit of a polymer. Proteins form by amino acids undergoing condensation reactions, in which the amino acids lose one water molecule per reaction in order to attach to one another with a peptide bond. By convention, a chain under 30 amino acids is often identified as a peptide, rather than a protein. To be able to perform their biological function, proteins fold into one or more specific spatial conformations driven by a number of non-covalent interactions, such as hydrogen bonding, ionic interactions, Van der Waals forces, and hydrophobic packing. To understand the functions of proteins at a molecular level, it is often necessary to determine their three-dimensional structure. This is the topic of the scientific field of structural biology, which employs techniques such as X-ray crystallography, NMR spectroscopy, cryo-electron microscopy (cryo-EM) and dual polarisation interferometry, to determine the structure of proteins.

Protein design is the rational design of new protein molecules to design novel activity, behavior, or purpose, and to advance basic understanding of protein function. Proteins can be designed from scratch or by making calculated variants of a known protein structure and its sequence. Rational protein design approaches make protein-sequence predictions that will fold to specific structures. These predicted sequences can then be validated experimentally through methods such as peptide synthesis, site-directed mutagenesis, or artificial gene synthesis.

Lattice proteins are highly simplified models of protein-like heteropolymer chains on lattice conformational space which are used to investigate protein folding. Simplification in lattice proteins is twofold: each whole residue is modeled as a single "bead" or "point" of a finite set of types, and each residue is restricted to be placed on vertices of a lattice. To guarantee the connectivity of the protein chain, adjacent residues on the backbone must be placed on adjacent vertices of the lattice. Steric constraints are expressed by imposing that no more than one residue can be placed on the same lattice vertex.

<span class="mw-page-title-main">Rosetta@home</span> BOINC based volunteer computing project researching protein folding

Rosetta@home is a volunteer computing project researching protein structure prediction on the Berkeley Open Infrastructure for Network Computing (BOINC) platform, run by the Baker lab. Rosetta@home aims to predict protein–protein docking and design new proteins with the help of about fifty-five thousand active volunteered computers processing at over 487,946 GigaFLOPS on average as of September 19, 2020. Foldit, a Rosetta@home videogame, aims to reach these goals with a crowdsourcing approach. Though much of the project is oriented toward basic research to improve the accuracy and robustness of proteomics methods, Rosetta@home also does applied research on malaria, Alzheimer's disease, and other pathologies.

<span class="mw-page-title-main">Homology modeling</span> Method of protein structure prediction using other known proteins

Homology modeling, also known as comparative modeling of protein, refers to constructing an atomic-resolution model of the "target" protein from its amino acid sequence and an experimental three-dimensional structure of a related homologous protein. Homology modeling relies on the identification of one or more known protein structures likely to resemble the structure of the query sequence, and on the production of an alignment that maps residues in the query sequence to residues in the template sequence. It has been seen that protein structures are more conserved than protein sequences amongst homologues, but sequences falling below a 20% sequence identity can have very different structure.

The contact order of a protein is a measure of the locality of the inter-amino acid contacts in the protein's native state tertiary structure. It is calculated as the average sequence distance between residues that form native contacts in the folded protein divided by the total length of the protein. Higher contact orders indicate longer folding times, and low contact order has been suggested as a predictor of potential downhill folding, or protein folding that occurs without a free energy barrier. This effect is thought to be due to the lower loss of conformational entropy associated with the formation of local as opposed to nonlocal contacts.

<span class="mw-page-title-main">Statistical potential</span>

In protein structure prediction, statistical potentials or knowledge-based potentials are scoring functions derived from an analysis of known protein structures in the Protein Data Bank (PDB).

Loop modeling is a problem in protein structure prediction requiring the prediction of the conformations of loop regions in proteins with or without the use of a structural template. Computer programs that solve these problems have been used to research a broad range of scientific topics from ADP to breast cancer. Because protein function is determined by its shape and the physiochemical properties of its exposed surface, it is important to create an accurate model for protein/ligand interaction studies. The problem arises often in homology modeling, where the tertiary structure of an amino acid sequence is predicted based on a sequence alignment to a template, or a second sequence whose structure is known. Because loops have highly variable sequences even within a given structural motif or protein fold, they often correspond to unaligned regions in sequence alignments; they also tend to be located at the solvent-exposed surface of globular proteins and thus are more conformationally flexible. Consequently, they often cannot be modeled using standard homology modeling techniques. More constrained versions of loop modeling are also used in the data fitting stages of solving a protein structure by X-ray crystallography, because loops can correspond to regions of low electron density and are therefore difficult to resolve.

<span class="mw-page-title-main">Ram Samudrala</span>

Ram Samudrala is a professor of computational biology and bioinformatics at the University at Buffalo, United States. He researches protein folding, structure, function, interaction, design, and evolution.

<span class="mw-page-title-main">CS23D</span>

CS23D is a web server to generate 3D structural models from NMR chemical shifts. CS23D combines maximal fragment assembly with chemical shift threading, de novo structure generation, chemical shift-based torsion angle prediction, and chemical shift refinement. CS23D makes use of RefDB and ShiftX.

<span class="mw-page-title-main">I-TASSER</span>

I-TASSER is a bioinformatics method for predicting three-dimensional structure model of protein molecules from amino acid sequences. It detects structure templates from the Protein Data Bank by a technique called fold recognition. The full-length structure models are constructed by reassembling structural fragments from threading templates using replica exchange Monte Carlo simulations. I-TASSER is one of the most successful protein structure prediction methods in the community-wide CASP experiments.

AlphaFold is an artificial intelligence (AI) program developed by DeepMind, a subsidiary of Alphabet, which performs predictions of protein structure. The program is designed as a deep learning system.

References

  1. "Editorial: So much more to know". Science. 309 (5731): 78–102. 2005. doi: 10.1126/science.309.5731.78b . PMID   15994524.
  2. 1 2 Dill, Ken A.; et al. (2007). "The protein folding problem: when will it be solved?". Current Opinion in Structural Biology. 17 (3): 342–346. doi:10.1016/j.sbi.2007.06.001. PMID   17572080.
  3. 1 2 Rigden, Daniel J. From Protein Structure to Function with Bioinformatics. Springer Science. 2009. ISBN   978-1-4020-9057-8.
  4. 1 2 Yonath, Ada. X-ray crystallography at the heart of life science. Current Opinion in Structural Biology. Volume 21, Issue 5, October 2011, Pages 622–626.
  5. Samudrala, R; Moult, J (1998). "An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction". Journal of Molecular Biology. 275 (5): 893–914. CiteSeerX   10.1.1.70.4101 . doi:10.1006/jmbi.1997.1479. PMID   9480776.
  6. 1 2 Nelson, David L. and Cox, Michael. Lehninger Principles of Biochemistry 5th Edition. M. W. H. Freeman; June 15, 2008. ISBN   1429224169.
  7. "The Baker Laboratory". Archived from the original on 2012-11-13.
  8. "Rosetta News Article".
  9. Samudrala, R; Xia, Y; Huang, ES; Levitt, M (1999). "Ab initio prediction of protein structure using a combined hierarchical approach". Proteins: Structure, Function, and Genetics. S3 (S3): 194–198. doi:10.1002/(SICI)1097-0134(1999)37:3+<194::AID-PROT24>3.0.CO;2-F. S2CID   1566472.
  10. Xu D, Zhang Y (July 2012). "Ab initio protein structure assembly using continuous structure fragments and optimized knowledge-based force field". Proteins. 80 (7): 1715–35. doi:10.1002/prot.24065. PMC   3370074 . PMID   22411565.
  11. Xu D, Zhang J, Roy A, Zhang Y (Aug 2011). "Automated protein structure modeling in CASP9 by I-TASSER pipeline combined with QUARK-based ab initio folding and FG-MD-based structure refinement". Proteins. 79 Suppl 10 (Suppl 10): 147–60. doi:10.1002/prot.23111. PMC   3228277 . PMID   22069036.
  12. Gibson, Greg and Muse, Spencer V. A Primer of Genome Science 3rd edition. Sinauer Associates, Inc. 2009. ISBN   978-0-87893-236-8.
  13. Qian et al. High-resolution structure prediction and the crystallographic phase problem. (2007). Nature. Volume 450.
  14. Jayachandran, Guha et al. (2006). Using massively parallel simulation and Markovian models to study protein folding: Examining the dynamics of the villin headpiece. Published online.
  15. Kmiecik, Sebastian; Gront, Dominik; Kolinski, Michal; Wieteska, Lukasz; Dawid, Aleksandra Elzbieta; Kolinski, Andrzej (2016-06-22). "Coarse-Grained Protein Models and Their Applications". Chemical Reviews. 116 (14): 7898–936. doi: 10.1021/acs.chemrev.6b00163 . ISSN   0009-2665. PMID   27333362.
  16. C.A. Floudas et al. Advances in protein structure prediction and de novo protein design: A review. Chemical Engineering Science 61 (2006) 966 – 988.