Template modeling score

Last updated April 20, 2023

In bioinformatics, the template modeling score or TM-score is a measure of similarity between two protein structures. The TM-score is intended as a more accurate measure of the global similarity of full-length protein structures than the often used RMSD measure. The TM-score indicates the similarity between two structures by a score between $(0,1]$ , where 1 indicates a perfect match between two structures (thus the higher the better).^[1] Generally scores below 0.20 corresponds to randomly chosen unrelated proteins whereas structures with a score higher than 0.5 assume roughly the same fold.^[2] A quantitative study ^[3] shows that proteins of TM-score = 0.5 have a posterior probability of 37% in the same CATH topology family and of 13% in the same SCOP fold family. The probabilities increase rapidly when TM-score > 0.5. The TM-score is designed to be independent of protein lengths.

The TM-score equation

TM-score between two protein structures (e.g., a template structure and a target structure) is defined by

{\text{TM-score}}=\max \left[{\frac {1}{L_{\text{target}}}}\sum _{i}^{L_{\text{common}}}{\frac {1}{1+\left({\frac {d_{i}}{d_{0}(L_{\text{target}})}}\right)^{2}}}\right]

where $L_{\text{target}}$ is the length of the amino acid sequence of the target protein, and $L_{\text{common}}$ is the number of residues that appear in both the template and target structures. $d_{i}$ is the distance between the $i$ th pair of residues in the template and target structures, and $d_{0}(L_{\text{target}})=1.24{\sqrt[{3}]{L_{\text{target}}-15}}-1.8$ is a distance scale that normalizes distances.

When comparing two protein structures that have the same residue order, $L_{\text{common}}$ reads from the C-alpha order number of the structure files (i.e., Column 23-26 in Protein Data Bank (file format)). When comparing two protein structures that have different sequences and/or different residue orders, a structural alignment is usually performed first, and TM-score is then calculated on the commonly aligned residues from the structural alignment.

Other measures

An often used structural similarity measure is root-mean-square deviation (RMSD). Because RMSD $={\sqrt {\sum _{i=1}^{L}d_{i}^{2}/{L}}}$ is calculated as an average of distance error ( $d_{i}$ ) with equal weight over all residue pairs, a large local error on a few residue pairs can result in a quite large RMSD. On the other hand, by putting $d_{i}$ in the denominator, TM-score naturally weights smaller distance errors more strongly than larger distance errors. Therefore, TM-score value is more sensitive to the global structural similarity rather than to the local structural errors, compared to RMSD. Another advantage of TM-score is the introduction of the scale $d_{0}(L_{\text{target}})=1.24{\sqrt[{3}]{L_{\text{target}}-15}}-1.8$ which makes the magnitude of TM-score length-independent for random structure pairs, while RMSD and most other measures are length-dependent metrics.

The Global Distance Test (GDT) algorithm, and its GDT TS score to represent "total score", is another measure of similarity between two protein structures with known amino acid correspondences (e.g. identical amino acid sequences) but different tertiary structures.^[4] GDT score has the same length-dependence issue as RMSD, because the average GDT score for random structure pairs has a power-law dependence on the protein size.^[1]

Related Research Articles

Protein engineering is the process of developing useful or valuable protein and design and production of unnatural polypeptides, often by altering amino acid sequences found in nature. It is a young discipline, with much research taking place into the understanding of protein folding and recognition for protein design principles. It has been used to improve the function of many enzymes for industrial catalysis. It is also a product and services market, with an estimated value of $168 billion by 2017.

<span class="mw-page-title-main">Protein structure prediction</span> Type of biological prediction

Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its secondary and tertiary structure from primary structure. Structure prediction is different from the inverse problem of protein design. Protein structure prediction is one of the most important goals pursued by computational biology; and it is important in medicine and biotechnology.

In bioinformatics and evolutionary biology, a substitution matrix describes the frequency at which a character in a nucleotide sequence or a protein sequence changes to other character states over evolutionary time. The information is often in the form of log odds of finding two specific character states aligned and depends on the assumed number of evolutionary changes or sequence dissimilarity between compared sequences. It is an application of a stochastic matrix. Substitution matrices are usually seen in the context of amino acid or DNA sequence alignments, where they are used to calculate similarity scores between the aligned sequences.

Critical Assessment of Structure Prediction (CASP), sometimes called Critical Assessment of Protein Structure Prediction, is a community-wide, worldwide experiment for protein structure prediction taking place every two years since 1994. CASP provides research groups with an opportunity to objectively test their structure prediction methods and delivers an independent assessment of the state of the art in protein structure modeling to the research community and software users. Even though the primary goal of CASP is to help advance the methods of identifying protein three-dimensional structure from its amino acid sequence many view the experiment more as a “world championship” in this field of science. More than 100 research groups from all over the world participate in CASP on a regular basis and it is not uncommon for entire groups to suspend their other research for months while they focus on getting their servers ready for the experiment and on performing the detailed predictions.

Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large RNA molecules. In contrast to simple structural superposition, where at least some equivalent residues of the two structures are known, structural alignment requires no a priori knowledge of equivalent positions. Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Structural alignment can therefore be used to imply evolutionary relationships between proteins that share very little common sequence. However, caution should be used in using the results as evidence for shared evolutionary ancestry because of the possible confounding effects of convergent evolution by which multiple unrelated amino acid sequences converge on a common tertiary structure.

Structural bioinformatics is the branch of bioinformatics that is related to the analysis and prediction of the three-dimensional structure of biological macromolecules such as proteins, RNA, and DNA. It deals with generalizations about macromolecular 3D structures such as comparisons of overall folds and local motifs, principles of molecular folding, evolution, binding interactions, and structure/function relationships, working both from experimentally solved structures and from computational models. The term structural has the same meaning as in structural biology, and structural bioinformatics can be seen as a part of computational structural biology. The main objective of structural bioinformatics is the creation of new methods of analysing and manipulating biological macromolecular data in order to solve problems in biology and generate new knowledge.

A Gap penalty is a method of scoring alignments of two or more sequences. When aligning sequences, introducing gaps in the sequences can allow an alignment algorithm to match more terms than a gap-less alignment can. However, minimizing gaps in an alignment is important to create a useful alignment. Too many gaps can cause an alignment to become meaningless. Gap penalties are used to adjust alignment scores based on the number and length of gaps. The five main types of gap penalties are constant, linear, affine, convex, and profile-based.

In molecular biology, protein threading, also known as fold recognition, is a method of protein modeling which is used to model those proteins which have the same fold as proteins of known structures, but do not have homologous proteins with known structure. It differs from the homology modeling method of structure prediction as it is used for proteins which do not have their homologous protein structures deposited in the Protein Data Bank (PDB), whereas homology modeling is used for those proteins which do. Threading works by using statistical knowledge of the relationship between the structures deposited in the PDB and the sequence of the protein which one wishes to model.

In bioinformatics, the root-mean-square deviation of atomic positions, or simply root-mean-square deviation (RMSD), is the measure of the average distance between the atoms of superimposed proteins. Note that RMSD calculation can be applied to other, non-protein molecules, such as small organic molecules. In the study of globular protein conformations, one customarily measures the similarity in three-dimensional structure by the RMSD of the Cα atomic coordinates after optimal rigid body superposition.

T-Coffee is a multiple sequence alignment software using a progressive approach. It generates a library of pairwise alignments to guide the multiple sequence alignment. It can also combine multiple sequences alignments obtained previously and in the latest versions can use structural information from PDB files (3D-Coffee). It has advanced features to evaluate the quality of the alignments and some capacity for identifying occurrence of motifs (Mocca). It produces alignment in the aln format (Clustal) by default, but can also produce PIR, MSF, and FASTA format. The most common input formats are supported.

Homology modeling, also known as comparative modeling of protein, refers to constructing an atomic-resolution model of the "target" protein from its amino acid sequence and an experimental three-dimensional structure of a related homologous protein. Homology modeling relies on the identification of one or more known protein structures likely to resemble the structure of the query sequence, and on the production of an alignment that maps residues in the query sequence to residues in the template sequence. It has been seen that protein structures are more conserved than protein sequences amongst homologues, but sequences falling below a 20% sequence identity can have very different structure.

The global distance test (GDT), also written as GDT_TS to represent "total score", is a measure of similarity between two protein structures with known amino acid correspondences but different tertiary structures. It is most commonly used to compare the results of protein structure prediction to the experimentally determined structure as measured by X-ray crystallography, protein NMR, or, increasingly, cryoelectron microscopy. The metric was developed by Adam Zemla at Lawrence Livermore National Laboratory and originally implemented in the Local-Global Alignment (LGA) program. It is intended as a more accurate measurement than the common root-mean-square deviation (RMSD) metric - which is sensitive to outlier regions created, for example, by poor modeling of individual loop regions in a structure that is otherwise reasonably accurate. The conventional GDT_TS score is computed over the alpha carbon atoms and is reported as a percentage, ranging from 0 to 100. In general, the higher the GDT_TS score, the more closely a model approximates a given reference structure.

The root-mean-square deviation (RMSD) or root-mean-square error (RMSE) is a frequently used measure of the differences between values predicted by a model or an estimator and the values observed. The RMSD represents the square root of the second sample moment of the differences between predicted values and observed values or the quadratic mean of these differences. These deviations are called residuals when the calculations are performed over the data sample that was used for estimation and are called errors when computed out-of-sample. The RMSD serves to aggregate the magnitudes of the errors in predictions for various data points into a single measure of predictive power. RMSD is a measure of accuracy, to compare forecasting errors of different models for a particular dataset and not between datasets, as it is scale-dependent.

CS-BLAST (Context-Specific BLAST) is a tool that searches a protein sequence that extends BLAST, using context-specific mutation probabilities. More specifically, CS-BLAST derives context-specific amino-acid similarities on each query sequence from short windows on the query sequences [4]. Using CS-BLAST doubles sensitivity and significantly improves alignment quality without a loss of speed in comparison to BLAST. CSI-BLAST is the context-specific analog of PSI-BLAST, which computes the mutation profile with substitution probabilities and mixes it with the query profile [2]. CSI-BLAST is the context specific analog of PSI-BLAST. Both of these programs are available as web-server and are available for free download.

The sequential structure alignment program (SSAP) in chemistry, physics, and biology is a method that uses double dynamic programming to produce a structural alignment based on atom-to-atom vectors in structure space. Instead of the alpha carbons typically used in structural alignment, SSAP constructs its vectors from the beta carbons for all residues except glycine, a method which thus takes into account the rotameric state of each residue as well as its location along the backbone. SSAP works by first constructing a series of inter-residue distance vectors between each residue and its nearest non-contiguous neighbors on each protein. A series of matrices are then constructed containing the vector differences between neighbors for each pair of residues for which vectors were constructed. Dynamic programming applied to each resulting matrix determines a series of optimal local alignments which are then summed into a "summary" matrix to which dynamic programming is applied again to determine the overall structural alignment.

I-sites are short sequence-structure motifs that are mined from the Protein Data Bank (PDB) that correlate strongly with three-dimensional structural elements. These sequence-structure motifs are used for the local structure prediction of proteins. Local structure can be expressed as fragments or as backbone angles. Locations in the protein sequence that have high confidence I-sites predictions may be the initiation sites of folding. I-sites have also been identified as discrete models for folding pathways. I-sites consist of about 250 motifs. Each motif has an amino acid profile, a fragment structure and optionally, a 4-dimensional tensor of pairwise sequence covariance.

RaptorX is a software and web server for protein structure and function prediction that is free for non-commercial use. RaptorX is among the most popular methods for protein structure prediction. Like other remote homology recognition/protein threading techniques, RaptorX is able to regularly generate reliable protein models when the widely used PSI-BLAST cannot. However, RaptorX is also significantly different from those profile-based methods in that RaptorX excels at modeling of protein sequences without a large number of sequence homologs by exploiting structure information. RaptorX Server has been designed to ensure a user-friendly interface for users inexpert in protein structure prediction methods.

<span class="mw-page-title-main">I-TASSER</span>

I-TASSER is a bioinformatics method for predicting three-dimensional structure model of protein molecules from amino acid sequences. It detects structure templates from the Protein Data Bank by a technique called fold recognition. The full-length structure models are constructed by reassembling structural fragments from threading templates using replica exchange Monte Carlo simulations. I-TASSER is one of the most successful protein structure prediction methods in the community-wide CASP experiments.

SuperPose is a freely available web server designed to perform both pairwise and multiple protein structure superpositions. The “Structural superposition” term refers to the rotations and translations performed on one structure to make it match or align with another structure or structures. Structural superposition can be quantified either in terms of similarity or difference measures. The optimal superposition is the one in which the similarity measure is maximized or the difference measure is minimized. The “SuperPose” web server uses “RMSD” or Root-Mean-Square Deviation as a difference measure to find the optimal pairwise or multiple protein structure superposition. After an initial sequence and secondary structure alignment, SuperPose generates a Difference Distance (DD) matrix from the equivalent C-alpha atoms of two molecules. The sequence/structure alignment and DD matrix analysis information is then fed into a modified quaternion eigenvalue algorithm to rapidly perform the structural superposition and calculate the RMSD between aligned regions of two macromolecules.

AlphaFold is an artificial intelligence (AI) program developed by DeepMind, a subsidiary of Alphabet, which performs predictions of protein structure. The program is designed as a deep learning system.

References

1 2 Zhang Y and Skolnick J (2004). "Scoring function for automated assessment of protein structure template quality". Proteins. 57 (4): 702–710. doi:10.1002/prot.20264. PMID 15476259. S2CID 7954787.
↑ Zhang Y and Skolnick J (2005). "TM-align: a protein structure alignment algorithm based on the TM-score". Nucleic Acids Res. 33 (7): 2302–2309. doi:10.1093/nar/gki524. PMC 1084323 . PMID 15849316.
↑ Xu J and Zhang Y (2010). "How significant is a protein structure similarity with TM-score = 0.5?". Bioinformatics. 26 (7): 889–895. doi:10.1093/bioinformatics/btq066. PMC 2913670 . PMID 20164152.
↑ Zemla A (2003). "LGA: A method for finding 3D similarities in protein structures". Nucleic Acids Research. 31 (13): 3370–3374. doi:10.1093/nar/gkg571. PMC 168977 . PMID 12824330.

External links

TM-score webserver — by the Yang Zhang research group. Calculates TM-score and supplies source code.
GDT and LGA description services and documentation on structure comparison and similarity measures.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[zhang2004-1] 1 2 Zhang Y and Skolnick J (2004). "Scoring function for automated assessment of protein structure template quality". Proteins. 57 (4): 702–710. doi:10.1002/prot.20264. PMID 15476259. S2CID 7954787.

[zhang2005-2] Zhang Y and Skolnick J (2005). "TM-align: a protein structure alignment algorithm based on the TM-score". Nucleic Acids Res. 33 (7): 2302–2309. doi:10.1093/nar/gki524. PMC 1084323 . PMID 15849316.

[xu2010-3] Xu J and Zhang Y (2010). "How significant is a protein structure similarity with TM-score = 0.5?". Bioinformatics. 26 (7): 889–895. doi:10.1093/bioinformatics/btq066. PMC 2913670 . PMID 20164152.

[Zemla2003-4] Zemla A (2003). "LGA: A method for finding 3D similarities in protein structures". Nucleic Acids Research. 31 (13): 3370–3374. doi:10.1093/nar/gkg571. PMC 168977 . PMID 12824330.

[1]

[2]

[3]

[4]