Threading (protein sequence)

Last updated

Protein threading, also known as fold recognition, is a method of protein modeling which is used to model those proteins which have the same fold as proteins of known structures, but do not have homologous proteins with known structure. It differs from the homology modeling method of structure prediction as it (protein threading) is used for proteins which do not have their homologous protein structures deposited in the Protein Data Bank (PDB), whereas homology modeling is used for those proteins which do. Threading works by using statistical knowledge of the relationship between the structures deposited in the PDB and the sequence of the protein which one wishes to model.

Protein biological molecule consisting of chains of amino acid residues

Proteins are large biomolecules, or macromolecules, consisting of one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, responding to stimuli, providing structure to cells and organisms, and transporting molecules from one location to another. Proteins differ from one another primarily in their sequence of amino acids, which is dictated by the nucleotide sequence of their genes, and which usually results in protein folding into a specific three-dimensional structure that determines its activity.

Protein folding the process of assisting in the covalent and noncovalent assembly of single chain polypeptides or multisubunit complexes into the correct tertiary structure

Protein folding is the physical process by which a protein chain acquires its native 3-dimensional structure, a conformation that is usually biologically functional, in an expeditious and reproducible manner. It is the physical process by which a polypeptide folds into its characteristic and functional three-dimensional structure from random coil. Each protein exists as an unfolded polypeptide or random coil when translated from a sequence of mRNA to a linear chain of amino acids. This polypeptide lacks any stable (long-lasting) three-dimensional structure. As the polypeptide chain is being synthesized by a ribosome, the linear chain begins to fold into its three-dimensional structure. Folding begins to occur even during translation of the polypeptide chain. Amino acids interact with each other to produce a well-defined three-dimensional structure, the folded protein, known as the native state. The resulting three-dimensional structure is determined by the amino acid sequence or primary structure.

Homology modeling method of protein structure prediction

Homology modeling, also known as comparative modeling of protein, refers to constructing an atomic-resolution model of the "target" protein from its amino acid sequence and an experimental three-dimensional structure of a related homologous protein. Homology modeling relies on the identification of one or more known protein structures likely to resemble the structure of the query sequence, and on the production of an alignment that maps residues in the query sequence to residues in the template sequence. It has been shown that protein structures are more conserved than protein sequences amongst homologues, but sequences falling below a 20% sequence identity can have very different structure.


The prediction is made by "threading" (i.e. placing, aligning) each amino acid in the target sequence to a position in the template structure, and evaluating how well the target fits the template. After the best-fit template is selected, the structural model of the sequence is built based on the alignment with the chosen template. Protein threading is based on two basic observations: that the number of different folds in nature is fairly small (approximately 1300); and that 90% of the new structures submitted to the PDB in the past three years have similar structural folds to ones already in the PDB.

Amino acid Organic compounds containing amine and carboxylic groups

Amino acids are organic compounds containing amine (-NH2) and carboxyl (-COOH) functional groups, along with a side chain (R group) specific to each amino acid. The key elements of an amino acid are carbon (C), hydrogen (H), oxygen (O), and nitrogen (N), although other elements are found in the side chains of certain amino acids. About 500 naturally occurring amino acids are known (though only 20 appear in the genetic code) and can be classified in many ways. They can be classified according to the core structural functional groups' locations as alpha- (α-), beta- (β-), gamma- (γ-) or delta- (δ-) amino acids; other categories relate to polarity, pH level, and side chain group type (aliphatic, acyclic, aromatic, containing hydroxyl or sulfur, etc.). In the form of proteins, amino acid residues form the second-largest component (water is the largest) of human muscles and other tissues. Beyond their role as residues in proteins, amino acids participate in a number of processes such as neurotransmitter transport and biosynthesis.

Classification of protein structure

The Structural Classification of Proteins (SCOP) database provides a detailed and comprehensive description of the structural and evolutionary relationships of known structure. Proteins are classified to reflect both structural and evolutionary relatedness. Many levels exist in the hierarchy, but the principal levels are family, superfamily and fold, as described below.

Protein family certain functional class or family of proteins

A protein family is a group of evolutionarily-related proteins. In many cases a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term protein family should not be confused with family as it is used in taxonomy.

A protein superfamily is the largest grouping (clade) of proteins for which common ancestry can be inferred. Usually this common ancestry is inferred from structural alignment and mechanistic similarity, even if no sequence similarity is evident. Sequence homology can then be deduced even if not apparent. Superfamilies typically contain several protein families which show sequence similarity within each family. The term protein clan is commonly used for protease and glycosyl hydrolases superfamilies based on the MEROPS and CAZy classification systems.

Family (clear evolutionary relationship): Proteins clustered together into families are clearly evolutionarily related. Generally, this means that pairwise residue identities between the proteins are 30% and greater. However, in some cases similar functions and structures provide definitive evidence of common descent in the absence of high sequence identity; for example, many globins form a family though some members have sequence identities of only 15%.

Globin family of globular proteins

The globins are a superfamily of heme-containing globular proteins, involved in binding and/or transporting oxygen. These proteins all incorporate the globin fold, a series of eight alpha helical segments. Two prominent members include myoglobin and hemoglobin. Both of these proteins reversibly bind oxygen via a heme prosthetic group. They are widely distributed in many organisms.

Superfamily (probable common evolutionary origin): Proteins that have low sequence identities, but whose structural and functional features suggest that a common evolutionary origin is probable, are placed together in superfamilies. For example, actin, the ATPase domain of the heat shock protein, and hexakinase together form a superfamily.

Actin motor protein involved in muscle contraction

Actin is a family of globular multi-functional proteins that form microfilaments. It is found in essentially all eukaryotic cells, where it may be present at a concentration of over 100 μM; its mass is roughly 42-kDa, with a diameter of 4 to 7 nm.


ATPases (EC, adenylpyrophosphatase, ATP monophosphatase, triphosphatase, SV40 T-antigen, adenosine 5'-triphosphatase, ATP hydrolase, complex V (mitochondrial electron transport), (Ca2+ + Mg2+)-ATPase, HCO3-ATPase, adenosine triphosphatase) are a class of enzymes that catalyze the decomposition of ATP into ADP and a free phosphate ion or the inverse reaction. This dephosphorylation reaction releases energy, which the enzyme (in most cases) harnesses to drive other chemical reactions that would not otherwise occur. This process is widely used in all known forms of life.

Heat shock proteins (HSP) are a family of proteins that are produced by cells in response to exposure to stressful conditions. They were first described in relation to heat shock, but are now known to also be expressed during other stresses including exposure to cold, UV light, and during wound healing or tissue remodeling. Many members of this group perform chaperone function by stabilizing new proteins to ensure correct folding or by helping to refold proteins that were damaged by the cell stress. This increase in expression is transcriptionally regulated. The dramatic upregulation of the heat shock proteins is a key part of the heat shock response and is induced primarily by heat shock factor (HSF). HSPs are found in virtually all living organisms, from bacteria to humans.

Fold (major structural similarity): Proteins are defined as having a common fold if they have the same major secondary structures in the same arrangement and with the same topological connections. Different proteins with the same fold often have peripheral elements of secondary structure and turn regions that differ in size and conformation. In some cases, these differing peripheral regions may comprise half the structure. Proteins placed together in the same fold category may not have a common evolutionary origin: the structural similarities could arise just from the physics and chemistry of proteins favoring certain packing arrangements and chain topologies.


A general paradigm of protein threading consists of the following four steps:

The construction of a structure template database: Select protein structures from the protein structure databases as structural templates. This generally involves selecting protein structures from databases such as PDB, FSSP, SCOP, or CATH, after removing protein structures with high sequence similarities.

The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. The data, typically obtained by X-ray crystallography, NMR spectroscopy, or, increasingly, cryo-electron microscopy, and submitted by biologists and biochemists from around the world, are freely accessible on the Internet via the websites of its member organisations. The PDB is overseen by an organization called the Worldwide Protein Data Bank, wwPDB.

The design of the scoring function: Design a good scoring function to measure the fitness between target sequences and templates based on the knowledge of the known relationships between the structures and the sequences. A good scoring function should contain mutation potential, environment fitness potential, pairwise potential, secondary structure compatibilities, and gap penalties. The quality of the energy function is closely related to the prediction accuracy, especially the alignment accuracy.

Threading alignment: Align the target sequence with each of the structure templates by optimizing the designed scoring function. This step is one of the major tasks of all threading-based structure prediction programs that take into account the pairwise contact potential; otherwise, a dynamic programming algorithm can fulfill it.

Threading prediction: Select the threading alignment that is statistically most probable as the threading prediction. Then construct a structure model for the target by placing the backbone atoms of the target sequence at their aligned backbone positions of the selected structural template.

Comparison with homology modeling

Homology modeling and protein threading are both template-based methods and there is no rigorous boundary between them in terms of prediction techniques. But the protein structures of their targets are different. Homology modeling is for those targets which have homologous proteins with known structure (usually/maybe of same family), while protein threading is for those targets with only fold-level homology found. In other words, homology modeling is for "easier" targets and protein threading is for "harder" targets.

Homology modeling treats the template in an alignment as a sequence, and only sequence homology is used for prediction. Protein threading treats the template in an alignment as a structure, and both sequence and structure information extracted from the alignment are used for prediction. When there is no significant homology found, protein threading can make a prediction based on the structure information. That also explains why protein threading may be more effective than homology modeling in many cases.

In practice, when the sequence identity in a sequence sequence alignment is low (i.e. <25%), homology modeling may not produce a significant prediction. In this case, if there is distant homology found for the target, protein threading can generate a good prediction.

More about threading

Fold recognition methods can be broadly divided into two types: 1, those that derive a 1-D profile for each structure in the fold library and align the target sequence to these profiles; and 2, those that consider the full 3-D structure of the protein template. A simple example of a profile representation would be to take each amino acid in the structure and simply label it according to whether it is buried in the core of the protein or exposed on the surface. More elaborate profiles might take into account the local secondary structure (e.g. whether the amino acid is part of an alpha helix) or even evolutionary information (how conserved the amino acid is). In the 3-D representation, the structure is modeled as a set of inter-atomic distances, i.e. the distances are calculated between some or all of the atom pairs in the structure. This is a much richer and far more flexible description of the structure, but is much harder to use in calculating an alignment. The profile-based fold recognition approach was first described by Bowie, Lüthy and David Eisenberg in 1991. [1] The term threading was first coined by David Jones, William R. Taylor and Janet Thornton in 1992, [2] and originally referred specifically to the use of a full 3-D structure atomic representation of the protein template in fold recognition. Today, the terms threading and fold recognition are frequently (though somewhat incorrectly) used interchangeably.

Fold recognition methods are widely used and effective because it is believed that there are a strictly limited number of different protein folds in nature, mostly as a result of evolution but also due to constraints imposed by the basic physics and chemistry of polypeptide chains. There is, therefore, a good chance (currently 70-80%) that a protein which has a similar fold to the target protein has already been studied by X-ray crystallography or nuclear magnetic resonance (NMR) spectroscopy and can be found in the PDB. Currently there are nearly 1300 different protein folds known, but new folds are still being discovered every year due in significant part to the ongoing structural genomics projects.

Many different algorithms have been proposed for finding the correct threading of a sequence onto a structure, though many make use of dynamic programming in some form. For full 3-D threading, the problem of identifying the best alignment is very difficult (it is an NP-hard problem for some models of threading).[ citation needed ] Researchers have made use of many combinatorial optimization methods such as Conditional random fields, simulated annealing, branch and bound and linear programming, searching to arrive at heuristic solutions. It is interesting to compare threading methods to methods which attempt to align two protein structures (protein structural alignment), and indeed many of the same algorithms have been applied to both problems.

Protein threading software

See also

Related Research Articles

Sequence alignment

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences, such as calculating the edit distance cost between strings in a natural language or in financial data.

Structural genomics

Structural genomics seeks to describe the 3-dimensional structure of every protein encoded by a given genome. This genome-based approach allows for a high-throughput method of structure determination by a combination of experimental and modeling approaches. The principal difference between structural genomics and traditional structural prediction is that structural genomics attempts to determine the structure of every protein encoded by the genome, rather than focusing on one particular protein. With full-genome sequences available, structure prediction can be done more quickly through a combination of experimental and modeling approaches, especially because the availability of large number of sequenced genomes and previously solved protein structures allows scientists to model protein structure on the structures of previously solved homologs.

Protein structure prediction

Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its folding and its secondary and tertiary structure from its primary structure. Structure prediction is fundamentally different from the inverse problem of protein design. Protein structure prediction is one of the most important goals pursued by bioinformatics and theoretical chemistry; it is highly important in medicine and biotechnology. Every two years, the performance of current methods is assessed in the CASP experiment. A continuous evaluation of protein structure prediction web servers is performed by the community project CAMEO3D.


Critical Assessment of protein Structure Prediction, or CASP, is a community-wide, worldwide experiment for protein structure prediction taking place every two years since 1994. CASP provides research groups with an opportunity to objectively test their structure prediction methods and delivers an independent assessment of the state of the art in protein structure modeling to the research community and software users. Even though the primary goal of CASP is to help advance the methods of identifying protein three-dimensional structure from its amino acid sequence, many view the experiment more as a “world championship” in this field of science. More than 100 research groups from all over the world participate in CASP on a regular basis and it is not uncommon for entire groups to suspend their other research for months while they focus on getting their servers ready for the experiment and on performing the detailed predictions.

Structural alignment

Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large RNA molecules. In contrast to simple structural superposition, where at least some equivalent residues of the two structures are known, structural alignment requires no a priori knowledge of equivalent positions. Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Structural alignment can therefore be used to imply evolutionary relationships between proteins that share very little common sequence. However, caution should be used in using the results as evidence for shared evolutionary ancestry because of the possible confounding effects of convergent evolution by which multiple unrelated amino acid sequences converge on a common tertiary structure.

Structural bioinformatics The branch of bioinformatics concerned with the analysis and prediction of the three-dimensional structure of biological macromolecules

Structural bioinformatics is the branch of bioinformatics which is related to the analysis and prediction of the three-dimensional structure of biological macromolecules such as proteins, RNA, and DNA. It deals with generalizations about macromolecular 3D structure such as comparisons of overall folds and local motifs, principles of molecular folding, evolution, and binding interactions, and structure/function relationships, working both from experimentally solved structures and from computational models. The term structural has the same meaning as in structural biology, and structural bioinformatics can be seen as a part of computational structural biology.

This list of structural comparison and alignment software is a compilation of software tools and web portals used in pairwise or multiple structural comparison and structural alignment.

Loop modeling is a problem in protein structure prediction requiring the prediction of the conformations of loop regions in proteins with or without the use of a structural template. Computer programs that solve these problems have been used to research a broad range of scientific topics from ADP to breast cancer. Because protein function is determined by its shape and the physiochemical properties of its exposed surface, it is important to create an accurate model for protein/ligand interaction studies. The problem arises often in homology modeling, where the tertiary structure of an amino acid sequence is predicted based on a sequence alignment to a template, or a second sequence whose structure is known. Because loops have highly variable sequences even within a given structural motif or protein fold, they often correspond to unaligned regions in sequence alignments; they also tend to be located at the solvent-exposed surface of globular proteins and thus are more conformationally flexible. Consequently, they often cannot be modeled using standard homology modeling techniques. More constrained versions of loop modeling are also used in the data fitting stages of solving a protein structure by X-ray crystallography, because loops can correspond to regions of low electron density and are therefore difficult to resolve.

ESyPred3D is an automated homology modeling program. The method gets the benefit of the increased alignment performances of an alignment strategy that uses neural networks. Alignments are obtained by combining, weighting and screening the results of several multiple alignment programs. The final three-dimensional structure is built using the modeling package MODELLER.

RAPTOR (software) protein threading software

RAPTOR is protein threading software used for protein structure prediction. It has been replaced by RaptorX, which is much more accurate than RAPTOR.

HHsearch is an open-source software program for protein sequence searching that is part of the free HH-suite software package. HHpred is a free protein function and protein structure prediction server that is based on HHsearch and HHblits, another program in the HH-suite package. HHpred and HHsearch are among the most popular methods for protein structure prediction and the detection of remotely related sequences, each having been cited over 500 times.

Phyre and Phyre2 are web-based services for protein structure prediction that are free for non-commercial use. Phyre is among the most popular methods for protein structure prediction having been cited over 1500 times. Like other remote homology recognition techniques, it is able to regularly generate reliable protein models when other widely used methods such as PSI-BLAST cannot. Phyre2 has been designed to ensure a user-friendly interface for users inexpert in protein structure prediction methods.

RaptorX for protein structure modeling and function prediction

David Tudor Jones British bioinformatician

David Tudor Jones is a Professor of Bioinformatics, and Head of Bioinformatics Group in the University College London. He is also the director in Bloomsbury Center for Bioinformatics, which is a joint Research Centre between UCL and Birkbeck, University of London and which also provides bioinformatics training and support services to biomedical researchers. In 2013, he is a member of editorial boards for PLoS ONE, BioData Mining, Advanced Bioinformatics, Chemical Biology & Drug Design, and Protein: Structure, Function and Bioinformatics.

SWISS-MODEL is a structural bioinformatics web-server dedicated to homology modeling of 3D protein structures. Homology modeling is currently the most accurate method to generate reliable three-dimensional protein structure models and is routinely used in many practical applications. Homology modelling methods make use of experimental protein structures ("templates") to build models for evolutionary related proteins ("targets").

The HH-suite is an open-source software package for sensitive protein sequence searching. It contains programs that can search for similar protein sequences in protein sequence databases. Sequence searches are a standard tool in modern biology with which the function of unknown proteins can be inferred from the functions of proteins with similar sequences.

CS23D web server to generate 3D structural models from NMR chemical shifts

CS23D is a web server to generate 3D structural models from NMR chemical shifts. CS23D combines maximal fragment assembly with chemical shift threading, de novo structure generation, chemical shift-based torsion angle prediction, and chemical shift refinement. CS23D makes use of RefDB and ShiftX.

I-TASSER software for for protein structure prediction and refinement, and structure-based protein function annotations

I-TASSER is a bioinformatics method for predicting three-dimensional structure model of protein molecules from amino acid sequences. It detects structure templates from the Protein Data Bank by a technique called fold recognition. The full-length structure models are constructed by reassembling structural fragments from threading templates using replica exchange Monte Carlo simulations. I-TASSER is one of the most successful protein structure prediction methods in the community-wide CASP experiments.


  1. Bowie JU, Lüthy R, Eisenberg D (1991). "A method to identify protein sequences that fold into a known three-dimensional structure". Science. 253 (5016): 164–170. doi:10.1126/science.1853201. PMID   1853201.
  2. Jones DT, Taylor WR, Thornton JM (1992). "A new approach to protein fold recognition". Nature. 358 (6381): 86–89. doi:10.1038/358086a0. PMID   1614539.
  3. Peng, Jian; Jinbo Xu (2011). "RaptorX: exploiting structure information for protein alignment by statistical inference". Proteins. 79 Suppl 10: 161–171. doi:10.1002/prot.23175. PMC   3226909 . PMID   21987485.
  4. Peng, Jian; Jinbo Xu (2010). "Low-homology protein threading". Bioinformatics. 26 (12): i294–i300. doi:10.1093/bioinformatics/btq192. PMC   2881377 . PMID   20529920.
  5. Peng, Jian; Jinbo Xu (April 2011). "A multiple-template approach to protein threading". Proteins. 79 (6): 1930–1939. doi:10.1002/prot.23016. PMC   3092796 .
  6. Ma, Jianzhu; Sheng Wang; Jinbo Xu (June 2012). "A conditional neural fields model for protein threading". Bioinformatics. 28 (12): i59-66. doi:10.1093/bioinformatics/bts213. PMC   3371845 . PMID   22689779.
  7. Wu S, Zhang Y (2008). "MUSTER: Improving protein sequence profile–profile alignments by using multiple sources of structure information". Proteins. 72 (2): 547–56. doi:10.1002/prot.21945. PMC   2666101 . PMID   18247410.
  8. Yang Y, Faraggi E, Zhao H, Zhou Y (2011). "Improving protein fold recognition and template-based modeling by employing probabilistic-based matching between predicted one-dimensional structural properties of query and corresponding native properties of templates". Bioinformatics. 27 (15): 2076–2082. doi:10.1093/bioinformatics/btr350. PMC   3137224 . PMID   21666270.
  9. Gront D, Blaszczyk M, Wojciechowski P, Kolinski A (2012). "BioShell Threader: protein homology detection based on sequence profiles and secondary structure profiles". Nucleic Acids Research. 40 (W1): W257–W262. doi:10.1093/nar/gks555. PMC   3394251 . PMID   22693216.

Further reading