RAPTOR (software)

RAPTOR
Original author(s)	Dr. Jinbo Xu
Developer(s)	Bioinformatics Solutions Inc.
Stable release	4.2 / November 2008;10 years ago
Operating system	Windows, Linux
Type	Protein structure prediction
Website	bioinfor.com/raptor

Last updated May 04, 2019

RAPTOR is protein threading software used for protein structure prediction. It has been replaced by RaptorX, which is much more accurate than RAPTOR.

Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its folding and its secondary and tertiary structure from its primary structure. Structure prediction is fundamentally different from the inverse problem of protein design. Protein structure prediction is one of the most important goals pursued by bioinformatics and theoretical chemistry; it is highly important in medicine and biotechnology. Every two years, the performance of current methods is assessed in the CASP experiment. A continuous evaluation of protein structure prediction web servers is performed by the community project CAMEO3D.

RaptorX for protein structure modeling and function prediction

Comparison of techniques

Protein threading vs. homology modeling

Researchers attempting to solve a protein's structure start their study with little more than a protein sequence. Initial steps may include performing a PSI-BLAST or PatternHunter search to locate a similar sequences with a known structure in the Protein Data Bank (PDB). If there are highly similar sequences with known structures, there is a high probability that this protein's structure will be very similar to those known structures as well as functions. If there is no homology found, the researcher must perform either X-ray crystallography or nuclear magnetic resonance (NMR) spectroscopy, both of which require considerable time and resources to yield a structure. Where these techniques are too expensive, time-consuming or limited in scope, researchers can use protein threading software, such as RAPTOR to create a highly reliable model of the protein.

PatternHunter is a commercially available homology search instrument software that uses sequence alignment techniques. It was initially developed in the year 2002 by three scientists: Bin Ma, John Tramp and Ming Li. These scientists were driven by the desire to solve the problem that many investigators face during studies that involve genomics and proteomics. These scientists realized that such studies greatly relied on homology studies that established short seed matches that were subsequently lengthened. Describing homologous genes was an essential part of most evolutionary studies and was crucial to the understanding of the evolution of gene families, the relationship between domains and families. Homologous genes could only be studied effectively using search tools that established like portions or local placement between two proteins or nucleic acid sequences. Homology was quantified by scores obtained from matching sequences, “mismatch and gap scores”.

The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. The data, typically obtained by X-ray crystallography, NMR spectroscopy, or, increasingly, cryo-electron microscopy, and submitted by biologists and biochemists from around the world, are freely accessible on the Internet via the websites of its member organisations. The PDB is overseen by an organization called the Worldwide Protein Data Bank, wwPDB.

X-ray crystallography (XRC) is a technique used for determining the atomic and molecular structure of a crystal, in which the crystalline structure causes a beam of incident X-rays to diffract into many specific directions. By measuring the angles and intensities of these diffracted beams, a crystallographer can produce a three-dimensional picture of the density of electrons within the crystal. From this electron density, the mean positions of the atoms in the crystal can be determined, as well as their chemical bonds, their crystallographic disorder, and various other information.

Protein threading is more effective than homology modeling, especially for proteins which have few homologs detectable by sequence alignment. The two methods both predict protein structure from a template. Given a protein sequence, protein threading first aligns (threads) the sequence to each template in a structure library by optimizing a scoring function that measures the fitness of a sequence-structure alignment. The selected best template is used to build the structure model. Unlike homology modeling, which selects template purely based on homology information (sequence alignments), the scoring function used in protein threading utilizes both homology and structure information (sequence structure alignments).

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences, such as calculating the edit distance cost between strings in a natural language or in financial data.

If a sequence has no significant homology found, homology modeling may not give reliable prediction in this case. Without homology information, protein threading can still use structure information to produce good prediction. Failed attempts to obtain a good template with BLAST often result in users processing results through RAPTOR.

Integer programming vs. dynamic programming

The integer programming approach to RAPTOR produces higher quality models than other protein threading methods. Most threading software use dynamic programming to optimize their scoring functions when aligning a sequence with a template. Dynamic programming is much easier to implement than integer programming; however if a scoring function has pairwise contact potential included, dynamic programming cannot globally optimize such a scoring function and instead just generates a local optimal alignment.

An integer programming problem is a mathematical optimization or feasibility program in which some or all of the variables are restricted to be integers. In many settings the term refers to integer linear programming (ILP), in which the objective function and the constraints are linear.

Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics. In both contexts it refers to simplifying a complicated problem by breaking it down into simpler sub-problems in a recursive manner. While some decision problems cannot be taken apart this way, decisions that span several points in time do often break apart recursively. Likewise, in computer science, if a problem can be solved optimally by breaking it into sub-problems and then recursively finding the optimal solutions to the sub-problems, then it is said to have optimal substructure.

Pairwise contacts are very conserved in protein structure and crucial for prediction accuracy. Integer programming can globally optimize a scoring function with pairwise contact potential and produce a global optimal alignment.

Components

Threading engines

NoCore, NPCore and IP are the three different threading engines implemented in RAPTOR. NoCore and NPCore are based on dynamic programming and faster than IP. The difference between them is that in NPCore, a template is parsed into many "core" regions. A core is a structurally conserved region. IP is RAPTOR's unique integer programming-based threading engine. It produces better alignments and models than the other two threading engines. People can always start with NoCore and NPCore. If their predictions are not good enough, IP may be a better choice. After all three methods are run, a simple consensus may help to find the best prediction.

3D structure modeling module

The default 3D structure modeling tool used in RAPTOR is OWL. Three-dimensional structure modeling involves two steps. The first step is loop modeling which models regions in the target sequence that map to nothing in the template. After all the loops are modeled and the backbone is ready, side chains are attached to the backbone and packed up. For loop modeling, a cyclic coordinate descent algorithm is used to fill the loops and avoid clashes. For side chain packing, a tree decomposition algorithm is used to pack up all the side chains and avoid any clashes. OWL is automatically called in RAPTOR to generate the 3D output.

If a researcher has MODELLER, they can also set up RAPTOR to call MODELLER automatically. RAPTOR can also generate ICM-Pro input files, with which people run ICM-Pro by themselves.

PSI-BLAST module

To make it a comprehensive tool set, PSI-BLAST is also included in RAPTOR to let people do homology modeling. People can set up all the necessary parameters by themselves. There are two steps involved in running PSI-BLAST. The first step is to generate the sequence profile. For this step, NR non-redundant database is used. The next step is to let PSI-BLAST search the target sequence against the sequences from the Protein Data Bank. Users can also specify their own database for each step.

Protein structure viewer

There are many different structure viewers. In RAPTOR, Jmol is used as the structure viewer for examining the generated prediction.

Jmol is computer software for molecular modelling chemical structures in 3-dimensions. Jmol returns a 3D representation of a molecule that may be used as a teaching tool, or for research e.g., in chemistry and biochemistry. It is written in the programming language Java, so it can run on the operating systems Windows, macOS, Linux, and Unix, if Java is installed. It is free and open-source software released under a GNU Lesser General Public License (LGPL) version 2.0. A standalone application and a software development kit (SDK) exist that can be integrated into other Java applications, such as Bioclipse and Taverna.

Output

After a threading/PSI-BLAST job, one can see a ranking list of all the templates. For each template, people can view the alignment, E-value and numerous other specific scores. Also, the functional information of the template and its SCOP classification are provided. One can also view the sequence's PSM matrix and secondary structure prediction. If a template has been reported by more than one method, it will be marked with the number of times it has been reported. This helps to identify the best template.

Performance in CASP

CASP, Critical Assessment of Techniques for Protein Structure Prediction, is a biennial experiment sponsored by NIH. CASP represents the Olympic Games of the protein structure prediction community and was established in 1994.

RAPTOR first appeared in CAFASP3 (CASP5) in 2002 and was ranked number one in the individual server group for that year. Since then, RAPTOR has actively participated in every CASP for evaluation purpose and been consistently ranked in the top tier.

The most recent CASP8 ran from May 2008 until August 2008. More than 80 prediction servers and more than 100 human expert groups worldwide registered for the event, where participants attempt to predict the 3D structure from a protein sequence. According to the ranking from Zhang's group, RAPTOR ranked 2nd among all the servers (meta server and individual servers). Baker lab's ROBETTA is placed 5th in the same ranking list.

Top five prediction servers in CASP8

Rank	Predictor	Targets Used	TM-score	MaxSub-score	GDT-score	GHA-score
1	Zhang-Server	171	120.65	108.78	114.69	85.55
2	RAPTOR	171	116.13	104.69	110.79	82.92
3	pro-sp3-TASSER	171	116.05	103.38	109.95	80.88
4	Phyre_de_novo	171	115.35	103.47	110.00	82.51
5	BAKER-ROBETTA	171	115.12	102.68	109.27	80.71

Related Research Articles

Protein engineering is the process of developing useful or valuable proteins. It is a young discipline, with much research taking place into the understanding of protein folding and recognition for protein design principles. It is also a product and services market, with an estimated value of $168 billion by 2017.

Critical Assessment of protein Structure Prediction, or CASP, is a community-wide, worldwide experiment for protein structure prediction taking place every two years since 1994. CASP provides research groups with an opportunity to objectively test their structure prediction methods and delivers an independent assessment of the state of the art in protein structure modeling to the research community and software users. Even though the primary goal of CASP is to help advance the methods of identifying protein three-dimensional structure from its amino acid sequence, many view the experiment more as a “world championship” in this field of science. More than 100 research groups from all over the world participate in CASP on a regular basis and it is not uncommon for entire groups to suspend their other research for months while they focus on getting their servers ready for the experiment and on performing the detailed predictions.

Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large RNA molecules. In contrast to simple structural superposition, where at least some equivalent residues of the two structures are known, structural alignment requires no a priori knowledge of equivalent positions. Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Structural alignment can therefore be used to imply evolutionary relationships between proteins that share very little common sequence. However, caution should be used in using the results as evidence for shared evolutionary ancestry because of the possible confounding effects of convergent evolution by which multiple unrelated amino acid sequences converge on a common tertiary structure.

Protein threading, also known as fold recognition, is a method of protein modeling which is used to model those proteins which have the same fold as proteins of known structures, but do not have homologous proteins with known structure. It differs from the homology modeling method of structure prediction as it is used for proteins which do not have their homologous protein structures deposited in the Protein Data Bank (PDB), whereas homology modeling is used for those proteins which do. Threading works by using statistical knowledge of the relationship between the structures deposited in the PDB and the sequence of the protein which one wishes to model.

A multiple sequence alignment (MSA) is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a linkage and are descended from a common ancestor. From the resulting MSA, sequence homology can be inferred and phylogenetic analysis can be conducted to assess the sequences' shared evolutionary origins. Visual depictions of the alignment as in the image at right illustrate mutation events such as point mutations that appear as differing characters in a single alignment column, and insertion or deletion mutations that appear as hyphens in one or more of the sequences in the alignment. Multiple sequence alignment is often used to assess sequence conservation of protein domains, tertiary and secondary structures, and even individual amino acids or nucleotides.

Homology modeling, also known as comparative modeling of protein, refers to constructing an atomic-resolution model of the "target" protein from its amino acid sequence and an experimental three-dimensional structure of a related homologous protein. Homology modeling relies on the identification of one or more known protein structures likely to resemble the structure of the query sequence, and on the production of an alignment that maps residues in the query sequence to residues in the template sequence. It has been shown that protein structures are more conserved than protein sequences amongst homologues, but sequences falling below a 20% sequence identity can have very different structure.

Loop modeling is a problem in protein structure prediction requiring the prediction of the conformations of loop regions in proteins with or without the use of a structural template. Computer programs that solve these problems have been used to research a broad range of scientific topics from ADP to breast cancer. Because protein function is determined by its shape and the physiochemical properties of its exposed surface, it is important to create an accurate model for protein/ligand interaction studies. The problem arises often in homology modeling, where the tertiary structure of an amino acid sequence is predicted based on a sequence alignment to a template, or a second sequence whose structure is known. Because loops have highly variable sequences even within a given structural motif or protein fold, they often correspond to unaligned regions in sequence alignments; they also tend to be located at the solvent-exposed surface of globular proteins and thus are more conformationally flexible. Consequently, they often cannot be modeled using standard homology modeling techniques. More constrained versions of loop modeling are also used in the data fitting stages of solving a protein structure by X-ray crystallography, because loops can correspond to regions of low electron density and are therefore difficult to resolve.

CAFASP, or the Critical Assessment of Fully Automated Structure Prediction, is a large-scale blind experiment in protein structure prediction that studies the performance of automated structure prediction webservers in homology modeling, fold recognition, and ab initio prediction of protein tertiary structures based only on amino acid sequence. The experiment runs once every two years in parallel with CASP, which focuses on predictions that incorporate human intervention and expertise. Compared to related benchmarking techniques LiveBench and EVA, which run weekly against newly solved protein structures deposited in the Protein Data Bank, CAFASP generates much less data, but has the advantage of producing predictions that are directly comparable to those produced by human prediction experts. Recently CAFASP has been run essentially integrated into the CASP results rather than as a separate experiment.

In computational biology, de novo protein structure prediction refers to an algorithmic process by which protein tertiary structure is predicted from its amino acid primary sequence. The problem itself has occupied leading scientists for decades while still remaining unsolved. According to Science, the problem remains one of the top 125 outstanding issues in modern science. At present, some of the most successful methods have a reasonable probability of predicting the folds of small, single-domain proteins within 1.5 angstroms over the entire structure.

ESyPred3D is an automated homology modeling program. The method gets the benefit of the increased alignment performances of an alignment strategy that uses neural networks. Alignments are obtained by combining, weighting and screening the results of several multiple alignment programs. The final three-dimensional structure is built using the modeling package MODELLER.

Phyre and Phyre2 are web-based services for protein structure prediction that are free for non-commercial use. Phyre is among the most popular methods for protein structure prediction having been cited over 1500 times. Like other remote homology recognition techniques, it is able to regularly generate reliable protein models when other widely used methods such as PSI-BLAST cannot. Phyre2 has been designed to ensure a user-friendly interface for users inexpert in protein structure prediction methods.

SWISS-MODEL is a structural bioinformatics web-server dedicated to homology modeling of 3D protein structures. Homology modeling is currently the most accurate method to generate reliable three-dimensional protein structure models and is routinely used in many practical applications. Homology modelling methods make use of experimental protein structures ("templates") to build models for evolutionary related proteins ("targets").

The HH-suite is an open-source software package for sensitive protein sequence searching. It contains programs that can search for similar protein sequences in protein sequence databases. Sequence searches are a standard tool in modern biology with which the function of unknown proteins can be inferred from the functions of proteins with similar sequences. HHsearch and HHblits are two main programs in the package and the entry point to its search function, the latter being a faster iteration. HHpred is an online server for protein structure prediction that uses homology information from HH-suite.

PredictProtein (PP) is an automatic service that searches up-to-date public sequence databases, creates alignments, and predicts aspects of protein structure and function. Users send a protein sequence and receive a single file with results from database comparisons and prediction methods. PP went online in 1992 at the European Molecular Biology Laboratory; since 1999 it has operated from Columbia University and in 2009 it moved to the Technische Universität München. Although many servers have implemented particular aspects, PP remains the most widely used public server for structure prediction: over 1.5 million requests from users in 104 countries have been handled; over 13000 users submitted 10 or more different queries. PP web pages are mirrored in 17 countries on 4 continents. The system is optimized to meet the demands of experimentalists not experienced in bioinformatics. This implied that we focused on incorporating only high-quality methods, and tried to collate results omitting less reliable or less important ones.

GeNMR method is the first fully automated template-based method of protein structure determination that utilizes both NMR chemical shifts and NOE -based distance restraints.

CS23D is a web server to generate 3D structural models from NMR chemical shifts. CS23D combines maximal fragment assembly with chemical shift threading, de novo structure generation, chemical shift-based torsion angle prediction, and chemical shift refinement. CS23D makes use of RefDB and ShiftX.

I-TASSER software for for protein structure prediction and refinement, and structure-based protein function annotations

I-TASSER is a bioinformatics method for predicting three-dimensional structure model of protein molecules from amino acid sequences. It detects structure templates from the Protein Data Bank by a technique called fold recognition. The full-length structure models are constructed by reassembling structural fragments from threading templates using replica exchange Monte Carlo simulations. I-TASSER is one of the most successful protein structure prediction methods in the community-wide CASP experiments.

References

Xu J, Li M, Kim D, Xu Y (2003). "RAPTOR: Optimal Protein Threading by Linear Programming, the inaugural issue". J Bioinform Comput Biol. 1 (1): 95–117. doi:10.1142/S0219720003000186. PMID 15290783.
Xu J, Li M (2003). "Assessment of RAPTOR's linear programming approach in CAFASP3". Proteins. 53 (Suppl 6): 579–584. doi:10.1002/prot.10531. PMID 14579349.
Xu J, Li M, Lin G, Kim D, Xu Y (2003). "Protein threading by linear programming". Pac Symp Biocomput: 264–275. PMID 12603034.
Xu J (2005). "Protein Fold Recognition by Predicted Alignment Accuracy". IEEE/ACM Trans. on Computational Biology and Bioinformatics.
Xu J (2005). "Rapid Protein Side-Chain Packing via Tree Decomposition". RECOMB.

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.