THEMATICS

Last updated

Theoretical Microscopic Anomalous Titration Curve Shapes (THEMATICS) is a computational method for predicting the biochemically active amino acids in a protein three-dimensional structure. [1] [2] [3]

The method was developed by Mary Jo Ondrechen, James Clifton, and Dagmar Ringe. [4] It is based on computed electrostatic and chemical properties of the individual amino acids in a protein structure. Specifically it identifies anomalous shapes in the theoretical titration curves of the ionizable amino acids. Biochemically active amino acids tend to have wide buffer ranges and non-sigmoidal titration patterns.

While the method predicts biochemically active amino acids successfully, it also provides input features to a machine learning predictor, Partial Order Optimum Likelihood (POOL). [5] [6]

Related Research Articles

<span class="mw-page-title-main">Protein</span> Biomolecule consisting of chains of amino acid residues

Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, responding to stimuli, providing structure to cells and organisms, and transporting molecules from one location to another. Proteins differ from one another primarily in their sequence of amino acids, which is dictated by the nucleotide sequence of their genes, and which usually results in protein folding into a specific 3D structure that determines its activity.

<span class="mw-page-title-main">Protein secondary structure</span> General three-dimensional form of local segments of proteins

Protein secondary structure is the local spatial conformation of the polypeptide backbone excluding the side chains. The two most common secondary structural elements are alpha helices and beta sheets, though beta turns and omega loops occur as well. Secondary structure elements typically spontaneously form as an intermediate before the protein folds into its three dimensional tertiary structure.

<span class="mw-page-title-main">Sequence alignment</span> Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences, such as calculating the distance cost between strings in a natural language or in financial data.

<span class="mw-page-title-main">Protein structure prediction</span> Type of biological prediction

Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its secondary and tertiary structure from primary structure. Structure prediction is different from the inverse problem of protein design. Protein structure prediction is one of the most important goals pursued by computational biology; and it is important in medicine and biotechnology.

In bioinformatics and evolutionary biology, a substitution matrix describes the frequency at which a character in a nucleotide sequence or a protein sequence changes to other character states over evolutionary time. The information is often in the form of log odds of finding two specific character states aligned and depends on the assumed number of evolutionary changes or sequence dissimilarity between compared sequences. It is an application of a stochastic matrix. Substitution matrices are usually seen in the context of amino acid or DNA sequence alignments, where they are used to calculate similarity scores between the aligned sequences.

<span class="mw-page-title-main">Structural alignment</span> Aligning molecular sequences using sequence and structural information

Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large RNA molecules. In contrast to simple structural superposition, where at least some equivalent residues of the two structures are known, structural alignment requires no a priori knowledge of equivalent positions. Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Structural alignment can therefore be used to imply evolutionary relationships between proteins that share very little common sequence. However, caution should be used in using the results as evidence for shared evolutionary ancestry because of the possible confounding effects of convergent evolution by which multiple unrelated amino acid sequences converge on a common tertiary structure.

<span class="mw-page-title-main">Structural bioinformatics</span> Bioinformatics subfield

Structural bioinformatics is the branch of bioinformatics that is related to the analysis and prediction of the three-dimensional structure of biological macromolecules such as proteins, RNA, and DNA. It deals with generalizations about macromolecular 3D structures such as comparisons of overall folds and local motifs, principles of molecular folding, evolution, binding interactions, and structure/function relationships, working both from experimentally solved structures and from computational models. The term structural has the same meaning as in structural biology, and structural bioinformatics can be seen as a part of computational structural biology. The main objective of structural bioinformatics is the creation of new methods of analysing and manipulating biological macromolecular data in order to solve problems in biology and generate new knowledge.

An epitope, also known as antigenic determinant, is the part of an antigen that is recognized by the immune system, specifically by antibodies, B cells, or T cells. The part of an antibody that binds to the epitope is called a paratope. Although epitopes are usually non-self proteins, sequences derived from the host that can be recognized are also epitopes.

Protein design is the rational design of new protein molecules to design novel activity, behavior, or purpose, and to advance basic understanding of protein function. Proteins can be designed from scratch or by making calculated variants of a known protein structure and its sequence. Rational protein design approaches make protein-sequence predictions that will fold to specific structures. These predicted sequences can then be validated experimentally through methods such as peptide synthesis, site-directed mutagenesis, or artificial gene synthesis.

In computational biology, protein pKa calculations are used to estimate the pKa values of amino acids as they exist within proteins. These calculations complement the pKa values reported for amino acids in their free state, and are used frequently within the fields of molecular modeling, structural bioinformatics, and computational biology.

The Chou–Fasman method is an empirical technique for the prediction of secondary structures in proteins, originally developed in the 1970s by Peter Y. Chou and Gerald D. Fasman. The method is based on analyses of the relative frequencies of each amino acid in alpha helices, beta sheets, and turns based on known protein structures solved with X-ray crystallography. From these frequencies a set of probability parameters were derived for the appearance of each amino acid in each secondary structure type, and these parameters are used to predict the probability that a given sequence of amino acids would form a helix, a beta strand, or a turn in a protein. The method is at most about 50–60% accurate in identifying correct secondary structures, which is significantly less accurate than the modern machine learning–based techniques.

The GOR method is an information theory-based method for the prediction of secondary structures in proteins. It was developed in the late 1970s shortly after the simpler Chou–Fasman method. Like Chou–Fasman, the GOR method is based on probability parameters derived from empirical studies of known protein tertiary structures solved by X-ray crystallography. However, unlike Chou–Fasman, the GOR method takes into account not only the propensities of individual amino acids to form particular secondary structures, but also the conditional probability of the amino acid to form a secondary structure given that its immediate neighbors have already formed that structure. The method is therefore essentially Bayesian in its analysis.

In computational biology, de novo protein structure prediction refers to an algorithmic process by which protein tertiary structure is predicted from its amino acid primary sequence. The problem itself has occupied leading scientists for decades while still remaining unsolved. According to Science, the problem remains one of the top 125 outstanding issues in modern science. At present, some of the most successful methods have a reasonable probability of predicting the folds of small, single-domain proteins within 1.5 angstroms over the entire structure.

Structural and physical properties of DNA provide important constraints on the binding sites formed on surfaces of DNA-binding proteins. Characteristics of such binding sites may be used for predicting DNA-binding sites from the structural and even sequence properties of unbound proteins. This approach has been successfully implemented for predicting the protein–protein interface. Here, this approach is adopted for predicting DNA-binding sites in DNA-binding proteins. First attempt to use sequence and evolutionary features to predict DNA-binding sites in proteins was made by Ahmad et al. (2004) and Ahmad and Sarai (2005). Some methods use structural information to predict DNA-binding sites and therefore require a three-dimensional structure of the protein, while others use only sequence information and do not require protein structure in order to make a prediction.

Ronald J. Williams is professor of computer science at Northeastern University, and one of the pioneers of neural networks. He co-authored a paper on the backpropagation algorithm which triggered a boom in neural network research. He also made fundamental contributions to the fields of recurrent neural networks and reinforcement learning. Together with Wenxu Tong and Mary Jo Ondrechen he developed Partial Order Optimum Likelihood (POOL), a machine learning method used in the prediction of active amino acids in protein structures. POOL is a maximum likelihood method with a monotonicity constraint and is a general predictor of properties that depend monotonically on the input features.

Protein function prediction methods are techniques that bioinformatics researchers use to assign biological or biochemical roles to proteins. These proteins are usually ones that are poorly studied or predicted based on genomic sequence data. These predictions are often driven by data-intensive computational procedures. Information may come from nucleic acid sequence homology, gene expression profiles, protein domain structures, text mining of publications, phylogenetic profiles, phenotypic profiles, and protein-protein interaction. Protein function is a broad term: the roles of proteins range from catalysis of biochemical reactions to transport to signal transduction, and a single protein may play a role in multiple processes or cellular pathways.


Computer Atlas of Surface Topography of Proteins (CASTp) aims to provide comprehensive and detailed quantitative characterization of topographic features of protein, is now updated to version 3.0. Since its release in 2006, the CASTp server has ~ 45 000 visits and fulfills ~ 33 000 calculation requests annually. CASTp has been proven as a confident tool for a wide range of researches, including investigations of signaling receptors, discoveries of cancer therapeutics, understanding of mechanism of drug actions, studies of immune disorder diseases, analysis of protein–nanoparticle interactions, inference of protein functions and development of high-throughput computational tools. This server is maintained by Jie Liang's lab in University of Illinois at Chicago.

<span class="mw-page-title-main">Mary Jo Ondrechen</span> Chemist, Educator

Mary Jo Ondrechen is a chemist, educator, researcher, community leader and activist. She serves as Professor of Chemistry and Chemical Biology and Principal Investigator of the Computational Biology Research Group at Northeastern University in Boston, Massachusetts.

Proline-rich protein 21 (PRR21) is a protein of the family of proline-rich proteins. It is encoded by the PRR21 gene, which is found on human chromosome 2, band 2q37.3. The gene exists in several species, both vertebrates and invertebrates, including humans. However, the protein have few conserved regions among species.

Molecular Operating Environment (MOE) is a drug discovery software platform that integrates visualization, modeling and simulations, as well as methodology development, in one package. MOE scientific applications are used by biologists, medicinal chemists and computational chemists in pharmaceutical, biotechnology and academic research. MOE runs on Windows, Linux, Unix, and macOS. Main application areas in MOE include structure-based design, fragment-based design, ligand-based design, pharmacophore discovery, medicinal chemistry applications, biologics applications, structural biology and bioinformatics, protein and antibody modeling, molecular modeling and simulations, virtual screening, cheminformatics & QSAR. The Scientific Vector Language (SVL) is the built-in command, scripting and application development language of MOE.

References

  1. Protein Function Predicted With New "THEMATICS" Method Developed By Northeastern University & Brandeis Scientists. ScienceDaily, (2001).
  2. Borman, S., From sequence to consequence. Chemical and Engineering News, 79(48): p. 31-33 (2001).
  3. Ball, P., Computers spot shape clues. Nature, (2001).
  4. “THEMATICS: A Simple Computational Predictor of Enzyme Function from Structure,” M.J. Ondrechen, J.G. Clifton & D. Ringe, Proc. Natl. Acad. Sci. USA 98, 12473-12478 (2001). PMID   11606719
  5. “Partial Order Optimum Likelihood (POOL): Maximum Likelihood Prediction of Active Site Residues Using 3D Structure and Sequence Properties,” W. Tong, Y. Wei, L.F. Murga, M.J. Ondrechen, and R.J. Williams, PLoS Computational Biology, 5(1): e1000266 (2009). PMID   9148270
  6. Somarowthu, Srinivas; Yang, Huyuan; Hildebrand, David G. C.; Ondrechen, Mary Jo (2011-06-01). "High-performance prediction of functional residues in proteins with machine learning and computed input features". Biopolymers. 95 (6): 390–400. doi:10.1002/bip.21589. ISSN   0006-3525. PMID   21254002.