Protein I-sites

Last updated

I-sites are short sequence-structure motifs that are mined from the Protein Data Bank (PDB) that correlate strongly with three-dimensional structural elements. These sequence-structure motifs are used for the local structure prediction of proteins. Local structure can be expressed as fragments or as backbone angles. Locations in the protein sequence that have high confidence I-sites predictions may be the initiation sites of folding. I-sites have also been identified as discrete models for folding pathways. I-sites consist of about 250 motifs. Each motif has an amino acid profile, a fragment structure (represented by a "paradigm" fragment chosen from a protein in the PDB) and optionally, a 4-dimensional tensor of pairwise sequence covariance.

In genetics, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and has, or is conjectured to have, a biological significance. For proteins, a sequence motif is distinguished from a structural motif, a motif formed by the three-dimensional arrangement of amino acids which may not be adjacent.

The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. The data, typically obtained by X-ray crystallography, NMR spectroscopy, or, increasingly, cryo-electron microscopy, and submitted by biologists and biochemists from around the world, are freely accessible on the Internet via the websites of its member organisations. The PDB is overseen by an organization called the Worldwide Protein Data Bank, wwPDB.

Protein folding the process of assisting in the covalent and noncovalent assembly of single chain polypeptides or multisubunit complexes into the correct tertiary structure

Protein folding is the physical process by which a protein chain acquires its native 3-dimensional structure, a conformation that is usually biologically functional, in an expeditious and reproducible manner. It is the physical process by which a polypeptide folds into its characteristic and functional three-dimensional structure from random coil. Each protein exists as an unfolded polypeptide or random coil when translated from a sequence of mRNA to a linear chain of amino acids. This polypeptide lacks any stable (long-lasting) three-dimensional structure. As the polypeptide chain is being synthesized by a ribosome, the linear chain begins to fold into its three-dimensional structure. Folding begins to occur even during translation of the polypeptide chain. Amino acids interact with each other to produce a well-defined three-dimensional structure, the folded protein, known as the native state. The resulting three-dimensional structure is determined by the amino acid sequence or primary structure.

Contents

Construction of I-site Library

The sequence and structure database

The database initially consisted of 471 protein sequence families from the HSSP database, with an average of 47 aligned sequences per family. Each family contained a single known structure (parent) from the Brookhaven protein Data Bank. These were a subset of the PDBSelect-25 list, having no more than 25% sequence identity between any two alignments. Disordered loops were omitted. Gaps and insertions in the sequence were ignored.

Clustering of sequence segments

Each position in the database is described by a weighted amino acid frequency. A similarity measure in sequence space between a segment (p) and a cluster of segments (q) is defined as:

In statistics and related fields, a similarity measure or similarity function is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity measure exists, usually such measures are in some sense the inverse of distance metrics: they take on large values for similar objects and either zero or a negative value for very dissimilar objects. For example, if two pieces of data have close x, y coordinates, then their “similarity” score, the likelihood that they are similar, will be much higher than two data points with more space between them. In the context of cluster analysis, Frey and Dueck suggest defining a similarity measure

where Pij(p) is the frequency of amino acid i in position j within the segment p. Nq is the number of sequence segments k in the cluster q. Fi is the frequency of amino acid type i in the database overall. The optimal values of a and a0 were determined empirically to be 0.5 and 15, respectively. Using this similarity measure, segments of a given length (3 to 15) were clustered via the k-means algorithm.

Assessing structure within a cluster; choice of paradigm

The structural similarity between any two peptide segments was evaluated using a combination of the RMS distance matrix error (dme):

where ai->j is the distance between a-carbon atoms i and j in the segment s1 of length L, and the maximum deviation in backbone torsion angles (mda) over the length of the segment is given by:

The paradigm structure for a cluster was chosen from the top-scoring 20 segments in the database as that with the smallest sum of mda values to the other 19. Other structural measures were tried before settling on these two: RMS deviation of a-carbon atoms (rmsd), dme alone, and a structural filter that looked for specific conserved contacts. The latter worked best in discriminating true and false positives, but could not be easily automated. The rmsd and dme were found to be poor discriminators of the two types of helix cap. The mda-dme combined filter best simulates the conserved contacts filter and is rapidly computed.

Related Research Articles

Alpha helix type of secondary structure

The alpha helix (α-helix) is a common motif in the secondary structure of proteins and is a right hand-helix conformation in which every backbone N−H group donates a hydrogen bond to the backbone C=O group of the amino acid located three or four residues earlier along the protein sequence.

Protein secondary structure general three-dimensional form of local segments of proteins

Protein secondary structure is the three dimensional form of local segments of proteins. The two most common secondary structural elements are alpha helices and beta sheets, though beta turns and omega loops occur as well. Secondary structure elements typically spontaneously form as an intermediate before the protein folds into its three dimensional tertiary structure.

Transmembrane protein protein spanning across a biological membrane

A transmembrane protein (TP) is a type of integral membrane protein that spans the entirety of the cell membrane to which it is permanently attached. Many transmembrane proteins function as gateways to permit the transport of specific substances across the membrane. They frequently undergo significant conformational changes to move a substance through the membrane.

Protein structure prediction

Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its folding and its secondary and tertiary structure from its primary structure. Structure prediction is fundamentally different from the inverse problem of protein design. Protein structure prediction is one of the most important goals pursued by bioinformatics and theoretical chemistry; it is highly important in medicine and biotechnology. Every two years, the performance of current methods is assessed in the CASP experiment. A continuous evaluation of protein structure prediction web servers is performed by the community project CAMEO3D.

Structural alignment

Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large RNA molecules. In contrast to simple structural superposition, where at least some equivalent residues of the two structures are known, structural alignment requires no a priori knowledge of equivalent positions. Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Structural alignment can therefore be used to imply evolutionary relationships between proteins that share very little common sequence. However, caution should be used in using the results as evidence for shared evolutionary ancestry because of the possible confounding effects of convergent evolution by which multiple unrelated amino acid sequences converge on a common tertiary structure.

Hyperfine structure small shifts and splittings in the energy levels of atoms, molecules and ions

In atomic physics, hyperfine structure refers to small shifts and splittings in the energy levels of atoms, molecules, and ions, due to interaction between the state of the nucleus and the state of the electron clouds.

Intrinsically disordered proteins

An intrinsically disordered protein (IDP) is a protein that lacks a fixed or ordered three-dimensional structure. IDPs cover a spectrum of states from fully unstructured to partially structured and include random coils, (pre-)molten globules, and large multi-domain proteins connected by flexible linkers. They constitute one of the main types of protein.

Protein contact map

A protein contact map represents the distance between all possible amino acid residue pairs of a three-dimensional protein structure using a binary two-dimensional matrix. For two residues and , the element of the matrix is 1 if the two residues are closer than a predetermined threshold, and 0 otherwise. Various contact definitions have been proposed: The distance between the Cα-Cα atom with threshold 6-12 Å; distance between Cβ-Cβ atoms with threshold 6-12 Å ; and distance between the side-chain centers of mass.

ATP-binding motif

An ATP, adenosine triphosphate, binding motif is a 250 residue sequence within an ATP binding protein’s primary structure. The binding motif is associated with a protein’s structure and/or function. ATP is a molecule of energy, and can be a coenzyme, involved in a number of biological reactions. ATP is proficient at interacting with other molecules through a binding site. The ATP binding site is the environment in which ATP catalytically actives the enzyme and, as a result, is hydrolyzed to ADP. The binding of ATP causes a conformational change to the enzyme it is interacting with.

Biomolecular structure 3D conformation of a biological sequence, like DNA, RNA, proteins

Biomolecular structure is the intricate folded, three-dimensional shape that is formed by a molecule of protein, DNA, or RNA, and that is important to its function. The structure of these molecules may be considered at any of several length scales ranging from the level of individual atoms to the relationships among entire protein subunits. This useful distinction among scales is often expressed as a decomposition of molecular structure into four levels: primary, secondary, tertiary, and quaternary. The scaffold for this multiscale organization of the molecule arises at the secondary level, where the fundamental structural elements are the molecule's various hydrogen bonds. This leads to several recognizable domains of protein structure and nucleic acid structure, including such secondary-structure features as alpha helixes and beta sheets for proteins, and hairpin loops, bulges, and internal loops for nucleic acids. The terms primary, secondary, tertiary, and quaternary structure were introduced by Kaj Ulrik Linderstrøm-Lang in his 1951 Lane Medical Lectures at Stanford University.

Homology modeling method of protein structure prediction

Homology modeling, also known as comparative modeling of protein, refers to constructing an atomic-resolution model of the "target" protein from its amino acid sequence and an experimental three-dimensional structure of a related homologous protein. Homology modeling relies on the identification of one or more known protein structures likely to resemble the structure of the query sequence, and on the production of an alignment that maps residues in the query sequence to residues in the template sequence. It has been shown that protein structures are more conserved than protein sequences amongst homologues, but sequences falling below a 20% sequence identity can have very different structure.

The global distance test (GDT), also written as GDT TS to represent "total score", is a measure of similarity between two protein structures with known amino acid correspondences but different tertiary structures. It is most commonly used to compare the results of protein structure prediction to the experimentally determined structure as measured by X-ray crystallography or protein NMR. The metric is intended as a more accurate measurement than the more common RMSD metric, which is sensitive to outlier regions created by poor modeling of individual loop regions in a structure that is otherwise reasonably accurate. GDT_TS measurements are used as major assessment criteria in the production of results from the Critical Assessment of Structure Prediction (CASP), a large-scale experiment in the structure prediction community dedicated to assessing current modeling techniques and identifying their primary deficiencies. In general, the higher GDT_TS is, the better a given model is in comparison to reference structure.

In protein structure prediction, a statistical potential or knowledge-based potential is an energy function derived from an analysis of known protein structures in the Protein Data Bank.

Protein domain

A protein domain is a conserved part of a given protein sequence and tertiary structure that can evolve, function, and exist independently of the rest of the protein chain. Each domain forms a compact three-dimensional structure and often can be independently stable and folded. Many proteins consist of several structural domains. One domain may appear in a variety of different proteins. Molecular evolution uses domains as building blocks and these may be recombined in different arrangements to create proteins with different functions. In general, domains vary in length from between about 50 amino acids up to 250 amino acids in length. The shortest domains, such as zinc fingers, are stabilized by metal ions or disulfide bridges. Domains often form functional units, such as the calcium-binding EF hand domain of calmodulin. Because they are independently stable, domains can be "swapped" by genetic engineering between one protein and another to make chimeric proteins.

In enzymology, an alpha-tubulin N-acetyltransferase is an enzyme which is encoded by the ATAT1 gene.

Pseudo amino acid composition, or PseAAC, was originally introduced by Kuo-Chen Chou (周国城) in 2001 to represent protein samples for improving protein subcellular localization prediction and membrane protein type prediction. Like the vanilla amino acid composition (AAC) method, it characterizes the protein mainly using a matrix of amino-acid frequencies, which helps with dealing with proteins without significant sequential homology to other proteins. Compared to AAC, additional information are also included in the matrix to represent some local features, such as correlation between residues of a certain distance.

The Walker A and Walker B motifs are protein sequence motifs, known to have highly conserved three-dimensional structures. These were first reported in ATP-binding proteins by Walker and co-workers in 1982.

I-TASSER software for for protein structure prediction and refinement, and structure-based protein function annotations

I-TASSER is a bioinformatics method for predicting three-dimensional structure model of protein molecules from amino acid sequences. It detects structure templates from the Protein Data Bank by a technique called fold recognition. The full-length structure models are constructed by reassembling structural fragments from threading templates using replica exchange Monte Carlo simulations. I-TASSER is one of the most successful protein structure prediction methods in the community-wide CASP experiments.

Direct coupling analysis or DCA is an umbrella term comprising several methods for analyzing sequence data in computational biology. The common idea of these methods is to use statistical modeling to quantify the strength of the direct relationship between two positions of a biological sequence, excluding effects from other positions. This contrasts usual measures of correlation, which can be large even if there is no direct relationship between the positions. Such a direct relationship can for example be the evolutionary pressure for two positions to maintain mutual compatibility in the biomolecular structure of the sequence, leading to molecular coevolution between the two positions. DCA has been used in the inference of protein residue contacts, RNA structure prediction, the inference of protein-protein interaction networks and the modeling of fitness landscapes.

References

Bystroff, C; Baker, D (1998). "Prediction of local structure in proteins using a library of sequence-structure motifs" (PDF). Journal of Molecular Biology. 281 (3): 565–77. CiteSeerX   10.1.1.125.3690 . doi:10.1006/jmbi.1998.1943. PMID   9698570.

CiteSeerx is a public search engine and digital library for scientific and academic papers, primarily in the fields of computer and information science. CiteSeer holds a United States patent # 6289342, titled "Autonomous citation indexing and literature browsing using citation context," granted on September 11, 2001. Stephen R. Lawrence, C. Lee Giles, Kurt D. Bollacker are the inventors of this patent assigned to NEC Laboratories America, Inc. This patent was filed on May 20, 1998, which has its roots (Priority) to January 5, 1998. A continuation patent was also granted to the same inventors and also assigned to NEC Labs on this invention i.e. US Patent # 6738780 granted on May 18, 2004 and was filed on May 16, 2001. CiteSeer is considered as a predecessor of academic search tools such as Google Scholar and Microsoft Academic Search. CiteSeer-like engines and archives usually only harvest documents from publicly available websites and do not crawl publisher websites. For this reason, authors whose documents are freely available are more likely to be represented in the index.

Digital object identifier Character string used as a permanent identifier for a digital object, in a format controlled by the International DOI Foundation

In computing, a Digital Object Identifier or DOI is a persistent identifier or handle used to identify objects uniquely, standardized by the International Organization for Standardization (ISO). An implementation of the Handle System, DOIs are in wide use mainly to identify academic, professional, and government information, such as journal articles, research reports and data sets, and official publications though they also have been used to identify other types of information resources, such as commercial videos.