Content | |
---|---|
Description | Protein Structure Classification |
Contact | |
Research center | Laboratory of Molecular Biology |
Authors | Alexey G. Murzin, Steven E. Brenner, Tim J. P. Hubbard, and Cyrus Chothia |
Primary citation | PMID 7723011 |
Release date | 1994 |
Access | |
Website | http://scop.mrc-lmb.cam.ac.uk/scop/ |
Miscellaneous | |
Version | 1.75 (June 2009; 110,800 domains in 38,221 structures classed as 3,902 families) [1] |
Curation policy | manual |
Content | |
---|---|
Description | SCOP - extended |
Contact | |
Authors | Naomi K. Fox, Steven E. Brenner, and John-Marc Chandonia |
Primary citation | PMID 24304899 |
Access | |
Website | https://scop.berkeley.edu |
Miscellaneous | |
Version | 2.07 (March 2018; 276,231 domains in 87,224 structures classed as 4,919 families) [2] |
Curation policy | manual (new classifications) and automated (new structures, BLAST) |
The Structural Classification of Proteins (SCOP) database is a largely manual classification of protein structural domains based on similarities of their structures and amino acid sequences. A motivation for this classification is to determine the evolutionary relationship between proteins. Proteins with the same shapes but having little sequence or functional similarity are placed in different superfamilies, and are assumed to have only a very distant common ancestor. Proteins having the same shape and some similarity of sequence and/or function are placed in "families", and are assumed to have a closer common ancestor.
Similar to CATH and Pfam databases, SCOP provides a classification of individual structural domains of proteins, rather than a classification of the entire proteins which may include a significant number of different domains.
The SCOP database is freely accessible on the internet. SCOP was created in 1994 in the Centre for Protein Engineering and the Laboratory of Molecular Biology. [3] It was maintained by Alexey G. Murzin and his colleagues in the Centre for Protein Engineering until its closure in 2010 and subsequently at the Laboratory of Molecular Biology in Cambridge, England. [4] [5] [6] [1]
The work on SCOP 1.75 has been discontinued in 2014. Since then SCOPe team from UC Berkeley has been responsible for updating the database in a compatible manner, with a combination of automated and manual methods. As of April 2019 [update] , the latest release is SCOPe 2.07 (March 2018). [2]
The new Structural Classification of Proteins version 2 (SCOP2) database was released at the beginning of 2020. The new update featured an improved database schema, a new API and modernised web interface. This was the most significant update by the Cambridge group since SCOP 1.75 and builds on the advances in schema from the SCOP 2 prototype. [7]
The source of protein structures is the Protein Data Bank. The unit of classification of structure in SCOP is the protein domain. What the SCOP authors mean by "domain" is suggested by their statement that small proteins and most medium-sized ones have just one domain, [8] and by the observation that human hemoglobin, [9] which has an α2β2 structure, is assigned two SCOP domains, one for the α and one for the β subunit.
The shapes of domains are called "folds" in SCOP. Domains belonging to the same fold have the same major secondary structures in the same arrangement with the same topological connections. 1195 folds are given in SCOP version 1.75. Short descriptions of each fold are given. For example, the "globin-like" fold is described as core: 6 helices; folded leaf, partly opened. The fold to which a domain belongs is determined by inspection, rather than by software.
The levels of SCOP version 1.75 are as follows.
The broadest groups on SCOP version 1.75 are the protein fold classes. These classes group structures with similar secondary structure composition, but different overall tertiary structures and evolutionarily origins. This is the top level "root" of the SCOP hierarchical classification.
The number in brackets, called a "sunid", is a SCOP unique integer identifier for each node in the SCOP hierarchy. The number in parentheses indicates how many elements are in each category. For example, there are 284 folds in the "All alpha proteins" class. Each member of the hierarchy is a link to the next level of the hierarchy.
Each class contains a number of distinct folds. This classification level indicates similar tertiary structure, but not necessarily evolutionary relatedness. For example, the "All-α proteins" class contains >280 distinct folds, including: Globin-like (core: 6 helices; folded leaf, partly opened), long alpha-hairpin (2 helices; antiparallel hairpin, left-handed twist) and Type I dockerin domains (tandem repeat of two calcium-binding loop-helix motifs, distinct from the EF-hand).
Domains within a fold are further classified into superfamilies. This is a largest grouping of proteins for which structural similarity is sufficient to indicate evolutionary relatedness and therefore share a common ancestor. However, this ancestor is presumed to be distant, because the different members of a superfamily have low sequence identities. For example, the two superfamilies of the "Globin-like" fold are: the Globin superfamily and alpha-helical ferredoxin superfamily (contains two Fe4-S4 clusters).
Protein families are more closely related than superfamilies. Domains are placed in the same family if that have either:
The similarity in sequence and structure is evidence that these proteins have a closer evolutionary relationship than do proteins in the same superfamily. Sequence tools, such as BLAST, are used to assist in placing domains into superfamilies and families. For example, the four families in the "globin-like" superfamily of the "globin-like" fold are truncated hemoglobin (lack the first helix), nerve tissue mini-hemoglobin (lack the first helix but otherwise is more similar to conventional globins than the truncated ones), globins (Heme-binding protein), and phycocyanin-like phycobilisome proteins (oligomers of two different types of globin-like subunits containing two extra helices at the N-terminus binds a bilin chromophore). Families in SCOP are each assigned a concise classification string, sccs, where the letter identifies the class to which the domain belongs; the following integers identify the fold, superfamily, and family, respectively (e.g., a.1.1.2 for the "Globin" family). [10]
A "TaxId" is the taxonomy ID number and links to the NCBI taxonomy browser, which provides more information about the species to which the protein belongs. Clicking on a species or isoform brings up a list of domains. For example, the "Hemoglobin, alpha-chain from Human (Homo sapiens)" protein has >190 solved protein structures, such as 2dn3 (complexed with cmo), and 2dn1 (complexed with hem, mbn, oxy). Clicking on the PDB numbers is supposed to display the structure of the molecule, but the links are currently broken (links work in pre-SCOP).
Most pages in SCOP contain a search box. Entering "trypsin +human" retrieves several proteins, including the protein trypsinogen from humans. Selecting that entry displays a page that includes the "lineage", which is at the top of most SCOP pages.
Searching for "Subtilisin" returns the protein, "Subtilisin from Bacillus subtilis, carlsberg", with the following lineage.
Although both of these proteins are proteases, they do not even belong to the same fold, which is consistent with them being an example of convergent evolution.
SCOP classification is more dependent on manual decisions than the semi-automatic classification by CATH, its chief rival. Human expertise is used to decide whether certain proteins are evolutionary related and therefore should be assigned to the same superfamily, or their similarity is a result of structural constraints and therefore they belong to the same fold. Another database, FSSP, is purely automatically generated (including regular automatic updates) but offers no classification, allowing the user to draw their own conclusion as to the significance of structural relationships based on the pairwise comparisons of individual protein structures.
By 2009, the original SCOP database manually classified 38,000 PDB entries into a strictly hierarchical structure. With the accelerating pace of protein structure publications, the limited automation of classification could not keep up, leading to a non-comprehensive dataset. The Structural Classification of Proteins extended (SCOPe) database was released in 2012 with far greater automation of the same hierarchical system and is full backwards compatible with SCOP version 1.75. In 2014, manual curation was reintroduced into SCOPe to maintain accurate structure assignment. As of February 2015, SCOPe 2.05 classified 71,000 of the 110,000 total PDB entries. [11]
SCOP2 prototype was a beta version of Structural classification of proteins and classification system that aimed to more the evolutionary complexity inherent in protein structure evolution. [12] It is therefore not a simple hierarchy, but a directed acyclic graph network connecting protein superfamilies representing structural and evolutionary relationships such as circular permutations, domain fusion and domain decay. Consequently, domains are not separated by strict fixed boundaries, but rather are defined by their relationships to the most similar other structures. The prototype was used for the development of the SCOP version 2 database. [7] The SCOP version 2, release January 2020, contains 5134 families and 2485 superfamilies compared to 3902 families and 1962 superfamilies in SCOP 1.75. The classification levels organise more than 41 000 non-redundant domains that represent more than 504 000 protein structures.
The Evolutionary Classification of Protein Domains (ECOD) database released in 2014 is a similar to SCOPe expansion of SCOP version 1.75. Unlike the compatible SCOPe, it renames the class-fold-superfamily-family hierarchy into an architecture-X-homology-topology-family (A-XHTF) grouping, with the last level mostly defined by Pfam and supplemented by HHsearch clustering for uncategorized sequences. [13] ECOD has the best PDB coverage of all three successors: it covers every PDB structure, and is updated biweekly. [14] The direct mapping to Pfam has proven useful to Pfam curators who use the homology-level category to supplement their "clan" grouping. [15]
The beta sheet is a common motif of the regular protein secondary structure. Beta sheets consist of beta strands (β-strands) connected laterally by at least two or three backbone hydrogen bonds, forming a generally twisted, pleated sheet. A β-strand is a stretch of polypeptide chain typically 3 to 10 amino acids long with backbone in an extended conformation. The supramolecular association of β-sheets has been implicated in the formation of the fibrils and protein aggregates observed in amyloidosis, Alzheimer's disease and other proteinopathies.
In biochemistry, globular proteins or spheroproteins are spherical ("globe-like") proteins and are one of the common protein types. Globular proteins are somewhat water-soluble, unlike the fibrous or membrane proteins. There are multiple fold classes of globular proteins, since there are many different architectures that can fold into a roughly spherical shape.
Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its secondary and tertiary structure from primary structure. Structure prediction is different from the inverse problem of protein design. Protein structure prediction is one of the most important goals pursued by computational biology; it is important in medicine and biotechnology.
In a chain-like biological molecule, such as a protein or nucleic acid, a structural motif is a common three-dimensional structure that appears in a variety of different, evolutionarily unrelated molecules. A structural motif does not have to be associated with a sequence motif; it can be represented by different and completely unrelated sequences in different proteins or RNA.
A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be confused with family as it is used in taxonomy.
In biology and biochemistry, protease inhibitors, or antiproteases, are molecules that inhibit the function of proteases. Many naturally occurring protease inhibitors are proteins.
The CATH Protein Structure Classification database is a free, publicly available online resource that provides information on the evolutionary relationships of protein domains. It was created in the mid-1990s by Professor Christine Orengo and colleagues including Janet Thornton and David Jones, and continues to be developed by the Orengo group at University College London. CATH shares many broad features with the SCOP resource, however there are also many areas in which the detailed classification differs greatly.
Protein structure is the three-dimensional arrangement of atoms in an amino acid-chain molecule. Proteins are polymers – specifically polypeptides – formed from sequences of amino acids, which are the monomers of the polymer. A single amino acid monomer may also be called a residue, which indicates a repeating unit of a polymer. Proteins form by amino acids undergoing condensation reactions, in which the amino acids lose one water molecule per reaction in order to attach to one another with a peptide bond. By convention, a chain under 30 amino acids is often identified as a peptide, rather than a protein. To be able to perform their biological function, proteins fold into one or more specific spatial conformations driven by a number of non-covalent interactions, such as hydrogen bonding, ionic interactions, Van der Waals forces, and hydrophobic packing. To understand the functions of proteins at a molecular level, it is often necessary to determine their three-dimensional structure. This is the topic of the scientific field of structural biology, which employs techniques such as X-ray crystallography, NMR spectroscopy, cryo-electron microscopy (cryo-EM) and dual polarisation interferometry, to determine the structure of proteins.
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. Last version of Pfam, 36.0, was released in September 2023 and contains 20,795 families. It is currently provided through InterPro database.
InterPro is a database of protein families, protein domains and functional sites in which identifiable features found in known proteins can be applied to new protein sequences in order to functionally characterise them.
In protein structures, a beta barrel(β barrel) is a beta sheet composed of tandem repeats that twists and coils to form a closed toroidal structure in which the first strand is bonded to the last strand. Beta-strands in many beta-barrels are arranged in an antiparallel fashion. Beta barrel structures are named for resemblance to the barrels used to contain liquids. Most of them are water-soluble outer membrane proteins and frequently bind hydrophobic ligands in the barrel center, as in lipocalins. Others span cell membranes and are commonly found in porins. Porin-like barrel structures are encoded by as many as 2–3% of the genes in Gram-negative bacteria. It has been shown that more than 600 proteins with various function such as oxidase, dismutase, and amylase contain the beta barrel structure.
The trefoil knot fold is a protein fold in which the protein backbone is twisted into a trefoil knot shape. "Shallow" knots in which the tail of the polypeptide chain only passes through a loop by a few residues are uncommon, but "deep" knots in which many residues are passed through the loop are extremely rare. Deep trefoil knots have been found in the SPOUT superfamily. including methyltransferase proteins involved in posttranscriptional RNA modification in all three domains of life, including bacterium Thermus thermophilus and proteins, in archaea and in eukaryota.
In molecular biology, a protein domain is a region of a protein's polypeptide chain that is self-stabilizing and that folds independently from the rest. Each domain forms a compact folded three-dimensional structure. Many proteins consist of several domains, and a domain may appear in a variety of different proteins. Molecular evolution uses domains as building blocks and these may be recombined in different arrangements to create proteins with different functions. In general, domains vary in length from between about 50 amino acids up to 250 amino acids in length. The shortest domains, such as zinc fingers, are stabilized by metal ions or disulfide bridges. Domains often form functional units, such as the calcium-binding EF hand domain of calmodulin. Because they are independently stable, domains can be "swapped" by genetic engineering between one protein and another to make chimeric proteins.
SUPERFAMILY is a database and search platform of structural and functional annotation for all proteins and genomes. It classifies amino acid sequences into known structural domains, especially into SCOP superfamilies. Domains are functional, structural, and evolutionary units that form proteins. Domains of common Ancestry are grouped into superfamilies. The domains and domain superfamilies are defined and described in SCOP. Superfamilies are groups of proteins which have structural evidence to support a common evolutionary ancestor but may not have detectable sequence homology.
Cyrus Homi Chothia was an English biochemist who was an emeritus scientist at the Medical Research Council (MRC) Laboratory of Molecular Biology (LMB) at the University of Cambridge and emeritus fellow of Wolfson College, Cambridge.
In molecular biology, protein fold classes are broad categories of protein tertiary structure topology. They describe groups of proteins that share similar amino acid and secondary structure proportions. Each class contains multiple, independent protein superfamilies.
A protein superfamily is the largest grouping (clade) of proteins for which common ancestry can be inferred. Usually this common ancestry is inferred from structural alignment and mechanistic similarity, even if no sequence similarity is evident. Sequence homology can then be deduced even if not apparent. Superfamilies typically contain several protein families which show sequence similarity within each family. The term protein clan is commonly used for protease and glycosyl hydrolases superfamilies based on the MEROPS and CAZy classification systems.
Alexander George Bateman is a computational biologist and Head of Protein Sequence Resources at the European Bioinformatics Institute (EBI), part of the European Molecular Biology Laboratory (EMBL) in Cambridge, UK. He has led the development of the Pfam biological database and introduced the Rfam database of RNA families. He has also been involved in the use of Wikipedia for community-based annotation of biological databases.
Julian John Thurstan Gough was a Group Leader in the Laboratory of Molecular Biology (LMB) of the Medical Research Council (MRC). He was previously a professor of bioinformatics at the University of Bristol.