Structural Classification of Proteins database

Last updated
SCOP
Structural Classification of Proteins database logo.gif
Content
DescriptionProtein Structure Classification
Contact
Research center Laboratory of Molecular Biology
AuthorsAlexey G. Murzin, Steven E. Brenner, Tim J. P. Hubbard, and Cyrus Chothia
Primary citation PMID   7723011
Release date1994
Access
Website http://scop.mrc-lmb.cam.ac.uk/scop/
Miscellaneous
Version1.75 (June 2009; 110,800 domains in 38,221 structures classed as 3,902 families) [1]
Curation policymanual
SCOPe
Content
DescriptionSCOP - extended
Contact
AuthorsNaomi K. Fox, Steven E. Brenner, and John-Marc Chandonia
Primary citation PMID   24304899
Access
Website https://scop.berkeley.edu
Miscellaneous
Version2.07 (March 2018; 276,231 domains in 87,224 structures classed as 4,919 families) [2]
Curation policymanual (new classifications) and automated (new structures, BLAST)

The Structural Classification of Proteins (SCOP) database is a largely manual classification of protein structural domains based on similarities of their structures and amino acid sequences. A motivation for this classification is to determine the evolutionary relationship between proteins. Proteins with the same shapes but having little sequence or functional similarity are placed in different superfamilies, and are assumed to have only a very distant common ancestor. Proteins having the same shape and some similarity of sequence and/or function are placed in "families", and are assumed to have a closer common ancestor.

Contents

Similar to CATH and Pfam databases, SCOP provides a classification of individual structural domains of proteins, rather than a classification of the entire proteins which may include a significant number of different domains.

The SCOP database is freely accessible on the internet. SCOP was created in 1994 in the Centre for Protein Engineering and the Laboratory of Molecular Biology. [3] It was maintained by Alexey G. Murzin and his colleagues in the Centre for Protein Engineering until its closure in 2010 and subsequently at the Laboratory of Molecular Biology in Cambridge, England. [4] [5] [6] [1]

The work on SCOP 1.75 has been discontinued in 2014. Since then SCOPe team from UC Berkeley has been responsible for updating the database in a compatible manner, with a combination of automated and manual methods. As of April 2019, the latest release is SCOPe 2.07 (March 2018). [2]

The new Structural Classification of Proteins version 2 (SCOP2) database was released at the beginning of 2020. The new update featured an improved database schema, a new API and modernised web interface. This was the most significant update by the Cambridge group since SCOP 1.75 and builds on the advances in schema from the SCOP 2 prototype. [7]

Hierarchical organisation

The source of protein structures is the Protein Data Bank. The unit of classification of structure in SCOP is the protein domain. What the SCOP authors mean by "domain" is suggested by their statement that small proteins and most medium-sized ones have just one domain, [8] and by the observation that human hemoglobin, [9] which has an α2β2 structure, is assigned two SCOP domains, one for the α and one for the β subunit.

The shapes of domains are called "folds" in SCOP. Domains belonging to the same fold have the same major secondary structures in the same arrangement with the same topological connections. 1195 folds are given in SCOP version 1.75. Short descriptions of each fold are given. For example, the "globin-like" fold is described as core: 6 helices; folded leaf, partly opened. The fold to which a domain belongs is determined by inspection, rather than by software.

The levels of SCOP version 1.75 are as follows.

  1. Class: Types of folds, e.g., beta sheets.
  2. Fold: The different shapes of domains within a class.
  3. Superfamily: The domains in a fold are grouped into superfamilies, which have at least a distant common ancestor.
  4. Family: The domains in a superfamily are grouped into families, which have a more recent common ancestor.
  5. Protein domain: The domains in families are grouped into protein domains, which are essentially the same protein.
  6. Species: The domains in "protein domains" are grouped according to species.
  7. Domain: part of a protein. For simple proteins, it can be the entire protein.

Classes

The broadest groups on SCOP version 1.75 are the protein fold classes. These classes group structures with similar secondary structure composition, but different overall tertiary structures and evolutionarily origins. This is the top level "root" of the SCOP hierarchical classification.

  1. All alpha proteins [46456] (284): Domains consisting of α-helices
  2. All beta proteins [48724] (174): Domains consisting of β-sheets
  3. Alpha and beta proteins (a/b) [51349] (147): Mainly parallel beta sheets (beta-alpha-beta units)
  4. Alpha and beta proteins (a+b) [53931] (376): Mainly antiparallel beta sheets (segregated alpha and beta regions)
  5. Multi-domain proteins (alpha and beta) [56572] (66): Folds consisting of two or more domains belonging to different classes
  6. membrane and cell surface proteins and peptides [56835] (58): Does not include proteins in the immune system
  7. Small proteins [56992] (90): Usually dominated by metal ligand, cofactor, and/or disulfide bridges
  8. coiled-coil proteins [57942] (7): Not a true class
  9. Low resolution protein structures [58117] (26): Peptides and fragments. Not a true class
  10. Peptides [58231] (121): peptides and fragments. Not a true class.
  11. Designed proteins [58788] (44): Experimental structures of proteins with essentially non-natural sequences. Not a true class

The number in brackets, called a "sunid", is a SCOP unique integer identifier for each node in the SCOP hierarchy. The number in parentheses indicates how many elements are in each category. For example, there are 284 folds in the "All alpha proteins" class. Each member of the hierarchy is a link to the next level of the hierarchy.

Folds

Each class contains a number of distinct folds. This classification level indicates similar tertiary structure, but not necessarily evolutionary relatedness. For example, the "All-α proteins" class contains >280 distinct folds, including: Globin-like (core: 6 helices; folded leaf, partly opened), long alpha-hairpin (2 helices; antiparallel hairpin, left-handed twist) and Type I dockerin domains (tandem repeat of two calcium-binding loop-helix motifs, distinct from the EF-hand).

Superfamilies

Domains within a fold are further classified into superfamilies. This is a largest grouping of proteins for which structural similarity is sufficient to indicate evolutionary relatedness and therefore share a common ancestor. However, this ancestor is presumed to be distant, because the different members of a superfamily have low sequence identities. For example, the two superfamilies of the "Globin-like" fold are: the Globin superfamily and alpha-helical ferredoxin superfamily (contains two Fe4-S4 clusters).

Families

Protein families are more closely related than superfamilies. Domains are placed in the same family if that have either:

  1. >30% sequence identity
  2. some sequence identity (e.g., 15%) and perform the same function

The similarity in sequence and structure is evidence that these proteins have a closer evolutionary relationship than do proteins in the same superfamily. Sequence tools, such as BLAST, are used to assist in placing domains into superfamilies and families. For example, the four families in the "globin-like" superfamily of the "globin-like" fold are truncated hemoglobin (lack the first helix), nerve tissue mini-hemoglobin (lack the first helix but otherwise is more similar to conventional globins than the truncated ones), globins (Heme-binding protein), and phycocyanin-like phycobilisome proteins (oligomers of two different types of globin-like subunits containing two extra helices at the N-terminus binds a bilin chromophore). Families in SCOP are each assigned a concise classification string, sccs, where the letter identifies the class to which the domain belongs; the following integers identify the fold, superfamily, and family, respectively (e.g., a.1.1.2 for the "Globin" family). [10]

PDB entry domains

A "TaxId" is the taxonomy ID number and links to the NCBI taxonomy browser, which provides more information about the species to which the protein belongs. Clicking on a species or isoform brings up a list of domains. For example, the "Hemoglobin, alpha-chain from Human (Homo sapiens)" protein has >190 solved protein structures, such as 2dn3 (complexed with cmo), and 2dn1 (complexed with hem, mbn, oxy). Clicking on the PDB numbers is supposed to display the structure of the molecule, but the links are currently broken (links work in pre-SCOP).

Example

Most pages in SCOP contain a search box. Entering "trypsin +human" retrieves several proteins, including the protein trypsinogen from humans. Selecting that entry displays a page that includes the "lineage", which is at the top of most SCOP pages.

Human trypsonogen[ check spelling ] lineage
  1. Root: scop
  2. Class: All beta proteins [48724]
  3. Fold: Trypsin-like serine proteases [50493]
    barrel, closed; n=6, S=8; greek-key
    duplication: consists of two domains of the same fold
  4. Superfamily: Trypsin-like serine proteases [50494]
  5. Family: Eukaryotic proteases [50514]
  6. Protein: Trypsin(ogen) [50515]
  7. Species: Human (Homo sapiens) [TaxId: 9606] [50519]

Searching for "Subtilisin" returns the protein, "Subtilisin from Bacillus subtilis, carlsberg", with the following lineage.

Subtilisin from Bacillus subtilis, carlsberg lineage
  1. Root: scop
  2. Class: Alpha and beta proteins (a/b) [51349]
    Mainly parallel beta sheets (beta-alpha-beta units)
  3. Fold: Subtilisin-like [52742]
    3 layers: a/b/a, parallel beta-sheet of 7 strands, order 2314567; left-handed crossover connection between strands 2 & 3
  4. Superfamily: Subtilisin-like [52743]
  5. Family: Subtilases [52744]
  6. Protein: Subtilisin [52745]
  7. Species: Bacillus subtilis, carlsberg [TaxId: 1423] [52746]

Although both of these proteins are proteases, they do not even belong to the same fold, which is consistent with them being an example of convergent evolution.

Comparison to other classification systems

SCOP classification is more dependent on manual decisions than the semi-automatic classification by CATH, its chief rival. Human expertise is used to decide whether certain proteins are evolutionary related and therefore should be assigned to the same superfamily, or their similarity is a result of structural constraints and therefore they belong to the same fold. Another database, FSSP, is purely automatically generated (including regular automatic updates) but offers no classification, allowing the user to draw their own conclusion as to the significance of structural relationships based on the pairwise comparisons of individual protein structures.

SCOP successors

By 2009, the original SCOP database manually classified 38,000 PDB entries into a strictly hierarchical structure. With the accelerating pace of protein structure publications, the limited automation of classification could not keep up, leading to a non-comprehensive dataset. The Structural Classification of Proteins extended (SCOPe) database was released in 2012 with far greater automation of the same hierarchical system and is full backwards compatible with SCOP version 1.75. In 2014, manual curation was reintroduced into SCOPe to maintain accurate structure assignment. As of February 2015, SCOPe 2.05 classified 71,000 of the 110,000 total PDB entries. [11]

SCOP2 prototype was a beta version of Structural classification of proteins and classification system that aimed to more the evolutionary complexity inherent in protein structure evolution. [12] It is therefore not a simple hierarchy, but a directed acyclic graph network connecting protein superfamilies representing structural and evolutionary relationships such as circular permutations, domain fusion and domain decay. Consequently, domains are not separated by strict fixed boundaries, but rather are defined by their relationships to the most similar other structures. The prototype was used for the development of the SCOP version 2 database. [7] The SCOP version 2, release January 2020, contains 5134 families and 2485 superfamilies compared to 3902 families and 1962 superfamilies in SCOP 1.75. The classification levels organise more than 41 000 non-redundant domains that represent more than 504 000 protein structures.

The Evolutionary Classification of Protein Domains (ECOD) database released in 2014 is a similar to SCOPe expansion of SCOP version 1.75. Unlike the compatible SCOPe, it renames the class-fold-superfamily-family hierarchy into an architecture-X-homology-topology-family (A-XHTF) grouping, with the last level mostly defined by Pfam and supplemented by HHsearch clustering for uncategorized sequences. [13] ECOD has the best PDB coverage of all three successors: it covers every PDB structure, and is updated biweekly. [14] The direct mapping to Pfam has proven useful to Pfam curators who use the homology-level category to supplement their "clan" grouping. [15]

See also

Related Research Articles

<span class="mw-page-title-main">Beta sheet</span> Protein structural motif

The beta sheet, (β-sheet) is a common motif of the regular protein secondary structure. Beta sheets consist of beta strands (β-strands) connected laterally by at least two or three backbone hydrogen bonds, forming a generally twisted, pleated sheet. A β-strand is a stretch of polypeptide chain typically 3 to 10 amino acids long with backbone in an extended conformation. The supramolecular association of β-sheets has been implicated in the formation of the fibrils and protein aggregates observed in amyloidosis, Alzheimer's disease and other proteinopathies.

<span class="mw-page-title-main">Globular protein</span> Spherical, water-soluble type of protein

In biochemistry, globular proteins or spheroproteins are spherical ("globe-like") proteins and are one of the common protein types. Globular proteins are somewhat water-soluble, unlike the fibrous or membrane proteins. There are multiple fold classes of globular proteins, since there are many different architectures that can fold into a roughly spherical shape.

<span class="mw-page-title-main">Protein family</span> Group of evolutionarily-related proteins

A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be confused with family as it is used in taxonomy.

In biology and biochemistry, protease inhibitors, or antiproteases, are molecules that inhibit the function of proteases. Many naturally occurring protease inhibitors are proteins.

<span class="mw-page-title-main">CATH database</span>

The CATH Protein Structure Classification database is a free, publicly available online resource that provides information on the evolutionary relationships of protein domains. It was created in the mid-1990s by Professor Christine Orengo and colleagues including Janet Thornton and David Jones, and continues to be developed by the Orengo group at University College London. CATH shares many broad features with the SCOP resource, however there are also many areas in which the detailed classification differs greatly.

<span class="mw-page-title-main">Protein structure</span> Three-dimensional arrangement of atoms in an amino acid-chain molecule

Protein structure is the three-dimensional arrangement of atoms in an amino acid-chain molecule. Proteins are polymers – specifically polypeptides – formed from sequences of amino acids, which are the monomers of the polymer. A single amino acid monomer may also be called a residue, which indicates a repeating unit of a polymer. Proteins form by amino acids undergoing condensation reactions, in which the amino acids lose one water molecule per reaction in order to attach to one another with a peptide bond. By convention, a chain under 30 amino acids is often identified as a peptide, rather than a protein. To be able to perform their biological function, proteins fold into one or more specific spatial conformations driven by a number of non-covalent interactions, such as hydrogen bonding, ionic interactions, Van der Waals forces, and hydrophobic packing. To understand the functions of proteins at a molecular level, it is often necessary to determine their three-dimensional structure. This is the topic of the scientific field of structural biology, which employs techniques such as X-ray crystallography, NMR spectroscopy, cryo-electron microscopy (cryo-EM) and dual polarisation interferometry, to determine the structure of proteins.

A DNA-binding domain (DBD) is an independently folded protein domain that contains at least one structural motif that recognizes double- or single-stranded DNA. A DBD can recognize a specific DNA sequence or have a general affinity to DNA. Some DNA-binding domains may also include nucleic acids in their folded structure.

<span class="mw-page-title-main">Pfam</span> Database of protein families

Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The most recent version, Pfam 36.0, was released in September 2023 and contains 20,795 families.

InterPro is a database of protein families, protein domains and functional sites in which identifiable features found in known proteins can be applied to new protein sequences in order to functionally characterise them.

<span class="mw-page-title-main">Trefoil knot fold</span>

The trefoil knot fold is a protein fold in which the protein backbone is twisted into a trefoil knot shape. "Shallow" knots in which the tail of the polypeptide chain only passes through a loop by a few residues are uncommon, but "deep" knots in which many residues are passed through the loop are extremely rare. Deep trefoil knots have been found in the SPOUT superfamily. including methyltransferase proteins involved in posttranscriptional RNA modification in all three domains of life, including bacterium Thermus thermophilus and proteins, in archaea and in eukaryota.

<span class="mw-page-title-main">Protein domain</span> Self-stable region of a proteins chain that folds independently from the rest

In molecular biology, a protein domain is a region of a protein's polypeptide chain that is self-stabilizing and that folds independently from the rest. Each domain forms a compact folded three-dimensional structure. Many proteins consist of several domains, and a domain may appear in a variety of different proteins. Molecular evolution uses domains as building blocks and these may be recombined in different arrangements to create proteins with different functions. In general, domains vary in length from between about 50 amino acids up to 250 amino acids in length. The shortest domains, such as zinc fingers, are stabilized by metal ions or disulfide bridges. Domains often form functional units, such as the calcium-binding EF hand domain of calmodulin. Because they are independently stable, domains can be "swapped" by genetic engineering between one protein and another to make chimeric proteins.

SUPERFAMILY is a database and search platform of structural and functional annotation for all proteins and genomes. It classifies amino acid sequences into known structural domains, especially into SCOP superfamilies. Domains are functional, structural, and evolutionary units that form proteins. Domains of common Ancestry are grouped into superfamilies. The domains and domain superfamilies are defined and described in SCOP. Superfamilies are groups of proteins which have structural evidence to support a common evolutionary ancestor but may not have detectable sequence homology.

<span class="mw-page-title-main">Subtilase</span>

Subtilases are a family of subtilisin-like serine proteases. They appear to have independently and convergently evolved an Asp/Ser/His catalytic triad, like in the trypsin serine proteases. The structure of proteins in this family shows that they have an alpha/beta fold containing a 7-stranded parallel beta sheet.

<span class="mw-page-title-main">Steven E. Brenner</span>

Steven Elliot Brenner is a professor at the Department of Plant and Microbial Biology at the University of California Berkeley, adjunct professor at the Department of Bioengineering and Therapeutic Sciences at the University of California, and San Francisco Faculty scientist, Physical Biosciences at the Lawrence Berkeley National Laboratory.

<span class="mw-page-title-main">Cyrus Chothia</span> English biochemist (1942–2019)

Cyrus Homi Chothia was an English biochemist who was an emeritus scientist at the Medical Research Council (MRC) Laboratory of Molecular Biology (LMB) at the University of Cambridge and emeritus fellow of Wolfson College, Cambridge.

<span class="mw-page-title-main">Protein fold class</span> Categories of protein tertiary structure

In molecular biology, protein fold classes are broad categories of protein tertiary structure topology. They describe groups of proteins that share similar amino acid and secondary structure proportions. Each class contains multiple, independent protein superfamilies.

A protein superfamily is the largest grouping (clade) of proteins for which common ancestry can be inferred. Usually this common ancestry is inferred from structural alignment and mechanistic similarity, even if no sequence similarity is evident. Sequence homology can then be deduced even if not apparent. Superfamilies typically contain several protein families which show sequence similarity within each family. The term protein clan is commonly used for protease and glycosyl hydrolases superfamilies based on the MEROPS and CAZy classification systems.

<span class="mw-page-title-main">Alex Bateman</span> British bioinformatician

Alexander George Bateman is a computational biologist and Head of Protein Sequence Resources at the European Bioinformatics Institute (EBI), part of the European Molecular Biology Laboratory (EMBL) in Cambridge, UK. He has led the development of the Pfam biological database and introduced the Rfam database of RNA families. He has also been involved in the use of Wikipedia for community-based annotation of biological databases.

Julian John Thurstan Gough is a Group Leader in the Laboratory of Molecular Biology (LMB) of the Medical Research Council (MRC). He was previously a professor of bioinformatics at the University of Bristol.

References

  1. 1 2 Andreeva A, Howorth D, Chandonia JM, Brenner SE, Hubbard TJ, Chothia C, Murzin AG (January 2008). "Data growth and its impact on the SCOP database: new developments". Nucleic Acids Research. 36 (Database issue): D419-25. doi:10.1093/nar/gkm993. PMC   2238974 . PMID   18000004.
  2. 1 2 Chandonia JM, Fox NK, Brenner SE (January 2019). "SCOPe: classification of large macromolecular structures in the structural classification of proteins-extended database". Nucleic Acids Research. 47 (D1): D475–D481. doi:10.1093/nar/gky1134. PMC   6323910 . PMID   30500919.
  3. Murzin AG, Brenner SE, Hubbard T, Chothia C (April 1995). "SCOP: a structural classification of proteins database for the investigation of sequences and structures". Journal of Molecular Biology. 247 (4): 536–40. doi:10.1016/S0022-2836(05)80134-2. PMID   7723011.
  4. Hubbard TJ, Ailey B, Brenner SE, Murzin AG, Chothia C (January 1999). "SCOP: a Structural Classification of Proteins database". Nucleic Acids Research. 27 (1): 254–6. doi:10.1093/nar/27.1.254. PMC   148149 . PMID   9847194.
  5. Lo Conte L, Ailey B, Hubbard TJ, Brenner SE, Murzin AG, Chothia C (January 2000). "SCOP: a structural classification of proteins database". Nucleic Acids Research. 28 (1): 257–9. doi:10.1093/nar/28.1.257. PMC   102479 . PMID   10592240.
  6. Andreeva A, Howorth D, Brenner SE, Hubbard TJ, Chothia C, Murzin AG (January 2004). "SCOP database in 2004: refinements integrate structure and sequence family data". Nucleic Acids Research. 32 (Database issue): D226-9. doi:10.1093/nar/gkh039. PMC   308773 . PMID   14681400.
  7. 1 2 Andreeva A, Kulesha E, Gough J, Murzin AG (January 2020). "SCOP database in 2020: : expanded classification of representative family and superfamily domains of known protein structures". Nucleic Acids Research. 48 (Database issue): D376–D382. doi: 10.1093/nar/gkz1064 . PMC   7139981 . PMID   31724711.
  8. Murzin AG, Brenner SE, Hubbard T, Chothia C (April 1995). "SCOP: a structural classification of proteins database for the investigation of sequences and structures" (PDF). Journal of Molecular Biology. 247 (4): 536–40. doi:10.1016/S0022-2836(05)80134-2. PMID   7723011. Archived from the original (PDF) on 2012-04-26.
  9. PDB: 2DN1 ; Park SY, Yokoyama T, Shibayama N, Shiro Y, Tame JR (July 2006). "1.25 A resolution crystal structures of human haemoglobin in the oxy, deoxy and carbonmonoxy forms". Journal of Molecular Biology. 360 (3): 690–701. doi:10.1016/j.jmb.2006.05.036. PMID   16765986.
  10. Lo Conte L, Brenner SE, Hubbard TJ, Chothia C, Murzin AG (January 2002). "SCOP database in 2002: refinements accommodate structural genomics". Nucleic Acids Research. 30 (1): 264–7. doi:10.1093/nar/30.1.264. PMC   99154 . PMID   11752311.
  11. "What is the relationship between SCOP, SCOPe, and SCOP2". scop.berkeley.edu. Retrieved 2015-08-22.
  12. Andreeva A, Howorth D, Chothia C, Kulesha E, Murzin AG (January 2014). "SCOP2 prototype: a new approach to protein structure mining". Nucleic Acids Research. 42 (Database issue): D310-4. doi:10.1093/nar/gkt1242. PMC   3964979 . PMID   24293656.
  13. Cheng H, Schaeffer RD, Liao Y, Kinch LN, Pei J, Shi S, Kim BH, Grishin NV (December 2014). "ECOD: an evolutionary classification of protein domains". PLOS Computational Biology. 10 (12): e1003926. Bibcode:2014PLSCB..10E3926C. doi: 10.1371/journal.pcbi.1003926 . PMC   4256011 . PMID   25474468.
  14. "Evolutionary Classification of Protein Domains". prodata.swmed.edu. Retrieved 18 May 2019.
  15. El-Gebali S, Mistry J, Bateman A, Eddy SR, Luciani A, Potter SC, Qureshi M, Richardson LJ, Salazar GA, Smart A, Sonnhammer EL, Hirsh L, Paladin L, Piovesan D, Tosatto SC, Finn RD (January 2019). "The Pfam protein families database in 2019". Nucleic Acids Research. 47 (D1): D427–D432. doi:10.1093/nar/gky995. PMC   6324024 . PMID   30357350.