Content | |
---|---|
Description | Protein Structure Classification |
Contact | |
Research center | University College London |
Laboratory | Institute of Structural and Molecular Biology |
Primary citation | Dawson et al. (2016) [1] |
Release date | 1997 |
Access | |
Website | cathdb |
Download URL | cathdb |
Miscellaneous | |
Data release frequency | CATH-B is released daily. Official releases are approximately annual. |
Version | 4.3 |
The CATH Protein Structure Classification database is a free, publicly available online resource that provides information on the evolutionary relationships of protein domains. It was created in the mid-1990s by Professor Christine Orengo and colleagues including Janet Thornton and David Jones, [2] and continues to be developed by the Orengo group at University College London. CATH shares many broad features with the SCOP resource, however there are also many areas in which the detailed classification differs greatly. [3] [4] [5] [6]
Experimentally determined protein three-dimensional structures are obtained from the Protein Data Bank and split into their consecutive polypeptide chains, where applicable. Protein domains are identified within these chains using a mixture of automatic methods and manual curation. [7]
The domains are then classified within the CATH structural hierarchy: at the Class (C) level, domains are assigned according to their secondary structure content, i.e. all alpha, all beta, a mixture of alpha and beta, or little secondary structure; at the Architecture (A) level, information on the secondary structure arrangement in three-dimensional space is used for assignment; at the Topology/fold (T) level, information on how the secondary structure elements are connected and arranged is used; assignments are made to the Homologous superfamily (H) level if there is good evidence that the domains are related by evolution [2] i.e. they are homologous.
# | Level | Description |
---|---|---|
1 | Class | the overall secondary-structure content of the domain. (Equivalent to the SCOP Class) |
2 | Architecture | high structural similarity but no evidence of homology. |
3 | Topology/fold | a large-scale grouping of topologies which share particular structural features (Equivalent to the 'fold' level in SCOP) |
4 | Homologous superfamily | indicative of a demonstrable evolutionary relationship. (Equivalent to SCOP superfamily) |
Additional sequence data for domains with no experimentally determined structures are provided by CATH's sister resource, Gene3D, which are used to populate the homologous superfamilies. Protein sequences from UniProtKB and Ensembl are scanned against CATH HMMs to predict domain sequence boundaries and make homologous superfamily assignments.
The CATH team releases new data both as daily snapshots, and official releases approximately annually. The latest release of CATH-Gene3D (v4.3) was released in December 2020 and consists of: [8]
CATH is an open source software project, with developers developing and maintaining a number of open-source tools, [9] which are available publicly on GitHub. [10]
Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its secondary and tertiary structure from primary structure. Structure prediction is different from the inverse problem of protein design.
A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be confused with family as it is used in taxonomy.
Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large RNA molecules. In contrast to simple structural superposition, where at least some equivalent residues of the two structures are known, structural alignment requires no a priori knowledge of equivalent positions. Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Structural alignment can therefore be used to imply evolutionary relationships between proteins that share very little common sequence. However, caution should be used in using the results as evidence for shared evolutionary ancestry because of the possible confounding effects of convergent evolution by which multiple unrelated amino acid sequences converge on a common tertiary structure.
The Structural Classification of Proteins (SCOP) database is a largely manual classification of protein structural domains based on similarities of their structures and amino acid sequences. A motivation for this classification is to determine the evolutionary relationship between proteins. Proteins with the same shapes but having little sequence or functional similarity are placed in different superfamilies, and are assumed to have only a very distant common ancestor. Proteins having the same shape and some similarity of sequence and/or function are placed in "families", and are assumed to have a closer common ancestor.
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature. It is maintained by the UniProt consortium, which consists of several European bioinformatics organisations and a foundation from Washington, DC, USA.
In molecular biology, protein threading, also known as fold recognition, is a method of protein modeling which is used to model those proteins which have the same fold as proteins of known structures, but do not have homologous proteins with known structure. It differs from the homology modeling method of structure prediction as it is used for proteins which do not have their homologous protein structures deposited in the Protein Data Bank (PDB), whereas homology modeling is used for those proteins which do. Threading works by using statistical knowledge of the relationship between the structures deposited in the PDB and the sequence of the protein which one wishes to model.
InterPro is a database of protein families, protein domains and functional sites in which identifiable features found in known proteins can be applied to new protein sequences in order to functionally characterise them.
In molecular biology, a protein domain is a region of a protein's polypeptide chain that is self-stabilizing and that folds independently from the rest. Each domain forms a compact folded three-dimensional structure. Many proteins consist of several domains, and a domain may appear in a variety of different proteins. Molecular evolution uses domains as building blocks and these may be recombined in different arrangements to create proteins with different functions. In general, domains vary in length from between about 50 amino acids up to 250 amino acids in length. The shortest domains, such as zinc fingers, are stabilized by metal ions or disulfide bridges. Domains often form functional units, such as the calcium-binding EF hand domain of calmodulin. Because they are independently stable, domains can be "swapped" by genetic engineering between one protein and another to make chimeric proteins.
Dame Janet Maureen Thornton, is a senior scientist and director emeritus at the European Bioinformatics Institute (EBI), part of the European Molecular Biology Laboratory (EMBL). She is one of the world's leading researchers in structural bioinformatics, using computational methods to understand protein structure and function. She served as director of the EBI from October 2001 to June 2015, and played a key role in ELIXIR.
Protein subfamily is a level of protein classification, based on their close evolutionary relationship. It is below the larger levels of protein superfamily and protein family.
MicrobesOnline is a publicly and freely accessible website that hosts multiple comparative genomic tools for comparing microbial species at the genomic, transcriptomic and functional levels. MicrobesOnline was developed by the Virtual Institute for Microbial Stress and Survival, which is based at the Lawrence Berkeley National Laboratory in Berkeley, California. The site was launched in 2005, with regular updates until 2011.
SUPERFAMILY is a database and search platform of structural and functional annotation for all proteins and genomes. It classifies amino acid sequences into known structural domains, especially into SCOP superfamilies. Domains are functional, structural, and evolutionary units that form proteins. Domains of common Ancestry are grouped into superfamilies. The domains and domain superfamilies are defined and described in SCOP. Superfamilies are groups of proteins which have structural evidence to support a common evolutionary ancestor but may not have detectable sequence homology.
The sequential structure alignment program (SSAP) in chemistry, physics, and biology is a method that uses double dynamic programming to produce a structural alignment based on atom-to-atom vectors in structure space. Instead of the alpha carbons typically used in structural alignment, SSAP constructs its vectors from the beta carbons for all residues except glycine, a method which thus takes into account the rotameric state of each residue as well as its location along the backbone. SSAP works by first constructing a series of inter-residue distance vectors between each residue and its nearest non-contiguous neighbors on each protein. A series of matrices are then constructed containing the vector differences between neighbors for each pair of residues for which vectors were constructed. Dynamic programming applied to each resulting matrix determines a series of optimal local alignments which are then summed into a "summary" matrix to which dynamic programming is applied again to determine the overall structural alignment.
David Tudor Jones is a Professor of Bioinformatics, and Head of Bioinformatics Group in the University College London. He is also the director in Bloomsbury Center for Bioinformatics, which is a joint Research Centre between UCL and Birkbeck, University of London and which also provides bioinformatics training and support services to biomedical researchers. In 2013, he is a member of editorial boards for PLoS ONE, BioData Mining, Advanced Bioinformatics, Chemical Biology & Drug Design, and Protein: Structure, Function and Bioinformatics.
In molecular biology, protein fold classes are broad categories of protein tertiary structure topology. They describe groups of proteins that share similar amino acid and secondary structure proportions. Each class contains multiple, independent protein superfamilies.
A protein superfamily is the largest grouping (clade) of proteins for which common ancestry can be inferred. Usually this common ancestry is inferred from structural alignment and mechanistic similarity, even if no sequence similarity is evident. Sequence homology can then be deduced even if not apparent. Superfamilies typically contain several protein families which show sequence similarity within each family. The term protein clan is commonly used for protease and glycosyl hydrolases superfamilies based on the MEROPS and CAZy classification systems.
Michael Joseph Ezra Sternberg is a professor at Imperial College London, where he is director of the Centre for Integrative Systems Biology and Bioinformatics and Head of the Structural bioinformatics Group.
Julian John Thurstan Gough was a Group Leader in the Laboratory of Molecular Biology (LMB) of the Medical Research Council (MRC). He was previously a professor of bioinformatics at the University of Bristol.
Christine Anne Orengo is a Professor of Bioinformatics at University College London (UCL) known for her work on protein structure, particularly the CATH database. Orengo serves as president of the International Society for Computational Biology (ISCB), the first woman to do so in the history of the society.