CATH database

Last updated
CATH
CATH - Protein Structure Classification Database.png
Content
DescriptionProtein Structure Classification
Contact
Research center University College London
LaboratoryInstitute of Structural and Molecular Biology
Primary citationDawson et al. (2016) [1]
Release date1997
Access
Website cathdb.info
Download URL cathdb.info/download
Miscellaneous
Data release
frequency
CATH-B is released daily. Official releases are approximately annual.
Version4.3
Schematic representation of the three top levels of the CATH classification scheme. CATH hierarchy.png
Schematic representation of the three top levels of the CATH classification scheme.

The CATH Protein Structure Classification database is a free, publicly available online resource that provides information on the evolutionary relationships of protein domains. It was created in the mid-1990s by Professor Christine Orengo and colleagues including Janet Thornton and David Jones, [2] and continues to be developed by the Orengo group at University College London. CATH shares many broad features with the SCOP resource, however there are also many areas in which the detailed classification differs greatly. [3] [4] [5] [6]

Contents

Hierarchical organization

Experimentally determined protein three-dimensional structures are obtained from the Protein Data Bank and split into their consecutive polypeptide chains, where applicable. Protein domains are identified within these chains using a mixture of automatic methods and manual curation. [7]

The domains are then classified within the CATH structural hierarchy: at the Class (C) level, domains are assigned according to their secondary structure content, i.e. all alpha, all beta, a mixture of alpha and beta, or little secondary structure; at the Architecture (A) level, information on the secondary structure arrangement in three-dimensional space is used for assignment; at the Topology/fold (T) level, information on how the secondary structure elements are connected and arranged is used; assignments are made to the Homologous superfamily (H) level if there is good evidence that the domains are related by evolution [2] i.e. they are homologous.

The four main levels of the CATH hierarchy:
#LevelDescription
1Classthe overall secondary-structure content of the domain. (Equivalent to the SCOP Class)
2Architecturehigh structural similarity but no evidence of homology.
3Topology/folda large-scale grouping of topologies which share particular structural features (Equivalent to the 'fold' level in SCOP)
4Homologous superfamilyindicative of a demonstrable evolutionary relationship. (Equivalent to SCOP superfamily)

Additional sequence data for domains with no experimentally determined structures are provided by CATH's sister resource, Gene3D, which are used to populate the homologous superfamilies. Protein sequences from UniProtKB and Ensembl are scanned against CATH HMMs to predict domain sequence boundaries and make homologous superfamily assignments.

Releases

The CATH team releases new data both as daily snapshots, and official releases approximately annually. The latest release of CATH-Gene3D (v4.3) was released in December 2020 and consists of: [8]

Open-source software

CATH is an open source software project, with developers developing and maintaining a number of open-source tools, [9] which are available publicly on GitHub. [10]

Related Research Articles

<span class="mw-page-title-main">Protein structure prediction</span> Type of biological prediction

Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its secondary and tertiary structure from primary structure. Structure prediction is different from the inverse problem of protein design.

<span class="mw-page-title-main">Protein family</span> Group of evolutionarily-related proteins

A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be confused with family as it is used in taxonomy.

<span class="mw-page-title-main">Structural alignment</span> Aligning molecular sequences using sequence and structural information

Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large RNA molecules. In contrast to simple structural superposition, where at least some equivalent residues of the two structures are known, structural alignment requires no a priori knowledge of equivalent positions. Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Structural alignment can therefore be used to imply evolutionary relationships between proteins that share very little common sequence. However, caution should be used in using the results as evidence for shared evolutionary ancestry because of the possible confounding effects of convergent evolution by which multiple unrelated amino acid sequences converge on a common tertiary structure.

<span class="mw-page-title-main">Structural Classification of Proteins database</span> Biological database of proteins

The Structural Classification of Proteins (SCOP) database is a largely manual classification of protein structural domains based on similarities of their structures and amino acid sequences. A motivation for this classification is to determine the evolutionary relationship between proteins. Proteins with the same shapes but having little sequence or functional similarity are placed in different superfamilies, and are assumed to have only a very distant common ancestor. Proteins having the same shape and some similarity of sequence and/or function are placed in "families", and are assumed to have a closer common ancestor.

<span class="mw-page-title-main">UniProt</span> Database of protein sequences and functional information

UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature. It is maintained by the UniProt consortium, which consists of several European bioinformatics organisations and a foundation from Washington, DC, USA.

In molecular biology, protein threading, also known as fold recognition, is a method of protein modeling which is used to model those proteins which have the same fold as proteins of known structures, but do not have homologous proteins with known structure. It differs from the homology modeling method of structure prediction as it is used for proteins which do not have their homologous protein structures deposited in the Protein Data Bank (PDB), whereas homology modeling is used for those proteins which do. Threading works by using statistical knowledge of the relationship between the structures deposited in the PDB and the sequence of the protein which one wishes to model.

InterPro is a database of protein families, protein domains and functional sites in which identifiable features found in known proteins can be applied to new protein sequences in order to functionally characterise them.

<span class="mw-page-title-main">Protein domain</span> Self-stable region of a proteins chain that folds independently from the rest

In molecular biology, a protein domain is a region of a protein's polypeptide chain that is self-stabilizing and that folds independently from the rest. Each domain forms a compact folded three-dimensional structure. Many proteins consist of several domains, and a domain may appear in a variety of different proteins. Molecular evolution uses domains as building blocks and these may be recombined in different arrangements to create proteins with different functions. In general, domains vary in length from between about 50 amino acids up to 250 amino acids in length. The shortest domains, such as zinc fingers, are stabilized by metal ions or disulfide bridges. Domains often form functional units, such as the calcium-binding EF hand domain of calmodulin. Because they are independently stable, domains can be "swapped" by genetic engineering between one protein and another to make chimeric proteins.

<span class="mw-page-title-main">Janet Thornton</span> British bioinformatician and academic

Dame Janet Maureen Thornton, is a senior scientist and director emeritus at the European Bioinformatics Institute (EBI), part of the European Molecular Biology Laboratory (EMBL). She is one of the world's leading researchers in structural bioinformatics, using computational methods to understand protein structure and function. She served as director of the EBI from October 2001 to June 2015, and played a key role in ELIXIR.

Protein subfamily is a level of protein classification, based on their close evolutionary relationship. It is below the larger levels of protein superfamily and protein family.

<span class="mw-page-title-main">MicrobesOnline</span>

MicrobesOnline is a publicly and freely accessible website that hosts multiple comparative genomic tools for comparing microbial species at the genomic, transcriptomic and functional levels. MicrobesOnline was developed by the Virtual Institute for Microbial Stress and Survival, which is based at the Lawrence Berkeley National Laboratory in Berkeley, California. The site was launched in 2005, with regular updates until 2011.

SUPERFAMILY is a database and search platform of structural and functional annotation for all proteins and genomes. It classifies amino acid sequences into known structural domains, especially into SCOP superfamilies. Domains are functional, structural, and evolutionary units that form proteins. Domains of common Ancestry are grouped into superfamilies. The domains and domain superfamilies are defined and described in SCOP. Superfamilies are groups of proteins which have structural evidence to support a common evolutionary ancestor but may not have detectable sequence homology.

The sequential structure alignment program (SSAP) in chemistry, physics, and biology is a method that uses double dynamic programming to produce a structural alignment based on atom-to-atom vectors in structure space. Instead of the alpha carbons typically used in structural alignment, SSAP constructs its vectors from the beta carbons for all residues except glycine, a method which thus takes into account the rotameric state of each residue as well as its location along the backbone. SSAP works by first constructing a series of inter-residue distance vectors between each residue and its nearest non-contiguous neighbors on each protein. A series of matrices are then constructed containing the vector differences between neighbors for each pair of residues for which vectors were constructed. Dynamic programming applied to each resulting matrix determines a series of optimal local alignments which are then summed into a "summary" matrix to which dynamic programming is applied again to determine the overall structural alignment.

<span class="mw-page-title-main">David T. Jones (biochemist)</span> British bioinformatician

David Tudor Jones is a Professor of Bioinformatics, and Head of Bioinformatics Group in the University College London. He is also the director in Bloomsbury Center for Bioinformatics, which is a joint Research Centre between UCL and Birkbeck, University of London and which also provides bioinformatics training and support services to biomedical researchers. In 2013, he is a member of editorial boards for PLoS ONE, BioData Mining, Advanced Bioinformatics, Chemical Biology & Drug Design, and Protein: Structure, Function and Bioinformatics.

<span class="mw-page-title-main">Protein fold class</span> Categories of protein tertiary structure

In molecular biology, protein fold classes are broad categories of protein tertiary structure topology. They describe groups of proteins that share similar amino acid and secondary structure proportions. Each class contains multiple, independent protein superfamilies.

A protein superfamily is the largest grouping (clade) of proteins for which common ancestry can be inferred. Usually this common ancestry is inferred from structural alignment and mechanistic similarity, even if no sequence similarity is evident. Sequence homology can then be deduced even if not apparent. Superfamilies typically contain several protein families which show sequence similarity within each family. The term protein clan is commonly used for protease and glycosyl hydrolases superfamilies based on the MEROPS and CAZy classification systems.

Michael Joseph Ezra Sternberg is a professor at Imperial College London, where he is director of the Centre for Integrative Systems Biology and Bioinformatics and Head of the Structural bioinformatics Group.

Julian John Thurstan Gough was a Group Leader in the Laboratory of Molecular Biology (LMB) of the Medical Research Council (MRC). He was previously a professor of bioinformatics at the University of Bristol.

<span class="mw-page-title-main">Christine Orengo</span> Professor of Bioinformatics

Christine Anne Orengo is a Professor of Bioinformatics at University College London (UCL) known for her work on protein structure, particularly the CATH database. Orengo serves as president of the International Society for Computational Biology (ISCB), the first woman to do so in the history of the society.

References

  1. Dawson NL, Lewis TE, Das S, Lees JG, Lee D, Ashford P, et al. (January 2017). "CATH: an expanded resource to predict protein function through structure and sequence". Nucleic Acids Research. 45 (D1): D289 –D295. doi:10.1093/nar/gkw1098. PMC   5210570 . PMID   27899584.
  2. 1 2 3 Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM (August 1997). "CATH--a hierarchic classification of protein domain structures". Structure. 5 (8). London, England: 1093–108. doi: 10.1016/s0969-2126(97)00260-8 . PMID   9309224.
  3. "CATH: Protein Structure Classification Database at UCL". Cathdb.info. Retrieved 9 March 2017.
  4. "CATH". Cathdb.info. Retrieved 9 March 2017.
  5. "CATH Database (@CATHDatabase)". Twitter . Retrieved 9 March 2017.
  6. Pearl FM, Bennett CF, Bray JE, Harrison AP, Martin N, Shepherd A, et al. (January 2003). "The CATH database: an extended protein family resource for structural and functional genomics". Nucleic Acids Research. 31 (1): 452–455. doi:10.1093/nar/gkg062. PMC   165509 . PMID   12520050.
  7. "CATH". cathdb.info. Retrieved 14 September 2024.
  8. "CATH". cathdb.info. Retrieved 14 September 2024.
  9. "Tools". cathdb.info. Retrieved 18 December 2016.
  10. UCLOrengoGroup/cath-tools, UCLOrengoGroup, 9 September 2024, retrieved 14 September 2024