Evolutionary Classification of Protein Domains

Last updated
ECOD
Content
Data types
captured
Protein domains
Contact
Research center Grishin Lab, University of Texas Southwestern Medical Center
Authors H. Cheng, R. D. Schaeffer, Y. Liao, L. N. Kinch, J. Pei, S. Shi, B. H. Kim, N. V. Grishin.
Primary citation PMID   25474468
Release date2014
Access
Website http://prodata.swmed.edu/ecod/
Miscellaneous
Versioncontinuously updated
Curation policymanual for new proteins; automated for ones with close matches

The Evolutionary Classification of Protein Domains (ECOD) is a biological database that classifies protein domains available from the Protein Data Bank. The ECOD tries to determine the evolutionary relationships between proteins.

Similar to Pfam, CATH, and SCOP, ECOD compiles domains instead of whole proteins. However, ECOD focuses on evolutionary relationships more heavily: instead of grouping proteins by folds, which may simply represent convergent evolution, ECOD groups proteins by demonstratable homology only. [1]

Related Research Articles

Gram-positive bacteria Bacteria that give a positive result in the Gram stain test

In bacteriology, gram-positive bacteria are bacteria that give a positive result in the Gram stain test, which is traditionally used to quickly classify bacteria into two broad categories according to their type of cell wall.

Excavata Supergroup of unicellular organisms belonging to the domain Eukaryota

Excavata is a major supergroup of unicellular organisms belonging to the domain Eukaryota. It was first suggested by Simpson and Patterson in 1999 and introduced by Thomas Cavalier-Smith in 2002 as a formal taxon. It contains a variety of free-living and symbiotic forms, and also includes some important parasites of humans, including Giardia and Trichomonas. Excavates were formerly considered to be included in the now obsolete Protista kingdom. They are classified based on their flagellar structures, and they are considered to be the most basal Flagellate lineage. Phylogenomic analyses split the members of the Excavates into three different and not all closely related groups: Discobids, Metamonads and Malawimonads. Except for Euglenozoa, they are all non-photosynthetic.

Protein structure prediction

Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its secondary and tertiary structure from primary structure. Structure prediction is different from the inverse problem of protein design. Protein structure prediction is one of the most important goals pursued by computational biology; and it is important in medicine and biotechnology.

Protein family

A protein family is a group of evolutionarily-related proteins. In many cases a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term protein family should not be confused with family as it is used in taxonomy.

Structural Classification of Proteins database

The Structural Classification of Proteins (SCOP) database is a largely manual classification of protein structural domains based on similarities of their structures and amino acid sequences. A motivation for this classification is to determine the evolutionary relationship between proteins. Proteins with the same shapes but having little sequence or functional similarity are placed in different superfamilies, and are assumed to have only a very distant common ancestor. Proteins having the same shape and some similarity of sequence and/or function are placed in "families", and are assumed to have a closer common ancestor.

Caenophidia Clade of snakes

The Caenophidia are a derived clade of alethinophidian snakes, which contains over 80% of all the extant species of snakes. The largest family is Colubridae, but it also includes at least seven other families, at least four of which were once classified as "Colubridae" before molecular phylogenetics helped us understand their relationships. It has been found to be monophyletic.

<i>Orthohepevirus A</i> Species of virus

The hepatitis E virus (HEV) is the causative agent of hepatitis E. It is of the species Orthohepevirus A.

Rossmann fold

The Rossmann fold is a tertiary fold found in proteins that bind nucleotides, such as enzyme cofactors FAD, NAD+, and NADP+. This fold is composed of alternating beta strands and alpha helical segments where the beta strands are hydrogen bonded to each other forming an extended beta sheet and the alpha helices surround both faces of the sheet to produce a three-layered sandwich. The classical Rossmann fold contains six beta strands whereas Rossmann-like folds, sometimes referred to as Rossmannoid folds, contain only five strands. The initial beta-alpha-beta (bab) fold is the most conserved segment of the Rossmann fold. The motif is named after Michael Rossmann who first noticed this structural motif in the enzyme lactate dehydrogenase in 1970 and who later observed that this was a frequently occurring motif in nucleotide binding proteins.

The CATH Protein Structure Classification database is a free, publicly available online resource that provides information on the evolutionary relationships of protein domains. It was created in the mid-1990s by Professor Christine Orengo and colleagues including Janet Thornton and David Jones, and continues to be developed by the Orengo group at University College London. CATH shares many broad features with the SCOP resource, however there are also many areas in which the detailed classification differs greatly.

Protein structure Three-dimensional arrangement of atoms in an amino acid-chain molecule

Protein structure is the three-dimensional arrangement of atoms in an amino acid-chain molecule. Proteins are polymers – specifically polypeptides – formed from sequences of amino acids, the monomers of the polymer. A single amino acid monomer may also be called a residue indicating a repeating unit of a polymer. Proteins form by amino acids undergoing condensation reactions, in which the amino acids lose one water molecule per reaction in order to attach to one another with a peptide bond. By convention, a chain under 30 amino acids is often identified as a peptide, rather than a protein. To be able to perform their biological function, proteins fold into one or more specific spatial conformations driven by a number of non-covalent interactions such as hydrogen bonding, ionic interactions, Van der Waals forces, and hydrophobic packing. To understand the functions of proteins at a molecular level, it is often necessary to determine their three-dimensional structure. This is the topic of the scientific field of structural biology, which employs techniques such as X-ray crystallography, NMR spectroscopy, cryo electron microscopy (cryo-EM) and dual polarisation interferometry to determine the structure of proteins.

Sequence homology Shared ancestry between DNA, RNA or protein sequences

Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a speciation event (orthologs), or a duplication event (paralogs), or else a horizontal gene transfer event (xenologs).

Conserved sequence Similar DNA, RNA or protein sequences within genomes or among species

In evolutionary biology, conserved sequences are identical or similar sequences in nucleic acids or proteins across species, or within a genome, or between donor and receptor taxa. Conservation indicates that a sequence has been maintained by natural selection.

Pfam

Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The most recent version, Pfam 33.1, was released in May 2020 and contains 18,259 families.

Cytochrome b A mitochondrial protein involved in the respiratory chain

Cytochrome b is a protein found in the mitochondria of eukaryotic cells. It functions as part of the electron transport chain and is the main subunit of transmembrane cytochrome bc1 and b6f complexes.

InterPro is a database of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to new protein sequences in order to functionally characterise them.

Protein domain Conserved part of a protein

A protein domain is a region of the protein's polypeptide chain that is self-stabilizing and that folds independently from the rest. Each domain forms a compact folded three-dimensional structure. Many proteins consist of several domains. One domain may appear in a variety of different proteins. Molecular evolution uses domains as building blocks and these may be recombined in different arrangements to create proteins with different functions. In general, domains vary in length from between about 50 amino acids up to 250 amino acids in length. The shortest domains, such as zinc fingers, are stabilized by metal ions or disulfide bridges. Domains often form functional units, such as the calcium-binding EF hand domain of calmodulin. Because they are independently stable, domains can be "swapped" by genetic engineering between one protein and another to make chimeric proteins.

Protein subfamily is a level of protein classification, based on their close evolutionary relationship. It is below the larger levels of protein superfamily and protein family.

SUPERFAMILY is a database and search platform of structural and functional annotation for all proteins and genomes. It classifies amino acid sequences into known structural domains, especially into SCOP superfamilies. Domains are functional, structural, and evolutionary units that form proteins. Domains of common Ancestry are grouped into superfamilies. The domains and domain superfamilies are defined and described in SCOP. Superfamilies are groups of proteins which have structural evidence to support a common evolutionary ancestor but may not have detectable sequence homology.

Protein fold class

Protein fold classes are broad categories of protein tertiary structure topology. They describe groups of proteins that share similar amino acid and secondary structure proportions. Each class contains multiple, independent protein superfamilies.

A protein superfamily is the largest grouping (clade) of proteins for which common ancestry can be inferred. Usually this common ancestry is inferred from structural alignment and mechanistic similarity, even if no sequence similarity is evident. Sequence homology can then be deduced even if not apparent. Superfamilies typically contain several protein families which show sequence similarity within each family. The term protein clan is commonly used for protease and glycosyl hydrolases superfamilies based on the MEROPS and CAZy classification systems.

References

  1. Cheng, H; Schaeffer, RD; Liao, Y; Kinch, LN; Pei, J; Shi, S; Kim, BH; Grishin, NV (December 2014). "ECOD: an evolutionary classification of protein domains". PLOS Computational Biology. 10 (12): e1003926. Bibcode:2014PLSCB..10E3926C. doi: 10.1371/journal.pcbi.1003926 . PMC   4256011 . PMID   25474468.