Protein family

Last updated
The human cyclophilin family, as represented by the structures of the isomerase domains of some of its members Structural coverage of the human cyclophilin family.png
The human cyclophilin family, as represented by the structures of the isomerase domains of some of its members

A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be confused with family as it is used in taxonomy.

Contents

Proteins in a family descend from a common ancestor and typically have similar three-dimensional structures, functions, and significant sequence similarity. [1] [2] Sequence similarity (usually amino-acid sequence) is one of the most common indicators of homology, or common evolutionary ancestry. [3] [4] Some frameworks for evaluating the significance of similarity between sequences use sequence alignment methods. Proteins that do not share a common ancestor are unlikely to show statistically significant sequence similarity, making sequence alignment a powerful tool for identifying the members of protein families. [3] [4] Families are sometimes grouped together into larger clades called superfamilies based on structural similarity, even if there is no identifiable sequence homology.

Currently, over 60,000 protein families have been defined, [5] although ambiguity in the definition of "protein family" leads different researchers to highly varying numbers.

Terminology and usage

The term protein family has broad usage and can be applied to large groups of proteins with barely detectable sequence similarity as well as narrow groups of proteins with near identical sequence, function, and structure. To distinguish between these cases, a hierarchical terminology is in use. At the highest level of classification are protein superfamilies, which group distantly related proteins, often based on their structural similarity. [6] [7] [8] [9] Next are protein families, which refer to proteins with a shared evolutionary origin exhibited by significant sequence similarity. [2] [10] Subfamilies can be defined within families to denote closely related proteins that have similar or identical functions. [11] For example, a superfamily like the PA clan of proteases has less sequence conservation than the C04 family within it.

PA clan vs C04 family sequence conservation.png
Above, sequence conservation of 250 members of the PA clan proteases (superfamily). Below, sequence conservation of 70 members of the C04 protease family: Arrows indicate catalytic triad residues, aligned on the basis of structure by DALI.

Protein domains and motifs

Protein families were first recognised when most proteins that were structurally understood were small, single-domain proteins such as myoglobin, hemoglobin, and cytochrome c. Since then, many proteins have been found with multiple independent structural and functional units called domains. Due to evolutionary shuffling, different domains in a protein have evolved independently. This has led to a focus on families of protein domains. Several online resources are devoted to identifying and cataloging these domains. [12] [13]

Different regions of a protein have differing functional constraints. For example, the active site of an enzyme requires certain amino-acid residues to be precisely oriented. A protein–protein binding interface may consist of a large surface with constraints on the hydrophobicity or polarity of the amino-acid residues. Functionally constrained regions of proteins evolve more slowly than unconstrained regions such as surface loops, giving rise to blocks of conserved sequence when the sequences of a protein family are compared (see multiple sequence alignment). These blocks are most commonly referred to as motifs, although many other terms are used (blocks, signatures, fingerprints, etc.). Several online resources are devoted to identifying and cataloging protein motifs. [14]

Evolution of protein families

According to current consensus, protein families arise in two ways. First, the separation of a parent species into two genetically isolated descendant species allows a gene/protein to independently accumulate variations (mutations) in these two lineages. This results in a family of orthologous proteins, usually with conserved sequence motifs. Second, a gene duplication may create a second copy of a gene (termed a paralog). Because the original gene is still able to perform its function, the duplicated gene is free to diverge and may acquire new functions (by random mutation).

Certain gene/protein families, especially in eukaryotes, undergo extreme expansions and contractions in the course of evolution, sometimes in concert with whole genome duplications. Expansions are less likely, and losses more likely, for intrinsically disordered proteins and for protein domains whose hydrophobic amino acids are further from the optimal degree of dispersion along the primary sequence. [15] This expansion and contraction of protein families is one of the salient features of genome evolution, but its importance and ramifications are currently unclear.

Phylogenetic tree of RAS superfamily: This tree was created using FigTree (free online software). RAStree.png
Phylogenetic tree of RAS superfamily: This tree was created using FigTree (free online software).

Use and importance of protein families

As the total number of sequenced proteins increases and interest expands in proteome analysis, an effort is ongoing to organize proteins into families and to describe their component domains and motifs. Reliable identification of protein families is critical to phylogenetic analysis, functional annotation, and the exploration of the diversity of protein function in a given phylogenetic branch. The Enzyme Function Initiative uses protein families and superfamilies as the basis for development of a sequence/structure-based strategy for large scale functional assignment of enzymes of unknown function. [16] The algorithmic means for establishing protein families on a large scale are based on a notion of similarity.

Protein family resources

Many biological databases catalog protein families and allow users to match query sequences to known families. These include:

Similarly, many database-searching algorithms exist, for example:

See also

Protein families

Related Research Articles

<span class="mw-page-title-main">Sequence alignment</span> Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences such as calculating the distance cost between strings in a natural language, or to display financial data.

<span class="mw-page-title-main">Structural genomics</span>

Structural genomics seeks to describe the 3-dimensional structure of every protein encoded by a given genome. This genome-based approach allows for a high-throughput method of structure determination by a combination of experimental and modeling approaches. The principal difference between structural genomics and traditional structural prediction is that structural genomics attempts to determine the structure of every protein encoded by the genome, rather than focusing on one particular protein. With full-genome sequences available, structure prediction can be done more quickly through a combination of experimental and modeling approaches, especially because the availability of large number of sequenced genomes and previously solved protein structures allows scientists to model protein structure on the structures of previously solved homologs.

<span class="mw-page-title-main">Protein structure prediction</span> Type of biological prediction

Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its secondary and tertiary structure from primary structure. Structure prediction is different from the inverse problem of protein design. Protein structure prediction is one of the most important goals pursued by computational biology; it is important in medicine and biotechnology.

In a chain-like biological molecule, such as a protein or nucleic acid, a structural motif is a common three-dimensional structure that appears in a variety of different, evolutionarily unrelated molecules. A structural motif does not have to be associated with a sequence motif; it can be represented by different and completely unrelated sequences in different proteins or RNA.

<span class="mw-page-title-main">Structural alignment</span> Aligning molecular sequences using sequence and structural information

Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large RNA molecules. In contrast to simple structural superposition, where at least some equivalent residues of the two structures are known, structural alignment requires no a priori knowledge of equivalent positions. Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Structural alignment can therefore be used to imply evolutionary relationships between proteins that share very little common sequence. However, caution should be used in using the results as evidence for shared evolutionary ancestry because of the possible confounding effects of convergent evolution by which multiple unrelated amino acid sequences converge on a common tertiary structure.

In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.

<span class="mw-page-title-main">Structural Classification of Proteins database</span> Biological database of proteins

The Structural Classification of Proteins (SCOP) database is a largely manual classification of protein structural domains based on similarities of their structures and amino acid sequences. A motivation for this classification is to determine the evolutionary relationship between proteins. Proteins with the same shapes but having little sequence or functional similarity are placed in different superfamilies, and are assumed to have only a very distant common ancestor. Proteins having the same shape and some similarity of sequence and/or function are placed in "families", and are assumed to have a closer common ancestor.

In molecular biology, protein threading, also known as fold recognition, is a method of protein modeling which is used to model those proteins which have the same fold as proteins of known structures, but do not have homologous proteins with known structure. It differs from the homology modeling method of structure prediction as it is used for proteins which do not have their homologous protein structures deposited in the Protein Data Bank (PDB), whereas homology modeling is used for those proteins which do. Threading works by using statistical knowledge of the relationship between the structures deposited in the PDB and the sequence of the protein which one wishes to model.

<span class="mw-page-title-main">Pfam</span> Database of protein families

Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. Last version of Pfam, 36.0, was released in September 2023 and contains 20,795 families. It is currently provided through InterPro database.

InterPro is a database of protein families, protein domains and functional sites in which identifiable features found in known proteins can be applied to new protein sequences in order to functionally characterise them.

Protein–protein interaction prediction is a field combining bioinformatics and structural biology in an attempt to identify and catalog physical interactions between pairs or groups of proteins. Understanding protein–protein interactions is important for the investigation of intracellular signaling pathways, modelling of protein complex structures and for gaining insights into various biochemical processes.

Rfam is a database containing information about non-coding RNA (ncRNA) families and other structured RNA elements. It is an annotated, open access database originally developed at the Wellcome Trust Sanger Institute in collaboration with Janelia Farm, and currently hosted at the European Bioinformatics Institute. Rfam is designed to be similar to the Pfam database for annotating protein families.

<span class="mw-page-title-main">Protein domain</span> Self-stable region of a proteins chain that folds independently from the rest

In molecular biology, a protein domain is a region of a protein's polypeptide chain that is self-stabilizing and that folds independently from the rest. Each domain forms a compact folded three-dimensional structure. Many proteins consist of several domains, and a domain may appear in a variety of different proteins. Molecular evolution uses domains as building blocks and these may be recombined in different arrangements to create proteins with different functions. In general, domains vary in length from between about 50 amino acids up to 250 amino acids in length. The shortest domains, such as zinc fingers, are stabilized by metal ions or disulfide bridges. Domains often form functional units, such as the calcium-binding EF hand domain of calmodulin. Because they are independently stable, domains can be "swapped" by genetic engineering between one protein and another to make chimeric proteins.

Protein subfamily is a level of protein classification, based on their close evolutionary relationship. It is below the larger levels of protein superfamily and protein family.

SUPERFAMILY is a database and search platform of structural and functional annotation for all proteins and genomes. It classifies amino acid sequences into known structural domains, especially into SCOP superfamilies. Domains are functional, structural, and evolutionary units that form proteins. Domains of common Ancestry are grouped into superfamilies. The domains and domain superfamilies are defined and described in SCOP. Superfamilies are groups of proteins which have structural evidence to support a common evolutionary ancestor but may not have detectable sequence homology.

Protein function prediction methods are techniques that bioinformatics researchers use to assign biological or biochemical roles to proteins. These proteins are usually ones that are poorly studied or predicted based on genomic sequence data. These predictions are often driven by data-intensive computational procedures. Information may come from nucleic acid sequence homology, gene expression profiles, protein domain structures, text mining of publications, phylogenetic profiles, phenotypic profiles, and protein-protein interaction. Protein function is a broad term: the roles of proteins range from catalysis of biochemical reactions to transport to signal transduction, and a single protein may play a role in multiple processes or cellular pathways.

<span class="mw-page-title-main">DNA annotation</span> The process of describing the structure and function of a genome

In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.

<span class="mw-page-title-main">Cyrus Chothia</span> English biochemist (1942–2019)

Cyrus Homi Chothia was an English biochemist who was an emeritus scientist at the Medical Research Council (MRC) Laboratory of Molecular Biology (LMB) at the University of Cambridge and emeritus fellow of Wolfson College, Cambridge.

A protein superfamily is the largest grouping (clade) of proteins for which common ancestry can be inferred. Usually this common ancestry is inferred from structural alignment and mechanistic similarity, even if no sequence similarity is evident. Sequence homology can then be deduced even if not apparent. Superfamilies typically contain several protein families which show sequence similarity within each family. The term protein clan is commonly used for protease and glycosyl hydrolases superfamilies based on the MEROPS and CAZy classification systems.

<span class="mw-page-title-main">PA clan of proteases</span>

The PA clan is the largest group of proteases with common ancestry as identified by structural homology. Members have a chymotrypsin-like fold and similar proteolysis mechanisms but can have identity of <10%. The clan contains both cysteine and serine proteases. PA clan proteases can be found in plants, animals, fungi, eubacteria, archaea and viruses.

References

  1. "What are protein families? Protein classification". EMBL-EBI . Retrieved 2023-11-14.
  2. 1 2 Orengo, Christine; Bateman, Alex (2013). "Introduction". In Orengo, Christine; Bateman, Alex (eds.). Protein Families: Relating Protein Sequence, Structure, and Function. Hoboken, New Jersey: John Wiley & Sons, Inc. pp. vii–xi. doi:10.1002/9781118743089.fmatter. ISBN   9781118743089.
  3. 1 2 Pearson, William R. (2013). "An Introduction to Sequence Similarity ("Homology") Searching". Current Protocols in Bioinformatics. 3: 3.1.1–3.1.8. doi:10.1002/0471250953.bi0301s42. ISSN   1934-3396. PMC   3820096 . PMID   23749753.
  4. 1 2 Chen, Junjie; Guo, Mingyue; Wang, Xiaolong; Liu, Bin (2018-03-01). "A comprehensive review and comparison of different computational methods for protein remote homology detection". Briefings in Bioinformatics. 19 (2): 231–244. doi:10.1093/bib/bbw108. ISSN   1477-4054. PMID   27881430.
  5. Kunin, Victor; Cases, Ildefonso; Enright, Anton J.; de Lorenzo, Victor; Ouzounis, Christos A. (2003). "Myriads of protein families, and still counting". Genome Biology. 4 (2): 401. doi: 10.1186/gb-2003-4-2-401 . ISSN   1474-760X. PMC   151299 . PMID   12620116.
  6. Dayhoff, MO (December 1974). "Computer analysis of protein sequences". Federation Proceedings. 33 (12): 2314–6. PMID   4435228.
  7. Dayhoff, MO; McLaughlin, PJ; Barker, WC; Hunt, LT (1975). "Evolution of sequences within protein superfamilies". Die Naturwissenschaften. 62 (4): 154–161. Bibcode:1975NW.....62..154D. doi:10.1007/BF00608697. S2CID   40304076.
  8. Dayhoff, MO (August 1976). "The origin and evolution of protein superfamilies". Federation Proceedings. 35 (10): 2132–8. PMID   181273.
  9. Orengo, Christine A.; Thornton, Janet M. (2005-06-01). "Protein Families and Their Evolution—A Structural Perspective". Annual Review of Biochemistry. 74 (1): 867–900. doi:10.1146/annurev.biochem.74.082803.133029. ISSN   0066-4154. PMID   15954844.
  10. Veeramachaneni, Vamsi; Makałowski, Wojciech (2004). "Visualizing Sequence Similarity of Protein Families". Genome Research. 14 (6): 1160–1169. doi: 10.1101/gr.2079204 . ISSN   1088-9051. PMC   419794 . PMID   15140831.
  11. Holm, Liisa; Heger, Andreas (2013). "Automated Sequence-Based Approaches for Identifying Domain Families". In Orengo, Christine; Bateman, Alex (eds.). Protein Families: Relating Protein Sequence, Structure, and Function. Hoboken, New Jersey: John Wiley & Sons, Inc. pp. 1–24. doi:10.1002/9781118743089.ch1. ISBN   9781118743089. S2CID   85641264.
  12. Wang, Yan; Zhang, Hang; Zhong, Haolin; Xue, Zhidong (2021-01-01). "Protein domain identification methods and online resources". Computational and Structural Biotechnology Journal. 19: 1145–1153. doi: 10.1016/j.csbj.2021.01.041 . ISSN   2001-0370. PMC   7895673 . PMID   33680357.
  13. Bateman, Alex (2013). "Sequence Classification of Protein Families: Pfam and other Resources". In Orengo, Christine; Bateman, Alex (eds.). Protein Families: Relating Protein Sequence, Structure, and Function. Hoboken, New Jersey: John Wiley & Sons, Inc. pp. 25–36. doi:10.1002/9781118743089.ch2. ISBN   9781118743089.
  14. Mulder, Nicola J.; Apweiler, Rolf (2001-12-19). "Tools and resources for identifying protein families, domains and motifs". Genome Biology. 3 (1): reviews2001.1. doi: 10.1186/gb-2001-3-1-reviews2001 . ISSN   1474-760X. PMC   150457 . PMID   11806833.
  15. James, Jennifer E; Nelson, Paul G; Masel, Joanna (4 April 2023). "Differential Retention of Pfam Domains Contributes to Long-term Evolutionary Trends". Molecular Biology and Evolution. 40 (4): msad073. doi:10.1093/molbev/msad073. PMC   10089649 . PMID   36947137.
  16. Gerlt, John A.; Allen, Karen N.; Almo, Steven C.; Armstrong, Richard N.; Babbitt, Patricia C.; Cronan, John E.; Dunaway-Mariano, Debra; Imker, Heidi J.; Jacobson, Matthew P.; Minor, Wladek; Poulter, C. Dale; Raushel, Frank M.; Sali, Andrej; Shoichet, Brian K.; Sweedler, Jonathan V. (2011-11-22). "The Enzyme Function Initiative". Biochemistry. 50 (46): 9950–9962. doi:10.1021/bi201312u. ISSN   0006-2960. PMC   3238057 . PMID   21999478.
  17. Gandhimathi, A.; Nair, Anu G.; Sowdhamini, R. (2012). "PASS2 version 4: An update to the database of structure-based sequence alignments of structural domain superfamilies". Nucleic Acids Research. 40 (D1): D531–D534. doi:10.1093/nar/gkr1096. ISSN   1362-4962. PMC   3245109 . PMID   22123743.
  18. Emms, David M.; Kelly, Steven (2015-08-06). "OrthoFinder: Solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy". Genome Biology. 16 (1): 157. doi: 10.1186/s13059-015-0721-2 . ISSN   1474-760X. PMC   4531804 . PMID   26243257.
  19. Emms, David M.; Kelly, Steven (2019-11-14). "OrthoFinder: Phylogenetic orthology inference for comparative genomics". Genome Biology. 20 (1): 238. doi: 10.1186/s13059-019-1832-y . ISSN   1474-760X. PMC   6857279 . PMID   31727128.