Protein subfamily

Last updated

Protein subfamily is a level of protein classification, based on their close evolutionary relationship. It is below the larger levels of protein superfamily and protein family. [1]

Proteins typically share greater sequence and function similarities with other subfamily members than they do with members of their wider family. [1] [2] For example, in the SCOP classification system, members of a subfamily share the same interaction interfaces and interaction partners. [3] These are stricter criteria than for a family, where members have similar structures, but may be more distantly related and so have different interfaces. Subfamilies are assigned by a variety of methods, including sequence similarity, [4] motifs linked to function, [5] or phylogenetic clade. [6] [7] There is no exact and consistent distinction between a subfamily and a family. The same group of proteins may sometimes be described as a family or a subfamily depending on the context.

Related Research Articles

Biological database database of biological information

Biological databases are libraries of life sciences information, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. They contain information from research areas including genomics, proteomics, metabolomics, microarray gene expression, and phylogenetics. Information contained in biological databases includes gene function, structure, localization, clinical effects of mutations as well as similarities of biological sequences and structures.

Protein family group of proteins that share a common evolutionary origin, reflected by similarity in their sequence

A protein family is a group of evolutionarily-related proteins. In many cases a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term protein family should not be confused with family as it is used in taxonomy.

Structural Classification of Proteins database

The Structural Classification of Proteins (SCOP) database is a largely manual classification of protein structural domains based on similarities of their structures and amino acid sequences. A motivation for this classification is to determine the evolutionary relationship between proteins. Proteins with the same shapes but having little sequence or functional similarity are placed in different superfamilies, and are assumed to have only a very distant common ancestor. Proteins having the same shape and some similarity of sequence and/or function are placed in "families", and are assumed to have a closer common ancestor.

Sequence homology Shared ancestry between DNA, RNA or protein sequences

Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a speciation event (orthologs), or a duplication event (paralogs), or else a horizontal gene transfer event (xenologs).

Pfam Database of protein families

Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The most recent version, Pfam 32.0, was released in September 2018 and contains 17,929 families.

InterPro is a database of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to new protein sequences in order to functionally characterise them.

Rfam is a database containing information about non-coding RNA (ncRNA) families and other structured RNA elements. It is an annotated, open access database originally developed at the Wellcome Trust Sanger Institute in collaboration with Janelia Farm, and currently hosted at the European Bioinformatics Institute. Rfam is designed to be similar to the Pfam database for annotating protein families.


Tetraloops are a type of four-base hairpin loop motifs in RNA secondary structure that cap many double helices. There are many variants of the tetraloop. The published ones include ANYA, CUYG, GNRA, UNAC and UNCG.

28S ribosomal RNA RNA component of the large subunit of the eukaryotic ribosome

28S ribosomal RNA is the structural ribosomal RNA (rRNA) for the large component, or large subunit (LSU) of eukaryotic cytoplasmic ribosomes, and thus one of the basic components of all eukaryotic cells. It is the eukaryotic nuclear homologue of the prokaryotic 23S and mitochondrial 16S ribosomal RNAs.

This is a list of computer programs that are used for nucleic acids simulations.

SUPERFAMILY is a database of structural and functional annotation for all proteins and genomes. It classifies amino acid sequences into known structural domains, especially into SCOP superfamilies. Domains are functional, structural, and evolutionary units that form proteins. Domains of common Ancestry are grouped into superfamilies. The domains and domain superfamilies are defined and described in SCOP. Superfamilies are groups of proteins which have structural evidence to support a common evolutionary ancestor but may not have detectable sequence homology.

Protein fold class

Protein fold classes are broad categories of protein tertiary structure topology. They describe groups of proteins that share similar amino acid and secondary structure proportions. Each class contains multiple, independent protein superfamilies.

European Nucleotide Archive Online database from the EBI on Nucleotides

The European Nucleotide Archive (ENA) is a repository providing free and unrestricted access to annotated DNA and RNA sequences. It also stores complementary information such as experimental procedures, details of sequence assembly and other metadata related to sequencing projects. The archive is composed of three main databases: the Sequence Read Archive, the Trace Archive and the EMBL Nucleotide Sequence Database. The ENA is produced and maintained by the European Bioinformatics Institute and is a member of the International Nucleotide Sequence Database Collaboration (INSDC) along with the DNA Data Bank of Japan and GenBank.

In bioinformatics, the PANTHER classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. PANTHER is part of the Gene Ontology Reference Genome Project designed to classify proteins and their genes for high-throughput analysis.

A protein superfamily is the largest grouping (clade) of proteins for which common ancestry can be inferred. Usually this common ancestry is inferred from structural alignment and mechanistic similarity, even if no sequence similarity is evident. Sequence homology can then be deduced even if not apparent. Superfamilies typically contain several protein families which show sequence similarity within each family. The term protein clan is commonly used for protease and glycosyl hydrolases superfamilies based on the MEROPS and CAZy classification systems.

Alex Bateman British bioinformatician

Alexander George Bateman is a computational biologist and Head of Protein Sequence Resources at the European Bioinformatics Institute (EBI), part of the European Molecular Biology Laboratory (EMBL) in Cambridge, UK. He has led the development of the Pfam biological database and introduced the Rfam database of RNA families. He has also been involved in the use of Wikipedia for community-based annotation of biological databases.

The ViennaRNA Package is a set of standalone programs and libraries used for prediction and analysis of RNA secondary structures. The source code for the package is distributed freely and compiled binaries are available for Linux, macOS and Windows platforms. The original paper has been cited over 2000 times.

In molecular biology, MvirDB is a publicly available database that stores information on toxins, virulence factors and antibiotic resistance genes. Sources that this database uses for DNA and protein information include: Tox-Prot, SCORPION, the PRINTS Virulence Factors, VFDB, TVFac, Islander, ARGO and VIDA. The database provides a BLAST tool that allows the user to query their sequence against all DNA and protein sequences in MvirDB. Information on virulence factors can be obtained from the usage of the provided browser tool. Once the browser tool is used, the results are returned as a readable table that is organized by ascending E-Values, each of which are hyperlinked to their related page. MvirDB is implemented in an Oracle 10g relational database.


  1. 1 2 "What are protein families?". EMBL-EBI Train online. 2011-11-18. Retrieved 2018-03-08.
  2. Das, Sayoni; Orengo, Christine A. (2016). "Protein function annotation using protein domain family resources" (PDF). Methods. 93: 24–34. doi:10.1016/j.ymeth.2015.09.029. PMID   26434392.
  3. Rausell, Antonio; Juan, David; Pazos, Florencio; Valencia, Alfonso (2010-02-02). "Protein interactions and ligand binding: From protein subfamilies to functional specificity". Proceedings of the National Academy of Sciences. 107 (5): 1995–2000. doi:10.1073/pnas.0908044107. PMC   2808218 . PMID   20133844.
  4. Brown, Duncan P.; Krishnamurthy, Nandini; Sjölander, Kimmen (2007-08-17). "Automated Protein Subfamily Identification and Classification". PLOS Computational Biology. 3 (8): e160. doi:10.1371/journal.pcbi.0030160. ISSN   1553-7358. PMC   1950344 . PMID   17708678.
  5. Eisen, Jonathan A.; Sweder, Kevin S.; Hanawalt, Philip C. (1995-07-25). "Evolution of the SNF2 family of proteins: subfamilies with distinct sequences and functions". Nucleic Acids Research. 23 (14): 2715–2723. doi:10.1093/nar/23.14.2715. ISSN   0305-1048. PMC   307096 . PMID   7651832.
  6. Wicker, Nicolas; Perrin, Guy René; Thierry, Jean Claude; Poch, Olivier (2001-08-01). "Secator: A Program for Inferring Protein Subfamilies from Phylogenetic Trees". Molecular Biology and Evolution. 18 (8): 1435–1441. doi: 10.1093/oxfordjournals.molbev.a003929 . ISSN   0737-4038.
  7. Mi, Huaiyu; Poudel, Sagar; Muruganujan, Anushya; Casagrande, John T.; Thomas, Paul D. (2016-01-04). "PANTHER version 10: expanded protein families and functions, and analysis tools". Nucleic Acids Research. 44 (D1): D336–D342. doi:10.1093/nar/gkv1194. ISSN   0305-1048. PMC   4702852 . PMID   26578592.