Superfamily database

Last updated
SUPERFAMILY
Content
DescriptionThe SUPERFAMILY database provides structural and functional annotation for all proteins and genomes.
Data types
captured
Protein families, genome annotation, alignments, Hidden Markov models (HMMs)
Organisms all
Contact
Research center University of Bristol
Laboratory
Primary citation PMID   19036790
Access
Data format FASTA format
Website supfam.org
Download URL supfam.org/SUPERFAMILY/downloads.html
Miscellaneous
License GNU General Public License
Version1.75

SUPERFAMILY is a database and search platform of structural and functional annotation for all proteins and genomes. [1] [2] [3] [4] [5] [6] [7] It classifies amino acid sequences into known structural domains, especially into SCOP superfamilies. [8] [9] Domains are functional, structural, and evolutionary units that form proteins. Domains of common Ancestry are grouped into superfamilies. The domains and domain superfamilies are defined and described in SCOP. [8] [10] Superfamilies are groups of proteins which have structural evidence to support a common evolutionary ancestor but may not have detectable sequence homology. [11]

Contents

Annotations

The SUPERFAMILY annotation is based on a collection of hidden Markov models (HMM), which represent structural protein domains at the SCOP superfamily level. [12] [13] A superfamily groups together domains which have an evolutionary relationship. The annotation is produced by scanning protein sequences from completely sequenced genomes against the hidden Markov models.

For each protein you can:

For each genome you can:

For each superfamily you can:

All annotation, models and the database dump are freely available for download to everyone.

Features

Sequence Search

Submit a protein or DNA sequence for SCOP superfamily and family level classification using the SUPERFAMILY HMM's. Sequences can be submitted either by raw input or by uploading a file, but all must be in FASTA format. Sequences can be amino acids, a fixed frame nucleotide sequence, or all frames of a submitted nucleotide sequence. Up to 1000 sequences can be run at a time.

Keyword Search

Search the database using a superfamily, family, or species name plus a sequence, SCOP, PDB, or HMM ID's. A successful search yields the class, folds, superfamilies, families, and individual proteins matching the query.

Domain Assignments

The database has domain assignments, alignments, and architectures for completely sequence eukaryotic and prokaryotic organisms, plus sequence collections.

Comparative Genomics Tools

Browse unusual (over- and under-represented) superfamilies and families, adjacent domain pair lists and graphs, unique domain pairs, domain combinations, domain architecture co-occurrence networks, and domain distribution across taxonomic kingdoms for each organism.

Genome Statistics

For each genome: number of sequences, number of sequences with assignment, percentage of sequences with assignment, percentage total sequence coverage, number of domains assigned, number of superfamilies assigned, number of families assigned, average superfamily size, percentage produced by duplication, average sequence length, average length matched, number of domain pairs, and number of unique domain architectures.

Gene Ontology

Domain-centric Gene Ontology (GO) automatically annotated.

Due to the growing gap between sequenced proteins and known functions of proteins, it is becoming increasingly important to develop a more automated method for functionally annotating proteins, especially for proteins with known domains. SUPERFAMILY uses protein-level GO annotations taken from the Genome Ontology Annotation (GOA) project, which offers high-quality GO annotations directly associated to proteins in the UniprotKB over a wide spectrum of species. [15] SUPERFAMILY has generated GO annotations for evolutionarily closed domains (at the SCOP family level) and distant domains (at the SCOP superfamily level).

Phenotype Ontology

Domain-centric phenotype/anatomy ontology including Disease Ontology, Human Phenotype, Mouse Phenotype, Worm Phenotype, Yeast Phenotype, Fly Phenotype, Fly Anatomy, Zebrafish Anatomy, Xenopus Anatomy, and Arabidopsis Plant.

Superfamily Annotation

InterPro abstracts for over 1,000 superfamilies, and Gene Ontology (GO) annotation for over 700 superfamilies. This feature allows for the direct annotation of key features, functions, and structures of a superfamily.

Functional Annotation

Functional annotation of SCOP 1.73 superfamilies.

The SUPERFAMILY database uses a scheme of 50 detailed function categories which map to 7 general function categories, similar to the scheme used in the COG database. [16] A general function assigned to a superfamily was used to reflect the major function for that superfamily. The general categories of function are:

  1. Information: storage, maintenance of genetic code; DNA replication and repair; general transcription and translation.
  2. Regulation: Regulation of gene expression and protein activity; information processing in response to environmental input; signal transduction; general regulatory or receptor activity.
  3. Metabolism: Anabolic and catabolic processes; cell maintenance and homeostasis; secondary metabolism.
  4. Intra-cellular processes: cell motility and division; cell death; intra-cellular transport; secretion.
  5. Extra-cellular processes: inter-, extr-cellular processes like cell adhesion; organismal process like blood clotting or the immune system.
  6. General: General and multiple functions; interactions with proteins, lipids, small molecules, and ions.
  7. Other/Unknown: an unknown function, viral proteins, or toxins.

Each domain superfamily in SCOP classes a to g were manually annotated using this scheme [17] [18] [19] and the information used was provided by SCOP, [10] InterPro, [20] [21] Pfam, [22] Swiss Prot, [23] and various literature sources.

Phylogenetic Trees

Create custom phylogenetic trees by selecting 3 or more available genomes on the SUPERFAMILY site. Trees are generated using heuristic parsimony methods, and are based on protein domain architecture data for all genomes in SUPERFAMILY. Genome combinations, or specific clades, can be displayed as individual trees.

Similar Domain Architectures

This feature allows the user to find the 10 domain architectures which are most similar to the domain architecture of interest.

Hidden Markov Models

Produce SCOP domain assignments for a sequence using the SUPERFAMILY hidden Markov models.

Profile Comparison

Find remote domain matches when the HMM search fails to find a significant match. Profile comparison (PRC) [24] for aligning and scoring two profile HMM's are used.

Web Services

Distributed Annotation Server and linking to SUPERFAMILY.

Downloads

Sequences, assignments, models, MySQL database, and scripts - updated weekly.

Use in Research

The SUPERFAMILY database has numerous research applications and has been used by many research groups for various studies. It can serve either as a database for proteins that the user wishes to examine with other methods, or to assign a function and structure to a novel or uncharacterized protein. One study found SUPERFAMILY to be very adept at correctly assigning an appropriate function and structure to a large number of domains of unknown function by comparing them to the databases hidden Markov models. [25] Another study used SUPERFAMILY to generate a data set of 1,733 Fold superfamily domains (FSF) in use of a comparison of proteomes and functionomes for to identify the origin of cellular diversification. [26]

Related Research Articles

<span class="mw-page-title-main">Protein family</span> Group of evolutionarily-related proteins

A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be confused with family as it is used in taxonomy.

<span class="mw-page-title-main">Structural Classification of Proteins database</span> Biological database of proteins

The Structural Classification of Proteins (SCOP) database is a largely manual classification of protein structural domains based on similarities of their structures and amino acid sequences. A motivation for this classification is to determine the evolutionary relationship between proteins. Proteins with the same shapes but having little sequence or functional similarity are placed in different superfamilies, and are assumed to have only a very distant common ancestor. Proteins having the same shape and some similarity of sequence and/or function are placed in "families", and are assumed to have a closer common ancestor.

<span class="mw-page-title-main">CATH database</span>

The CATH Protein Structure Classification database is a free, publicly available online resource that provides information on the evolutionary relationships of protein domains. It was created in the mid-1990s by Professor Christine Orengo and colleagues including Janet Thornton and David Jones, and continues to be developed by the Orengo group at University College London. CATH shares many broad features with the SCOP resource, however there are also many areas in which the detailed classification differs greatly.

<span class="mw-page-title-main">Pfam</span> Database of protein families

Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The most recent version, Pfam 36.0, was released in September 2023 and contains 20,795 families.

InterPro is a database of protein families, protein domains and functional sites in which identifiable features found in known proteins can be applied to new protein sequences in order to functionally characterise them.

Protein subfamily is a level of protein classification, based on their close evolutionary relationship. It is below the larger levels of protein superfamily and protein family.

<span class="mw-page-title-main">MicrobesOnline</span>

MicrobesOnline is a publicly and freely accessible website that hosts multiple comparative genomic tools for comparing microbial species at the genomic, transcriptomic and functional levels. MicrobesOnline was developed by the Virtual Institute for Microbial Stress and Survival, which is based at the Lawrence Berkeley National Laboratory in Berkeley, California. The site was launched in 2005, with regular updates until 2011.

<span class="mw-page-title-main">HMMER</span> Software package for sequence analysis

HMMER is a free and commonly used software package for sequence analysis written by Sean Eddy. Its general usage is to identify homologous protein or nucleotide sequences, and to perform sequence alignments. It detects homology by comparing a profile-HMM to either a single sequence or a database of sequences. Sequences that score significantly better to the profile-HMM compared to a null model are considered to be homologous to the sequences that were used to construct the profile-HMM. Profile-HMMs are constructed from a multiple sequence alignment in the HMMER package using the hmmbuild program. The profile-HMM implementation used in the HMMER software was based on the work of Krogh and colleagues. HMMER is a console utility ported to every major operating system, including different versions of Linux, Windows, and macOS.

<span class="mw-page-title-main">Richard M. Durbin</span> British computational biologist

Richard Michael Durbin is a British computational biologist and Al-Kindi Professor of Genetics at the University of Cambridge. He also serves as an associate faculty member at the Wellcome Sanger Institute where he was previously a senior group leader.

<span class="mw-page-title-main">DNA annotation</span> The process of describing the structure and function of a genome

In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.

<span class="mw-page-title-main">Cyrus Chothia</span> English biochemist (1942–2019)

Cyrus Homi Chothia was an English biochemist who was an emeritus scientist at the Medical Research Council (MRC) Laboratory of Molecular Biology (LMB) at the University of Cambridge and emeritus fellow of Wolfson College, Cambridge.

<span class="mw-page-title-main">Protein fold class</span> Categories of protein tertiary structure

In molecular biology, protein fold classes are broad categories of protein tertiary structure topology. They describe groups of proteins that share similar amino acid and secondary structure proportions. Each class contains multiple, independent protein superfamilies.

TIGRFAMs is a database of protein families designed to support manual and automated genome annotation. Each entry includes a multiple sequence alignment and hidden Markov model (HMM) built from the alignment. Sequences that score above the defined cutoffs of a given TIGRFAMs HMM are assigned to that protein family and may be assigned the corresponding annotations. Most models describe protein families found in Bacteria and Archaea.

dcGO is a comprehensive ontology database for protein domains. As an ontology resource, dcGO integrates Open Biomedical Ontologies from a variety of contexts, ranging from functional information like Gene Ontology to others on enzymes and pathways, from phenotype information across major model organisms to information about human diseases and drugs. As a protein domain resource, dcGO includes annotations to both the individual domains and supra-domains.

In bioinformatics, the PANTHER classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. PANTHER is part of the Gene Ontology Reference Genome Project designed to classify proteins and their genes for high-throughput analysis.

A protein superfamily is the largest grouping (clade) of proteins for which common ancestry can be inferred. Usually this common ancestry is inferred from structural alignment and mechanistic similarity, even if no sequence similarity is evident. Sequence homology can then be deduced even if not apparent. Superfamilies typically contain several protein families which show sequence similarity within each family. The term protein clan is commonly used for protease and glycosyl hydrolases superfamilies based on the MEROPS and CAZy classification systems.

<span class="mw-page-title-main">Alex Bateman</span> British bioinformatician

Alexander George Bateman is a computational biologist and Head of Protein Sequence Resources at the European Bioinformatics Institute (EBI), part of the European Molecular Biology Laboratory (EMBL) in Cambridge, UK. He has led the development of the Pfam biological database and introduced the Rfam database of RNA families. He has also been involved in the use of Wikipedia for community-based annotation of biological databases.

Monica Riley was an American scientist who contributed to the discovery of messenger RNA in her Ph.D work with Arthur Pardee, and was later a pioneer in the exploration and computer representation of the Escherichia coli genome.

Julian John Thurstan Gough is a Group Leader in the Laboratory of Molecular Biology (LMB) of the Medical Research Council (MRC). He was previously a professor of bioinformatics at the University of Bristol.

Donna R. Maglott is a staff scientist at the National Center for Biotechnology Information known for her research on large-scale genomics projects, including the mouse genome and development of databases required for genomics research.

References

  1. Wilson, D; Pethica, R; Zhou, Y; Talbot, C; Vogel, C; Madera, M; Chothia, C; Gough, J (January 2009). "SUPERFAMILY--sophisticated comparative genomics, data mining, visualization and phylogeny". Nucleic Acids Research . 37 (Database issue): D380-6. doi:10.1093/NAR/GKN762. ISSN   0305-1048. PMC   2686452 . PMID   19036790. Wikidata   Q26781958.
  2. Madera, Martin; Vogel, Christine; Kummerfeld, Sarah K.; Chothia, Cyrus; Gough, Julian (2004-01-01). "The SUPERFAMILY database in 2004: additions and improvements". Nucleic Acids Research. 32 (suppl 1): D235–D239. doi:10.1093/nar/gkh117. ISSN   0305-1048. PMC   308851 . PMID   14681402.
  3. Wilson, D.; Madera, M.; Vogel, C.; Chothia, C.; Gough, J. (2007). "The SUPERFAMILY database in 2007: Families and functions". Nucleic Acids Research. 35 (Database issue): D308–D313. doi:10.1093/nar/gkl910. PMC   1669749 . PMID   17098927.
  4. Gough, J. (2002). "The SUPERFAMILY database in structural genomics". Acta Crystallographica Section D. 58 (Pt 11): 1897–1900. doi: 10.1107/s0907444902015160 . PMID   12393919.
  5. Gough, J.; Chothia, C. (2002). "SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments". Nucleic Acids Research. 30 (1): 268–272. doi:10.1093/nar/30.1.268. PMC   99153 . PMID   11752312.
  6. 1 2 De Lima Morais, D. A.; Fang, H.; Rackham, O. J. L.; Wilson, D.; Pethica, R.; Chothia, C.; Gough, J. (2010). "SUPERFAMILY 1.75 including a domain-centric gene ontology method". Nucleic Acids Research. 39 (Database issue): D427–D434. doi:10.1093/nar/gkq1130. PMC   3013712 . PMID   21062816.
  7. Oates, M. E.; Stahlhacke, J; Vavoulis, D. V.; Smithers, B; Rackham, O. J.; Sardar, A. J.; Zaucha, J; Thurlby, N; Fang, H; Gough, J (2015). "The SUPERFAMILY 1.75 database in 2014: A doubling of data". Nucleic Acids Research. 43 (Database issue): D227–33. doi:10.1093/nar/gku1041. PMC   4383889 . PMID   25414345.
  8. 1 2 Hubbard, T. J.; Ailey, B.; Brenner, S. E.; Murzin, A. G.; Chothia, C. (1999). "SCOP: A Structural Classification of Proteins database". Nucleic Acids Research. 27 (1): 254–256. doi:10.1093/nar/27.1.254. PMC   148149 . PMID   9847194.
  9. Lo Conte, L.; Ailey, B.; Hubbard, T. J.; Brenner, S. E.; Murzin, A. G.; Chothia, C. (2000). "SCOP: A Structural Classification of Proteins database". Nucleic Acids Research. 28 (1): 257–259. doi:10.1093/nar/28.1.257. PMC   102479 . PMID   10592240.
  10. 1 2 Andreeva, Antonina; Howorth, Dave; Brenner, Steven E.; Hubbard, Tim J. P.; Chothia, Cyrus; Murzin, Alexey G. (2004-01-01). "SCOP database in 2004: refinements integrate structure and sequence family data". Nucleic Acids Research. 32 (Database issue): D226–D229. doi:10.1093/nar/gkh039. ISSN   0305-1048. PMC   308773 . PMID   14681400.
  11. Dayhoff, M. O.; McLaughlin, P. J.; Barker, W. C.; Hunt, L. T. (1975-04-01). "Evolution of sequences within protein superfamilies". Naturwissenschaften. 62 (4): 154–161. Bibcode:1975NW.....62..154D. doi:10.1007/BF00608697. ISSN   0028-1042. S2CID   40304076.
  12. Gough, J.; Karplus, K.; Hughey, R.; Chothia, C. (2001). "Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure1". Journal of Molecular Biology. 313 (4): 903–919. CiteSeerX   10.1.1.144.6577 . doi:10.1006/jmbi.2001.5080. PMID   11697912.
  13. Karplus, K.; Barrett, C.; Hughey, R. (1998-01-01). "Hidden Markov models for detecting remote protein homologies". Bioinformatics. 14 (10): 846–856. doi: 10.1093/bioinformatics/14.10.846 . ISSN   1367-4803. PMID   9927713.
  14. Botstein, D.; Cherry, J. M.; Ashburner, M.; Ball, C. A.; Blake, J. A.; Butler, H.; Davis, A. P.; Dolinski, K.; Dwight, S. S.; Eppig, J. T.; Harris, M. A.; Hill, D. P.; Issel-Tarver, L.; Kasarskis, A.; Lewis, S.; Matese, J. C.; Richardson, J. E.; Ringwald, M.; Rubin, G. M.; Sherlock, G. (2000). "Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium". Nature Genetics . 25 (1): 25–29. doi:10.1038/75556. PMC   3037419 . PMID   10802651. Open Access logo PLoS transparent.svg
  15. Barrell, Daniel; Dimmer, Emily; Huntley, Rachael P.; Binns, David; O’Donovan, Claire; Apweiler, Rolf (2009-01-01). "The GOA database in 2009—an integrated Gene Ontology Annotation resource". Nucleic Acids Research. 37 (suppl 1): D396–D403. doi:10.1093/nar/gkn803. ISSN   0305-1048. PMC   2686469 . PMID   18957448.
  16. Tatusov, Roman L; Fedorova, Natalie D; Jackson, John D; Jacobs, Aviva R; Kiryutin, Boris; Koonin, Eugene V; Krylov, Dmitri M; Mazumder, Raja; Mekhedov, Sergei L (2003-09-11). "The COG database: an updated version includes eukaryotes". BMC Bioinformatics. 4: 41. doi: 10.1186/1471-2105-4-41 . ISSN   1471-2105. PMC   222959 . PMID   12969510.
  17. Vogel, Christine; Berzuini, Carlo; Bashton, Matthew; Gough, Julian; Teichmann, Sarah A. (2004-02-20). "Supra-domains: evolutionary units larger than single protein domains". Journal of Molecular Biology. 336 (3): 809–823. CiteSeerX   10.1.1.116.6568 . doi:10.1016/j.jmb.2003.12.026. ISSN   0022-2836. PMID   15095989.
  18. Vogel, Christine; Teichmann, Sarah A.; Pereira-Leal, Jose (2005-02-11). "The relationship between domain duplication and recombination". Journal of Molecular Biology. 346 (1): 355–365. doi:10.1016/j.jmb.2004.11.050. ISSN   0022-2836. PMID   15663950.
  19. Vogel, Christine; Chothia, Cyrus (2006-05-01). "Protein Family Expansions and Biological Complexity". PLOS Computational Biology. 2 (5): e48. Bibcode:2006PLSCB...2...48V. doi: 10.1371/journal.pcbi.0020048 . ISSN   1553-734X. PMC   1464810 . PMID   16733546.
  20. Mulder, Nicola J.; Apweiler, Rolf; Attwood, Teresa K.; Bairoch, Amos; Barrell, Daniel; Bateman, Alex; Binns, David; Biswas, Margaret; Bradley, Paul (2003-01-01). "The InterPro Database, 2003 brings increased coverage and new features". Nucleic Acids Research. 31 (1): 315–318. doi:10.1093/nar/gkg046. ISSN   0305-1048. PMC   165493 . PMID   12520011.
  21. Mulder, Nicola J.; Apweiler, Rolf; Attwood, Teresa K.; Bairoch, Amos; Bateman, Alex; Binns, David; Bradley, Paul; Bork, Peer; Bucher, Phillip (2005-01-01). "InterPro, progress and status in 2005". Nucleic Acids Research. 33 (Database Issue): D201–D205. doi:10.1093/nar/gki106. ISSN   0305-1048. PMC   540060 . PMID   15608177.
  22. Finn, Robert D.; Mistry, Jaina; Schuster-Böckler, Benjamin; Griffiths-Jones, Sam; Hollich, Volker; Lassmann, Timo; Moxon, Simon; Marshall, Mhairi; Khanna, Ajay (2006-01-01). "Pfam: clans, web tools and services". Nucleic Acids Research. 34 (Database issue): D247–D251. doi:10.1093/nar/gkj149. ISSN   0305-1048. PMC   1347511 . PMID   16381856.
  23. Boeckmann, Brigitte; Blatter, Marie-Claude; Famiglietti, Livia; Hinz, Ursula; Lane, Lydie; Roechert, Bernd; Bairoch, Amos (2005-11-01). "Protein variety and functional diversity: Swiss-Prot annotation in its biological context". Comptes Rendus Biologies. 328 (10–11): 882–899. doi:10.1016/j.crvi.2005.06.001. ISSN   1631-0691. PMID   16286078.
  24. Madera, Martin (2008-11-15). "Profile Comparer: a program for scoring and aligning profile hidden Markov models". Bioinformatics. 24 (22): 2630–2631. doi:10.1093/bioinformatics/btn504. ISSN   1367-4803. PMC   2579712 . PMID   18845584.
  25. Mudgal, Richa; Sandhya, Sankaran; Chandra, Nagasuma; Srinivasan, Narayanaswamy (2015-07-31). "De-DUFing the DUFs: Deciphering distant evolutionary relationships of Domains of Unknown Function using sensitive homology detection methods". Biology Direct. 10 (1): 38. doi: 10.1186/s13062-015-0069-2 . PMC   4520260 . PMID   26228684.
  26. Nasir, Arshan; Caetano-Anollés, Gustavo (2013). "Comparative Analysis of Proteomes and Functionomes Provides Insights into Origins of Cellular Diversification". Archaea. 2013: 648746. doi: 10.1155/2013/648746 . PMC   3892558 . PMID   24492748.