InterPro

Last updated
InterPro
InterPro logo.png
Content
DescriptionInterPro functionally analyzes protein sequences and classifies them into protein families while predicting the presence of domains and functional sites.
Contact
Research center EMBL
Laboratory European Bioinformatics Institute
Primary citationThe InterPro protein families and domains database: 20 years on [1]
Release date1999
Access
Website www.ebi.ac.uk/interpro/
Download URL ftp.ebi.ac.uk/pub/databases/interpro/
Miscellaneous
Data release
frequency
8-weekly
Version97.0 (9 November 2023;9 months ago (2023-11-09))

InterPro is a database of protein families, protein domains and functional sites in which identifiable features found in known proteins can be applied to new protein sequences [2] in order to functionally characterise them. [3] [4]

Contents

The contents of InterPro consist of diagnostic signatures and the proteins that they significantly match. The signatures consist of models (simple types, such as regular expressions or more complex ones, such as Hidden Markov models) which describe protein families, domains or sites. Models are built from the amino acid sequences of known families or domains and they are subsequently used to search unknown sequences (such as those arising from novel genome sequencing) in order to classify them. Each of the member databases of InterPro contributes towards a different niche, from very high-level, structure-based classifications (SUPERFAMILY and CATH-Gene3D) through to quite specific sub-family classifications (PRINTS and PANTHER).

InterPro's intention is to provide a one-stop-shop for protein classification, where all the signatures produced by the different member databases are placed into entries within the InterPro database. Signatures which represent equivalent domains, sites or families are put into the same entry and entries can also be related to one another. Additional information such as a description, consistent names and Gene Ontology (GO) terms are associated with each entry, where possible.

Data contained in InterPro

InterPro contains three main entities: proteins, signatures (also referred to as "methods" or "models") and entries. The proteins in UniProtKB are also the central protein entities in InterPro. Information regarding which signatures significantly match these proteins are calculated as the sequences are released by UniProtKB and these results are made available to the public (see below). The matches of signatures to proteins are what determine how signatures are integrated together into InterPro entries: comparative overlap of matched protein sets and the location of the signatures' matches on the sequences are used as indicators of relatedness. Only signatures deemed to be of sufficient quality are integrated into InterPro. As of version 81.0 (released 21 August 2020) InterPro entries annotated 73.9% of residues found in UniProtKB with another 9.2% annotated by signatures that are pending integration. [5]

The coverage of UniProtKB residues by InterPro entries as of InterPro version 81.0. InterPro coverage of amino acid residues in UniProtKB as of August 2020.png
The coverage of UniProtKB residues by InterPro entries as of InterPro version 81.0.

InterPro also includes data for splice variants and the proteins contained in the UniParc and UniMES databases.

InterPro consortium member databases

The signatures from InterPro come from 13 "member databases", which are listed below.

CATH-Gene3D
Describes protein families and domain architectures in complete genomes. Protein families are formed using a Markov clustering algorithm, followed by multi-linkage clustering according to sequence identity. Mapping of predicted structure and sequence domains is undertaken using hidden Markov models libraries representing CATH and Pfam domains. Functional annotation is provided to proteins from multiple resources. Functional prediction and analysis of domain architectures is available from the Gene3D website.
CDD
Conserved Domain Database is a protein annotation resource that consists of a collection of annotated multiple sequence alignment models for ancient domains and full-length proteins. These are available as position-specific score matrices (PSSMs) for fast identification of conserved domains in protein sequences via RPS-BLAST.
HAMAP
Stands for High-quality Automated and Manual Annotation of microbial Proteomes. HAMAP profiles are manually created by expert curators they identify proteins that are part of well-conserved bacterial, archaeal and plastid-encoded (i.e. chloroplasts, cyanelles, apicoplasts, non-photosynthetic plastids) proteins families or subfamilies.
MobiDB
MobiDB is database annotating intrinsic disorder in proteins.
PANTHER
PANTHER is a large collection of protein families that have been subdivided into functionally related subfamilies, using human expertise. These subfamilies model the divergence of specific functions within protein families, allowing more accurate association with function (human-curated molecular function and biological process classifications and pathway diagrams), as well as inference of amino acids important for functional specificity. Hidden Markov models (HMMs) are built for each family and subfamily for classifying additional protein sequences.
Pfam
Is large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families.
The 13 member databases of the InterPro consortium grouped by their signature construction method and the biological entity they focus on. InterPro consortium member databases.png
The 13 member databases of the InterPro consortium grouped by their signature construction method and the biological entity they focus on.
PIRSF
Protein classification system is a network with multiple levels of sequence diversity from superfamilies to subfamilies that reflects the evolutionary relationship of full-length proteins and domains. The primary PIRSF classification unit is the homeomorphic family, whose members are both homologous (evolved from a common ancestor) and homeomorphic (sharing full-length sequence similarity and a common domain architecture).
PRINTS
PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family; its diagnostic power is refined by iterative scanning of UniProt. Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs, their full diagnostic potency deriving from the mutual context afforded by motif neighbours.
PROSITE
PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs.
SMART
Simple Modular Architecture Research Tool Allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. More than 800 domain families found in signaling, extracellular and chromatin-associated proteins are detectable. These domains are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues.
SUPERFAMILY
SUPERFAMILY is a library of profile hidden Markov models that represent all proteins of known structure. The library is based on the SCOP classification of proteins: each model corresponds to a SCOP domain and aims to represent the entire SCOP superfamily that the domain belongs to. SUPERFAMILY has been used to carry out structural assignments to all completely sequenced genomes.
SFLD
A hierarchical classification of enzymes that relates specific sequence-structure features to specific chemical capabilities.
TIGRFAMs
TIGRFAMs is a collection of protein families, featuring curated multiple sequence alignments, hidden Markov models (HMMs) and annotation, which provides a tool for identifying functionally related proteins based on sequence homology. Those entries which are "equivalogs" group homologous proteins which are conserved with respect to function.

Data types

InterPro consists of seven types of data provided by different members of the consortium:

Data Types of InterPro
Data TypeDescriptionContributing Databases
InterPro EntriesStructural and/or functional domains of proteins predicted using one or more signaturesAll 13 member databases
Member Database signaturesSignatures from member databases. These include signatures that are integrated into InterPro, and those that are notAll 13 member databases
ProteinProtein sequences UniProtKB (Swiss-Prot and TrEMBL)
ProteomeCollection of proteins that belong to a single organismUniProtKB
Structure3-dimensional structures of proteins PDBe
TaxonomyProtein taxonomic informationUniProtKB
SetGroups of evolutionary related families Pfam, CDD
Icons that identify the five entry types found in InterPro (Homologous Superfamily, Family, Domain, Repeat, or Site). InterPro Entry types.png
Icons that identify the five entry types found in InterPro (Homologous Superfamily, Family, Domain, Repeat, or Site).

InterPro entry types

InterPro entries can be further broken down into five types:

  • Homologous Superfamily: A group of proteins that share a common evolutionary origin as seen in their structural similarities, even if their sequences are not highly similar. These entries are specifically only provided by two member databases: CATH-Gene3D and SUPERFAMILY.
  • Family: A group of proteins that have a common evolutionary origin determined through structural similarities, related functions, or sequence homology.
  • Domain: A distinct unit in a protein with a particular function, structure, or sequence.
  • Repeat: A sequence of amino acids, usually no longer than 50 amino acids, that tend to repeat many times in a protein.
  • Site: A short sequence of amino acids where at least one amino acid is conserved. These include post-translation modification sites, conserved sites, binding sites, and active sites.

Access

The database is available for text- and sequence-based searches via a webserver, and for download via anonymous FTP. Like other EBI databases, it is in the public domain, since its content can be used "by any individual and for any purpose". [8] InterPro aims to release data to the public every 8 weeks, typically within a day of the UniProtKB release of the same proteins.

InterPro application programming interface (API)

InterPro provides an API for programmatic access to all InterPro entries and their related entries in Json format. [9] There are six main endpoints for the API corresponding to the different InterPro data types: entry, protein, structure, taxonomy, proteome and set.

InterProScan

InterProScan is a software package that allows users to scan sequences against member database signatures. Users can use this signature scanning software to functionally characterize novel nucleotide or protein sequences. [10] InterProScan is frequently used in genome projects in order to obtain a "first-pass" characterisation of the genome of interest. [11] [12] As of December 2020, the public version of InterProScan (v5.x) uses a Java-based architecture. [13] The software package is currently only supported on a 64-bit Linux operating system.

InterProScan, along with many other EMBL-EBI bioinformatics tools, can also be accessed programmatically using RESTful and SOAP Web Services APIs. [14]

See also

Related Research Articles

<span class="mw-page-title-main">Protein family</span> Group of evolutionarily-related proteins

A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be confused with family as it is used in taxonomy.

<span class="mw-page-title-main">CATH database</span>

The CATH Protein Structure Classification database is a free, publicly available online resource that provides information on the evolutionary relationships of protein domains. It was created in the mid-1990s by Professor Christine Orengo and colleagues including Janet Thornton and David Jones, and continues to be developed by the Orengo group at University College London. CATH shares many broad features with the SCOP resource, however there are also many areas in which the detailed classification differs greatly.

<span class="mw-page-title-main">UniProt</span> Database of protein sequences and functional information

UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature. It is maintained by the UniProt consortium, which consists of several European bioinformatics organisations and a foundation from Washington, DC, USA.

The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wellcome Genome Campus in Hinxton near Cambridge, and employs over 600 full-time equivalent (FTE) staff. Institute leaders such as Rolf Apweiler, Alex Bateman, Ewan Birney, and Guy Cochrane, an adviser on the National Genomics Data Center Scientific Advisory Board, serve as part of the international research network of the BIG Data Center at the Beijing Institute of Genomics.

<span class="mw-page-title-main">Pfam</span> Database of protein families

Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. Last version of Pfam, 36.0, was released in September 2023 and contains 20,795 families. It is currently provided through InterPro database.

<span class="mw-page-title-main">Amos Bairoch</span> Swiss bioinformatician

Amos Bairoch is a Swiss bioinformatician and Professor of Bioinformatics at the Department of Human Protein Sciences of the University of Geneva where he leads the CALIPHO group at the Swiss Institute of Bioinformatics (SIB) combining bioinformatics, curation, and experimental efforts to functionally characterize human proteins.

<span class="mw-page-title-main">PROSITE</span> Database of protein domains, families and functional sites

PROSITE is a protein database. It consists of entries describing the protein families, domains and functional sites as well as amino acid patterns and profiles in them. These are manually curated by a team of the Swiss Institute of Bioinformatics and tightly integrated into Swiss-Prot protein annotation. PROSITE was created in 1988 by Amos Bairoch, who directed the group for more than 20 years. Since July 2018, the director of PROSITE and Swiss-Prot is Alan Bridge.

Rfam is a database containing information about non-coding RNA (ncRNA) families and other structured RNA elements. It is an annotated, open access database originally developed at the Wellcome Trust Sanger Institute in collaboration with Janelia Farm, and currently hosted at the European Bioinformatics Institute. Rfam is designed to be similar to the Pfam database for annotating protein families.

In molecular biology, the PRINTS database is a collection of so-called "fingerprints": it provides both a detailed annotation resource for protein families, and a diagnostic tool for newly determined sequences. A fingerprint is a group of conserved motifs taken from a multiple sequence alignment - together, the motifs form a characteristic signature for the aligned protein family. The motifs themselves are not necessarily contiguous in sequence, but may come together in 3D space to define molecular binding sites or interaction surfaces. The particular diagnostic strength of fingerprints lies in their ability to distinguish sequence differences at the clan, superfamily, family and subfamily levels. This allows fine-grained functional diagnoses of uncharacterised sequences, allowing, for example, discrimination between family members on the basis of the ligands they bind or the proteins with which they interact, and highlighting potential oligomerisation or allosteric sites.

<span class="mw-page-title-main">MicrobesOnline</span>

MicrobesOnline is a publicly and freely accessible website that hosts multiple comparative genomic tools for comparing microbial species at the genomic, transcriptomic and functional levels. MicrobesOnline was developed by the Virtual Institute for Microbial Stress and Survival, which is based at the Lawrence Berkeley National Laboratory in Berkeley, California. The site was launched in 2005, with regular updates until 2011.

The Reference Sequence (RefSeq) database is an open access, annotated and curated collection of publicly available nucleotide sequences and their protein products. RefSeq was introduced in 2000. This database is built by National Center for Biotechnology Information (NCBI), and, unlike GenBank, provides only a single record for each natural biological molecule for major organisms ranging from viruses to bacteria to eukaryotes.

SUPERFAMILY is a database and search platform of structural and functional annotation for all proteins and genomes. It classifies amino acid sequences into known structural domains, especially into SCOP superfamilies. Domains are functional, structural, and evolutionary units that form proteins. Domains of common Ancestry are grouped into superfamilies. The domains and domain superfamilies are defined and described in SCOP. Superfamilies are groups of proteins which have structural evidence to support a common evolutionary ancestor but may not have detectable sequence homology.

PDBsum is a database that provides an overview of the contents of each 3D macromolecular structure deposited in the Protein Data Bank (PDB).

Simple Modular Architecture Research Tool (SMART) is a biological database that is used in the identification and analysis of protein domains within protein sequences. SMART uses profile-hidden Markov models built from multiple sequence alignments to detect protein domains in protein sequences. The most recent release of SMART contains 1,204 domain models. Data from SMART was used in creating the Conserved Domain Database collection and is also distributed as part of the InterPro database. The database is hosted by the European Molecular Biology Laboratory in Heidelberg.

<span class="mw-page-title-main">Rolf Apweiler</span> German bioinformatician

Rolf Apweiler is a director of European Bioinformatics Institute (EBI) part of the European Molecular Biology Laboratory (EMBL) with Ewan Birney.

SWISS-MODEL is a structural bioinformatics web-server dedicated to homology modeling of 3D protein structures. Homology modeling is currently the most accurate method to generate reliable three-dimensional protein structure models and is routinely used in many practical applications. Homology modelling methods make use of experimental protein structures ("templates") to build models for evolutionary related proteins ("targets").

<span class="mw-page-title-main">European Nucleotide Archive</span> Online database from the EBI on Nucleotides

The European Nucleotide Archive (ENA) is a repository providing free and unrestricted access to annotated DNA and RNA sequences. It also stores complementary information such as experimental procedures, details of sequence assembly and other metadata related to sequencing projects. The archive is composed of three main databases: the Sequence Read Archive, the Trace Archive and the EMBL Nucleotide Sequence Database. The ENA is produced and maintained by the European Bioinformatics Institute and is a member of the International Nucleotide Sequence Database Collaboration (INSDC) along with the DNA Data Bank of Japan and GenBank.

In bioinformatics, the PANTHER classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. PANTHER is part of the Gene Ontology Reference Genome Project designed to classify proteins and their genes for high-throughput analysis.

<span class="mw-page-title-main">Alex Bateman</span> British bioinformatician

Alexander George Bateman is a computational biologist and Head of Protein Sequence Resources at the European Bioinformatics Institute (EBI), part of the European Molecular Biology Laboratory (EMBL) in Cambridge, UK. He has led the development of the Pfam biological database and introduced the Rfam database of RNA families. He has also been involved in the use of Wikipedia for community-based annotation of biological databases.

Julian John Thurstan Gough was a Group Leader in the Laboratory of Molecular Biology (LMB) of the Medical Research Council (MRC). He was previously a professor of bioinformatics at the University of Bristol.

References

  1. Blum M, Chang HY, Chuguransky S, Grego T, Kandasaamy S, Mitchell A, et al. (November 2020). "The InterPro protein families and domains database: 20 years on". Nucleic Acids Research. 49 (D1): D344–D354. doi: 10.1093/nar/gkaa977 . PMC   7778928 . PMID   33156333.
  2. Hunter S, Jones P, Mitchell A, Apweiler R, Attwood TK, Bateman A, et al. (January 2012). "InterPro in 2011: new developments in the family and domain prediction database". Nucleic Acids Research. 40 (Database issue): D306-12. doi:10.1093/nar/gkr948. PMC   3245097 . PMID   22096229.
  3. Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E, Biswas M, et al. (January 2001). "The InterPro database, an integrated documentation resource for protein families, domains and functional sites". Nucleic Acids Research. 29 (1): 37–40. doi:10.1093/nar/29.1.37. PMC   29841 . PMID   11125043.
  4. Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E, Biswas M, et al. (December 2000). "InterPro--an integrated documentation resource for protein families, domains and functional sites". Bioinformatics. 16 (12): 1145–50. doi: 10.1093/bioinformatics/16.12.1145 . PMID   11159333.
  5. 1 2 Blum, Matthias; Chang, Hsin-Yu; Chuguransky, Sara; Grego, Tiago; Kandasaamy, Swaathi; Mitchell, Alex; Nuka, Gift; Paysan-Lafosse, Typhaine; Qureshi, Matloob; Raj, Shriya; Richardson, Lorna (2020-11-06). "The InterPro protein families and domains database: 20 years on". Nucleic Acids Research. 49 (D1): D344–D354. doi: 10.1093/nar/gkaa977 . ISSN   0305-1048. PMC   7778928 . PMID   33156333.
  6. EMBL-EBI. "Where does the data come from? | InterPro" . Retrieved 2020-12-04.
  7. EMBL-EBI. "InterPro entry types | InterPro" . Retrieved 2020-12-04.
  8. "Terms of Use for EMBL-EBI Services | European Bioinformatics Institute".
  9. "How to download InterPro data? — InterPro Documentation". interpro-documentation.readthedocs.io. Retrieved 2020-12-04.
  10. Quevillon E, Silventoinen V, Pillai S, Harte N, Mulder N, Apweiler R, Lopez R (July 2005). "InterProScan: protein domains identifier" (Free full text). Nucleic Acids Research. 33 (Web Server issue): W116-20. doi:10.1093/nar/gki442. PMC   1160203 . PMID   15980438.
  11. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. (February 2001). "Initial sequencing and analysis of the human genome" (PDF). Nature. 409 (6822): 860–921. Bibcode:2001Natur.409..860L. doi: 10.1038/35057062 . PMID   11237011.
  12. Holt RA, Subramanian GM, Halpern A, Sutton GG, Charlab R, Nusskern DR, et al. (October 2002). "The genome sequence of the malaria mosquito Anopheles gambiae". Science. 298 (5591): 129–49. Bibcode:2002Sci...298..129H. CiteSeerX   10.1.1.149.9058 . doi:10.1126/science.1076181. PMID   12364791. S2CID   4512225.
  13. Jones P, Binns D, Chang HY, Fraser M, Li W, McAnulla C, et al. (May 2014). "InterProScan 5: genome-scale protein function classification". Bioinformatics. 30 (9): 1236–40. doi:10.1093/bioinformatics/btu031. PMC   3998142 . PMID   24451626.
  14. Madeira F, Park YM, Lee J, Buso N, Gur T, Madhusoodanan N, et al. (July 2019). "The EMBL-EBI search and sequence analysis tools APIs in 2019". Nucleic Acids Research. 47 (W1): W636–W641. doi:10.1093/nar/gkz268. PMC   6602479 . PMID   30976793.