Content | |
---|---|
Description | InterPro functionally analyzes protein sequences and classifies them into protein families while predicting the presence of domains and functional sites. |
Contact | |
Research center | EMBL |
Laboratory | European Bioinformatics Institute |
Primary citation | The InterPro protein families and domains database: 20 years on [1] |
Release date | 1999 |
Access | |
Website | www |
Download URL | ftp.ebi.ac.uk/pub/databases/interpro/ |
Miscellaneous | |
Data release frequency | 8-weekly |
Version | 97.0 (9 November 2023 ) |
InterPro is a database of protein families, protein domains and functional sites in which identifiable features found in known proteins can be applied to new protein sequences [2] in order to functionally characterise them. [3] [4]
The contents of InterPro consist of diagnostic signatures and the proteins that they significantly match. The signatures consist of models (simple types, such as regular expressions or more complex ones, such as Hidden Markov models) which describe protein families, domains or sites. Models are built from the amino acid sequences of known families or domains and they are subsequently used to search unknown sequences (such as those arising from novel genome sequencing) in order to classify them. Each of the member databases of InterPro contributes towards a different niche, from very high-level, structure-based classifications (SUPERFAMILY and CATH-Gene3D) through to quite specific sub-family classifications (PRINTS and PANTHER).
InterPro's intention is to provide a one-stop-shop for protein classification, where all the signatures produced by the different member databases are placed into entries within the InterPro database. Signatures which represent equivalent domains, sites or families are put into the same entry and entries can also be related to one another. Additional information such as a description, consistent names and Gene Ontology (GO) terms are associated with each entry, where possible.
InterPro contains three main entities: proteins, signatures (also referred to as "methods" or "models") and entries. The proteins in UniProtKB are also the central protein entities in InterPro. Information regarding which signatures significantly match these proteins are calculated as the sequences are released by UniProtKB and these results are made available to the public (see below). The matches of signatures to proteins are what determine how signatures are integrated together into InterPro entries: comparative overlap of matched protein sets and the location of the signatures' matches on the sequences are used as indicators of relatedness. Only signatures deemed to be of sufficient quality are integrated into InterPro. As of version 81.0 (released 21 August 2020) InterPro entries annotated 73.9% of residues found in UniProtKB with another 9.2% annotated by signatures that are pending integration. [5]
InterPro also includes data for splice variants and the proteins contained in the UniParc and UniMES databases.
The signatures from InterPro come from 13 "member databases", which are listed below.
InterPro consists of seven types of data provided by different members of the consortium:
Data Type | Description | Contributing Databases |
---|---|---|
InterPro Entries | Structural and/or functional domains of proteins predicted using one or more signatures | All 13 member databases |
Member Database signatures | Signatures from member databases. These include signatures that are integrated into InterPro, and those that are not | All 13 member databases |
Protein | Protein sequences | UniProtKB (Swiss-Prot and TrEMBL) |
Proteome | Collection of proteins that belong to a single organism | UniProtKB |
Structure | 3-dimensional structures of proteins | PDBe |
Taxonomy | Protein taxonomic information | UniProtKB |
Set | Groups of evolutionary related families | Pfam, CDD |
InterPro entries can be further broken down into five types:
The database is available for text- and sequence-based searches via a webserver, and for download via anonymous FTP. Like other EBI databases, it is in the public domain, since its content can be used "by any individual and for any purpose". [8] InterPro aims to release data to the public every 8 weeks, typically within a day of the UniProtKB release of the same proteins.
InterPro provides an API for programmatic access to all InterPro entries and their related entries in Json format. [9] There are six main endpoints for the API corresponding to the different InterPro data types: entry, protein, structure, taxonomy, proteome and set.
InterProScan is a software package that allows users to scan sequences against member database signatures. Users can use this signature scanning software to functionally characterize novel nucleotide or protein sequences. [10] InterProScan is frequently used in genome projects in order to obtain a "first-pass" characterisation of the genome of interest. [11] [12] As of December 2020 [update] , the public version of InterProScan (v5.x) uses a Java-based architecture. [13] The software package is currently only supported on a 64-bit Linux operating system.
InterProScan, along with many other EMBL-EBI bioinformatics tools, can also be accessed programmatically using RESTful and SOAP Web Services APIs. [14]
A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be confused with family as it is used in taxonomy.
The CATH Protein Structure Classification database is a free, publicly available online resource that provides information on the evolutionary relationships of protein domains. It was created in the mid-1990s by Professor Christine Orengo and colleagues including Janet Thornton and David Jones, and continues to be developed by the Orengo group at University College London. CATH shares many broad features with the SCOP resource, however there are also many areas in which the detailed classification differs greatly.
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature. It is maintained by the UniProt consortium, which consists of several European bioinformatics organisations and a foundation from Washington, DC, USA.
The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wellcome Genome Campus in Hinxton near Cambridge, and employs over 600 full-time equivalent (FTE) staff. Institute leaders such as Rolf Apweiler, Alex Bateman, Ewan Birney, and Guy Cochrane, an adviser on the National Genomics Data Center Scientific Advisory Board, serve as part of the international research network of the BIG Data Center at the Beijing Institute of Genomics.
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. Last version of Pfam, 36.0, was released in September 2023 and contains 20,795 families. It is currently provided through InterPro database.
Amos Bairoch is a Swiss bioinformatician and Professor of Bioinformatics at the Department of Human Protein Sciences of the University of Geneva where he leads the CALIPHO group at the Swiss Institute of Bioinformatics (SIB) combining bioinformatics, curation, and experimental efforts to functionally characterize human proteins.
PROSITE is a protein database. It consists of entries describing the protein families, domains and functional sites as well as amino acid patterns and profiles in them. These are manually curated by a team of the Swiss Institute of Bioinformatics and tightly integrated into Swiss-Prot protein annotation. PROSITE was created in 1988 by Amos Bairoch, who directed the group for more than 20 years. Since July 2018, the director of PROSITE and Swiss-Prot is Alan Bridge.
Rfam is a database containing information about non-coding RNA (ncRNA) families and other structured RNA elements. It is an annotated, open access database originally developed at the Wellcome Trust Sanger Institute in collaboration with Janelia Farm, and currently hosted at the European Bioinformatics Institute. Rfam is designed to be similar to the Pfam database for annotating protein families.
In molecular biology, the PRINTS database is a collection of so-called "fingerprints": it provides both a detailed annotation resource for protein families, and a diagnostic tool for newly determined sequences. A fingerprint is a group of conserved motifs taken from a multiple sequence alignment - together, the motifs form a characteristic signature for the aligned protein family. The motifs themselves are not necessarily contiguous in sequence, but may come together in 3D space to define molecular binding sites or interaction surfaces. The particular diagnostic strength of fingerprints lies in their ability to distinguish sequence differences at the clan, superfamily, family and subfamily levels. This allows fine-grained functional diagnoses of uncharacterised sequences, allowing, for example, discrimination between family members on the basis of the ligands they bind or the proteins with which they interact, and highlighting potential oligomerisation or allosteric sites.
MicrobesOnline is a publicly and freely accessible website that hosts multiple comparative genomic tools for comparing microbial species at the genomic, transcriptomic and functional levels. MicrobesOnline was developed by the Virtual Institute for Microbial Stress and Survival, which is based at the Lawrence Berkeley National Laboratory in Berkeley, California. The site was launched in 2005, with regular updates until 2011.
The Reference Sequence (RefSeq) database is an open access, annotated and curated collection of publicly available nucleotide sequences and their protein products. RefSeq was introduced in 2000. This database is built by National Center for Biotechnology Information (NCBI), and, unlike GenBank, provides only a single record for each natural biological molecule for major organisms ranging from viruses to bacteria to eukaryotes.
SUPERFAMILY is a database and search platform of structural and functional annotation for all proteins and genomes. It classifies amino acid sequences into known structural domains, especially into SCOP superfamilies. Domains are functional, structural, and evolutionary units that form proteins. Domains of common Ancestry are grouped into superfamilies. The domains and domain superfamilies are defined and described in SCOP. Superfamilies are groups of proteins which have structural evidence to support a common evolutionary ancestor but may not have detectable sequence homology.
PDBsum is a database that provides an overview of the contents of each 3D macromolecular structure deposited in the Protein Data Bank (PDB).
Simple Modular Architecture Research Tool (SMART) is a biological database that is used in the identification and analysis of protein domains within protein sequences. SMART uses profile-hidden Markov models built from multiple sequence alignments to detect protein domains in protein sequences. The most recent release of SMART contains 1,204 domain models. Data from SMART was used in creating the Conserved Domain Database collection and is also distributed as part of the InterPro database. The database is hosted by the European Molecular Biology Laboratory in Heidelberg.
Rolf Apweiler is a director of European Bioinformatics Institute (EBI) part of the European Molecular Biology Laboratory (EMBL) with Ewan Birney.
SWISS-MODEL is a structural bioinformatics web-server dedicated to homology modeling of 3D protein structures. Homology modeling is currently the most accurate method to generate reliable three-dimensional protein structure models and is routinely used in many practical applications. Homology modelling methods make use of experimental protein structures ("templates") to build models for evolutionary related proteins ("targets").
The European Nucleotide Archive (ENA) is a repository providing free and unrestricted access to annotated DNA and RNA sequences. It also stores complementary information such as experimental procedures, details of sequence assembly and other metadata related to sequencing projects. The archive is composed of three main databases: the Sequence Read Archive, the Trace Archive and the EMBL Nucleotide Sequence Database. The ENA is produced and maintained by the European Bioinformatics Institute and is a member of the International Nucleotide Sequence Database Collaboration (INSDC) along with the DNA Data Bank of Japan and GenBank.
In bioinformatics, the PANTHER classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. PANTHER is part of the Gene Ontology Reference Genome Project designed to classify proteins and their genes for high-throughput analysis.
Alexander George Bateman is a computational biologist and Head of Protein Sequence Resources at the European Bioinformatics Institute (EBI), part of the European Molecular Biology Laboratory (EMBL) in Cambridge, UK. He has led the development of the Pfam biological database and introduced the Rfam database of RNA families. He has also been involved in the use of Wikipedia for community-based annotation of biological databases.
Julian John Thurstan Gough was a Group Leader in the Laboratory of Molecular Biology (LMB) of the Medical Research Council (MRC). He was previously a professor of bioinformatics at the University of Bristol.