Biological databases are stores of biological information. [1] The journal Nucleic Acids Research regularly publishes special issues on biological databases and has a list of such databases. The 2018 issue has a list of about 180 such databases and updates to previously described databases. [2] Omics Discovery Index can be used to browse and search several biological databases. Furthermore, the NIAID Data Ecosystem Discovery Portal developed by the National Institute of Allergy and Infectious Diseases (NIAID) enables searching across databases.
Meta databases are databases of databases that collect data about data to generate new data. They are capable of merging information from different sources and making it available in a new and more convenient form, or with an emphasis on a particular disease or organism. Originally, metadata was only a common term referring simply to data about data such as tags, keywords, and markup headers.
Model organism databases provide in-depth biological data for intensively studied organisms.
The primary databases make up the International Nucleotide Sequence Database (INSD). The include:
DDBJ (Japan), GenBank (USA) and European Nucleotide Archive (Europe) are repositories for nucleotide sequence data from all organisms. All three accept nucleotide sequence submissions, and then exchange new and updated data on a daily basis to achieve optimal synchronisation between them. These three databases are primary databases, as they house original sequence data. They collaborate with Sequence Read Archive (SRA), which archives raw reads from high-throughput sequencing instruments.
Secondary databases are:[ clarification needed ]
Other databases
These databases collect genome sequences, annotate and analyze them, and provide public access. Some add curation of experimental literature to improve computed annotations. These databases may hold many species genomes, or a single model organism genome.
(See also: List of proteins in the human body)
Several publicly available data repositories and resources have been developed to support and manage protein related information, biological knowledge discovery and data-driven hypothesis generation. [15] The databases in the table below are selected from the databases listed in the Nucleic Acids Research (NAR) databases issues and database collection and the databases cross-referenced in the UniProtKB. Most of these databases are cross-referenced with UniProt / UniProtKB so that identifiers can be mapped to each other. [15]
Proteins in human:
There are about ~20,000 protein coding genes in the standard human genome. (Roughly ~1200 already have Wikipedia articles - the Gene Wiki - about them) if we are Including splice variants, there could be as many as 500,000 unique human proteins [16]
This section needs expansion. You can help by adding to it. (January 2015) |
DB name | DB website | Provider | Data sources | Revenue/Sponsors sources | Integrates | Desc. | Size | DB type | Actively maintained |
---|---|---|---|---|---|---|---|---|---|
InterPro | http://www.ebi.ac.uk/interpro/ | ELIXIR infrastructure | European Bioinformatics Institute | EMBL, The Welcome trust, BBSRC | CATH-Gene3D, CDD, HAMAP, MobiDB, PANTHER, Pfam, SMART, SUPERFAMILY, SFLD, TIGRFAMs, | classifies proteins into families and predicts the presence of domains and sites | Protein sequence databases | Yes | |
NextProt | https://www.nextprot.org/ | CALIPHO (is a group at the SIB) | Swiss Institute of Bioinformatics | https://www.sib.swiss/about/funding-sources | UniProt, Cellosaurus, Gnomad, IntAct, SRAA Atlas, Uniprot - GOA, BGEE, COSMIC, MassIVE, Peptide atlas | a human protein-centric knowledge resource | Protein sequence databases | Yes | |
Wiki-pi | http://severus.dbmi.pitt.edu/wiki-pi/ | Madhavi K. Ganapathiraju | At present Wiki-Pi contains 48,419 unique interactions among 10,492 proteins. However it is not clear if this is unique proteins[13] | Protein interaction Database | ?? | ||||
Human Protein Reference Database | Institute of Bioinformatics (IOB), Bangalore, India | One source claims 15000 [17] proteins. But it is unclear how many of these are unique | |||||||
Pfam | Sanger Institute | protein families database of alignments and HMMs | Protein sequence databases | ||||||
Human Proteinpedia | Institute of Bioinformatics (IOB), Bangalore and Johns Hopkins University, | The human Proteinpedia is based on HPRD (Human protein reference database)which is a repository hosting over 30,000 human proteins. However it is unclear how many of these are unique proteins | |||||||
Human Protein Atlas | The Swedish Government | It contains roughly 10 million IHC images of a bit less than 25,000 antibodies. But once again it is unclear how many of these are unique | |||||||
PRINTS | Manchester University | a compendium of protein fingerprints | Protein sequence databases | ||||||
PROSITE | database of protein families and domains | Protein sequence databases | |||||||
Protein Information Resource | Georgetown University Medical Center [GUMC] | Protein sequence databases | |||||||
SUPERFAMILY | library of HMMs representing superfamilies and database of (superfamily and family) annotations for all completely sequenced organisms | Protein sequence databases | |||||||
Swiss-Prot | Swiss Institute of Bioinformatics | protein knowledgebase | Protein sequence databases | ||||||
Protein Data Bank | Protein DataBank in Europe (PDBe), [18] ProteinDatabank in Japan (PDBj), [19] Research Collaboratory for Structural Bioinformatics (RCSB) [20] | (PDB) | Protein structure databases | ||||||
Structural Classification of Proteins (SCOP) | Protein structure databases | ||||||||
CATH database | Protein structure databases | ||||||||
ModBase | Sali Lab, UCSF | database of comparative protein structure models | Protein model databases | ||||||
SIMAP | database of protein similarities computed using FASTA | Protein model databases | |||||||
Swiss-model | server and repository for protein structure models | Protein model databases | |||||||
AAindex | database of amino acid indices, amino acid mutation matrices, and pair-wise contact potentials | Protein model databases | |||||||
BioGRID | Samuel Lunenfeld Research Institute | general repository for interaction datasets | Protein-protein and other molecular interactions | ||||||
RNA-binding protein database | Protein-protein and other molecular interactions | ||||||||
Database of Interacting Proteins | Univ. of California | Protein-protein and other molecular interactions | |||||||
IntAct [21] | EMBL-EBI | open-source database for molecular interactions | Protein-protein and other molecular interactions | ||||||
String | an open source molecular interaction database to study interactions between proteins | Protein-protein and other molecular interactions | |||||||
Human Protein Atlas | Human Protein Atlas | aims at mapping all the human proteins in cells, tissues and organs | Protein expression databases | ||||||
ProteinModelPortal | Protein Model Portal of the PSI-Nature Structural Biology Knowledgebase | ?? | ?? | 3D structure protein databases | |||||
SWISS-MODEL Repository | Database of annotated 3D protein structure models | University of Basel | The Swiss government | 3D structure protein databases | |||||
DisProt | Database of Protein Disorder | ELIXIR infrastructure | Indiana University School of Medicine, Temple University, University of Padua | funding from the European Union's Horizon 2020 | Swiss Prot/Uni Prot, CATH, Pfam, Europe PMC, BITEM, ECO, Geneontology | database of experimental evidences of disorder in proteins | 3D structure protein databases, Protein sequence databases | ||
MobiDB | Database of intrinsically disordered and mobile proteins | John Moult, Christine Orengo, Predrag Radivojac | University of Padua | Italian Government | database of intrinsic protein disorder annotation | 3D structure protein databases, Protein sequence databases | |||
ModBase | Database of Comparative Protein Structure Models | Ursula Pieper, Ben Webb, Narayanan Eswar, Andrej Sali Roberto Sanchez | UCSF, Sali Lab | 3D structure protein databases | |||||
PDBsum | Pictorial database of 3D structures in the Protein Data Bank | European Bioinformatics Institute 2013 | Wellcome Trust | 3D structure protein databases | |||||
CCDS | The Consensus CDS protein set database | NCBI | ?? | Sequence databases | |||||
UniProtKB | Universal Protein Resource (UniProt) | ?? | ?? | Sequence databases | |||||
Swiss Prot/Uni Prot | https://www.sib.swiss/swiss-prot and https://www.uniprot.org/ | SIB Swiss Institute of Bioinformatics | European Bioinformatics Institute (EMBL-EBI) | Swiss-Prot has collected over 81 000 variants in roughly 13,000 human protein sequence records from peer-reviewed literature. It is unclear how many unique proteins types are present in the database. |
Numerous databases collect information about species and other taxonomic categories. The Catalogue of Life is a special case as it is a meta-database of about 150 specialized "global species databases" (GSDs) that have collected the names and other information on (almost) all described and thus "known" species.
Images play a critical role in biomedicine, ranging from images of anthropological specimens to zoology. However, there are relatively few databases dedicated to image collection, although some projects such as iNaturalist collect photos as a main part of their data. A special case of "images" are 3-dimensional images such as protein structures or 3D-reconstructions of anatomical structures. Image databases include, among others: [22]
Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The process of analyzing and interpreting data can sometimes be referred to as computational biology, however this distinction between the two terms is often disputed. To some, the term computational biology refers to building and using models of biological systems.
Biological databases are libraries of biological sciences, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. They contain information from research areas including genomics, proteomics, metabolomics, microarray gene expression, and phylogenetics. Information contained in biological databases includes gene function, structure, localization, clinical effects of mutations as well as similarities of biological sequences and structures.
In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.
Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a speciation event (orthologs), or a duplication event (paralogs), or else a horizontal gene transfer event (xenologs).
Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other vertebrates and model organisms. Ensembl is one of several well known genome browsers for the retrieval of genomic information.
The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wellcome Genome Campus in Hinxton near Cambridge, and employs over 600 full-time equivalent (FTE) staff.
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The latest version of Pfam, 37.0, was released in June 2024 and contains 21,979 families. It is currently provided through InterPro website.
KEGG is a collection of databases dealing with genomes, biological pathways, diseases, drugs, and chemical substances. KEGG is utilized for bioinformatics research and education, including data analysis in genomics, metagenomics, metabolomics and other omics studies, modeling and simulation in systems biology, and translational research in drug development.
The Generic Model Organism Database (GMOD) project provides biological research communities with a toolkit of open-source software components for visualizing, annotating, managing, and storing biological data. The GMOD project is funded by the United States National Institutes of Health, National Science Foundation and the USDA Agricultural Research Service.
MicrobesOnline is a publicly and freely accessible website that hosts multiple comparative genomic tools for comparing microbial species at the genomic, transcriptomic and functional levels. MicrobesOnline was developed by the Virtual Institute for Microbial Stress and Survival, which is based at the Lawrence Berkeley National Laboratory in Berkeley, California. The site was launched in 2005, with regular updates until 2011.
The Reference Sequence (RefSeq) database is an open access, annotated and curated collection of publicly available nucleotide sequences and their protein products. RefSeq was introduced in 2000. This database is built by National Center for Biotechnology Information (NCBI), and, unlike GenBank, provides only a single record for each natural biological molecule for major organisms ranging from viruses to bacteria to eukaryotes.
The UCSC Genome Browser is an online and downloadable genome browser hosted by the University of California, Santa Cruz (UCSC). It is an interactive website offering access to genome sequence data from a variety of vertebrate and invertebrate species and major model organisms, integrated with a large collection of aligned annotations. The Browser is a graphical viewer optimized to support fast interactive performance and is an open-source, web-based tool suite built on top of a MySQL database for rapid visualization, examination, and querying of the data at many levels. The Genome Browser Database, browsing tools, downloadable data files, and documentation can all be found on the UCSC Genome Bioinformatics website.
Protein function prediction methods are techniques that bioinformatics researchers use to assign biological or biochemical roles to proteins. These proteins are usually ones that are poorly studied or predicted based on genomic sequence data. These predictions are often driven by data-intensive computational procedures. Information may come from nucleic acid sequence homology, gene expression profiles, protein domain structures, text mining of publications, phylogenetic profiles, phenotypic profiles, and protein-protein interaction. Protein function is a broad term: the roles of proteins range from catalysis of biochemical reactions to transport to signal transduction, and a single protein may play a role in multiple processes or cellular pathways.
In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.
In bioinformatics, the PANTHER classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. PANTHER is part of the Gene Ontology Reference Genome Project designed to classify proteins and their genes for high-throughput analysis.
In bioinformatics, a Gene Disease Database is a systematized collection of data, typically structured to model aspects of reality, in a way to comprehend the underlying mechanisms of complex diseases, by understanding multiple composite interactions between phenotype-genotype relationships and gene-disease mechanisms. Gene Disease Databases integrate human gene-disease associations from various expert curated databases and text mining derived associations including Mendelian, complex and environmental diseases.
Single nucleotide polymorphism annotation is the process of predicting the effect or function of an individual SNP using SNP annotation tools. In SNP annotation the biological information is extracted, collected and displayed in a clear form amenable to query. SNP functional annotation is typically performed based on the available information on nucleic acid and protein sequences.
The Expression Atlas is a database maintained by the European Bioinformatics Institute that provides information on gene expression patterns from RNA-Seq and Microarray studies, and protein expression from Proteomics studies. The Expression Atlas allows searches by gene, splice variant, protein attribute, disease, treatment or organism part. Individual genes or gene sets can be searched for. All datasets in Expression Atlas have its metadata manually curated and its data analysed through standardised analysis pipelines. There are two components to the Expression Atlas, the Baseline Atlas and the Differential Atlas:
PomBase is a model organism database that provides online access to the fission yeast Schizosaccharomyces pombe genome sequence and annotated features, together with a wide range of manually curated functional gene-specific data. The PomBase website was redeveloped in 2016 to provide users with a more fully integrated, better-performing service.
Model organism databases (MODs) are biological databases, or knowledgebases, dedicated to the provision of in-depth biological data for intensively studied model organisms. MODs allow researchers to easily find background information on large sets of genes, efficiently plan experiments, integrate their data with existing knowledge, and formulate new hypotheses. They allow users to analyse results and interpret datasets, and the data they generate are increasingly used to describe less well studied species. Where possible, MODs share common approaches to collect and represent biological information. For example, all MODs use the Gene Ontology (GO) to describe functions, processes and cellular locations of specific gene products. Projects also exist to enable software sharing for curation, visualization and querying between different MODs. Organismal diversity and varying user requirements however mean that MODs are often required to customize capture, display, and provision of data.