List of biological databases

Last updated

Biological databases are stores of biological information. [1] The journal Nucleic Acids Research regularly publishes special issues on biological databases and has a list of such databases. The 2018 issue has a list of about 180 such databases and updates to previously described databases. [2] Omics Discovery Index can be used to browse and search several biological databases. Furthermore, the NIAID Data Ecosystem Discovery Portal developed by the National Institute of Allergy and Infectious Diseases (NIAID) enables searching across databases.

Contents

Meta databases

Meta databases are databases of databases that collect data about data to generate new data. They are capable of merging information from different sources and making it available in a new and more convenient form, or with an emphasis on a particular disease or organism. Originally, metadata was only a common term referring simply to data about data such as tags, keywords, and markup headers.

Model organism databases

Model organism databases provide in-depth biological data for intensively studied organisms.

Nucleic acid databases

DNA databases

The primary databases make up the International Nucleotide Sequence Database (INSD). The include:

DDBJ (Japan), GenBank (USA) and European Nucleotide Archive (Europe) are repositories for nucleotide sequence data from all organisms. All three accept nucleotide sequence submissions, and then exchange new and updated data on a daily basis to achieve optimal synchronisation between them. These three databases are primary databases, as they house original sequence data. They collaborate with Sequence Read Archive (SRA), which archives raw reads from high-throughput sequencing instruments.

Secondary databases are:[ clarification needed ]

Other databases

Gene expression databases

Generic gene expression databases

Microarray gene expression databases

Genome databases

These databases collect genome sequences, annotate and analyze them, and provide public access. Some add curation of experimental literature to improve computed annotations. These databases may hold many species genomes, or a single model organism genome.

Phenotype databases

RNA databases

Amino acid / protein databases

(See also: List of proteins in the human body)

Several publicly available data repositories and resources have been developed to support and manage protein related information, biological knowledge discovery and data-driven hypothesis generation. [15] The databases in the table below are selected from the databases listed in the Nucleic Acids Research (NAR) databases issues and database collection and the databases cross-referenced in the UniProtKB. Most of these databases are cross-referenced with UniProt / UniProtKB so that identifiers can be mapped to each other. [15]

Proteins in human:

There are about ~20,000 protein coding genes in the standard human genome. (Roughly ~1200 already have Wikipedia articles - the Gene Wiki - about them) if we are Including splice variants, there could be as many as 500,000 unique human proteins [16]

Different types of Protein databases

Signal transduction pathway databases

Metabolic pathway and protein function databases

Taxonomic databases

Numerous databases collect information about species and other taxonomic categories. The Catalogue of Life is a special case as it is a meta-database of about 150 specialized "global species databases" (GSDs) that have collected the names and other information on (almost) all described and thus "known" species.

Image databases

Images play a critical role in biomedicine, ranging from images of anthropological specimens to zoology. However, there are relatively few databases dedicated to image collection, although some projects such as iNaturalist collect photos as a main part of their data. A special case of "images" are 3-dimensional images such as protein structures or 3D-reconstructions of anatomical structures. Image databases include, among others: [22]

Radiologic databases

Additional databases

Exosomal databases

Mathematical model databases

Databases on antimicrobial resistance rates and antibiotic consumption

Databases on antimicrobial resistance mechanisms

Wiki-style databases

Specialized databases

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The process of analyzing and interpreting data can sometimes be referred to as computational biology, however this distinction between the two terms is often disputed. To some, the term computational biology refers to building and using models of biological systems.

<span class="mw-page-title-main">Biological database</span> Database of biological information

Biological databases are libraries of biological sciences, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. They contain information from research areas including genomics, proteomics, metabolomics, microarray gene expression, and phylogenetics. Information contained in biological databases includes gene function, structure, localization, clinical effects of mutations as well as similarities of biological sequences and structures.

In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.

<span class="mw-page-title-main">Sequence homology</span> Shared ancestry between DNA, RNA or protein sequences

Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a speciation event (orthologs), or a duplication event (paralogs), or else a horizontal gene transfer event (xenologs).

<span class="mw-page-title-main">Ensembl genome database project</span> Scientific project at the European Bioinformatics Institute

Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other vertebrates and model organisms. Ensembl is one of several well known genome browsers for the retrieval of genomic information.

The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wellcome Genome Campus in Hinxton near Cambridge, and employs over 600 full-time equivalent (FTE) staff.

<span class="mw-page-title-main">Pfam</span> Database of protein families

Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The latest version of Pfam, 37.0, was released in June 2024 and contains 21,979 families. It is currently provided through InterPro website.

<span class="mw-page-title-main">KEGG</span> Collection of bioinformatics databases

KEGG is a collection of databases dealing with genomes, biological pathways, diseases, drugs, and chemical substances. KEGG is utilized for bioinformatics research and education, including data analysis in genomics, metagenomics, metabolomics and other omics studies, modeling and simulation in systems biology, and translational research in drug development.

<span class="mw-page-title-main">Generic Model Organism Database</span>

The Generic Model Organism Database (GMOD) project provides biological research communities with a toolkit of open-source software components for visualizing, annotating, managing, and storing biological data. The GMOD project is funded by the United States National Institutes of Health, National Science Foundation and the USDA Agricultural Research Service.

<span class="mw-page-title-main">MicrobesOnline</span>

MicrobesOnline is a publicly and freely accessible website that hosts multiple comparative genomic tools for comparing microbial species at the genomic, transcriptomic and functional levels. MicrobesOnline was developed by the Virtual Institute for Microbial Stress and Survival, which is based at the Lawrence Berkeley National Laboratory in Berkeley, California. The site was launched in 2005, with regular updates until 2011.

The Reference Sequence (RefSeq) database is an open access, annotated and curated collection of publicly available nucleotide sequences and their protein products. RefSeq was introduced in 2000. This database is built by National Center for Biotechnology Information (NCBI), and, unlike GenBank, provides only a single record for each natural biological molecule for major organisms ranging from viruses to bacteria to eukaryotes.

The UCSC Genome Browser is an online and downloadable genome browser hosted by the University of California, Santa Cruz (UCSC). It is an interactive website offering access to genome sequence data from a variety of vertebrate and invertebrate species and major model organisms, integrated with a large collection of aligned annotations. The Browser is a graphical viewer optimized to support fast interactive performance and is an open-source, web-based tool suite built on top of a MySQL database for rapid visualization, examination, and querying of the data at many levels. The Genome Browser Database, browsing tools, downloadable data files, and documentation can all be found on the UCSC Genome Bioinformatics website.

Protein function prediction methods are techniques that bioinformatics researchers use to assign biological or biochemical roles to proteins. These proteins are usually ones that are poorly studied or predicted based on genomic sequence data. These predictions are often driven by data-intensive computational procedures. Information may come from nucleic acid sequence homology, gene expression profiles, protein domain structures, text mining of publications, phylogenetic profiles, phenotypic profiles, and protein-protein interaction. Protein function is a broad term: the roles of proteins range from catalysis of biochemical reactions to transport to signal transduction, and a single protein may play a role in multiple processes or cellular pathways.

<span class="mw-page-title-main">DNA annotation</span> The process of describing the structure and function of a genome

In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.

In bioinformatics, the PANTHER classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. PANTHER is part of the Gene Ontology Reference Genome Project designed to classify proteins and their genes for high-throughput analysis.

In bioinformatics, a Gene Disease Database is a systematized collection of data, typically structured to model aspects of reality, in a way to comprehend the underlying mechanisms of complex diseases, by understanding multiple composite interactions between phenotype-genotype relationships and gene-disease mechanisms. Gene Disease Databases integrate human gene-disease associations from various expert curated databases and text mining derived associations including Mendelian, complex and environmental diseases.

Single nucleotide polymorphism annotation is the process of predicting the effect or function of an individual SNP using SNP annotation tools. In SNP annotation the biological information is extracted, collected and displayed in a clear form amenable to query. SNP functional annotation is typically performed based on the available information on nucleic acid and protein sequences.

The Expression Atlas is a database maintained by the European Bioinformatics Institute that provides information on gene expression patterns from RNA-Seq and Microarray studies, and protein expression from Proteomics studies. The Expression Atlas allows searches by gene, splice variant, protein attribute, disease, treatment or organism part. Individual genes or gene sets can be searched for. All datasets in Expression Atlas have its metadata manually curated and its data analysed through standardised analysis pipelines. There are two components to the Expression Atlas, the Baseline Atlas and the Differential Atlas:

PomBase is a model organism database that provides online access to the fission yeast Schizosaccharomyces pombe genome sequence and annotated features, together with a wide range of manually curated functional gene-specific data. The PomBase website was redeveloped in 2016 to provide users with a more fully integrated, better-performing service.

Model organism databases (MODs) are biological databases, or knowledgebases, dedicated to the provision of in-depth biological data for intensively studied model organisms. MODs allow researchers to easily find background information on large sets of genes, efficiently plan experiments, integrate their data with existing knowledge, and formulate new hypotheses. They allow users to analyse results and interpret datasets, and the data they generate are increasingly used to describe less well studied species. Where possible, MODs share common approaches to collect and represent biological information. For example, all MODs use the Gene Ontology (GO) to describe functions, processes and cellular locations of specific gene products. Projects also exist to enable software sharing for curation, visualization and querying between different MODs. Organismal diversity and varying user requirements however mean that MODs are often required to customize capture, display, and provision of data.

References

  1. Wren JD, Bateman A (October 2008). "Databases, data tombs and dust in the wind". Bioinformatics. 24 (19): 2127–8. doi: 10.1093/bioinformatics/btn464 . PMID   18819940.
  2. "Volume 46 Issue D1 | Nucleic Acids Research | Oxford Academic". academic.oup.com. Retrieved 2018-09-04.
  3. Lock A, Rutherford K, Harris MA, Hayles J, Oliver SG, Bähler J, Wood V (January 2019). "PomBase 2018: user-driven reimplementation of the fission yeast database provides rapid and intuitive access to diverse, interconnected information". Nucleic Acids Research. 47 (D1): D821–D827. doi:10.1093/nar/gky961. PMC   6324063 . PMID   30321395.
  4. Zhu B, Stülke J (January 2018). "SubtiWiki in 2018: from genes and proteins to functional network annotation of the model organism Bacillus subtilis". Nucleic Acids Research. 46 (D1): D743–D748. doi:10.1093/nar/gkx908. PMC   5753275 . PMID   29788229.
  5. Margarita Garcia-Hernandez; Tanya Berardini; Guanghong Chen; Debbie Crist; Aisling Doyle; Eva Huala; Emma Knee; Mark Lambrecht; Neil Miller; Lukas A. Mueller; Suparna Mundodi; Leonore Reiser; Seung Y. Rhee; Randy Scholl; Julie Tacklind; Dan C. Weems; Yihe Wu; Iris Xu; Daniel Yoo; Jungwon Yoon; Peifen Zhang (November 2002). "TAIR: a resource for integrated Arabidopsis data". Functional & Integrative Genomics. 2 (6): 239–253. doi:10.1007/s10142-002-0077-z. PMID   12444417. S2CID   7827488.
  6. Powell S, Forslund K, Szklarczyk D, Trachana K, Roth A, Huerta-Cepas J, et al. (January 2014). "eggNOG v4.0: nested orthology inference across 3686 organisms". Nucleic Acids Research. 42 (Database issue): D231-9. doi: 10.1093/nar/gkt1253 . PMC   3964997 . PMID   24297252.
  7. Huerta-Cepas J, Szklarczyk D, Heller D, Hernández-Plaza A, Forslund SK, Cook H, et al. (January 2019). "eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses". Nucleic Acids Research. 47 (D1): D309–D314. doi: 10.1093/nar/gky1085 . PMC   6324079 . PMID   30418610.
  8. ArrayExpress
  9. GEO
  10. "The Human Protein Atlas". www.proteinatlas.org. Retrieved 2019-05-27.
  11. Dash S, Campbell JD, Cannon EK, Cleary AM, Huang W, Kalberer SR, et al. (January 2016). "Legume information system (LegumeInfo.org): a key component of a set of federated data resources for the legume family". Nucleic Acids Research. 44 (D1): D1181-8. doi:10.1093/nar/gkv1159. PMC   4702835 . PMID   26546515.
  12. "Saccharomyces Genome Database | SGD". www.yeastgenome.org. Retrieved 2018-09-04.
  13. Grant D, Nelson RT, Cannon SB, Shoemaker RC (January 2010). "SoyBase, the USDA-ARS soybean genetics and genomics database". Nucleic Acids Research. 38 (Database issue): D843-6. doi:10.1093/nar/gkp798. PMC   2808871 . PMID   20008513.
  14. "IRESbase".
  15. 1 2 Chen C, Huang H, Wu CH (2017). "Protein Bioinformatics Databases and Resources". In Wu CH, Arighi CN, Ross KE (eds.). Protein Bioinformatics. Methods in Molecular Biology. Vol. 1558. New York, NY: Springer New York. pp. 3–39. doi:10.1007/978-1-4939-6783-4_1. ISBN   978-1-4939-6781-0. PMC   5506686 . PMID   28150231.
  16. Karnkowska, Anna; Treitli, Sebastian C.; Brzoň, Ondřej; Novák, Lukáš; Vacek, Vojtěch; Soukal, Petr; Barlow, Lael D.; Herman, Emily K.; Pipaliya, Shweta V.; Pánek, Tomáš; Žihala, David; Petrželková, Romana; Butenko, Anzhelika; Eme, Laura; Stairs, Courtney W.; Roger, Andrew J.; Eliáš, Marek; Dacks, Joel B.; Hampl, Vladimír (2019). "The Oxymonad Genome Displays Canonical Eukaryotic Complexity in the Absence of a Mitochondrion". Molecular Biology and Evolution. 36 (10): 2292–2312. doi:10.1093/molbev/msz147. PMC   6759080 . PMID   31387118.
  17. Keshava Prasad, T. S.; Goel, R.; Kandasamy, K.; Keerthikumar, S.; Kumar, S.; Mathivanan, S.; Telikicherla, D.; Raju, R.; Shafreen, B.; Venugopal, A.; Balakrishnan, L.; Marimuthu, A.; Banerjee, S.; Somanathan, D. S.; Sebastian, A.; Rani, S.; Ray, S.; Harrys Kishore, C. J.; Kanth, S.; Ahmed, M.; Kashyap, M. K.; Mohmood, R.; Ramachandra, Y. L.; Krishna, V.; Rahiman, B. A.; Mohan, S.; Ranganathan, P.; Ramabadran, S.; Chaerkady, R.; Pandey, A. (2008). "Human Protein Reference Database—2009 update". Nucleic Acids Research. 37 (Database issue): D767–D772. doi:10.1093/nar/gkn892. PMC   2686490 . PMID   18988627.
  18. Mir S, Alhroub Y, Anyango S, Armstrong DR, Berrisford JM, Clark AR, et al. (January 2018). "PDBe: towards reusable data delivery infrastructure at protein data bank in Europe". Nucleic Acids Research. 46 (D1): D486–D492. doi:10.1093/nar/gkx1070. PMC   5753225 . PMID   29126160.
  19. Kinjo AR, Bekker GJ, Suzuki H, Tsuchiya Y, Kawabata T, Ikegawa Y, Nakamura H (January 2017). "Protein Data Bank Japan (PDBj): updated user interfaces, resource description framework, analysis tools for large structures". Nucleic Acids Research. 45 (D1): D282–D288. doi:10.1093/nar/gkw962. PMC   5210648 . PMID   27789697.
  20. Rose PW, Prlić A, Altunkaya A, Bi C, Bradley AR, Christie CH, et al. (January 2017). "The RCSB protein data bank: integrative view of protein, gene and 3D structural information". Nucleic Acids Research. 45 (D1): D271–D281. doi:10.1093/nar/gkw1000. PMC   5210513 . PMID   27794042.
  21. Hermjakob H, Montecchi-Palazzi L, Lewington C, Mudali S, Kerrien S, Orchard S, et al. (January 2004). "IntAct: an open source molecular interaction database". Nucleic Acids Research. 32 (Database issue): D452-5. doi:10.1093/nar/gkh052. PMC   308786 . PMID   14681455.
  22. 1 2 Ellenberg J, Swedlow JR, Barlow M, Cook CE, Sarkans U, Patwardhan A, et al. (November 2018). "A call for public archives for biological image data". Nature Methods. 15 (11): 849–854. doi:10.1038/s41592-018-0195-8. PMC   6884425 . PMID   30377375.
  23. Tendler BC, Hanayik T, Ansorge O, Bangerter-Christensen S, Berns GS, Bertelsen MF, et al. (March 2022). "The Digital Brain Bank, an open access platform for post-mortem imaging datasets". eLife. 11: e73153. doi: 10.7554/eLife.73153 . PMC   9042233 . PMID   35297760.
  24. Iudin A, Korir PK, Salavert-Torres J, Kleywegt GJ, Patwardhan A (May 2016). "EMPIAR: a public archive for raw electron microscopy image data". Nature Methods. 13 (5): 387–388. doi:10.1038/nmeth.3806. PMID   27067018. S2CID   38996040.
  25. Crickmore, N.; Berry, C.; Panneerselvam, S.; Mishra, R.; Connor, T. R.; Bonning, B. C. (November 2021). "A structure-based nomenclature for Bacillus thuringiensis and other bacteria-derived pesticidal proteins". Journal of Invertebrate Pathology. 186 (D1): 107438. doi: 10.1016/j.jip.2020.107438 . PMID   32652083.
  26. Panneerselvam S; Mishra R; Berry C; Crickmore N; Bonning BC (2022). "BPPRC database: a web-based tool to access and analyse bacterial pesticidal proteins". Database (Oxford). 186 (D1): 107438. doi: 10.1093/database/baac022 . PMC   9216523 . PMID   35396594.
  27. Hounkpe BW, Chenou F, de Lima F, De Paula EV (January 2021). "HRT Atlas v1.0 database: redefining human and mouse housekeeping genes and candidate reference transcripts by mining massive RNA-seq datasets". Nucleic Acids Research. 49 (D1): D947–D955. doi: 10.1093/nar/gkaa609 . PMC   7778946 . PMID   32663312.
  28. (IHEC) data portal
  29. CEEHRC
  30. Blueprint
  31. EGA
  32. DEEP
  33. CREST
  34. "Sharing epigenomes globally". Nature Methods. 15 (3): 151. 2018. doi: 10.1038/nmeth.4630 . ISSN   1548-7105.
  35. Valverde H, Cantón FR, Aledo JC (November 2019). "MetOSite: an integrated resource for the study of methionine residues sulfoxidation". Bioinformatics. 35 (22): 4849–4850. doi:10.1093/bioinformatics/btz462. PMC   6853639 . PMID   31197322.