Biocuration is the field of life sciences dedicated to organizing biomedical data, information and knowledge into structured formats, such as spreadsheets, tables and knowledge graphs. [1] [2] The biocuration of biomedical knowledge is made possible by the cooperative work of biocurators, software developers and bioinformaticians and is at the base of the work of biological databases. [1]
A biocurator is a professional scientist who curates, collects, annotates, and validates information that is disseminated by biological and model organism databases. [3] [4] It is a new profession, with the first mentions in the scientific literature dating of 2006 in the context of the work in databases like the Immune Epitope Database and Analysis Resource. [5] [6] Biocurators usually are PhD-level with a mix of experiences in wet lab and computational representations of knowledge (e.g. via ontologies). [7]
The role of a biocurator encompasses quality control of primary biological research data intended for publication, extracting and organizing data from original scientific literature, and describing the data with standard annotation protocols and vocabularies that enable powerful queries and biological database interoperability. Biocurators communicate with researchers to ensure the accuracy of curated information and to foster data exchanges with research laboratories. [6]
Biocurators are present in diverse research environments, but may not self-identify as biocurators. Projects such as ELIXIR (the European life-sciences Infrastructure for biological Information) and GOBLET (Global Organization for Bioinformatics Learning, Education and Training) [8] promote training and support biocuration as a career path. [9] [10]
In 2011, biocuration was already recognized as a profession, but there were no formal degree courses to prepare curators for biological data in a targeted fashion. [11] With the growth of the field, the University of Cambridge and the EMBL-EBI started to jointly offer a Postgraduate Certificate in Biocuration, [12] considered as a step towards recognising biocuration as a discipline on its own. [13] There is a perceived increase in demand of biocuration, and a need for additional biocuration training by graduate programs. [14]
Organizations that employ biocurators, like Clinical Genome Resource (ClinGen), often provide specialized materials and training for biocuration. [15]
The role of biocurators is best known among the field of biological knowledgebases. Such databases, like UniProt [16] and PDB [17] rely on professional biocurators to organize information. Among other things, biocurators work to improve the data quality, for example, by merging duplicated entries. [18]
An important part of those knowledgebases are model organisms databases, which rely on biocurators to curate information regarding organisms of particular kinds. Some notable examples of model organism databases are FlyBase, [19] PomBase, [20] and ZFIN, [21] dedicated to curate information about Drosophila, Schizosaccharomyces pombe and zebrafish respectively.
Biocuration is the integration of biological information into on-line databases in a semantically standardized way, using appropriate unique traceable identifiers, and providing necessary metadata including source and provenance.
Biocurators commonly employ and take part in the creation and development of shared biomedical ontologies: structured, controlled vocabularies that encompass many biological and medical knowledge domains, such as the Open Biomedical Ontologies. These domains include genomics and proteomics, anatomy, animal and plant development, biochemistry, metabolic pathways, taxonomic classification, and mutant phenotypes. Given the variety of existing ontologies, there are guidelines that orient researchers on how to choose a suitable one. [22]
The Unified Medical Language System is one such systems that integrates and distributes millions of terms used in the life sciences domain. [23]
Biocurators enforce the consistent use of gene nomenclature guidelines and participate in the genetic nomenclature committees of various model organisms, often in collaboration with the HUGO Gene Nomenclature Committee (HGNC). They also enforce other nomenclature guidelines like those provided by the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (IUBMB), one example of which is the Enzyme Commission EC number.
More generally, the use of persistent identifiers is praised by the community, so to improve clarity and facilitate knowledge [24]
In genome annotation for example, the identifiers defined by the ontologists and consortia are used to describe parts of the genome. For example, the gene ontology (GO) curates terms for biological processes, which are used to describe what we know about specific genes.
As of 2021, life sciences communication is still done primarily via free natural languages, like English or German, which hold a degree of ambiguity and make it hard to connect knowledge. So, besides annotating biological sequences, biocurators also annotate texts, linking words to unique identifiers. This aids in disambiguation, clarifying the meaning intended, and making the texts processable by computers. One application of text annotation is to specify the exact gene a scientist is referring to. [25]
Publicly available text annotations make it possible to biologists to take further advantage of biomedical text. The Europe PMC has an Application Programming Interface which centralizes text annotations from a variety of sources and make them available in a Graphic User Interface called SciLite. [26] The PubTator Central also provides annotations, but is fully based on computerized text-mining and does not provide a user interface. [27] There are also programs that allow users to manually annotate the biomedical texts they are interested, such as the ezTag system. [28]
A type of biocuration within the field of medical genetics, variant curation is a process for assessment of genetic changes according to the likelihood that they may cause disease. [29] This is an evidence-based process that uses data from a multitude of sources. These sources can include population data, computational data, functional data, segregation data, de novo data, allelic data, among others. [30] It is a collaborative process that can be automated, however manual curation is considered to be the gold standard. [31]
There is no single standardised process of variant curation; different researchers and organisations use different variant curation processes. [29] However, a set of internationally-accepted [32] standards and guidelines for the interpretation of genetic variants have been jointly developed by the American College of Medical Genetics and the Association for Molecular Pathology. [30] These are known as the ACMG/AMP guidelines. These guidelines provide a framework for classifying genetic variants as “pathogenic”, “likely pathogenic”, “uncertain significance”, “likely benign” or “benign”, in order from most likely to cause disease to least likely to cause disease. The guidelines also list various levels of evidence ranging from very strong, strong, moderate or supporting. [33] The combination of types of evidence found, and the levels in which those pieces of evidence exist, allows for each variant to be classified along the scale from "pathogenic" to "benign". [32]
The International Society for Biocuration (ISB) is a non-profit organisation "promotes the field of biocuration and provides a forum for information exchange through meetings and workshops." It has grown from the International Biocuration Conferences and founded in early 2009. [4]
The ISB offers the Biocuration Career Award to biocurators in the community: the Biocurator Career Award (given annually) and the ISB Award for Exceptional Contributions to Biocuration (given biannually).
The official journal of the ISB, Database, is a venue specialized in articles about databases and biocuration. [34]
Traditionally, biocuration has been done by dedicated experts, which integrate data into databases. Community curation has emerged as a promising approach to improve the dissemination of knowledge from published data and provide a cost-effective way to improve the scalability of biocuration. In some cases, community help is leveraged in jamborees that introduce domain experts to curation tasks, carried during the event, [35] while others rely on asynchronous contributions of experts and non-experts. [36]
Several biological databases include author contributions in their functional curation strategy to some extent, which may range from associating gene identifiers with publications or free-text, to more structured and detailed annotation of sequences and functional data, outputting curation to the same standards as professional biocurators. Most community curation at Model Organism Databases involves annotation by original authors of published research (first-pass annotation) to effectively obtain accurate identifiers for objects to be curated, or identify data-types for detailed curation. For example:
Other databases, such as PomBase, rely on publication authors to submit highly detailed, ontology-based annotations for their publications, and meta-data associated with genome-wide data-sets using controlled vocabularies. A web-based tool Canto; [41] was developed to facilitate community submissions. Since Canto is freely available, generic and highly configurable, it has been adopted by other projects. [42] Curation is subjected to review by professional curators resulting in high quality in-depth curation of all molecular data-types. [43]
The widely used UniProt knowledgebase has also a community curation mechanism that allows researchers to add information about proteins. [44]
Bio-wikis rely on their communities to provide content and a series of wiki-style resources are available for biocuration. [45] [46] AuthorReward, [47] for example, is an extension to MediaWiki that quantifies researchers' contributions to biology wikis. RiceWiki was an example of a wiki-based database for community curation of rice genes equipped with AuthorReward. [48] [49] CAZypedia is another such wiki for community biocuration of information on carbohydrate-active enzymes (CAZys). [50]
The WikiProteins/WikiProfessional was a project to semantically organize biological data led by Barend Mons. [51] [52] The 2007 project had direct contributions of Jimmy Wales, Wikipedia co-founder, and took Wikidata as an inspiration. [51] A currently active project that runs on an adaptation of mediawiki software is WikiPathways, which crowdsources information about biological pathways. [53]
There is some overlap between the work of biocurators and Wikipedia, with boundaries between scientific databases and Wikipedia becoming increasingly blurred. [54] [46] [55] Databases like Rfam [56] [57] and the Protein Data Bank [58] for example make heavy use of Wikipedia and its editors to curate information. [59] [60] However, most databases offer highly structured data that is searchable in complex combinations, which is usually not possible on Wikipedia, although Wikidata aims at solving this problem to some extent.
The Gene Wiki project used Wikipedia for collaborative curation of thousands of genes and gene products, such as titin and insulin. [61] Several projects also employ Wikipedia as a platform for curation of medical information. [36]
One other way that Wikipedia is used for biocuration is via its list articles. For example, the Comprehensive Antibiotic Resistance Database integrates its assessment of databases about antibiotic resistance to a particular Wikipedia list. [62]
The Wikimedia knowledge base Wikidata is increasingly being used by the biocuration community as an integrative repository across life sciences. [63] Wikidata is being seen by some as an alternative with better prospects of maintenance and interoperability than smaller independent biological knowledge bases. [64] [65]
Wikidata has been used to curate information on SARS-CoV-2 and the COVID-19 pandemic [66] [67] and by the Gene Wiki project to curate information about genes. [68] Data from biocuration on Wikidata is reused on external resources via SPARQL queries. [69] Some projects use curation via Wikidata as a path to improve life-sciences information on Wikipedia. [70]
An approach to involve the crowd in biocuration is via gamified platforms that use game design principles to boost engagement. A few examples are:
Natural-language processing and text mining technologies can help biocurators to extract of information for manual curation. [80] Text mining can scale curation efforts, supporting the identification of gene names, for example, as well as for partially inferring ontologies. [81] [82] The conversion of unstructured assertions to structured information makes use of techniques like named entity recognition and parsing of dependencies. [83] Text-mining of biomedical concepts faces challenges regarding variations in reporting, and the community is working to increase the machine-readability of articles. [84]
During the COVID-19 pandemic, biomedical text mining was heavily used to cope with the large amount of published scientific research about the topic (over 50.000 articles). [85]
The popular NLP python package SpaCy has a modification for biomedical texts, SciSpaCy, which is maintained by the Allen Institute for AI. [86]
Among the challenges for text-mining applied to biocuration is the difficulty of accessing full texts of biomedical articles due to pay wall, linking the challenges of biocuration to those of the open-access movement. [87]
A complementary approach to biocuration via text mining involves applying optical character recognition to biomedical figures, coupled to automatic annotation algorithms. This has been used to extract gene information from pathway figures, for example. [88]
Suggestions to improve the written text to facilitate annotations range from using controlled natural languages [89] to providing clear association of concepts (such as genes and proteins) with the particular species of interest. [89]
While challenges remain, text-mining is already an integral part of the workflow of biocuration in several biological knowledgebases. [90]
The BioCreAtivE (Critical Assessment of Information Extraction systems in Biology) Challenge is a community-wide effort to develop and evaluate text mining and information extraction systems for the life sciences. The challenge was first launched in 2004 and has since become an important event in the biocuration and bioinformatics communities. [91] The main goal of the challenge is to foster the development of advanced computational tools that can effectively extract information from the vast amount of biological data available.
The BioCreative Challenge is organized into several subtasks that cover various aspects of text mining and information extraction in the life sciences. These subtasks include gene normalization, relation extraction, entity recognition, and document classification. Participants in the challenge are provided with a set of annotated data to develop and test their systems, and their performance is evaluated based on various metrics, such as precision, recall, and F-score. [91]
The BioCreative Challenge has led to the development of many innovative text mining and information extraction systems that have greatly improved the efficiency and accuracy of biocuration efforts. These systems have been integrated into many biocuration pipelines and have helped to speed up the curation process and enhance the quality of curated data.
Biological databases are libraries of biological sciences, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. They contain information from research areas including genomics, proteomics, metabolomics, microarray gene expression, and phylogenetics. Information contained in biological databases includes gene function, structure, localization, clinical effects of mutations as well as similarities of biological sequences and structures.
The Gene Ontology (GO) is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species. More specifically, the project aims to: 1) maintain and develop its controlled vocabulary of gene and gene product attributes; 2) annotate genes and gene products, and assimilate and disseminate annotation data; and 3) provide tools for easy access to all aspects of the data provided by the project, and to enable functional interpretation of experimental data using the GO, for example via enrichment analysis. GO is part of a larger classification effort, the Open Biomedical Ontologies, being one of the Initial Candidate Members of the OBO Foundry.
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature. It is maintained by the UniProt consortium, which consists of several European bioinformatics organisations and a foundation from Washington, DC, USA.
Biomedical text mining refers to the methods and study of how text mining may be applied to texts and literature of the biomedical domain. As a field of research, biomedical text mining incorporates ideas from natural language processing, bioinformatics, medical informatics and computational linguistics. The strategies in this field have been applied to the biomedical literature available through services such as PubMed.
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. Last version of Pfam, 36.0, was released in September 2023 and contains 20,795 families. It is currently provided through InterPro database.
The Biological General Repository for Interaction Datasets (BioGRID) is a curated biological database of protein-protein interactions, genetic interactions, chemical interactions, and post-translational modifications created in 2003 (originally referred to as simply the General Repository for Interaction Datasets by Mike Tyers, Bobby-Joe Breitkreutz, and Chris Stark at the Lunenfeld-Tanenbaum Research Institute at Mount Sinai Hospital. It strives to provide a comprehensive curated resource for all major model organism species while attempting to remove redundancy to create a single mapping of data. Users of The BioGRID can search for their protein, chemical or publication of interest and retrieve annotation, as well as curated data as reported, by the primary literature and compiled by in house large-scale curation efforts. The BioGRID is hosted in Toronto, Ontario, Canada and Dallas, Texas, United States and is partnered with the Saccharomyces Genome Database, FlyBase, WormBase, PomBase, and the Alliance of Genome Resources. The BioGRID is funded by the NIH and CIHR. BioGRID is an observer member of the International Molecular Exchange Consortium.
The Saccharomyces Genome Database (SGD) is a scientific database of the molecular biology and genetics of the yeast Saccharomyces cerevisiae, which is commonly known as baker's or budding yeast. Further information is located at the Yeastract curated repository.
The Generic Model Organism Database (GMOD) project provides biological research communities with a toolkit of open-source software components for visualizing, annotating, managing, and storing biological data. The GMOD project is funded by the United States National Institutes of Health, National Science Foundation and the USDA Agricultural Research Service.
Rfam is a database containing information about non-coding RNA (ncRNA) families and other structured RNA elements. It is an annotated, open access database originally developed at the Wellcome Trust Sanger Institute in collaboration with Janelia Farm, and currently hosted at the European Bioinformatics Institute. Rfam is designed to be similar to the Pfam database for annotating protein families.
The Gene Wiki is a project within Wikipedia that aims to describe the relationships and functions of all human genes. It was established to transfer information from scientific resources to Wikipedia stub articles.
The Arabidopsis Information Resource (TAIR) is a community resource and online model organism database of genetic and molecular biology data for the model plant Arabidopsis thaliana, commonly known as mouse-ear cress.
WikiPathways is a community resource for contributing and maintaining content dedicated to biological pathways. Any registered WikiPathways user can contribute, and anybody can become a registered user. Contributions are monitored by a group of admins, but the bulk of peer review, editorial curation, and maintenance is the responsibility of the user community. WikiPathways is originally built using MediaWiki software, a custom graphical pathway editing tool (PathVisio) and integrated BridgeDb databases covering major gene, protein, and metabolite systems. WikiPathways was founded in 2008 by Thomas Kelder, Alex Pico, Martijn Van Iersel, Kristina Hanspers, Bruce Conklin and Chris Evelo. Current architects are Alex Pico and Martina Summer-Kutmon.
DisProt is a manually curated biological database of intrinsically disordered proteins (IDPs) and regions (IDRs). DisProt annotations cover state information on the protein but also, when available, its state transitions, interactions and functional aspects of disorder detected by specific experimental methods. DisProt is hosted and maintained in the BioComputing UP laboratory.
Teresa K. Attwood is a professor of Bioinformatics in the Department of Computer Science and School of Biological Sciences at the University of Manchester and a visiting fellow at the European Bioinformatics Institute (EMBL-EBI). She held a Royal Society University Research Fellowship at University College London (UCL) from 1993 to 1999 and at the University of Manchester from 1999 to 2002.
Experimental factor ontology, also known as EFO, is an open-access ontology of experimental variables particularly those used in molecular biology. The ontology covers variables which include aspects of disease, anatomy, cell type, cell lines, chemical compounds and assay information. EFO is developed and maintained at the EMBL-EBI as a cross-cutting resource for the purposes of curation, querying and data integration in resources such as Ensembl, ChEMBL and Expression Atlas.
In bioinformatics, a Gene Disease Database is a systematized collection of data, typically structured to model aspects of reality, in a way to comprehend the underlying mechanisms of complex diseases, by understanding multiple composite interactions between phenotype-genotype relationships and gene-disease mechanisms. Gene Disease Databases integrate human gene-disease associations from various expert curated databases and text mining derived associations including Mendelian, complex and environmental diseases.
Alexander George Bateman is a computational biologist and Head of Protein Sequence Resources at the European Bioinformatics Institute (EBI), part of the European Molecular Biology Laboratory (EMBL) in Cambridge, UK. He has led the development of the Pfam biological database and introduced the Rfam database of RNA families. He has also been involved in the use of Wikipedia for community-based annotation of biological databases.
PomBase is a model organism database that provides online access to the fission yeast Schizosaccharomyces pombe genome sequence and annotated features, together with a wide range of manually curated functional gene-specific data. The PomBase website was redeveloped in 2016 to provide users with a more fully integrated, better-performing service.
Model organism databases (MODs) are biological databases, or knowledgebases, dedicated to the provision of in-depth biological data for intensively studied model organisms. MODs allow researchers to easily find background information on large sets of genes, plan experiments efficiently, combine their data with existing knowledge, and construct novel hypotheses. They allow users to analyse results and interpret datasets, and the data they generate are increasingly used to describe less well studied species. Where possible, MODs share common approaches to collect and represent biological information. For example, all MODs use the Gene Ontology (GO) to describe functions, processes and cellular locations of specific gene products. Projects also exist to enable software sharing for curation, visualization and querying between different MODs. Organismal diversity and varying user requirements however mean that MODs are often required to customize capture, display, and provision of data.