Gene Ontology

Last updated
The Gene Ontology
Go-logo.small.png
Content
DescriptionResource with controlled vocabulary to describe the function of genes and gene products
Contact
Primary citation PMID   36866529
Access
Website geneontology.org
Miscellaneous
License CC BY 4.0 license

The Gene Ontology (GO) is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species. [1] More specifically, the project aims to: 1) maintain and develop its controlled vocabulary of gene and gene product attributes; 2) annotate genes and gene products, and assimilate and disseminate annotation data; and 3) provide tools for easy access to all aspects of the data provided by the project, and to enable functional interpretation of experimental data using the GO, for example via enrichment analysis. [2] [3] GO is part of a larger classification effort, the Open Biomedical Ontologies, being one of the Initial Candidate Members of the OBO Foundry. [4]

Contents

Whereas gene nomenclature focuses on gene and gene products, the Gene Ontology focuses on the function of the genes and gene products. The GO also extends the effort by using a markup language to make the data (not only of the genes and their products but also of curated attributes) machine readable, and to do so in a way that is unified across all species (whereas gene nomenclature conventions vary by biological taxon).

History

The Gene Ontology was originally constructed in 1998 by a consortium of researchers studying the genomes of three model organisms: Drosophila melanogaster (fruit fly), Mus musculus (mouse), and Saccharomyces cerevisiae (brewer's or baker's yeast). [5] Many other Model Organism Databases have joined the Gene Ontology Consortium, contributing not only to annotation data, but also to the development of ontologies and tools to view and apply the data. Many major plant, animal, and microorganism databases make a contribution towards this project. [6] As of July 2019, the GO contains 44,945 terms; there are 6,408,283 annotations to 4,467 different biological organisms. [6] There is a significant body of literature on the development and use of the GO, and it has become a standard tool in the bioinformatics arsenal. Their objectives have three aspects: building gene ontology, assigning ontology to gene/gene products, and developing software and databases for the first two objects.

Several analyses of the Gene Ontology using formal, domain-independent properties of classes (the metaproperties) are also starting to appear. For instance, there is now an ontological analysis of biological ontologies. [7]

Terms and ontology

From a practical view, an ontology is a representation of something we know about. "Ontologies" consist of representations of things that are detectable or directly observable and the relationships between those things. There is no universal standard terminology in biology and related domains, and term usage may be specific to a species, research area, or even a particular research group. This makes communication and sharing of data more difficult. The Gene Ontology project provides an ontology of defined terms representing gene product properties. The ontology covers three domains:

Each GO term within the ontology has a term name, which may be a word or string of words; a unique alphanumeric identifier; a definition with cited sources; and an ontology indicating the domain to which it belongs. Terms may also have synonyms, which are classed as being exactly equivalent to the term name, broader, narrower, or related; references to equivalent concepts in other databases; and comments on term meaning or usage. The GO ontology is structured as a directed acyclic graph, and each term has defined relationships to one or more other terms in the same domain, and sometimes to other domains. The GO vocabulary is designed to be species-neutral and includes terms applicable to prokaryotes and eukaryotes, single and multicellular organisms.

GO is not static, and additions, corrections, and alterations are suggested by and solicited from members of the research and annotation communities, as well as by those directly involved in the GO project. [8] For example, an annotator may request a specific term to represent a metabolic pathway, or a section of the ontology may be revised with the help of community experts (e.g. [9] ). Suggested edits are reviewed by the ontology editors, and implemented where appropriate.

The GO ontology and annotation files are freely available from the GO website in a number of formats or can be accessed online using the GO browser AmiGO. [6] The Gene Ontology project also provides downloadable mappings of its terms to other classification systems.

Example term

id: GO:0000016
name: lactase activity
ontology: molecular_function
def: "Catalysis of the reaction: lactose + H2O=D-glucose + D-galactose." [EC:3.2.1.108]
synonym: "lactase-phlorizin hydrolase activity" BROAD [EC:3.2.1.108]
synonym: "lactose galactohydrolase activity" EXACT [EC:3.2.1.108]
xref: EC:3.2.1.108
xref: MetaCyc:LACTASE-RXN
xref: Reactome:20536
is_a: GO:0004553 ! hydrolase activity, hydrolyzing O-glycosyl compounds

Data source: [10]

Annotation

Genome annotation encompasses the practice of capturing data about a gene product, and GO annotations use terms from the GO to do so. Annotations from GO curators are integrated and disseminated on the GO website, where they can be downloaded directly or viewed online using AmiGO. [11] In addition to the gene product identifier and the relevant GO term, GO annotations have at least the following data: The reference used to make the annotation (e.g. a journal article); An evidence code denoting the type of evidence upon which the annotation is based; The date and the creator of the annotation

Supporting information, depending on the GO term and evidence used, and supplementary information, such as the conditions the function is observed under, may also be included in a GO annotation.

The evidence code comes from a controlled vocabulary of codes, the Evidence Code Ontology, covering both manual and automated annotation methods. [12] For example, Traceable Author Statement (TAS) means a curator has read a published scientific paper and the metadata for that annotation bears a citation to that paper; Inferred from Sequence Similarity (ISS) means a human curator has reviewed the output from a sequence similarity search and verified that it is biologically meaningful. Annotations from automated processes (for example, remapping annotations created using another annotation vocabulary) are given the code Inferred from Electronic Annotation (IEA). In 2010, over 98% of all GO annotations were inferred computationally, not by curators, but as of July 2, 2019, only about 30% of all GO annotations were inferred computationally. [13] [14] As these annotations are not checked by a human, the GO Consortium considers them to be marginally less reliable and they are commonly to a higher level, less detailed terms. Full annotation data sets can be downloaded from the GO website. To support the development of annotation, the GO Consortium provides workshops and mentors new groups of curators and developers.

Many machine learning algorithms have been designed and implemented to predict Gene Ontology annotations. [15] [16]

Example annotation

Gene product: Actin, alpha cardiac muscle 1, UniProtKB:P68032
GO term: heart contraction; GO:0060047 (biological process)
Evidence code: Inferred from Mutant Phenotype (IMP)
Reference: PMID   17611253
Assigned by: UniProtKB, June 6, 2008

Data source: [17]

Tools

There are a large number of tools available, both online and for download, that use the data provided by the GO project. [18] The vast majority of these come from third parties; the GO Consortium develops and supports two tools, AmiGO and OBO-Edit.

AmiGO [19] [11] is a web-based application that allows users to query, browse, and visualize ontologies and gene product annotation data. It also has a BLAST tool, [20] tools allowing analysis of larger data sets, [21] [22] and an interface to query the GO database directly. [23] AmiGO can be used online at the GO website to access the data provided by the GO Consortium or downloaded and installed for local use on any database employing the GO database schema (e.g. [24] ). It is free open source software and is available as part of the go-dev software distribution. [25]

OBO-Edit is an open source, platform-independent ontology editor developed and maintained by the Gene Ontology Consortium. [26] It is implemented in Java and uses a graph-oriented approach to display and edit ontologies. OBO-Edit includes a comprehensive search and filter interface, with the option to render subsets of terms to make them visually distinct; the user interface can also be customized according to user preferences. OBO-Edit also has a reasoner that can infer links that have not been explicitly stated based on existing relationships and their properties. Although it was developed for biomedical ontologies, OBO-Edit can be used to view, search, and edit any ontology. It is freely available to download. [25]

Consortium

The Gene Ontology Consortium is the set of biological databases and research groups actively involved in the gene ontology project. [14] This includes a number of model organism databases and multi-species protein databases, software development groups, and a dedicated editorial office.

See also

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

The Rat Genome Database (RGD) is a database of rat genomics, genetics, physiology and functional data, as well as data for comparative genomics between rat, human and mouse. RGD is responsible for attaching biological information to the rat genome via structured vocabulary, or ontology, annotations assigned to genes and quantitative trait loci (QTL), and for consolidating rat strain data and making it available to the research community. They are also developing a suite of tools for mining and analyzing genomic, physiologic and functional data for the rat, and comparative data for rat, mouse, human, and five other species.

The Open Biological and Biomedical Ontologies (OBO) Foundry is a group of people dedicated to build and maintain ontologies related to the life sciences. The OBO Foundry establishes a set of principles for ontology development for creating a suite of interoperable reference ontologies in the biomedical domain. Currently, there are more than a hundred ontologies that follow the OBO Foundry principles.

The Saccharomyces Genome Database (SGD) is a scientific database of the molecular biology and genetics of the yeast Saccharomyces cerevisiae, which is commonly known as baker's or budding yeast. Further information is located at the Yeastract curated repository.

FlyBase is an online bioinformatics database and the primary repository of genetic and molecular data for the insect family Drosophilidae. For the most extensively studied species and model organism, Drosophila melanogaster, a wide range of data are presented in different formats.

<span class="mw-page-title-main">MicrobesOnline</span>

MicrobesOnline is a publicly and freely accessible website that hosts multiple comparative genomic tools for comparing microbial species at the genomic, transcriptomic and functional levels. MicrobesOnline was developed by the Virtual Institute for Microbial Stress and Survival, which is based at the Lawrence Berkeley National Laboratory in Berkeley, California. The site was launched in 2005, with regular updates until 2011.

The Comparative Toxicogenomics Database (CTD) is a public website and research tool launched in November 2004 that curates scientific data describing relationships between chemicals/drugs, genes/proteins, diseases, taxa, phenotypes, GO annotations, pathways, and interaction modules. The database is maintained by the Department of Biological Sciences at North Carolina State University.

Protein function prediction methods are techniques that bioinformatics researchers use to assign biological or biochemical roles to proteins. These proteins are usually ones that are poorly studied or predicted based on genomic sequence data. These predictions are often driven by data-intensive computational procedures. Information may come from nucleic acid sequence homology, gene expression profiles, protein domain structures, text mining of publications, phylogenetic profiles, phenotypic profiles, and protein-protein interaction. Protein function is a broad term: the roles of proteins range from catalysis of biochemical reactions to transport to signal transduction, and a single protein may play a role in multiple processes or cellular pathways.

<span class="mw-page-title-main">DNA annotation</span> The process of describing the structure and function of a genome

In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.

The Critical Assessment of Functional Annotation (CAFA) is an experiment designed to provide a large-scale assessment of computational methods dedicated to predicting protein function. Different algorithms are evaluated by their ability to predict the Gene Ontology (GO) terms in the categories of Molecular Function, Biological Process, and Cellular Component.

Suzanna (Suzi) E. Lewis was a scientist and Principal investigator at the Berkeley Bioinformatics Open-source Project based at Lawrence Berkeley National Laboratory until her retirement in 2019. Lewis led the development of open standards and software for genome annotation and ontologies.

<span class="mw-page-title-main">Blast2GO</span> Bioinformatics software tool

Blast2GO, first published in 2005, is a bioinformatics software tool for the automatic, high-throughput functional annotation of novel sequence data. It makes use of the BLAST algorithm to identify similar sequences to then transfers existing functional annotation from yet characterised sequences to the novel one. The functional information is represented via the Gene Ontology (GO), a controlled vocabulary of functional attributes. The Gene Ontology, or GO, is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species.

In bioinformatics, the PANTHER classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. PANTHER is part of the Gene Ontology Reference Genome Project designed to classify proteins and their genes for high-throughput analysis.

Gene Ontology (GO) term enrichment is a technique for interpreting sets of genes making use of the Gene Ontology system of classification, in which genes are assigned to a set of predefined bins depending on their functional characteristics. For example, the gene FasR is categorized as being a receptor, involved in apoptosis and located on the plasma membrane.

<span class="mw-page-title-main">Gene set enrichment analysis</span> Bioinformatics method

Gene set enrichment analysis (GSEA) (also called functional enrichment analysis or pathway enrichment analysis) is a method to identify classes of genes or proteins that are over-represented in a large set of genes or proteins, and may have an association with different phenotypes (e.g. different organism growth patterns or diseases). The method uses statistical approaches to identify significantly enriched or depleted groups of genes. Transcriptomics technologies and proteomics results often identify thousands of genes, which are used for the analysis.

PomBase is a model organism database that provides online access to the fission yeast Schizosaccharomyces pombe genome sequence and annotated features, together with a wide range of manually curated functional gene-specific data. The PomBase website was redeveloped in 2016 to provide users with a more fully integrated, better-performing service.

Model organism databases (MODs) are biological databases, or knowledgebases, dedicated to the provision of in-depth biological data for intensively studied model organisms. MODs allow researchers to easily find background information on large sets of genes, plan experiments efficiently, combine their data with existing knowledge, and construct novel hypotheses. They allow users to analyse results and interpret datasets, and the data they generate are increasingly used to describe less well studied species. Where possible, MODs share common approaches to collect and represent biological information. For example, all MODs use the Gene Ontology (GO) to describe functions, processes and cellular locations of specific gene products. Projects also exist to enable software sharing for curation, visualization and querying between different MODs. Organismal diversity and varying user requirements however mean that MODs are often required to customize capture, display, and provision of data.

<span class="mw-page-title-main">Christophe Dessimoz</span>

Christophe Dessimoz is a Swiss National Science Foundation (SNSF) Professor at the University of Lausanne, Associate Professor at University College London and a group leader at the Swiss Institute of Bioinformatics. He was awarded the Overton Prize in 2019 for his contributions to computational biology. Starting in April 2022, he will be joint executive director of the SIB Swiss Institute of Bioinformatics, along with Ron Appel.

<span class="mw-page-title-main">Canto (gene curation tool)</span>

Canto is a web-based tool to support the curation of gene-specific scientific data, by both professional biocurators and publication authors. Canto was developed as part of the PomBase project, and is funded by the Wellcome Trust.

Biocuration is the field of life sciences dedicated to organizing biomedical data, information and knowledge into structured formats, such as spreadsheets, tables and knowledge graphs. The biocuration of biomedical knowledge is made possible by the cooperative work of biocurators, software developers and bioinformaticians and is at the base of the work of biological databases.

References

  1. The Gene Ontology Consortium (January 2008). "The Gene Ontology project in 2008". Nucleic Acids Research. 36 (Database issue): D440–4. doi:10.1093/nar/gkm883. PMC   2238979 . PMID   17984083.
  2. Dessimoz, Christophe; Škunca, Nives, eds. (2017). The Gene Ontology Handbook. Methods in Molecular Biology. Vol. 1446. doi:10.1007/978-1-4939-3743-1. ISBN   9781493937431. ISSN   1064-3745. S2CID   3708801. Open Access logo PLoS transparent.svg
  3. Gaudet, Pascale; Škunca, Nives; Hu, James C.; Dessimoz, Christophe (2017). "Primer on the Gene Ontology". The Gene Ontology Handbook. Methods in Molecular Biology. Vol. 1446. pp. 25–37. doi:10.1007/978-1-4939-3743-1_3. ISBN   978-1-4939-3741-7. ISSN   1064-3745. PMC   6377150 . PMID   27812933.
  4. Smith B, Ashburner M, Rosse C, Bard J, Bug W, Ceusters W, Goldberg LJ, Eilbeck K, Ireland A, Mungall CJ, Leontis N, Rocca-Serra P, Ruttenberg A, Sansone SA, Scheuermann RH, Shah N, Whetzel PL, Lewis S (November 2007). "The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration". Nature Biotechnology. 25 (11): 1251–5. doi:10.1038/nbt1346. PMC   2814061 . PMID   17989687.
  5. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G (May 2000). "Gene ontology: tool for the unification of biology. The Gene Ontology Consortium". Nature Genetics. 25 (1): 25–9. doi:10.1038/75556. PMC   3037419 . PMID   10802651.
  6. 1 2 3 "The Gene Ontology Resource". Gene Ontology Consortium.
  7. Deb, B. (2012). "An ontological analysis of some biological ontologies". Frontiers in Genetics. 3: 269. doi: 10.3389/fgene.2012.00269 . PMC   3509948 . PMID   23226158.
  8. Lovering, Ruth C. (2017). "How Does the Scientific Community Contribute to Gene Ontology?". In Dessimoz, C; Skunca, N (eds.). The Gene Ontology Handbook. Methods in Molecular Biology. Vol. 1446. Springer (New York). pp. 85–93. doi:10.1007/978-1-4939-3743-1_7. ISBN   978-1-4939-3741-7. ISSN   1064-3745. PMID   27812937. S2CID   4924457.
  9. Diehl AD, Lee JA, Scheuermann RH, Blake JA (April 2007). "Ontology development for biological systems: immunology". Bioinformatics. 23 (7): 913–5. doi: 10.1093/bioinformatics/btm029 . PMID   17267433.
  10. "AmiGO 2 Manual: Term Page". Gene Ontology Consortium Wiki. 2013-07-10.
  11. 1 2 AmiGO--the current official web-based set of tools for searching and browsing the Gene Ontology database
  12. "Evidence Code Ontology". Evidence Code Ontology.
  13. du Plessis L, Skunca N, Dessimoz C (November 2011). "The what, where, how and why of gene ontology--a primer for bioinformaticians". Briefings in Bioinformatics. 12 (6): 723–35. doi:10.1093/bib/bbr002. PMC   3220872 . PMID   21330331.
  14. 1 2 "The GO Consortium" . Retrieved 2009-03-16.
  15. Pinoli P, Chicco D, Masseroli M (June 2013). "Computational algorithms to predict Gene Ontology annotation". BMC Bioinformatics. 16 (6): S4. doi: 10.1186/1471-2105-16-S6-S4 . PMC   4416163 . PMID   25916950.
  16. Cozzetto, Domenico; Jones, David T. (2017). "Computational Methods for Annotation Transfers from Sequence". In Dessimoz, C; Skunca, N (eds.). The Gene Ontology Handbook. Methods in Molecular Biology. Vol. 1446. Springer (New York). pp. 55–67. doi:10.1007/978-1-4939-3743-1_5. ISBN   978-1-4939-3741-7. ISSN   1064-3745. PMID   27812935.
  17. The GO Consortium (2009-03-16). "AmiGO: P68032 Associations".
  18. Mosquera JL, Sánchez-Pla A (July 2008). "SerbGO: searching for the best GO tool". Nucleic Acids Research. 36 (Web Server issue): W368–71. doi:10.1093/nar/gkn256. PMC   2447766 . PMID   18480123.
  19. Carbon S, Ireland A, Mungall CJ, Shu S, Marshall B, Lewis S (January 2009). AmiGO Hub; Web Presence Working Group. "AmiGO: online access to ontology and annotation data". Bioinformatics. 25 (2): 288–9. doi:10.1093/bioinformatics/btn615. PMC   2639003 . PMID   19033274.
  20. "AmiGO BLAST tool". Archived from the original on 2011-08-20. Retrieved 2009-03-13.
  21. AmiGO Term Enrichment tool Archived 2008-04-07 at the Wayback Machine ; finds significant shared GO terms in an annotation set
  22. AmiGO Slimmer Archived 2011-09-29 at the Wayback Machine ; maps granular annotations up to high-level terms
  23. GOOSE Archived 2009-03-01 at the Wayback Machine , GO Online SQL Environment; allows direct SQL querying of the GO database
  24. The Plant Ontology Consortium (2009-03-16). "Plant Ontology Consortium" . Retrieved 2009-03-16.
  25. 1 2 "Gene Ontology downloads at SourceForge" . Retrieved 2009-03-16.
  26. Day-Richter J, Harris MA, Haendel M, Lewis S (August 2007). "OBO-Edit--an ontology editor for biologists". Bioinformatics. 23 (16): 2198–200. doi: 10.1093/bioinformatics/btm112 . PMID   17545183.