OrthoDB

Last updated
OrthoDB
OrthoDB logo.png
Content
DescriptionCatalog of Orthologs.
Contact
Research center Swiss Institute of Bioinformatics
LaboratoryComputational Evolutionary Genomics Group
AuthorsEvgenia V. Kriventseva
Primary citationKriventseva et al. (2015) [1]
Release date2007
Access
Website www.orthodb.org
Download URL https://www.orthodb.org/?page=filelist
Sparql endpoint sparql.orthodb.org/sparql
Miscellaneous
License CC-BY-3.0

OrthoDB [1] [2] [3] [4] presents a catalog of orthologous protein-coding genes across vertebrates, arthropods, fungi, plants, and bacteria. Orthology refers to the last common ancestor of the species under consideration, and thus OrthoDB explicitly delineates orthologs at each major radiation along the species phylogeny. The database of orthologs presents available protein descriptors, together with Gene Ontology and InterPro attributes, which serve to provide general descriptive annotations of the orthologous groups, and facilitate comprehensive orthology database querying. OrthoDB also provides computed evolutionary traits of orthologs, such as gene duplicability and loss profiles, divergence rates, sibling groups, and gene intron-exon architectures.

Contents

In comparative genomics, the importance of scale cannot be underestimated. As gene orthology delineation requires specific expertise and considerable computational resources, scale is something that individual non-specialist research groups cannot accomplish on their own. This challenging task is achieved by OrthoDB, with very comprehensive sets of species and several unique features such as the extensive functional and evolutionary annotations of orthologous groups, with the integration of many useful links to other world-leading databases that focus on capturing information about gene function. No genome can exist as a useful data source without extensive comparative analyses with other genomes – OrthoDB provides a critically important resource for comparative genomics for the entire community of researchers from those interested in grand evolutionary questions to those focused on the specific biological functions of individual genes.

Methodology

Orthology is defined relative to the last common ancestor of the species being considered, thereby determining the hierarchical nature of orthologous classifications. This is explicitly addressed in OrthoDB by application of the orthology delineation procedure at each major radiation point of the considered phylogeny. The OrthoDB implementation employs a Best-Reciprocal-Hit (BRH) clustering algorithm based on all-against-all Smith–Waterman protein sequence comparisons. Gene set pre-processing selects the longest protein-coding transcript of alternatively spliced genes and of very similar gene copies. The procedure triangulates BRHs to progressively build the clusters and requires an overall minimum sequence alignment overlap to avoid domain walking. These core clusters are further expanded to include all more closely related within-species in-paralogs, and the previously identified very similar gene copies.

Data content

The database contains some 600 eukaryotic species and more than 3600 bacteria [1] sourced from Ensembl, UniProt, NCBI, FlyBase, and several other databases. The ever-increasing sampling of sequenced genomes brings a clearer account of the majority of gene genealogies that will facilitate informed hypotheses of gene function in newly sequenced genomes.

Examples of studies that have employed data from OrthoDB include comparative analyses of gene repertoire evolution, [5] [6] comparisons of fruit fly and mosquito developmental genes, [7] analyses of bloodmeal- or infection-induced changes in gene expression in mosquitoes, [8] [9] [10] analysis of the evolution of mammalian milk production, [11] and mosquito gene and genome evolution. [12] Others studies citing OrthoDB can be found at PubMed and Google Scholar.

Performance

OrthoDB has performed consistently well in benchmarking assessments alongside other orthology delineation procedures. Results were compared to reference trees for three well-conserved protein families, [13] and to a larger set of curated protein families. [14]

BUSCO

Benchmarking sets of Universal Single-Copy Orthologs [15] - Orthologous groups are selected from OrthoDB for the root-level classifications of arthropods, vertebrates, metazoans, fungi, and other major clades. Groups are required to contain single-copy orthologs in at least 90% of the species (in others they may be lost or duplicated), and the missing species cannot all be from the same clade. Species with frequent losses or duplications are removed from the selection unless they hold a key position in the phylogeny. BUSCOs are therefore expected to be found as single-copy orthologs in any newly sequenced genome from the appropriate phylogenetic clade, and can be used to analyse newly sequenced genomes to assess their relative completeness. The BUSCO assessment tool and datasets (accessible here) are being widely used in many genomics projects, with most journal editors now requiring such quality assessments before accepting new genome publications.

Notes and references

  1. 1 2 3 Kriventseva EV, Tegenfeldt F, Petty TJ, Waterhouse RM, Simão FA, Pozdnyakov IA, Ioannidis P, Zdobnov EM (January 2015). "OrthoDB v8: update of the hierarchical catalog of orthologs and the underlying free software". Nucleic Acids Res. 43 (Database issue): D250–6. doi:10.1093/nar/gku1220. PMC   4383991 . PMID   25428351.
  2. Waterhouse RM, Tegenfeldt F, Li J, Zdobnov EM, Kriventseva EV (January 2013). "OrthoDB: a hierarchical catalog of animal, fungal and bacterial orthologs". Nucleic Acids Res. 41 (Database issue): D358–65. doi:10.1093/nar/gks1116. PMC   3531149 . PMID   23180791.
  3. Waterhouse RM, Zdobnov EM, Tegenfeldt F, Li J, Kriventseva EV (January 2011). "OrthoDB: the hierarchical catalog of eukaryotic orthologs in 2011". Nucleic Acids Res. 39 (Database issue): D283–8. doi:10.1093/nar/gkq930. PMC   3013786 . PMID   20972218.
  4. Kriventseva EV, Rahman N, Espinosa O, Zdobnov EM (Jan 2008). "OrthoDB: the hierarchical catalog of eukaryotic orthologs". Nucleic Acids Res. 36 (Database issue): D271–5. doi:10.1093/nar/gkm845. PMC   2238902 . PMID   17947323.
  5. Waterhouse RM, Zdobnov EM, Kriventseva EV (January 2011). "Correlating traits of gene retention, sequence divergence, duplicability and essentiality in vertebrates, arthropods, and fungi". Genome Biol. Evol. 3: 75–86. doi:10.1093/gbe/evq083. PMC   3030422 . PMID   21148284.
  6. Hase T, Niimura Y, Tanaka H (2010). "Difference in gene duplicability may explain the difference in overall structure of protein-protein interaction networks among eukaryotes". BMC Evol. Biol. 10: 358. doi: 10.1186/1471-2148-10-358 . PMC   2994879 . PMID   21087510.
  7. Behura SK, Haugen M, Flannery E, Sarro J, Tessier CR, Severson DW, Duman-Scheel M (2011). "Comparative Genomic Analysis of Drosophila melanogaster and Vector Mosquito Developmental Genes". PLOS ONE. 6 (7): e21504. Bibcode:2011PLoSO...621504B. doi: 10.1371/journal.pone.0021504 . PMC   3130749 . PMID   21754989.
  8. Bonizzoni M, Dunn WA, Campbell CL, Olson KE, Dimon MT, Marinotti O, James AA (2011). "RNA-seq analyses of blood-induced changes in gene expression in the mosquito vector species, Aedes aegypti". BMC Genomics. 12: 82. doi: 10.1186/1471-2164-12-82 . PMC   3042412 . PMID   21276245.
  9. Pinto SB, Lombardo F, Koutsos AC, Waterhouse RM, McKay K, An C, Ramakrishnan C, Kafatos FC, Michel K (2009). "Discovery of Plasmodium modulators by genome-wide analysis of circulating hemocytes in Anopheles gambiae". Proc Natl Acad Sci U S A. 106 (50): 21270–5. Bibcode:2009PNAS..10621270P. doi: 10.1073/pnas.0909463106 . PMC   2783009 . PMID   19940242.
  10. Bartholomay LC, Waterhouse RM, Mayhew GF, Campbell CL, Michel K, Zou Z, Ramirez JL, Das S, Alvarez K, Arensburger P, Bryant B, Chapman SB, Dong Y, Erickson SM, Karunaratne SH, Kokoza V, Kodira CD, Pignatelli P, Shin SW, Vanlandingham DL, Atkinson PW, Birren B, Christophides GK, Clem RJ, Hemingway J, Higgs S, Megy K, Ranson H, Zdobnov EM, Raikhel AS, Christensen BM, Dimopoulos G, Muskavitch MA (2010). "Pathogenomics of Culex quinquefasciatus and meta-analysis of infection responses to diverse pathogens". Science. 330 (6000): 88–90. Bibcode:2010Sci...330...88B. doi:10.1126/science.1193162. PMC   3104938 . PMID   20929811.
  11. Lemay DG, Lynn DJ, Martin WF, Neville MC, Casey TM, Rincon G, Kriventseva EV, Barris WC, Hinrichs AS, Molenaar AJ, Pollard KS, Maqbool NJ, Singh K, Murney R, Zdobnov EM, Tellam RL, Medrano JF, German JB, Rijnkels M (2009). "The bovine lactation genome: insights into the evolution of mammalian milk". Genome Biol. 10 (4): R43. doi: 10.1186/gb-2009-10-4-r43 . PMC   2688934 . PMID   19393040.
  12. Neafsey DE, Waterhouse RM, Abai MR, Aganezov SS, Alekseyev MA, Allen JE, Amon J, Arcà B, Arensburger P, Artemov G, Assour LA, Basseri H, Berlin A, Birren BW, Blandin SA, Brockman AI, Burkot TR, Burt A, Chan CS, Chauve C, Chiu JC, Christensen M, Costantini C, Davidson VL, Deligianni E, Dottorini T, Dritsou V, Gabriel SB, Guelbeogo WM, Hall AB, Han MV, Hlaing T, Hughes DS, Jenkins AM, Jiang X, Jungreis I, Kakani EG, Kamali M, Kemppainen P, Kennedy RC, Kirmitzoglou IK, Koekemoer LL, Laban N, Langridge N, Lawniczak MK, Lirakis M, Lobo NF, Lowy E, MacCallum RM, Mao C, Maslen G, Mbogo C, McCarthy J, Michel K, Mitchell SN, Moore W, Murphy KA, Naumenko AN, Nolan T, Novoa EM, O'Loughlin S, Oringanje C, Oshaghi MA, Pakpour N, Papathanos PA, Peery AN, Povelones M, Prakash A, Price DP, Rajaraman A, Reimer LJ, Rinker DC, Rokas A, Russell TL, Sagnon N, Sharakhova MV, Shea T, Simão FA, Simard F, Slotman MA, Somboon P, Stegniy V, Struchiner CJ, Thomas GW, Tojo M, Topalis P, Tubio JM, Unger MF, Vontas J, Walton C, Wilding CS, Willis JH, Wu YC, Yan G, Zdobnov EM, Zhou X, Catteruccia F, Christophides GK, Collins FH, Cornman RS, Crisanti A, Donnelly MJ, Emrich SJ, Fontaine MC, Gelbart W, Hahn MW, Hansen IA, Howell PI, Kafatos FC, Kellis M, Lawson D, Louis C, Luckhart S, Muskavitch MA, Ribeiro JM, Riehle MA, Sharakhov IV, Tu Z, Zwiebel LJ, Besansky NJ (January 2015). "Highly evolvable malaria vectors: the genomes of 16 Anopheles mosquitoes". Science. 347 (6217): 62176. Bibcode:2015Sci...347...43N. doi:10.1126/science.1258522. PMC   4380271 . PMID   25554792.
  13. Boeckmann B, Robinson-Rechavi M, Xenarios I, Dessimoz C (September 2011). "Conceptual framework and pilot study to benchmark phylogenomic databases based on reference gene trees". Brief. Bioinform. 12 (5): 423–35. doi:10.1093/bib/bbr034. PMC   3178055 . PMID   21737420.
  14. http://eggnog.embl.de/orthobench OrthoBench]
    Trachana K, Larsson TA, Powell S, Chen WH, Doerks T, Muller J, Bork P (October 2011). "Orthology prediction methods: a quality assessment using curated protein families". BioEssays. 33 (10): 769–80. doi:10.1002/bies.201100062. PMC   3193375 . PMID   21853451.
  15. Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM (June 2015). "BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs". Bioinformatics. 31 (19): 3210–2. doi: 10.1093/bioinformatics/btv351 . PMID   26059717.

See also

Related Research Articles

In bioinformatics, sequence clustering algorithms attempt to group biological sequences that are somehow related. The sequences can be either of genomic, "transcriptomic" (ESTs) or protein origin. For proteins, homologous sequences are typically grouped into families. For EST data, clustering is important to group sequences originating from the same gene before the ESTs are assembled to reconstruct the original mRNA.

<span class="mw-page-title-main">Protein family</span> Group of evolutionarily-related proteins

A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be confused with family as it is used in taxonomy.

<span class="mw-page-title-main">Comparative genomics</span>

Comparative genomics is a field of biological research in which the genomic features of different organisms are compared. The genomic features may include the DNA sequence, genes, gene order, regulatory sequences, and other genomic structural landmarks. In this branch of genomics, whole or large parts of genomes resulting from genome projects are compared to study basic biological similarities and differences as well as evolutionary relationships between organisms. The major principle of comparative genomics is that common features of two organisms will often be encoded within the DNA that is evolutionarily conserved between them. Therefore, comparative genomic approaches start with making some form of alignment of genome sequences and looking for orthologous sequences in the aligned genomes and checking to what extent those sequences are conserved. Based on these, genome and molecular evolution are inferred and this may in turn be put in the context of, for example, phenotypic evolution or population genetics.

<span class="mw-page-title-main">Sequence homology</span> Shared ancestry between DNA, RNA or protein sequences

Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a speciation event (orthologs), or a duplication event (paralogs), or else a horizontal gene transfer event (xenologs).

<span class="mw-page-title-main">KEGG</span> Collection of bioinformatics databases

KEGG is a collection of databases dealing with genomes, biological pathways, diseases, drugs, and chemical substances. KEGG is utilized for bioinformatics research and education, including data analysis in genomics, metagenomics, metabolomics and other omics studies, modeling and simulation in systems biology, and translational research in drug development.

<span class="mw-page-title-main">MicrobesOnline</span>

MicrobesOnline is a publicly and freely accessible website that hosts multiple comparative genomic tools for comparing microbial species at the genomic, transcriptomic and functional levels. MicrobesOnline was developed by the Virtual Institute for Microbial Stress and Survival, which is based at the Lawrence Berkeley National Laboratory in Berkeley, California. The site was launched in 2005, with regular updates until 2011.

<span class="mw-page-title-main">NBPF3</span> Protein-coding gene in the species Homo sapiens

Neuroblastoma breakpoint family, member 3, also known as NBPF3, is a human gene of the neuroblastoma breakpoint family, which resides on chromosome 1 of the human genome. NBPF3 is located at 1p36.12, immediately upstream of genes ALPL and RAP1GAP.

<span class="mw-page-title-main">CCDC186</span> Protein found in humans

CCDC186 is a protein that in humans is encoded by the CCDC186 gene The CCDC186 gene is also known as the CTCL-tumor associated antigen with accession number NM_018017.

Inparanoid is an algorithm that finds orthologous genes and paralogous genes that arose—most likely by duplication—after some speciation event. Such protein-coding genes are called in-paralogs, as opposed to out-paralogs.

The Viral Bioinformatics Resource Center (VBRC) is an online resource providing access to a database of curated viral genomes and a variety of tools for bioinformatic genome analysis. This resource was one of eight BRCs funded by NIAID with the goal of promoting research against emerging and re-emerging pathogens, particularly those seen as potential bioterrorism threats. The VBRC is now supported by Dr. Chris Upton at the University of Victoria.

<span class="mw-page-title-main">DNA annotation</span> The process of describing the structure and function of a genome

In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.

OMA is a database of orthologs extracted from available complete genomes. The orthology predictions of OMA are available in several forms:

PhylomeDB is a public biological database for complete catalogs of gene phylogenies (phylomes). It allows users to interactively explore the evolutionary history of genes through the visualization of phylogenetic trees and multiple sequence alignments. Moreover, phylomeDB provides genome-wide orthology and paralogy predictions which are based on the analysis of the phylogenetic trees. The automated pipeline used to reconstruct trees aims at providing a high-quality phylogenetic analysis of different genomes, including Maximum Likelihood tree inference, alignment trimming and evolutionary model testing.

The human gene Chromosome 3 open reading frame 14 is a gene of uncertain function located at 3p14.2 near fragile site FRBA3—which falls between this gene and the centromere. Its protein is expected to localize to the nucleus and bind DNA. Orthologs have been identified in all of the major animal groups, minus amphibians and insects, tracing as far back as the sea anemone; indicating an origin of over 1000 mya, highlighting its importance in the animal genome.

The eggNOG database is a database of biological information hosted by the EMBL. It is based on the original idea of COGs and expands that idea to non-supervised orthologous groups constructed from numerous organisms. The database was created in 2007 and updated to version 4.5 in 2015. eggNOG stands for evolutionary genealogy of genes: Non-supervised Orthologous Groups.

<span class="mw-page-title-main">Infologs</span>

Infologs are independently designed synthetic genes derived from one or a few genes where substitutions are systematically incorporated to maximize information. Infologs are designed for perfect diversity distribution to maximize search efficiency.

In molecular phylogenetics, relationships among individuals are determined using character traits, such as DNA, RNA or protein, which may be obtained using a variety of sequencing technologies. High-throughput next-generation sequencing has become a popular technique in transcriptomics, which represent a snapshot of gene expression. In eukaryotes, making phylogenetic inferences using RNA is complicated by alternative splicing, which produces multiple transcripts from a single gene. As such, a variety of approaches may be used to improve phylogenetic inference using transcriptomic data obtained from RNA-Seq and processed using computational phylogenetics.

<span class="mw-page-title-main">Christophe Dessimoz</span>

Christophe Dessimoz is a Swiss National Science Foundation (SNSF) Professor at the University of Lausanne, Associate Professor at University College London and a group leader at the Swiss Institute of Bioinformatics. He was awarded the Overton Prize in 2019 for his contributions to computational biology. Starting in April 2022, he will be joint executive director of the SIB Swiss Institute of Bioinformatics, along with Ron Appel.

<span class="mw-page-title-main">OrthoFinder</span>

OrthoFinder is a command-line software tool for comparative genomics. OrthoFinder determines the correspondence between genes in different organisms. This correspondence provides a framework for understanding the evolution of life on Earth, and enables the extrapolation and transfer of biological knowledge between organisms.