Content | |
---|---|
Description | Catalog of Orthologs. |
Contact | |
Research center | Swiss Institute of Bioinformatics |
Laboratory | Computational Evolutionary Genomics Group |
Authors | Evgenia V. Kriventseva |
Primary citation | Kriventseva et al. (2015) [1] |
Release date | 2007 |
Access | |
Website | www |
Download URL | https://www.orthodb.org/?page=filelist |
Sparql endpoint | sparql |
Miscellaneous | |
License | CC-BY-3.0 |
OrthoDB [1] [2] [3] [4] presents a catalog of orthologous protein-coding genes across vertebrates, arthropods, fungi, plants, and bacteria. Orthology refers to the last common ancestor of the species under consideration, and thus OrthoDB explicitly delineates orthologs at each major radiation along the species phylogeny. The database of orthologs presents available protein descriptors, together with Gene Ontology and InterPro attributes, which serve to provide general descriptive annotations of the orthologous groups, and facilitate comprehensive orthology database querying. OrthoDB also provides computed evolutionary traits of orthologs, such as gene duplicability and loss profiles, divergence rates, sibling groups, and gene intron-exon architectures.
In comparative genomics, the importance of scale cannot be underestimated. As gene orthology delineation requires specific expertise and considerable computational resources, scale is something that individual non-specialist research groups cannot accomplish on their own. This challenging task is achieved by OrthoDB, with very comprehensive sets of species and several unique features such as the extensive functional and evolutionary annotations of orthologous groups, with the integration of many useful links to other world-leading databases that focus on capturing information about gene function. No genome can exist as a useful data source without extensive comparative analyses with other genomes – OrthoDB provides a critically important resource for comparative genomics for the entire community of researchers from those interested in grand evolutionary questions to those focused on the specific biological functions of individual genes.
Orthology is defined relative to the last common ancestor of the species being considered, thereby determining the hierarchical nature of orthologous classifications. This is explicitly addressed in OrthoDB by application of the orthology delineation procedure at each major radiation point of the considered phylogeny. The OrthoDB implementation employs a Best-Reciprocal-Hit (BRH) clustering algorithm based on all-against-all Smith–Waterman protein sequence comparisons. Gene set pre-processing selects the longest protein-coding transcript of alternatively spliced genes and of very similar gene copies. The procedure triangulates BRHs to progressively build the clusters and requires an overall minimum sequence alignment overlap to avoid domain walking. These core clusters are further expanded to include all more closely related within-species in-paralogs, and the previously identified very similar gene copies.
The database contains some 600 eukaryotic species and more than 3600 bacteria [1] sourced from Ensembl, UniProt, NCBI, FlyBase, and several other databases. The ever-increasing sampling of sequenced genomes brings a clearer account of the majority of gene genealogies that will facilitate informed hypotheses of gene function in newly sequenced genomes.
Examples of studies that have employed data from OrthoDB include comparative analyses of gene repertoire evolution, [5] [6] comparisons of fruit fly and mosquito developmental genes, [7] analyses of bloodmeal- or infection-induced changes in gene expression in mosquitoes, [8] [9] [10] analysis of the evolution of mammalian milk production, [11] and mosquito gene and genome evolution. [12] Others studies citing OrthoDB can be found at PubMed and Google Scholar.
OrthoDB has performed consistently well in benchmarking assessments alongside other orthology delineation procedures. Results were compared to reference trees for three well-conserved protein families, [13] and to a larger set of curated protein families. [14]
Benchmarking sets of Universal Single-Copy Orthologs [15] - Orthologous groups are selected from OrthoDB for the root-level classifications of arthropods, vertebrates, metazoans, fungi, and other major clades. Groups are required to contain single-copy orthologs in at least 90% of the species (in others they may be lost or duplicated), and the missing species cannot all be from the same clade. Species with frequent losses or duplications are removed from the selection unless they hold a key position in the phylogeny. BUSCOs are therefore expected to be found as single-copy orthologs in any newly sequenced genome from the appropriate phylogenetic clade, and can be used to analyse newly sequenced genomes to assess their relative completeness. The BUSCO assessment tool and datasets (accessible here) are being widely used in many genomics projects, with most journal editors now requiring such quality assessments before accepting new genome publications.
In bioinformatics, sequence clustering algorithms attempt to group biological sequences that are somehow related. The sequences can be either of genomic, "transcriptomic" (ESTs) or protein origin. For proteins, homologous sequences are typically grouped into families. For EST data, clustering is important to group sequences originating from the same gene before the ESTs are assembled to reconstruct the original mRNA.
A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be confused with family as it is used in taxonomy.
Comparative genomics is a field of biological research in which the genomic features of different organisms are compared. The genomic features may include the DNA sequence, genes, gene order, regulatory sequences, and other genomic structural landmarks. In this branch of genomics, whole or large parts of genomes resulting from genome projects are compared to study basic biological similarities and differences as well as evolutionary relationships between organisms. The major principle of comparative genomics is that common features of two organisms will often be encoded within the DNA that is evolutionarily conserved between them. Therefore, comparative genomic approaches start with making some form of alignment of genome sequences and looking for orthologous sequences in the aligned genomes and checking to what extent those sequences are conserved. Based on these, genome and molecular evolution are inferred and this may in turn be put in the context of, for example, phenotypic evolution or population genetics.
Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a speciation event (orthologs), or a duplication event (paralogs), or else a horizontal gene transfer event (xenologs).
KEGG is a collection of databases dealing with genomes, biological pathways, diseases, drugs, and chemical substances. KEGG is utilized for bioinformatics research and education, including data analysis in genomics, metagenomics, metabolomics and other omics studies, modeling and simulation in systems biology, and translational research in drug development.
MicrobesOnline is a publicly and freely accessible website that hosts multiple comparative genomic tools for comparing microbial species at the genomic, transcriptomic and functional levels. MicrobesOnline was developed by the Virtual Institute for Microbial Stress and Survival, which is based at the Lawrence Berkeley National Laboratory in Berkeley, California. The site was launched in 2005, with regular updates until 2011.
Neuroblastoma breakpoint family, member 3, also known as NBPF3, is a human gene of the neuroblastoma breakpoint family, which resides on chromosome 1 of the human genome. NBPF3 is located at 1p36.12, immediately upstream of genes ALPL and RAP1GAP.
CCDC186 is a protein that in humans is encoded by the CCDC186 gene The CCDC186 gene is also known as the CTCL-tumor associated antigen with accession number NM_018017.
Inparanoid is an algorithm that finds orthologous genes and paralogous genes that arose—most likely by duplication—after some speciation event. Such protein-coding genes are called in-paralogs, as opposed to out-paralogs.
The Viral Bioinformatics Resource Center (VBRC) is an online resource providing access to a database of curated viral genomes and a variety of tools for bioinformatic genome analysis. This resource was one of eight BRCs funded by NIAID with the goal of promoting research against emerging and re-emerging pathogens, particularly those seen as potential bioterrorism threats. The VBRC is now supported by Dr. Chris Upton at the University of Victoria.
In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.
OMA is a database of orthologs extracted from available complete genomes. The orthology predictions of OMA are available in several forms:
PhylomeDB is a public biological database for complete catalogs of gene phylogenies (phylomes). It allows users to interactively explore the evolutionary history of genes through the visualization of phylogenetic trees and multiple sequence alignments. Moreover, phylomeDB provides genome-wide orthology and paralogy predictions which are based on the analysis of the phylogenetic trees. The automated pipeline used to reconstruct trees aims at providing a high-quality phylogenetic analysis of different genomes, including Maximum Likelihood tree inference, alignment trimming and evolutionary model testing.
The human gene Chromosome 3 open reading frame 14 is a gene of uncertain function located at 3p14.2 near fragile site FRBA3—which falls between this gene and the centromere. Its protein is expected to localize to the nucleus and bind DNA. Orthologs have been identified in all of the major animal groups, minus amphibians and insects, tracing as far back as the sea anemone; indicating an origin of over 1000 mya, highlighting its importance in the animal genome.
The eggNOG database is a database of biological information hosted by the EMBL. It is based on the original idea of COGs and expands that idea to non-supervised orthologous groups constructed from numerous organisms. The database was created in 2007 and updated to version 4.5 in 2015. eggNOG stands for evolutionary genealogy of genes: Non-supervised Orthologous Groups.
Infologs are independently designed synthetic genes derived from one or a few genes where substitutions are systematically incorporated to maximize information. Infologs are designed for perfect diversity distribution to maximize search efficiency.
In molecular phylogenetics, relationships among individuals are determined using character traits, such as DNA, RNA or protein, which may be obtained using a variety of sequencing technologies. High-throughput next-generation sequencing has become a popular technique in transcriptomics, which represent a snapshot of gene expression. In eukaryotes, making phylogenetic inferences using RNA is complicated by alternative splicing, which produces multiple transcripts from a single gene. As such, a variety of approaches may be used to improve phylogenetic inference using transcriptomic data obtained from RNA-Seq and processed using computational phylogenetics.
Christophe Dessimoz is a Swiss National Science Foundation (SNSF) Professor at the University of Lausanne, Associate Professor at University College London and a group leader at the Swiss Institute of Bioinformatics. He was awarded the Overton Prize in 2019 for his contributions to computational biology. Starting in April 2022, he will be joint executive director of the SIB Swiss Institute of Bioinformatics, along with Ron Appel.
OrthoFinder is a command-line software tool for comparative genomics. OrthoFinder determines the correspondence between genes in different organisms. This correspondence provides a framework for understanding the evolution of life on Earth, and enables the extrapolation and transfer of biological knowledge between organisms.