PhylomeDB

Last updated
PhylomeDB
Content
Descriptiongenome-wide collections of gene phylogenies.
Contact
Laboratory Comparative Genomics Group, Centre for Genomic Regulation (CRG), Barcelona, Spain.
Authors Jaime Huerta-Cepas, Salvador Capella-Gutierrez, Leszek Pryszcz, Marina Marcet-Houben, Ernst Thür, Laia Carreté, Miguel Ángel Naranjo-Ortiz and Toni Gabaldón
Primary citationHuerta-Cepas et al. (2014) [1]
Release date2014
Access
Website http://phylomedb.org

PhylomeDB is a public biological database for complete catalogs of gene phylogenies (phylomes). [1] [2] [3] It allows users to interactively explore the evolutionary history of genes through the visualization of phylogenetic trees and multiple sequence alignments. Moreover, phylomeDB provides genome-wide orthology and paralogy predictions which are based on the analysis of the phylogenetic trees. The automated pipeline used to reconstruct trees aims at providing a high-quality phylogenetic analysis of different genomes, including Maximum Likelihood tree inference, alignment trimming [4] and evolutionary model testing.

Contents

PhylomeDB includes also a public download section with the complete set of trees, alignments and orthology predictions, as well as a web API that facilitates cross linking trees from external sources. Finally, phylomeDB provides an advanced tree visualization interface based on the ETE toolkit, [5] which integrates tree topologies, taxonomic information, domain mapping and alignment visualization in a single and interactive tree image.

New steps on phylomeDB

The tree searching engine of PhylomeDB was updated to provide a gene-centric view of all phylomeDB resources. Thus, after a protein or gene search, all the available trees in phylomeDB are listed and organized by phylome and tree type. Users can switch among all available seed and collateral trees without missing the focus on the searched protein or gene.

In phylomeDB v4 all the information available for each tree is now shown using an integrated layout in which tree topology, taxonomy data, alignments and domain annotations, and event-age (phylostratigraphy) information are rendered in the same figure using the newest visualization features provided by the ETE toolkit v2.2:

  1. Pfam domains have been mapped to each alignment in our database and are now displayed in a compact panel at the right side of the tree. For each sequence, domains and their names are shown, they can be clicked to obtain a short description and the external link to Pfam. Protein regions not mapped to domains are shown using the standard amino acid color codes, while gap regions are represented by a flat line.
  2. Tree images have been also simplified to improve readability. Mappings and/or cross-linking to general and organism-oriented databases has been extended to include the major Arabidopsis thaliana sequence database TAIR, Drosophila’s Flybase, as well as the Ascomycete-based genome database Genolevures.
  3. Speciation and duplication events are indicated using different node colors and branch support values are now automatically highlighted for lowly supported partitions using a transparent red bubble inversely proportional to the branch bootstrap or aLRT value.
  4. Internal tree searches can be performed for any of the annotated node attributes while links to other databases are provided through the contextual menu of the tree browser that appear when clicking any node.

Also, users can download relevant data, including the whole database, a specific phylome or, from the tree entry page, the relevant data corresponding to that tree. In this new release we have implemented the possibility to download orthology predictions from a tree in the recently developed OrthoXML standard format, in addition to a tabulated format.

Quest for Orthologs

The Quest for Orthologs (QfO) consortium involve more than 30 phylogenomic databases. The main of the consortium is improve and standardize orthology predictions through collaboration and discuss about new emerging methods.

  1. link to: Quest for Orthologs
  2. link to: 2015 Meeting QfO

See also

Related Research Articles

Sequence homology Shared ancestry between DNA, RNA or protein sequences

Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a speciation event (orthologs), or a duplication event (paralogs), or else a horizontal gene transfer event (xenologs).

The European Bioinformatics Institute (EMBL-EBI) is an Intergovernmental Organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wellcome Genome Campus in Hinxton near Cambridge, and employs over 600 full-time equivalent (FTE) staff. Institute leaders such as Rolf Apweiler, Alex Bateman, Ewan Birney, and Guy Cochrane, an adviser on the National Genomics Data Center Scientific Advisory Board, serve as part of the international research network of the BIG Data Center at the Beijing Institute of Genomics.

Conserved sequence Similar DNA, RNA or protein sequences within genomes or among species

In evolutionary biology, conserved sequences are identical or similar sequences in nucleic acids or proteins across species, or within a genome, or between donor and receptor taxa. Conservation indicates that a sequence has been maintained by natural selection.

Pfam

Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The most recent version, Pfam 34.0, was released in March 2021 and contains 19,179 families.

InterPro is a database of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to new protein sequences in order to functionally characterise them.

Protein–protein interaction prediction is a field combining bioinformatics and structural biology in an attempt to identify and catalog physical interactions between pairs or groups of proteins. Understanding protein–protein interactions is important for the investigation of intracellular signaling pathways, modelling of protein complex structures and for gaining insights into various biochemical processes.

MicrobesOnline

MicrobesOnline is a publicly and freely accessible website that hosts multiple comparative genomic tools for comparing microbial species at the genomic, transcriptomic and functional levels. MicrobesOnline was developed by the Virtual Institute for Microbial Stress and Survival, which is based at the Lawrence Berkeley National Laboratory in Berkeley, California. The site was launched in 2005, with regular updates until 2011.

STRING

In molecular biology, STRING is a biological database and web resource of known and predicted protein–protein interactions.

SUPERFAMILY is a database and search platform of structural and functional annotation for all proteins and genomes. It classifies amino acid sequences into known structural domains, especially into SCOP superfamilies. Domains are functional, structural, and evolutionary units that form proteins. Domains of common Ancestry are grouped into superfamilies. The domains and domain superfamilies are defined and described in SCOP. Superfamilies are groups of proteins which have structural evidence to support a common evolutionary ancestor but may not have detectable sequence homology.

The UCSC Genome Browser is an online and downloadable genome browser hosted by the University of California, Santa Cruz (UCSC). It is an interactive website offering access to genome sequence data from a variety of vertebrate and invertebrate species and major model organisms, integrated with a large collection of aligned annotations. The Browser is a graphical viewer optimized to support fast interactive performance and is an open-source, web-based tool suite built on top of a MySQL database for rapid visualization, examination, and querying of the data at many levels. The Genome Browser Database, browsing tools, downloadable data files, and documentation can all be found on the UCSC Genome Bioinformatics website.

OrthoDB

OrthoDB presents a catalog of orthologous protein-coding genes across vertebrates, arthropods, fungi, plants, and bacteria. Orthology refers to the last common ancestor of the species under consideration, and thus OrthoDB explicitly delineates orthologs at each major radiation along the species phylogeny. The database of orthologs presents available protein descriptors, together with Gene Ontology and InterPro attributes, which serve to provide general descriptive annotations of the orthologous groups, and facilitate comprehensive orthology database querying. OrthoDB also provides computed evolutionary traits of orthologs, such as gene duplicability and loss profiles, divergence rates, sibling groups, and gene intron-exon architectures.

OMA is a database of orthologs extracted from available complete genomes. The orthology predictions of OMA are available in several forms:

Blast2GO

Blast2GO, first published in 2005, is a bioinformatics software tool for the automatic, high-throughput functional annotation of novel sequence data. It makes use of the BLAST algorithm to identify similar sequences to then transfers existing functional annotation from yet characterised sequences to the novel one. The functional information is represented via the Gene Ontology (GO), a controlled vocabulary of functional attributes. The Gene Ontology, or GO, is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species.

In bioinformatics, the PANTHER classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. PANTHER is part of the Gene Ontology Reference Genome Project designed to classify proteins and their genes for high-throughput analysis.

In bioinformatics, alignment-free sequence analysis approaches to molecular sequence and structure data provide alternatives over alignment-based approaches.

In molecular phylogenetics, relationships among individuals are determined using character traits, such as DNA, RNA or protein, which may be obtained using a variety of sequencing technologies. High-throughput next-generation sequencing has become a popular technique in transcriptomics, which represent a snapshot of gene expression. In eukaryotes, making phylogenetic inferences using RNA is complicated by alternative splicing, which produces multiple transcripts from a single gene. As such, a variety of approaches may be used to improve phylogenetic inference using transcriptomic data obtained from RNA-Seq and processed using computational phylogenetics.

Non-coding RNAs have been discovered using both experimental and bioinformatic approaches. Bioinformatic approaches can be divided into three main categories. The first involves homology search, although these techniques are by definition unable to find new classes of ncRNAs. The second category includes algorithms designed to discover specific types of ncRNAs that have similar properties. Finally, some discovery methods are based on very general properties of RNA, and are thus able to discover entirely new kinds of ncRNAs.

Genome mining

Genome mining describes the exploitation of genomic information for the discovery of biosynthetic pathways of natural products and their possible interactions. It depends on computational technology and bioinformatics tools. The mining process relies on a huge amount of data accessible in genomic databases. By applying data mining algorithms, the data can be used to generate new knowledge in several areas of medicinal chemistry, such as discovering novel natural products.

References

  1. 1 2 Huerta-Cepas, Jaime; Capella-Gutierrez, S; Pryszcz, LP; Marcet-Houben, M; Gabaldón, T (Jan 2014). "PhylomeDB v4: zooming into the plurality of evolutionary histories of a genome". Nucleic Acids Res. England. 42 (Database issue): D897–902. doi:10.1093/nar/gkt1177. PMC   3964985 . PMID   24275491.
  2. Huerta-Cepas, J; Bueno, A; Dopazo, J; Gabaldón, T (Jan 2008). "PhylomeDB: a database for genome-wide collections of gene phylogenies". Nucleic Acids Res. England. 36 (Database issue): D491–6. doi:10.1093/nar/gkm899. PMC   2238872 . PMID   17962297.
  3. Huerta-Cepas, Jaime; Capella-Gutierrez, S; Pryszcz, LP; Denisov, I; Kormes, D; Marcet-Houben, M; Gabaldón, T (Jan 2011). "PhylomeDB v3.0: an expanding repository of genome-wide collections of trees, alignments and phylogeny-based orthology and paralogy predictions". Nucleic Acids Res. England. 39 (Database issue): D556–60. doi:10.1093/nar/gkq1109. PMC   3013701 . PMID   21075798.
  4. Capella-Gutierrez, S; Silla-Martínez, JM; Gabaldón, T (Aug 2009). "trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses". Bioinformatics. 25 (Database issue): 1972–3. doi:10.1093/bioinformatics/btp348. PMC   2712344 . PMID   19505945.
  5. Huerta-Cepas, J; Dopazo, J; Gabaldón, T (Jan 2010). "ETE: a python Environment for Tree Exploration". BMC Bioinformatics. 11 (Database issue): 24. doi:10.1186/1471-2105-11-24. PMC   2820433 . PMID   20070885.