Original author(s) | Metascape Team |
---|---|
Developer(s) | Yingyao Zhou, Bin Zhou, Lars Pache, Max Chang, Christopher Benner, Sumit Chanda |
Stable release | 3.5 / 1 March 2019 |
Type | Bioinformatics |
License | Freeware |
Website | metascape |
Metascape is a free gene annotation and analysis resource that helps biologists make sense of one or multiple gene lists. Metascape provides automated meta-analysis tools to understand either common or unique pathways and protein networks within a group of orthogonal target-discovery studies.
In the "OMICs" age, it is important to gain biological insights into a list of genes. Although a number of bioinformatics sources exist for this purpose, such as DAVID, they are not all free, easy to use, and well maintained. To analyze multiple lists of genes originated from orthogonal but complementary "OMICs" studies, tools often require computational skills that are beyond the reach of many biologists. According to the Metascape blog, [1] a team of scientists self-organized to address this challenge. The team includes core members Yingyao Zhou, Bin Zhou, Lars Pache, Max Chang, Christopher Benner, and Sumit Chanda, as well as other contributors over the time. Metascape was first released as a beta version on Oct 8, 2015. The first Metascape application was published on Dec 9, 2015. [2] Metascape has gone through multiple releases since then. It currently supports key model organisms, pathway enrichment analysis, protein-protein interaction network and component analysis, automatic presentation of the results as publication-ready web report, Excel and PowerPoint presentations.
The paper titled "Metascape provides a biologist-oriented resource for the analysis of systems-level datasets" was published on Apr 3, 2019 in Nature Communications. [3]
Metascape implements a CAME analysis workflow:
Metascape integrated over 40 bioinformatics knowledgebase into a seamless user interface, where experimental biologists can use a single-click Express Analysis feature to turn multiple gene lists into interpretable results.
All analysis results are presented in a web report, which contains Excel annotation and enrichment sheets, PowerPoint slides, and custom analysis files (e.g., .cys file by Cytoscape, .svg by Circos) for further offline analysis or processing.
One noticeable strength of Metascape is its visualization capability. Metascape has aided in the interpretation of 2,600 published studies as of December 2021, [4] among which, 2/3 of publications made use of graphs or sheets prepared by Metascape.
Metascape for Bioinformaticians (MSBio) was released in 2021 to meet the growing needs of computational biologists to automate Metascape batch analyzes for large-scale gene lists. [5] MSBio leverages the power of container technology to encapsulate the computational platform in Docker containers. Academic users can conduct offline analyses, which is only limited by the hardware they have access to. Commercial users have the capability of adding proprietary knowledgebase and conducting secure computations using internal computational assets. MSBio databases are updated in synchronization with the Metascape website.
Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.
The Gene Ontology (GO) is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species. More specifically, the project aims to: 1) maintain and develop its controlled vocabulary of gene and gene product attributes; 2) annotate genes and gene products, and assimilate and disseminate annotation data; and 3) provide tools for easy access to all aspects of the data provided by the project, and to enable functional interpretation of experimental data using the GO, for example via enrichment analysis. GO is part of a larger classification effort, the Open Biomedical Ontologies, being one of the Initial Candidate Members of the OBO Foundry.
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature. It is maintained by the UniProt consortium, which consists of several European bioinformatics organisations and a foundation from Washington, DC, USA.
The Rat Genome Database (RGD) is a database of rat genomics, genetics, physiology and functional data, as well as data for comparative genomics between rat, human and mouse. RGD is responsible for attaching biological information to the rat genome via structured vocabulary, or ontology, annotations assigned to genes and quantitative trait loci (QTL), and for consolidating rat strain data and making it available to the research community. They are also developing a suite of tools for mining and analyzing genomic, physiologic and functional data for the rat, and comparative data for rat, mouse, human, and five other species.
In the field of molecular biology, gene expression profiling is the measurement of the activity of thousands of genes at once, to create a global picture of cellular function. These profiles can, for example, distinguish between cells that are actively dividing, or show how the cells react to a particular treatment. Many experiments of this sort measure an entire genome simultaneously, that is, every gene present in a particular cell.
Reactome is a free online database of biological pathways. It is manually curated and authored by PhD-level biologists, in collaboration with Reactome editorial staff. The content is cross-referenced to many bioinformatics databases. The rationale behind Reactome is to visually represent biological pathways in full mechanistic detail, while making the source data available in a computationally accessible format.
The Saccharomyces Genome Database (SGD) is a scientific database of the molecular biology and genetics of the yeast Saccharomyces cerevisiae, which is commonly known as baker's or budding yeast. Further information is located at the Yeastract curated repository.
GenMAPP is a free, open-source bioinformatics software tool designed to visualize and analyze genomic data in the context of pathways, connecting gene-level datasets to biological processes and disease. First created in 2000, GenMAPP is developed by an open-source team based in an academic research laboratory. GenMAPP maintains databases of gene identifiers and collections of pathway maps in addition to visualization and analysis tools. Together with other public resources, GenMAPP aims to provide the research community with tools to gain insight into biology through the integration of data types ranging from genes to proteins to pathways to disease.
DAVID is a free online bioinformatics resource developed by the Laboratory of Human Retrovirology and Immunoinformatics. All tools in the DAVID Bioinformatics Resources aim to provide functional interpretation of large lists of genes derived from genomic studies, e.g. microarray and proteomics studies. DAVID can be found at https://david.ncifcrf.gov/
Protein function prediction methods are techniques that bioinformatics researchers use to assign biological or biochemical roles to proteins. These proteins are usually ones that are poorly studied or predicted based on genomic sequence data. These predictions are often driven by data-intensive computational procedures. Information may come from nucleic acid sequence homology, gene expression profiles, protein domain structures, text mining of publications, phylogenetic profiles, phenotypic profiles, and protein-protein interaction. Protein function is a broad term: the roles of proteins range from catalysis of biochemical reactions to transport to signal transduction, and a single protein may play a role in multiple processes or cellular pathways.
ArrayTrack is a multi-purpose bioinformatics tool primarily used for microarray data management, analysis, and interpretation. ArrayTrack was developed to support in-house filter array research for the U.S. Food and Drug Administration in 2001, and was made freely available to the public as an integrated research tool for microarrays in 2003. Since then, ArrayTrack has averaged about 5,000 users per year. It is regularly updated by the National Center for Toxicological Research.
dcGO is a comprehensive ontology database for protein domains. As an ontology resource, dcGO integrates Open Biomedical Ontologies from a variety of contexts, ranging from functional information like Gene Ontology to others on enzymes and pathways, from phenotype information across major model organisms to information about human diseases and drugs. As a protein domain resource, dcGO includes annotations to both the individual domains and supra-domains.
In bioinformatics, the PANTHER classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. PANTHER is part of the Gene Ontology Reference Genome Project designed to classify proteins and their genes for high-throughput analysis.
Protein Complex Enrichment Analysis Tool is an online bioinformatics tool used to analyze high-throughput datasets using protein complex enrichment analysis. The tool uses a protein complex resource as the back end annotation data instead of conventional gene ontology- or pathway-based annotations. The tool incorporates several useful features in order to provide a comprehensive data-mining environment, including network-based visualization and interactive querying options.
Gene set enrichment analysis (GSEA) (also called functional enrichment analysis or pathway enrichment analysis) is a method to identify classes of genes or proteins that are over-represented in a large set of genes or proteins, and may have an association with different phenotypes (e.g. different organism growth patterns or diseases). The method uses statistical approaches to identify significantly enriched or depleted groups of genes. Transcriptomics technologies and proteomics results often identify thousands of genes, which are used for the analysis.
In bioinformatics, a Gene Disease Database is a systematized collection of data, typically structured to model aspects of reality, in a way to comprehend the underlying mechanisms of complex diseases, by understanding multiple composite interactions between phenotype-genotype relationships and gene-disease mechanisms. Gene Disease Databases integrate human gene-disease associations from various expert curated databases and text mining derived associations including Mendelian, complex and environmental diseases.
Pathway is the term from molecular biology for a curated schematic representation of a well characterized segment of the molecular physiological machinery, such as a metabolic pathway describing an enzymatic process within a cell or tissue or a signaling pathway model representing a regulatory process that might, in its turn, enable a metabolic or another regulatory process downstream. A typical pathway model starts with an extracellular signaling molecule that activates a specific receptor, thus triggering a chain of molecular interactions. A pathway is most often represented as a relatively small graph with gene, protein, and/or small molecule nodes connected by edges of known functional relations. While a simpler pathway might appear as a chain, complex pathway topologies with loops and alternative routes are much more common. Computational analyses employ special formats of pathway representation. In the simplest form, however, a pathway might be represented as a list of member molecules with order and relations unspecified. Such a representation, generally called Functional Gene Set (FGS), can also refer to other functionally characterised groups such as protein families, Gene Ontology (GO) and Disease Ontology (DO) terms etc. In bioinformatics, methods of pathway analysis might be used to identify key genes/ proteins within a previously known pathway in relation to a particular experiment / pathological condition or building a pathway de novo from proteins that have been identified as key affected elements. By examining changes in e.g. gene expression in a pathway, its biological activity can be explored. However most frequently, pathway analysis refers to a method of initial characterization and interpretation of an experimental condition that was studied with omics tools or genome-wide association study. Such studies might identify long lists of altered genes. A visual inspection is then challenging and the information is hard to summarize, since the altered genes map to a broad range of pathways, processes, and molecular functions. In such situations, the most productive way of exploring the list is to identify enrichment of specific FGSs in it. The general approach of enrichment analyses is to identify FGSs, members of which were most frequently or most strongly altered in the given condition, in comparison to a gene set sampled by chance. In other words, enrichment can map canonical prior knowledge structured in the form of FGSs to the condition represented by altered genes.
Biocuration is the field of life sciences dedicated to organizing biomedical data, information and knowledge into structured formats, such as spreadsheets, tables and knowledge graphs. The biocuration of biomedical knowledge is made possible by the cooperative work of biocurators, software developers and bioinformaticians and is at the base of the work of biological databases.
Sumit K. Chanda is an American research scientist who works on viral and immunological human diseases. He also led the team that built and deployed Metascape for the analysis of omics data. This tool attracts over 500,000 users per year and has been cited 2,000 times a year since its inception in 2019.