Microarray databases

Last updated

A microarray database is a repository containing microarray gene expression data. The key uses of a microarray database are to store the measurement data, manage a searchable index, and make the data available to other applications for analysis and interpretation (either directly, or via user downloads).

Microarray databases can fall into two distinct classes:

  1. A peer reviewed, public repository that adheres to academic or industry standards and is designed to be used by many analysis applications and groups. A good example of this is the Gene Expression Omnibus (GEO) from NCBI or ArrayExpress from EBI.
  2. A specialized repository associated primarily with the brand of a particular entity (lab, company, university, consortium, group), an application suite, a topic, or an analysis method, whether it is commercial, non-profit, or academic. These databases might have one or more of the following characteristics:
    • A subscription or license may be needed to gain full access,
    • The content may come primarily from a specific group (e.g. SMD, or UPSC-BASE), the Immunological Genome Project
    • There may be constraints on who can use the data or for what purpose data can be used,
    • Special permission may be required to submit new data, or there may be no obvious process at all,
    • Only certain applications may be equipped to use the data, often also associated with the same entity (for example, caArray at NCI is specialized for the caBIG),
    • Further processing or reformatting of the data may be required for standard applications or analysis,
    • They claim to address the 'urgent need' to have a standard, centralized repository for microarray data. (See YMD, last updated in 2003, for example),
    • There is a claim to an incremental improvement over one of the public repositories,
    • A meta-analysis application, which incorporates studies from one or more public databases (e.g. Gemma primarily uses GEO studies; NextBio uses various sources)

Some of the most known public, curated microarray databases are:


DatabaseScopeMicroarray experiment setsSample profilesAs of date
ArrayTrack ArrayTrack hosts both public and private data, including MAQC benchmark data, with integrated analysis tools162250,093Feb 2012
NCI mAdbHosts NCI data with integrated analysis and statistics tools ?105,000Mar 2012
ImmGen databaseOpen access across all immune system cells; expression data, differential expression, coregulated clusters, regulation2671059Jan 2012
GenevestigatorGene expression search engine based on manually curated, well annotated public and proprietary microarray and RNA-seq datasets3228232,855October 2016
Gene Expression Omnibus - NCBIany curated MIAME compliant molecular abundance study25859641770October 28, 2011
ArrayExpress at EBIAny curated MIAME or MINSEQE compliant transcriptomics data24838708914October 28, 2011
Stanford Microarray databaseprivate and published microarray and molecule abundance database (now defunct)82542 ?October 23, 2011
The Cancer Genome Atlas (TCGA)collection of expression data for different cancers21229 ?August 30, 2013
GeneNetwork systemOpen access standard arrays, exons arrays, and RNA-seq data for genetic analysis (eQTL studies) with analysis suite~100~10000July, 2010
UNC modENCODE Microarray databaseNimblegen customer 2.1 million array~6180July 17, 2009
UPSC-BASEdata generated by microarray analysis within Umeå Plant Science Centre (UPSC).~100 ?November 15, 2007
UPenn RAD database MIAME compliant public and private studies, associated with ArrayExpress~100~2500Sept. 1, 2007
UNC Microarray databaseprovides the service for microarray data storage, retrieval, analysis, and visualization~312093April 1, 2007
MUSC databaseThe database is a repository for DNA microarray data generated by MUSC investigators as well as researchers in the global research community.~45555April 1, 2007
caArray at NCICancer data, prepared for analysis on caBIG 411741November 15, 2006

See also

Related Research Articles

Biostatistics is a branch of statistics that applies statistical methods to a wide range of topics in biology. It encompasses the design of biological experiments, the collection and analysis of data from those experiments and the interpretation of the results.

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

<span class="mw-page-title-main">DNA microarray</span> Collection of microscopic DNA spots attached to a solid surface

A DNA microarray is a collection of microscopic DNA spots attached to a solid surface. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously or to genotype multiple regions of a genome. Each DNA spot contains picomoles of a specific DNA sequence, known as probes. These can be a short section of a gene or other DNA element that are used to hybridize a cDNA or cRNA sample under high-stringency conditions. Probe-target hybridization is usually detected and quantified by detection of fluorophore-, silver-, or chemiluminescence-labeled targets to determine relative abundance of nucleic acid sequences in the target. The original nucleic acid arrays were macro arrays approximately 9 cm × 12 cm and the first computerized image based analysis was published in 1981. It was invented by Patrick O. Brown. An example of its application is in SNPs arrays for polymorphisms in cardiovascular diseases, cancer, pathogens and GWAS analysis. It is also used for the identification of structural variations and the measurement of gene expression.

<span class="mw-page-title-main">Biochip</span> Substrates performing biochemical reactions

In molecular biology, biochips are engineered substrates that can host large numbers of simultaneous biochemical reactions. One of the goals of biochip technology is to efficiently screen large numbers of biological analytes, with potential applications ranging from disease diagnosis to detection of bioterrorism agents. For example, digital microfluidic biochips are under investigation for applications in biomedical fields. In a digital microfluidic biochip, a group of (adjacent) cells in the microfluidic array can be configured to work as storage, functional operations, as well as for transporting fluid droplets dynamically.

The transcriptome is the set of all RNA transcripts, including coding and non-coding, in an individual or a population of cells. The term can also sometimes be used to refer to all RNAs, or just mRNA, depending on the particular experiment. The term transcriptome is a portmanteau of the words transcript and genome; it is associated with the process of transcript production during the biological process of transcription.

Bioconductor is a free, open source and open development software project for the analysis and comprehension of genomic data generated by wet lab experiments in molecular biology.

<span class="mw-page-title-main">Gene expression profiling</span>

In the field of molecular biology, gene expression profiling is the measurement of the activity of thousands of genes at once, to create a global picture of cellular function. These profiles can, for example, distinguish between cells that are actively dividing, or show how the cells react to a particular treatment. Many experiments of this sort measure an entire genome simultaneously, that is, every gene present in a particular cell.

<span class="mw-page-title-main">Genetic analysis</span>

Genetic analysis is the overall process of studying and researching in fields of science that involve genetics and molecular biology. There are a number of applications that are developed from this research, and these are also considered parts of the process. The base system of analysis revolves around general genetics. Basic studies include identification of genes and inherited disorders. This research has been conducted for centuries on both a large-scale physical observation basis and on a more microscopic scale. Genetic analysis can be used generally to describe methods both used in and resulting from the sciences of genetics and molecular biology, or to applications resulting from this research.

Reactome is a free online database of biological pathways. There are several Reactomes that concentrate on specific organisms, the largest of these is focused on human biology, the following description concentrates on the human Reactome. It is authored by biologists, in collaboration with Reactome editorial staff. The content is cross-referenced to many bioinformatics databases. The rationale behind Reactome is to visually represent biological pathways in full mechanistic detail, while making the source data available in a computationally accessible format.

Genevestigator is an application consisting of a gene expression database and tools to analyse the data. It exists in two versions, biomedical and plant, depending on the species of the underlying microarray and RNAseq as well as single-cell RNA-sequencing data. It was started in January 2004 by scientists from ETH Zurich and is currently developed and commercialized by Nebion AG.

<span class="mw-page-title-main">Microarray analysis techniques</span>

Microarray analysis techniques are used in interpreting the data generated from experiments on DNA, RNA, and protein microarrays, which allow researchers to investigate the expression state of a large number of genes - in many cases, an organism's entire genome - in a single experiment. Such experiments can generate very large amounts of data, allowing researchers to assess the overall state of a cell or organism. Data in such large quantities is difficult - if not impossible - to analyze without the help of computer programs.

The Functional GEnomics Data Society (FGED) was a non-profit, volunteer-run international organization of biologists, computer scientists, and data analysts that aims to facilitate biological and biomedical discovery through data integration. The approach of FGED was to promote the sharing of basic research data generated primarily via high-throughput technologies that generate large data sets within the domain of functional genomics.

Within computational biology, an MA plot is an application of a Bland–Altman plot for visual representation of genomic data. The plot visualizes the differences between measurements taken in two samples, by transforming the data onto M and A scales, then plotting these values. Though originally applied in the context of two channel DNA microarray gene expression data, MA plots are also used to visualise high-throughput sequencing analysis.

<span class="mw-page-title-main">MAGIChip</span>

MAGIChips, also known as "microarrays of gel-immobilized compounds on a chip" or "three-dimensional DNA microarrays", are devices for molecular hybridization produced by immobilizing oligonucleotides, DNA, enzymes, antibodies, and other compounds on a photopolymerized micromatrix of polyacrylamide gel pads of 100x100x20µm or smaller size. This technology is used for analysis of nucleic acid hybridization, specific binding of DNA, and low-molecular weight compounds with proteins, and protein-protein interactions.

Immunomics is the study of immune system regulation and response to pathogens using genome-wide approaches. With the rise of genomic and proteomic technologies, scientists have been able to visualize biological networks and infer interrelationships between genes and/or proteins; recently, these technologies have been used to help better understand how the immune system functions and how it is regulated. Two thirds of the genome is active in one or more immune cell types and less than 1% of genes are uniquely expressed in a given type of cell. Therefore, it is critical that the expression patterns of these immune cell types be deciphered in the context of a network, and not as an individual, so that their roles be correctly characterized and related to one another. Defects of the immune system such as autoimmune diseases, immunodeficiency, and malignancies can benefit from genomic insights on pathological processes. For example, analyzing the systematic variation of gene expression can relate these patterns with specific diseases and gene networks important for immune functions.

Translational bioinformatics (TBI) is a field that emerged in the 2010s to study health informatics, focused on the convergence of molecular bioinformatics, biostatistics, statistical genetics and clinical informatics. Its focus is on applying informatics methodology to the increasing amount of biomedical and genomic data to formulate knowledge and medical tools, which can be utilized by scientists, clinicians, and patients. Furthermore, it involves applying biomedical research to improve human health through the use of computer-based information system. TBI employs data mining and analyzing biomedical informatics in order to generate clinical knowledge for application. Clinical knowledge includes finding similarities in patient populations, interpreting biological information to suggest therapy treatments and predict health outcomes.

WormBase is an online biological database about the biology and genome of the nematode model organism Caenorhabditis elegans and contains information about other related nematodes. WormBase is used by the C. elegans research community both as an information resource and as a place to publish and distribute their results. The database is regularly updated with new versions being released every two months. WormBase is one of the organizations participating in the Generic Model Organism Database (GMOD) project.

The Expression Atlas is a database maintained by the European Bioinformatics Institute that provides information on gene expression patterns from RNA-Seq and Microarray studies, and protein expression from Proteomics studies. The Expression Atlas allows searches by gene, splice variant, protein attribute, disease, treatment or organism part. Individual genes or gene sets can be searched for. All datasets in Expression Atlas have its metadata manually curated and its data analysed through standardised analysis pipelines. There are two components to the Expression Atlas, the Baseline Atlas and the Differential Atlas:

Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. Here, mRNA serves as a transient intermediary molecule in the information network, whilst non-coding RNAs perform additional diverse functions. A transcriptome captures a snapshot in time of the total transcripts present in a cell. Transcriptomics technologies provide a broad account of which cellular processes are active and which are dormant. A major challenge in molecular biology is to understand how a single genome gives rise to a variety of cells. Another is how gene expression is regulated.

Minimum information standards are sets of guidelines and formats for reporting data derived by specific high-throughput methods. Their purpose is to ensure the data generated by these methods can be easily verified, analysed and interpreted by the wider scientific community. Ultimately, they facilitate the transfer of data from journal articles into databases in a form that enables data to be mined across multiple data sets. Minimal information standards are available for a vast variety of experiment types including microarray (MIAME), RNAseq (MINSEQE), metabolomics (MSI) and proteomics (MIAPE).