A microarray database is a repository containing microarray gene expression data. The key uses of a microarray database are to store the measurement data, manage a searchable index, and make the data available to other applications for analysis and interpretation (either directly, or via user downloads).
Microarray databases can fall into two distinct classes:
Some of the most known public, curated microarray databases are:
Database | Scope | Microarray experiment sets | Sample profiles | As of date |
---|---|---|---|---|
ArrayTrack | ArrayTrack hosts both public and private data, including MAQC benchmark data, with integrated analysis tools | 1622 | 50,093 | Feb 2012 |
NCI mAdb | Hosts NCI data with integrated analysis and statistics tools | ? | 105,000 | Mar 2012 |
ImmGen database | Open access across all immune system cells; expression data, differential expression, coregulated clusters, regulation | 267 | 1059 | Jan 2012 |
Genevestigator | Gene expression search engine based on manually curated, well annotated public and proprietary microarray and RNA-seq datasets | 3228 | 232,855 | October 2016 |
Gene Expression Omnibus - NCBI | any curated MIAME compliant molecular abundance study | 25859 | 641770 | October 28, 2011 |
ArrayExpress at EBI | Any curated MIAME or MINSEQE compliant transcriptomics data | 24838 | 708914 | October 28, 2011 |
Stanford Microarray database | private and published microarray and molecule abundance database (now defunct) | 82542 | ? | October 23, 2011 |
The Cancer Genome Atlas (TCGA) | collection of expression data for different cancers | 21229 | ? | August 30, 2013 |
GeneNetwork system | Open access standard arrays, exons arrays, and RNA-seq data for genetic analysis (eQTL studies) with analysis suite | ~100 | ~10000 | July, 2010 |
UNC modENCODE Microarray database | Nimblegen customer 2.1 million array | ~6 | 180 | July 17, 2009 |
UPSC-BASE | data generated by microarray analysis within Umeå Plant Science Centre (UPSC). | ~100 | ? | November 15, 2007 |
UPenn RAD database | MIAME compliant public and private studies, associated with ArrayExpress | ~100 | ~2500 | Sept. 1, 2007 |
UNC Microarray database | provides the service for microarray data storage, retrieval, analysis, and visualization | ~31 | 2093 | April 1, 2007 |
MUSC database | The database is a repository for DNA microarray data generated by MUSC investigators as well as researchers in the global research community. | ~45 | 555 | April 1, 2007 |
caArray at NCI | Cancer data, prepared for analysis on caBIG | 41 | 1741 | November 15, 2006 |
Biostatistics is a branch of statistics that applies statistical methods to a wide range of topics in biology. It encompasses the design of biological experiments, the collection and analysis of data from those experiments and the interpretation of the results.
Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.
A DNA microarray is a collection of microscopic DNA spots attached to a solid surface. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously or to genotype multiple regions of a genome. Each DNA spot contains picomoles of a specific DNA sequence, known as probes. These can be a short section of a gene or other DNA element that are used to hybridize a cDNA or cRNA sample under high-stringency conditions. Probe-target hybridization is usually detected and quantified by detection of fluorophore-, silver-, or chemiluminescence-labeled targets to determine relative abundance of nucleic acid sequences in the target. The original nucleic acid arrays were macro arrays approximately 9 cm × 12 cm and the first computerized image based analysis was published in 1981. It was invented by Patrick O. Brown. An example of its application is in SNPs arrays for polymorphisms in cardiovascular diseases, cancer, pathogens and GWAS analysis. It is also used for the identification of structural variations and the measurement of gene expression.
In molecular biology, biochips are engineered substrates that can host large numbers of simultaneous biochemical reactions. One of the goals of biochip technology is to efficiently screen large numbers of biological analytes, with potential applications ranging from disease diagnosis to detection of bioterrorism agents. For example, digital microfluidic biochips are under investigation for applications in biomedical fields. In a digital microfluidic biochip, a group of (adjacent) cells in the microfluidic array can be configured to work as storage, functional operations, as well as for transporting fluid droplets dynamically.
The transcriptome is the set of all RNA transcripts, including coding and non-coding, in an individual or a population of cells. The term can also sometimes be used to refer to all RNAs, or just mRNA, depending on the particular experiment. The term transcriptome is a portmanteau of the words transcript and genome; it is associated with the process of transcript production during the biological process of transcription.
Bioconductor is a free, open source and open development software project for the analysis and comprehension of genomic data generated by wet lab experiments in molecular biology.
In the field of molecular biology, gene expression profiling is the measurement of the activity of thousands of genes at once, to create a global picture of cellular function. These profiles can, for example, distinguish between cells that are actively dividing, or show how the cells react to a particular treatment. Many experiments of this sort measure an entire genome simultaneously, that is, every gene present in a particular cell.
Genetic analysis is the overall process of studying and researching in fields of science that involve genetics and molecular biology. There are a number of applications that are developed from this research, and these are also considered parts of the process. The base system of analysis revolves around general genetics. Basic studies include identification of genes and inherited disorders. This research has been conducted for centuries on both a large-scale physical observation basis and on a more microscopic scale. Genetic analysis can be used generally to describe methods both used in and resulting from the sciences of genetics and molecular biology, or to applications resulting from this research.
Reactome is a free online database of biological pathways. There are several Reactomes that concentrate on specific organisms, the largest of these is focused on human biology, the following description concentrates on the human Reactome. It is authored by biologists, in collaboration with Reactome editorial staff. The content is cross-referenced to many bioinformatics databases. The rationale behind Reactome is to visually represent biological pathways in full mechanistic detail, while making the source data available in a computationally accessible format.
Genevestigator is an application consisting of a gene expression database and tools to analyse the data. It exists in two versions, biomedical and plant, depending on the species of the underlying microarray and RNAseq as well as single-cell RNA-sequencing data. It was started in January 2004 by scientists from ETH Zurich and is currently developed and commercialized by Nebion AG.
Microarray analysis techniques are used in interpreting the data generated from experiments on DNA, RNA, and protein microarrays, which allow researchers to investigate the expression state of a large number of genes - in many cases, an organism's entire genome - in a single experiment. Such experiments can generate very large amounts of data, allowing researchers to assess the overall state of a cell or organism. Data in such large quantities is difficult - if not impossible - to analyze without the help of computer programs.
The Functional GEnomics Data Society (FGED) was a non-profit, volunteer-run international organization of biologists, computer scientists, and data analysts that aims to facilitate biological and biomedical discovery through data integration. The approach of FGED was to promote the sharing of basic research data generated primarily via high-throughput technologies that generate large data sets within the domain of functional genomics.
Within computational biology, an MA plot is an application of a Bland–Altman plot for visual representation of genomic data. The plot visualizes the differences between measurements taken in two samples, by transforming the data onto M and A scales, then plotting these values. Though originally applied in the context of two channel DNA microarray gene expression data, MA plots are also used to visualise high-throughput sequencing analysis.
MAGIChips, also known as "microarrays of gel-immobilized compounds on a chip" or "three-dimensional DNA microarrays", are devices for molecular hybridization produced by immobilizing oligonucleotides, DNA, enzymes, antibodies, and other compounds on a photopolymerized micromatrix of polyacrylamide gel pads of 100x100x20µm or smaller size. This technology is used for analysis of nucleic acid hybridization, specific binding of DNA, and low-molecular weight compounds with proteins, and protein-protein interactions.
Immunomics is the study of immune system regulation and response to pathogens using genome-wide approaches. With the rise of genomic and proteomic technologies, scientists have been able to visualize biological networks and infer interrelationships between genes and/or proteins; recently, these technologies have been used to help better understand how the immune system functions and how it is regulated. Two thirds of the genome is active in one or more immune cell types and less than 1% of genes are uniquely expressed in a given type of cell. Therefore, it is critical that the expression patterns of these immune cell types be deciphered in the context of a network, and not as an individual, so that their roles be correctly characterized and related to one another. Defects of the immune system such as autoimmune diseases, immunodeficiency, and malignancies can benefit from genomic insights on pathological processes. For example, analyzing the systematic variation of gene expression can relate these patterns with specific diseases and gene networks important for immune functions.
Translational bioinformatics (TBI) is a field that emerged in the 2010s to study health informatics, focused on the convergence of molecular bioinformatics, biostatistics, statistical genetics and clinical informatics. Its focus is on applying informatics methodology to the increasing amount of biomedical and genomic data to formulate knowledge and medical tools, which can be utilized by scientists, clinicians, and patients. Furthermore, it involves applying biomedical research to improve human health through the use of computer-based information system. TBI employs data mining and analyzing biomedical informatics in order to generate clinical knowledge for application. Clinical knowledge includes finding similarities in patient populations, interpreting biological information to suggest therapy treatments and predict health outcomes.
WormBase is an online biological database about the biology and genome of the nematode model organism Caenorhabditis elegans and contains information about other related nematodes. WormBase is used by the C. elegans research community both as an information resource and as a place to publish and distribute their results. The database is regularly updated with new versions being released every two months. WormBase is one of the organizations participating in the Generic Model Organism Database (GMOD) project.
The Expression Atlas is a database maintained by the European Bioinformatics Institute that provides information on gene expression patterns from RNA-Seq and Microarray studies, and protein expression from Proteomics studies. The Expression Atlas allows searches by gene, splice variant, protein attribute, disease, treatment or organism part. Individual genes or gene sets can be searched for. All datasets in Expression Atlas have its metadata manually curated and its data analysed through standardised analysis pipelines. There are two components to the Expression Atlas, the Baseline Atlas and the Differential Atlas:
Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. Here, mRNA serves as a transient intermediary molecule in the information network, whilst non-coding RNAs perform additional diverse functions. A transcriptome captures a snapshot in time of the total transcripts present in a cell. Transcriptomics technologies provide a broad account of which cellular processes are active and which are dormant. A major challenge in molecular biology is to understand how a single genome gives rise to a variety of cells. Another is how gene expression is regulated.
Minimum information standards are sets of guidelines and formats for reporting data derived by specific high-throughput methods. Their purpose is to ensure the data generated by these methods can be easily verified, analysed and interpreted by the wider scientific community. Ultimately, they facilitate the transfer of data from journal articles into databases in a form that enables data to be mined across multiple data sets. Minimal information standards are available for a vast variety of experiment types including microarray (MIAME), RNAseq (MINSEQE), metabolomics (MSI) and proteomics (MIAPE).