In molecular biology, a batch effect occurs when non-biological factors in an experiment cause changes in the data produced by the experiment. Such effects can lead to inaccurate conclusions when their causes are correlated with one or more outcomes of interest in an experiment. They are common in many types of high-throughput sequencing experiments, including those using microarrays, mass spectrometers, [1] and single-cell RNA-sequencing data. [2] They are most commonly discussed in the context of genomics and high-throughput sequencing research, but they exist in other fields of science as well. [1]
Multiple definitions of the term "batch effect" have been proposed in the literature. Lazar et al. (2013) noted, "Providing a complete and unambiguous definition of the so-called batch effect is a challenging task, especially because its origins and the way it manifests in the data are not completely known or not recorded." Focusing on microarray experiments, they propose a new definition based on several previous ones: "[T]he batch effect represents the systematic technical differences when samples are processed and measured in different batches and which are unrelated to any biological variation recorded during the MAGE [microarray gene expression] experiment." [3]
Many potentially variable factors have been identified as potential causes of batch effects, including the following:
Various statistical techniques have been developed to attempt to correct for batch effects in high-throughput experiments. These techniques are intended for use during the stages of experimental design and data analysis. They have historically mostly focused on genomics experiments, and have only recently begun to expand into other scientific fields such as proteomics. [5] One problem associated with such techniques is that they may unintentionally remove actual biological variation. [6] Some techniques that have been used to detect and/or correct for batch effects include the following:
Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.
A DNA microarray is a collection of microscopic DNA spots attached to a solid surface. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously or to genotype multiple regions of a genome. Each DNA spot contains picomoles of a specific DNA sequence, known as probes. These can be a short section of a gene or other DNA element that are used to hybridize a cDNA or cRNA sample under high-stringency conditions. Probe-target hybridization is usually detected and quantified by detection of fluorophore-, silver-, or chemiluminescence-labeled targets to determine relative abundance of nucleic acid sequences in the target. The original nucleic acid arrays were macro arrays approximately 9 cm × 12 cm and the first computerized image based analysis was published in 1981. It was invented by Patrick O. Brown. An example of its application is in SNPs arrays for polymorphisms in cardiovascular diseases, cancer, pathogens and GWAS analysis. It is also used for the identification of structural variations and the measurement of gene expression.
Functional genomics is a field of molecular biology that attempts to describe gene functions and interactions. Functional genomics make use of the vast data generated by genomic and transcriptomic projects. Functional genomics focuses on the dynamic aspects such as gene transcription, translation, regulation of gene expression and protein–protein interactions, as opposed to the static aspects of the genomic information such as DNA sequence or structures. A key characteristic of functional genomics studies is their genome-wide approach to these questions, generally involving high-throughput methods rather than a more traditional "candidate-gene" approach.
The transcriptome is the set of all RNA transcripts, including coding and non-coding, in an individual or a population of cells. The term can also sometimes be used to refer to all RNAs, or just mRNA, depending on the particular experiment. The term transcriptome is a portmanteau of the words transcript and genome; it is associated with the process of transcript production during the biological process of transcription.
In the field of molecular biology, gene expression profiling is the measurement of the activity of thousands of genes at once, to create a global picture of cellular function. These profiles can, for example, distinguish between cells that are actively dividing, or show how the cells react to a particular treatment. Many experiments of this sort measure an entire genome simultaneously, that is, every gene present in a particular cell.
An RNA spike-in is an RNA transcript of known sequence and quantity used to calibrate measurements in RNA hybridization assays, such as DNA microarray experiments, RT-qPCR, and RNA-Seq.
Microarray analysis techniques are used in interpreting the data generated from experiments on DNA, RNA, and protein microarrays, which allow researchers to investigate the expression state of a large number of genes - in many cases, an organism's entire genome - in a single experiment. Such experiments can generate very large amounts of data, allowing researchers to assess the overall state of a cell or organism. Data in such large quantities is difficult - if not impossible - to analyze without the help of computer programs.
ChIP-sequencing, also known as ChIP-seq, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global binding sites precisely for any protein of interest. Previously, ChIP-on-chip was the most common technique utilized to study these protein–DNA relations.
RIP-chip is a molecular biology technique which combines RNA immunoprecipitation with a microarray. The purpose of this technique is to identify which RNA sequences interact with a particular RNA binding protein of interest in vivo. It can also be used to determine relative levels of gene expression, to identify subsets of RNAs which may be co-regulated, or to identify RNAs that may have related functions. This technique provides insight into the post-transcriptional gene regulation which occurs between RNA and RNA binding proteins.
RNA-Seq is a sequencing technique that uses next-generation sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample, representing an aggregated snapshot of the cells' dynamic pool of RNAs, also known as transcriptome.
Integromics is a global bioinformatics company headquartered in Granada, Spain, with a second office in Madrid, subsidiaries in the United States and United Kingdom, and distributors in 10 countries. Integromics provides bioinformatics software for data management and data analysis in genomics and proteomics. The company provides a line of products that serve the gene expression, sequencing, and proteomics markets. Customers include genomic research centers, pharmaceutical companies, academic institutions, clinical research organizations, and biotechnology companies.
Extracellular RNA (exRNA) describes RNA species present outside of the cells in which they were transcribed. Carried within extracellular vesicles, lipoproteins, and protein complexes, exRNAs are protected from ubiquitous RNA-degrading enzymes. exRNAs may be found in the environment or, in multicellular organisms, within the tissues or biological fluids such as venous blood, saliva, breast milk, urine, semen, menstrual blood, and vaginal fluid. Although their biological function is not fully understood, exRNAs have been proposed to play a role in a variety of biological processes including syntrophy, intercellular communication, and cell regulation. The United States National Institutes of Health (NIH) published in 2012 a set of Requests for Applications (RFAs) for investigating extracellular RNA biology. Funded by the NIH Common Fund, the resulting program was collectively known as the Extracellular RNA Communication Consortium (ERCC). The ERCC was renewed for a second phase in 2019.
TopHat is an open-source bioinformatics tool for the throughput alignment of shotgun cDNA sequencing reads generated by transcriptomics technologies using Bowtie first and then mapping to a reference genome to discover RNA splice sites de novo. TopHat aligns RNA-Seq reads to mammalian-sized genomes.
Single-cell transcriptomics examines the gene expression level of individual cells in a given population by simultaneously measuring the RNA concentration of hundreds to thousands of genes. Single-cell transcriptomics makes it possible to unravel heterogeneous cell populations, reconstruct cellular developmental pathways, and model transcriptional dynamics — all previously masked in bulk RNA sequencing.
Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. Here, mRNA serves as a transient intermediary molecule in the information network, whilst non-coding RNAs perform additional diverse functions. A transcriptome captures a snapshot in time of the total transcripts present in a cell. Transcriptomics technologies provide a broad account of which cellular processes are active and which are dormant. A major challenge in molecular biology is to understand how a single genome gives rise to a variety of cells. Another is how gene expression is regulated.
CITE-Seq is a method for performing RNA sequencing along with gaining quantitative and qualitative information on surface proteins with available antibodies on a single cell level. So far, the method has been demonstrated to work with only a few proteins per cell. As such, it provides an additional layer of information for the same cell by combining both proteomics and transcriptomics data. For phenotyping, this method has been shown to be as accurate as flow cytometry by the groups that developed it. It is currently one of the main methods, along with REAP-Seq, to evaluate both gene expression and protein levels simultaneously in different species.
Bloom filters are space-efficient probabilistic data structures used to test whether an element is a part of a set. Bloom filters require much less space than other data structures for representing sets, however the downside of Bloom filters is that there is a false positive rate when querying the data structure. Since multiple elements may have the same hash values for a number of hash functions, then there is a probability that querying for a non-existent element may return a positive if another element with the same hash values has been added to the Bloom filter. Assuming that the hash function has equal probability of selecting any index of the Bloom filter, the false positive rate of querying a Bloom filter is a function of the number of bits, number of hash functions and number of elements of the Bloom filter. This allows the user to manage the risk of a getting a false positive by compromising on the space benefits of the Bloom filter.
Hilary S. Parker is an American biostatistician and data scientist. She was formerly a senior data analyst at the fashion merchandising company Stitch Fix. Parker co-hosts the data analytics podcast Not So Standard Deviations with Roger Peng. She received her PhD in biostatistics from the Johns Hopkins Bloomberg School of Public Health and has formerly been employed by Etsy.