DESeq2

Last updated
Original author(s) Michael Love
Constantin Ahlmann-Eltze
Kwame Forbes
Simon Anders
Wolfgang Huber
Initial release22 March 2013;11 years ago (2013-03-22)
Stable release
1.40.2 / 20 August 2023;9 months ago (2023-08-20)
Repository DESeq2 on GitHub
Operating system Linux, macOS, Windows
Platform R programming language
Type Bioinformatics
License GNU Lesser General Public License
Website DESeq2 on Bioconductor

DESeq2 is a software package in the field of bioinformatics and computational biology for the statistical programming language R. It is primarily employed for the analysis of high-throughput RNA sequencing (RNA-seq) data to identify differentially expressed genes between different experimental conditions. DESeq2 employs statistical methods to normalize and analyze RNA-seq data, making it a valuable tool for researchers studying gene expression patterns and regulation. It is available through the Bioconductor repository.

Contents

It was first presented in 2014. [1] As of September 2023, its use has been cited over 30,000 times. [2]

Features

One of the key steps in the analysis of RNA-seq data is data normalization. [3] DESeq2 employs the "size factor" normalization method, which adjusts for differences in sequencing depth between samples. [1] This normalization ensures that the expression values of genes are comparable across samples, allowing for accurate identification of differentially expressed genes. In addition to size factor normalization, DESeq2 also employs a variance-stabilizing transformation, which further enhances the quality of the data by stabilizing the variance across different expression levels. [4] This combination of normalization techniques minimizes bias and improves the accuracy of differential expression analysis.

DESeq2 makes available negative binomial distribution models to account for the over-dispersion commonly observed in RNA-seq data. [5] This modeling approach takes into consideration the variability that is not adequately explained by a simple Poisson distribution. By incorporating the negative binomial distribution, DESeq2 accurately models the dispersion of gene expression counts and provides more reliable estimates of differential expression.

DESeq2 also offers an adaptive shrinkage procedure, known as the "apeglm" method, which is particularly useful when dealing with small sample sizes. [6] This technique effectively shrinks the log-fold changes of gene expression estimates, reducing the impact of extreme values and improving the stability of results. This is especially valuable for researchers working with limited biological replicates, as it helps to mitigate the problem of low statistical power.

Further, DESeq2 allows users to incorporate relevant covariates into their analyses. [1] This feature enables researchers to account for potential confounding factors, such as batch effects or experimental conditions, that can influence gene expression. By including covariates in the analysis, DESeq2 offers a more accurate assessment of the true differential expression patterns in the data.

Use

DESeq2 is interfaced through R, via the bioconductor repository. [7] The repository provides comprehensive documentation and tutorials, making it accessible to a wide range of researchers.

Related Research Articles

In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. It can be performed on the entire genome, transcriptome or proteome of an organism, and can also involve only selected segments or regions, like tandem repeats and transposable elements. Methodologies used include sequence alignment, searches against biological databases, and others.

<span class="mw-page-title-main">Real-time polymerase chain reaction</span> Laboratory technique of molecular biology

A real-time polymerase chain reaction is a laboratory technique of molecular biology based on the polymerase chain reaction (PCR). It monitors the amplification of a targeted DNA molecule during the PCR, not at its end, as in conventional PCR. Real-time PCR can be used quantitatively and semi-quantitatively.

<span class="mw-page-title-main">Microarray analysis techniques</span>

Microarray analysis techniques are used in interpreting the data generated from experiments on DNA, RNA, and protein microarrays, which allow researchers to investigate the expression state of a large number of genes – in many cases, an organism's entire genome – in a single experiment. Such experiments can generate very large amounts of data, allowing researchers to assess the overall state of a cell or organism. Data in such large quantities is difficult – if not impossible – to analyze without the help of computer programs.

lumi is a free, open source and open development software project for the analysis and comprehension of Illumina expression and methylation microarray data. The project was started in the summer of 2006 and set out to provide algorithms and data management tools of Illumina in the framework of Bioconductor. It is based on the statistical R programming language.

Within computational biology, an MA plot is an application of a Bland–Altman plot for visual representation of genomic data. The plot visualizes the differences between measurements taken in two samples, by transforming the data onto M and A scales, then plotting these values. Though originally applied in the context of two channel DNA microarray gene expression data, MA plots are also used to visualise high-throughput sequencing analysis.

ChIP-sequencing, also known as ChIP-seq, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global binding sites precisely for any protein of interest. Previously, ChIP-on-chip was the most common technique utilized to study these protein–DNA relations.

<span class="mw-page-title-main">RNA-Seq</span> Lab technique in cellular biology

RNA-Seq is a technique that uses next-generation sequencing to reveal the presence and quantity of RNA molecules in a biological sample, providing a snapshot of gene expression in the sample, also known as transcriptome.

Fold change is a measure describing how much a quantity changes between an original and a subsequent measurement. It is defined as the ratio between the two quantities; for quantities A and B the fold change of B with respect to A is B/A. In other words, a change from 30 to 60 is defined as a fold-change of 2. This is also referred to as a "one fold increase". Similarly, a change from 30 to 15 is referred to as a "0.5-fold decrease". Fold change is often used when analysing multiple measurements of a biological system taken at different times as the change described by the ratio between the time points is easier to interpret than the difference.

Flow cytometry bioinformatics is the application of bioinformatics to flow cytometry data, which involves storing, retrieving, organizing and analyzing flow cytometry data using extensive computational resources and tools. Flow cytometry bioinformatics requires extensive use of and contributes to the development of techniques from computational statistics and machine learning. Flow cytometry and related methods allow the quantification of multiple independent biomarkers on large numbers of single cells. The rapid growth in the multidimensionality and throughput of flow cytometry data, particularly in the 2000s, has led to the creation of a variety of computational analysis methods, data standards, and public databases for the sharing of results.

<span class="mw-page-title-main">Alicia Oshlack</span> Australian bioinformatician

Alicia Yinema Kate Nungarai Oshlack is an Australian bioinformatician and is Co-Head of Computational Biology at the Peter MacCallum Cancer Centre in Melbourne, Victoria, Australia. She is best known for her work developing methods for the analysis of transcriptome data as a measure of gene expression. She has characterized the role of gene expression in human evolution by comparisons of humans, chimpanzees, orangutans, and rhesus macaques, and works collaboratively in data analysis to improve the use of clinical sequencing of RNA samples by RNAseq for human disease diagnosis.

The Expression Atlas is a database maintained by the European Bioinformatics Institute that provides information on gene expression patterns from RNA-Seq and Microarray studies, and protein expression from Proteomics studies. The Expression Atlas allows searches by gene, splice variant, protein attribute, disease, treatment or organism part. Individual genes or gene sets can be searched for. All datasets in Expression Atlas have its metadata manually curated and its data analysed through standardised analysis pipelines. There are two components to the Expression Atlas, the Baseline Atlas and the Differential Atlas:

Single-cell transcriptomics examines the gene expression level of individual cells in a given population by simultaneously measuring the RNA concentration of hundreds to thousands of genes. Single-cell transcriptomics makes it possible to unravel heterogeneous cell populations, reconstruct cellular developmental pathways, and model transcriptional dynamics — all previously masked in bulk RNA sequencing.

Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. Here, mRNA serves as a transient intermediary molecule in the information network, whilst non-coding RNAs perform additional diverse functions. A transcriptome captures a snapshot in time of the total transcripts present in a cell. Transcriptomics technologies provide a broad account of which cellular processes are active and which are dormant. A major challenge in molecular biology is to understand how a single genome gives rise to a variety of cells. Another is how gene expression is regulated.

Jean Yee Hwa Yang is an Australian statistician known for her work on variance reduction for microarrays, and for inferring proteins from mass spectrometry data. Yang is a Professor in the School of Mathematics and Statistics at the University of Sydney.

<span class="mw-page-title-main">Trajectory inference</span>

Trajectory inference or pseudotemporal ordering is a computational technique used in single-cell transcriptomics to determine the pattern of a dynamic process experienced by cells and then arrange cells based on their progression through the process. Single-cell protocols have much higher levels of noise than bulk RNA-seq, so a common step in a single-cell transcriptomics workflow is the clustering of cells into subgroups. Clustering can contend with this inherent variation by combining the signal from many cells, while allowing for the identification of cell types. However, some differences in gene expression between cells are the result of dynamic processes such as the cell cycle, cell differentiation, or response to an external stimuli. Trajectory inference seeks to characterize such differences by placing cells along a continuous path that represents the evolution of the process rather than dividing cells into discrete clusters. In some methods this is done by projecting cells onto an axis called pseudotime which represents the progression through the process.

CITE-Seq is a method for performing RNA sequencing along with gaining quantitative and qualitative information on surface proteins with available antibodies on a single cell level. So far, the method has been demonstrated to work with only a few proteins per cell. As such, it provides an additional layer of information for the same cell by combining both proteomics and transcriptomics data. For phenotyping, this method has been shown to be as accurate as flow cytometry by the groups that developed it. It is currently one of the main methods, along with REAP-Seq, to evaluate both gene expression and protein levels simultaneously in different species.

<span class="mw-page-title-main">Cellular deconvolution</span> Set of computational techniques

Cellular deconvolution refers to computational techniques aiming at estimating the proportions of different cell types in samples collected from a tissue. For example, samples collected from the human brain are a mixture of various neuronal and glial cell types in different proportions, where each cell type has a diverse gene expression profile. Since most high-throughput technologies use bulk samples and measure the aggregated levels of molecular information for all cells in a sample, the measured values would be an aggregate of the values pertaining to the expression landscape of different cell types. Therefore, many downstream analyses such as differential gene expression might be confounded by the variations in cell type proportions when using the output of high-throughput technologies applied to bulk samples. The development of statistical methods to identify cell type proportions in large-scale bulk samples is an important step for better understanding of the relationship between cell type composition and diseases.

Single-cell genome and epigenome by transposases sequencing (scGET-seq) is a DNA sequencing method for profiling open and closed chromatin. In contrast to single-cell assay for transposase-accessible chromatin with sequencing (scATAC-seq), which only targets active euchromatin, scGET-seq is also capable of probing inactive heterochromatin.

<span class="mw-page-title-main">BRB-seq</span> Technology for RNA sequencing

Bulk RNA barcoding and sequencing (BRB-seq) is an ultra-high-throughput bulk 3' mRNA-seq technology that uses early-stage sample barcoding and unique molecular identifiers (UMIs) to allow the pooling of up to 384 samples in one tube early in the sequencing library preparation workflow. The transcriptomic technology is compatible with both Illumina and MGI short-read sequencing instruments.

References

  1. 1 2 3 Love, Michael I; Huber, Wolfgang; Anders, Simon (December 2014). "Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2". Genome Biology. 15 (12): 550. doi: 10.1186/s13059-014-0550-8 . PMC   4302049 . PMID   25516281.
  2. Love, M. I.; Huber, W.; Anders, S. (2014). "Citation Metrics". Genome Biology. 15 (12). University of Otago: 550. doi: 10.1186/s13059-014-0550-8 . PMC   4302049 . PMID   25516281.
  3. Evans, Ciaran; Hardin, Johanna; Stoebel, Daniel M (28 September 2018). "Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions". Briefings in Bioinformatics. 19 (5): 776–792. doi:10.1093/bib/bbx008. PMC   6171491 . PMID   28334202.
  4. "varianceStabilizingTransformation: Apply a variance stabilizing transformation (VST) to the..." rdrr.io. Archived from the original on 28 September 2023. Retrieved 28 September 2023.
  5. "Gene-level differential expression analysis". HBC Training. Github.io. 15 May 2020. Archived from the original on 28 September 2023. Retrieved 28 September 2023.
  6. Chipman, Hugh A.; Kolaczyk, Eric D.; McCulloch, Robert E. (December 1997). "Adaptive Bayesian Wavelet Shrinkage". Journal of the American Statistical Association. 92 (440): 1413. doi:10.2307/2965411. JSTOR   2965411.
  7. "DESeq2: An Overview of a Popular RNA-Seq Analysis Package". pluto.bio. 18 October 2021. Archived from the original on 27 September 2023. Retrieved 27 September 2023.