DESeq2

Last updated
Original author(s) Michael Love
Constantin Ahlmann-Eltze
Kwame Forbes
Simon Anders
Wolfgang Huber
Initial release22 March 2013;11 years ago (2013-03-22)
Stable release
1.40.2 / 20 August 2023;10 months ago (2023-08-20)
Repository DESeq2 on GitHub
Operating system Linux, macOS, Windows
Platform R programming language
Type Bioinformatics
License GNU Lesser General Public License
Website DESeq2 on Bioconductor

DESeq2 is a software package in the field of bioinformatics and computational biology for the statistical programming language R. It is primarily employed for the analysis of high-throughput RNA sequencing (RNA-seq) data to identify differentially expressed genes between different experimental conditions. DESeq2 employs statistical methods to normalize and analyze RNA-seq data, making it a valuable tool for researchers studying gene expression patterns and regulation. It is available through the Bioconductor repository.

Contents

It was first presented in 2014. [1] As of September 2023, its use has been cited over 30,000 times. [2]

Features

One of the key steps in the analysis of RNA-seq data is data normalization. [3] DESeq2 employs the "size factor" normalization method, which adjusts for differences in sequencing depth between samples. [1] This normalization ensures that the expression values of genes are comparable across samples, allowing for accurate identification of differentially expressed genes. In addition to size factor normalization, DESeq2 also employs a variance-stabilizing transformation, which further enhances the quality of the data by stabilizing the variance across different expression levels. [4] This combination of normalization techniques minimizes bias and improves the accuracy of differential expression analysis.

DESeq2 makes available negative binomial distribution models to account for the over-dispersion commonly observed in RNA-seq data. [5] This modeling approach takes into consideration the variability that is not adequately explained by a simple Poisson distribution. By incorporating the negative binomial distribution, DESeq2 accurately models the dispersion of gene expression counts and provides more reliable estimates of differential expression.

DESeq2 also offers an adaptive shrinkage procedure, known as the "apeglm" method, which is particularly useful when dealing with small sample sizes. [6] This technique effectively shrinks the log-fold changes of gene expression estimates, reducing the impact of extreme values and improving the stability of results. This is especially valuable for researchers working with limited biological replicates, as it helps to mitigate the problem of low statistical power.

Further, DESeq2 allows users to incorporate relevant covariates into their analyses. [1] This feature enables researchers to account for potential confounding factors, such as batch effects or experimental conditions, that can influence gene expression. By including covariates in the analysis, DESeq2 offers a more accurate assessment of the true differential expression patterns in the data.

Use

DESeq2 is interfaced through R, via the bioconductor repository. [7] The repository provides comprehensive documentation and tutorials, making it accessible to a wide range of researchers.

Related Research Articles

In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. It can be performed on the entire genome, transcriptome or proteome of an organism, and can also involve only selected segments or regions, like tandem repeats and transposable elements. Methodologies used include sequence alignment, searches against biological databases, and others.

<span class="mw-page-title-main">DNA microarray</span> Collection of microscopic DNA spots attached to a solid surface

A DNA microarray is a collection of microscopic DNA spots attached to a solid surface. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously or to genotype multiple regions of a genome. Each DNA spot contains picomoles of a specific DNA sequence, known as probes. These can be a short section of a gene or other DNA element that are used to hybridize a cDNA or cRNA sample under high-stringency conditions. Probe-target hybridization is usually detected and quantified by detection of fluorophore-, silver-, or chemiluminescence-labeled targets to determine relative abundance of nucleic acid sequences in the target. The original nucleic acid arrays were macro arrays approximately 9 cm × 12 cm and the first computerized image based analysis was published in 1981. It was invented by Patrick O. Brown. An example of its application is in SNPs arrays for polymorphisms in cardiovascular diseases, cancer, pathogens and GWAS analysis. It is also used for the identification of structural variations and the measurement of gene expression.

<span class="mw-page-title-main">Real-time polymerase chain reaction</span> Laboratory technique of molecular biology

A real-time polymerase chain reaction is a laboratory technique of molecular biology based on the polymerase chain reaction (PCR). It monitors the amplification of a targeted DNA molecule during the PCR, not at its end, as in conventional PCR. Real-time PCR can be used quantitatively and semi-quantitatively.

<span class="mw-page-title-main">Microarray analysis techniques</span>

Microarray analysis techniques are used in interpreting the data generated from experiments on DNA, RNA, and protein microarrays, which allow researchers to investigate the expression state of a large number of genes – in many cases, an organism's entire genome – in a single experiment. Such experiments can generate very large amounts of data, allowing researchers to assess the overall state of a cell or organism. Data in such large quantities is difficult – if not impossible – to analyze without the help of computer programs.

lumi is a free, open source and open development software project for the analysis and comprehension of Illumina expression and methylation microarray data. The project was started in the summer of 2006 and set out to provide algorithms and data management tools of Illumina in the framework of Bioconductor. It is based on the statistical R programming language.

Within computational biology, an MA plot is an application of a Bland–Altman plot for visual representation of genomic data. The plot visualizes the differences between measurements taken in two samples, by transforming the data onto M and A scales, then plotting these values. Though originally applied in the context of two channel DNA microarray gene expression data, MA plots are also used to visualise high-throughput sequencing analysis.

ChIP-sequencing, also known as ChIP-seq, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global binding sites precisely for any protein of interest. Previously, ChIP-on-chip was the most common technique utilized to study these protein–DNA relations.

The Illumina Methylation Assay using the Infinium I platform uses 'BeadChip' technology to generate a comprehensive genome-wide profiling of human DNA methylation. Similar to bisulfite sequencing and pyrosequencing, this method quantifies methylation levels at various loci within the genome. This assay is used for methylation probes on the Illumina Infinium HumanMethylation27 BeadChip. Probes on the 27k array target regions of the human genome to measure methylation levels at 27,578 CpG dinucleotides in 14,495 genes. In 2008, Illumina released the Infinium HumanMethylation450 BeadChip array, which targets over 450,000 methylation sites. In 2016, the Infinium MethylationEPIC BeadChip ("EPIC") was released, which interrogates over 850,000 methylation sites across the human genome.

<span class="mw-page-title-main">RNA-Seq</span> Lab technique in cellular biology

RNA-Seq is a technique that uses next-generation sequencing to reveal the presence and quantity of RNA molecules in a biological sample, providing a snapshot of gene expression in the sample, also known as transcriptome.

Fold change is a measure describing how much a quantity changes between an original and a subsequent measurement. It is defined as the ratio between the two quantities; for quantities A and B the fold change of B with respect to A is B/A. In other words, a change from 30 to 60 is defined as a fold-change of 2. This is also referred to as a "one fold increase". Similarly, a change from 30 to 15 is referred to as a "0.5-fold decrease". Fold change is often used when analysing multiple measurements of a biological system taken at different times as the change described by the ratio between the time points is easier to interpret than the difference.

MicroRNA sequencing (miRNA-seq), a type of RNA-Seq, is the use of next-generation sequencing or massively parallel high-throughput DNA sequencing to sequence microRNAs, also called miRNAs. miRNA-seq differs from other forms of RNA-seq in that input material is often enriched for small RNAs. miRNA-seq allows researchers to examine tissue-specific expression patterns, disease associations, and isoforms of miRNAs, and to discover previously uncharacterized miRNAs. Evidence that dysregulated miRNAs play a role in diseases such as cancer has positioned miRNA-seq to potentially become an important tool in the future for diagnostics and prognostics as costs continue to decrease. Like other miRNA profiling technologies, miRNA-Seq has both advantages and disadvantages.

<span class="mw-page-title-main">Alicia Oshlack</span> Australian bioinformatician

Alicia Yinema Kate Nungarai Oshlack is an Australian bioinformatician and is Co-Head of Computational Biology at the Peter MacCallum Cancer Centre in Melbourne, Victoria, Australia. She is best known for her work developing methods for the analysis of transcriptome data as a measure of gene expression. She has characterized the role of gene expression in human evolution by comparisons of humans, chimpanzees, orangutans, and rhesus macaques, and works collaboratively in data analysis to improve the use of clinical sequencing of RNA samples by RNAseq for human disease diagnosis.

Single-cell transcriptomics examines the gene expression level of individual cells in a given population by simultaneously measuring the RNA concentration of hundreds to thousands of genes. Single-cell transcriptomics makes it possible to unravel heterogeneous cell populations, reconstruct cellular developmental pathways, and model transcriptional dynamics — all previously masked in bulk RNA sequencing.

Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. Here, mRNA serves as a transient intermediary molecule in the information network, whilst non-coding RNAs perform additional diverse functions. A transcriptome captures a snapshot in time of the total transcripts present in a cell. Transcriptomics technologies provide a broad account of which cellular processes are active and which are dormant. A major challenge in molecular biology is to understand how a single genome gives rise to a variety of cells. Another is how gene expression is regulated.

<span class="mw-page-title-main">Trajectory inference</span>

Trajectory inference or pseudotemporal ordering is a computational technique used in single-cell transcriptomics to determine the pattern of a dynamic process experienced by cells and then arrange cells based on their progression through the process. Single-cell protocols have much higher levels of noise than bulk RNA-seq, so a common step in a single-cell transcriptomics workflow is the clustering of cells into subgroups. Clustering can contend with this inherent variation by combining the signal from many cells, while allowing for the identification of cell types. However, some differences in gene expression between cells are the result of dynamic processes such as the cell cycle, cell differentiation, or response to an external stimuli. Trajectory inference seeks to characterize such differences by placing cells along a continuous path that represents the evolution of the process rather than dividing cells into discrete clusters. In some methods this is done by projecting cells onto an axis called pseudotime which represents the progression through the process.

CITE-Seq is a method for performing RNA sequencing along with gaining quantitative and qualitative information on surface proteins with available antibodies on a single cell level. So far, the method has been demonstrated to work with only a few proteins per cell. As such, it provides an additional layer of information for the same cell by combining both proteomics and transcriptomics data. For phenotyping, this method has been shown to be as accurate as flow cytometry by the groups that developed it. It is currently one of the main methods, along with REAP-Seq, to evaluate both gene expression and protein levels simultaneously in different species.

<span class="mw-page-title-main">Cellular deconvolution</span> Set of computational techniques

Cellular deconvolution refers to computational techniques aiming at estimating the proportions of different cell types in samples collected from a tissue. For example, samples collected from the human brain are a mixture of various neuronal and glial cell types in different proportions, where each cell type has a diverse gene expression profile. Since most high-throughput technologies use bulk samples and measure the aggregated levels of molecular information for all cells in a sample, the measured values would be an aggregate of the values pertaining to the expression landscape of different cell types. Therefore, many downstream analyses such as differential gene expression might be confounded by the variations in cell type proportions when using the output of high-throughput technologies applied to bulk samples. The development of statistical methods to identify cell type proportions in large-scale bulk samples is an important step for better understanding of the relationship between cell type composition and diseases.

Single-cell genome and epigenome by transposases sequencing (scGET-seq) is a DNA sequencing method for profiling open and closed chromatin. In contrast to single-cell assay for transposase-accessible chromatin with sequencing (scATAC-seq), which only targets active euchromatin, scGET-seq is also capable of probing inactive heterochromatin.

<span class="mw-page-title-main">BRB-seq</span> Technology for RNA sequencing

Bulk RNA barcoding and sequencing (BRB-seq) is an ultra-high-throughput bulk 3' mRNA-seq technology that uses early-stage sample barcoding and unique molecular identifiers (UMIs) to allow the pooling of up to 384 samples in one tube early in the sequencing library preparation workflow. The transcriptomic technology is compatible with both Illumina and MGI short-read sequencing instruments.

References

  1. 1 2 3 Love, Michael I; Huber, Wolfgang; Anders, Simon (December 2014). "Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2". Genome Biology. 15 (12): 550. doi: 10.1186/s13059-014-0550-8 . PMC   4302049 . PMID   25516281.
  2. Love, M. I.; Huber, W.; Anders, S. (2014). "Citation Metrics". Genome Biology. 15 (12). University of Otago: 550. doi: 10.1186/s13059-014-0550-8 . PMC   4302049 . PMID   25516281.
  3. Evans, Ciaran; Hardin, Johanna; Stoebel, Daniel M (28 September 2018). "Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions". Briefings in Bioinformatics. 19 (5): 776–792. doi:10.1093/bib/bbx008. PMC   6171491 . PMID   28334202.
  4. "varianceStabilizingTransformation: Apply a variance stabilizing transformation (VST) to the..." rdrr.io. Archived from the original on 28 September 2023. Retrieved 28 September 2023.
  5. "Gene-level differential expression analysis". HBC Training. Github.io. 15 May 2020. Archived from the original on 28 September 2023. Retrieved 28 September 2023.
  6. Chipman, Hugh A.; Kolaczyk, Eric D.; McCulloch, Robert E. (December 1997). "Adaptive Bayesian Wavelet Shrinkage". Journal of the American Statistical Association. 92 (440): 1413. doi:10.2307/2965411. JSTOR   2965411.
  7. "DESeq2: An Overview of a Popular RNA-Seq Analysis Package". pluto.bio. 18 October 2021. Archived from the original on 27 September 2023. Retrieved 27 September 2023.