MA plot

Last updated December 03, 2023 • 2 min readFrom Wikipedia, The Free Encyclopedia

Within computational biology, an MA plot is an application of a Bland–Altman plot for visual representation of genomic data. The plot visualizes the differences between measurements taken in two samples, by transforming the data onto M (log ratio) and A (mean average) scales, then plotting these values. Though originally applied in the context of two channel DNA microarray gene expression data, MA plots are also used to visualise high-throughput sequencing analysis.^[1]^[2]

Explanation

Microarray data is often normalized within arrays to control for systematic biases in dye coupling and hybridization efficiencies, as well as other technical biases in the DNA probes and the print tip used to spot the array.^[3] By minimizing these systematic variations, true biological differences can be found. To determine whether normalization is needed, one can plot Cy5 (R) intensities against Cy3 (G) intensities and see whether the slope of the line is around 1. An improved method, which is basically a scaled, 45 degree rotation of the R vs. G plot is an MA-plot.^[4] The MA-plot is a plot of the distribution of the red/green intensity ratio ('M') plotted by the average intensity ('A'). M and A are defined by the following equations.

M=\log _{2}(R/G)=\log _{2}(R)-\log _{2}(G)

A={\frac {1}{2}}\log _{2}(RG)={\frac {1}{2}}(\log _{2}(R)+\log _{2}(G))

M is, therefore, the binary logarithm of the intensity ratio (or difference between log intensities) and A is the average log intensity for a dot in the plot. MA plots are then used to visualize intensity-dependent ratio of raw microarray data (microarrays typically show a bias here, with higher A resulting in higher |M|, i.e. the brighter the spot the more likely an observed difference between sample and control). The MA plot puts the variable M on the y-axis and A on the x-axis and gives a quick overview of the distribution of the data.

In many microarray gene expression experiments, an underlying assumption is that most of the genes would not see any change in their expression; therefore, the majority of the points on the y-axis (M) would be located at 0, since log(1) is 0. If this is not the case, then a normalization method such as LOESS should be applied to the data before statistical analysis. (On the diagram below see the red line running below the zero mark before normalization, it should be straight. Since it is not straight, the data should be normalized. After being normalized, the red line is straight on the zero line and shows as pink/black.)

Packages

Several Bioconductor packages, for the R software, provide the facility for creating MA plots. These include affy (ma.plot, mva.pairs), limma (plotMA), marray (maPlot), and edgeR(maPlot)

Similar "RA" plots can be generated using the raPlot function in the caroline CRAN R package.

An interactive MA plot to filter genes by M, A and p-values, search by names or with a lasso, and save selected genes, is available as an R-Shiny code Enhanced-MA-Plot.

Example in the R programming language

library(affy)if (require(affydata)){data(Dilution)}y<-(exprs(Dilution)[,c("20B","10A")])x11()ma.plot(rowMeans(log2(y)),log2(y[,1])-log2(y[,2]),cex=1)title("Dilutions Dataset (array 20B v 10A)")library(preprocessCore)#do a quantile normalizationx<-normalize.quantiles(y)x11()ma.plot(rowMeans(log2(x)),log2(x[,1])-log2(x[,2]),cex=1)title("Post Norm: Dilutions Dataset (array 20B v 10A)")

Pre Normalization

Post Normalization

MA Plots.

Related Research Articles

In mathematics, the binary logarithm is the power to which the number $2$ must be raised to obtain the value $n$ . That is, for any real number $x$ ,

A DNA microarray is a collection of microscopic DNA spots attached to a solid surface. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously or to genotype multiple regions of a genome. Each DNA spot contains picomoles of a specific DNA sequence, known as probes. These can be a short section of a gene or other DNA element that are used to hybridize a cDNA or cRNA sample under high-stringency conditions. Probe-target hybridization is usually detected and quantified by detection of fluorophore-, silver-, or chemiluminescence-labeled targets to determine relative abundance of nucleic acid sequences in the target. The original nucleic acid arrays were macro arrays approximately 9 cm × 12 cm and the first computerized image based analysis was published in 1981. It was invented by Patrick O. Brown. An example of its application is in SNPs arrays for polymorphisms in cardiovascular diseases, cancer, pathogens and GWAS analysis. It is also used for the identification of structural variations and the measurement of gene expression.

Comparative genomic hybridization(CGH) is a molecular cytogenetic method for analysing copy number variations (CNVs) relative to ploidy level in the DNA of a test sample compared to a reference sample, without the need for culturing cells. The aim of this technique is to quickly and efficiently compare two genomic DNA samples arising from two sources, which are most often closely related, because it is suspected that they contain differences in terms of either gains or losses of either whole chromosomes or subchromosomal regions (a portion of a whole chromosome). This technique was originally developed for the evaluation of the differences between the chromosomal complements of solid tumor and normal tissue, and has an improved resolution of 5–10 megabases compared to the more traditional cytogenetic analysis techniques of giemsa banding and fluorescence in situ hybridization (FISH) which are limited by the resolution of the microscope utilized.

The transcriptome is the set of all RNA transcripts, including coding and non-coding, in an individual or a population of cells. The term can also sometimes be used to refer to all RNAs, or just mRNA, depending on the particular experiment. The term transcriptome is a portmanteau of the words transcript and genome; it is associated with the process of transcript production during the biological process of transcription.

Bioconductor is a free, open source and open development software project for the analysis and comprehension of genomic data generated by wet lab experiments in molecular biology.

<span class="mw-page-title-main">Real-time polymerase chain reaction</span> Laboratory technique of molecular biology

A real-time polymerase chain reaction is a laboratory technique of molecular biology based on the polymerase chain reaction (PCR). It monitors the amplification of a targeted DNA molecule during the PCR, not at its end, as in conventional PCR. Real-time PCR can be used quantitatively and semi-quantitatively.

<span class="mw-page-title-main">Microarray analysis techniques</span>

Microarray analysis techniques are used in interpreting the data generated from experiments on DNA, RNA, and protein microarrays, which allow researchers to investigate the expression state of a large number of genes - in many cases, an organism's entire genome - in a single experiment. Such experiments can generate very large amounts of data, allowing researchers to assess the overall state of a cell or organism. Data in such large quantities is difficult - if not impossible - to analyze without the help of computer programs.

ChIP-on-chip is a technology that combines chromatin immunoprecipitation ('ChIP') with DNA microarray ("chip"). Like regular ChIP, ChIP-on-chip is used to investigate interactions between proteins and DNA in vivo. Specifically, it allows the identification of the cistrome, the sum of binding sites, for DNA-binding proteins on a genome-wide basis. Whole-genome analysis can be performed to determine the locations of binding sites for almost any protein of interest. As the name of the technique suggests, such proteins are generally those operating in the context of chromatin. The most prominent representatives of this class are transcription factors, replication-related proteins, like origin recognition complex protein (ORC), histones, their variants, and histone modifications.

lumi is a free, open source and open development software project for the analysis and comprehension of Illumina expression and methylation microarray data. The project was started in the summer of 2006 and set out to provide algorithms and data management tools of Illumina in the framework of Bioconductor. It is based on the statistical R programming language.

Sal-like protein 2 is a protein that in humans is encoded by the SALL2 gene.

The Illumina Methylation Assay using the Infinium I platform uses 'BeadChip' technology to generate a comprehensive genome-wide profiling of human DNA methylation. Similar to bisulfite sequencing and pyrosequencing, this method quantifies methylation levels at various loci within the genome. This assay is used for methylation probes on the Illumina Infinium HumanMethylation27 BeadChip. Probes on the 27k array target regions of the human genome to measure methylation levels at 27,578 CpG dinucleotides in 14,495 genes. The Infinium HumanMethylation450 BeadChip array targets > 450,000 methylation sites. In 2016, the Infinium MethylationEPIC BeadChip was released, which interrogates over 850,000 methylation sites across the human genome.

RNA-Seq is a sequencing technique that uses next-generation sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample, representing an aggregated snapshot of the cells' dynamic pool of RNAs, also known as transcriptome.

Fold change is a measure describing how much a quantity changes between an original and a subsequent measurement. It is defined as the ratio between the two quantities; for quantities A and B the fold change of B with respect to A is B/A. In other words, a change from 30 to 60 is defined as a fold-change of 2. This is also referred to as a "one fold increase". Similarly, a change from 30 to 15 is referred to as a "0.5-fold decrease". Fold change is often used when analysing multiple measurements of a biological system taken at different times as the change described by the ratio between the time points is easier to interpret than the difference.

In statistics, a volcano plot is a type of scatter-plot that is used to quickly identify changes in large data sets composed of replicate data. It plots significance versus fold-change on the y and x axes, respectively. These plots are increasingly common in omic experiments such as genomics, proteomics, and metabolomics where one often has a list of many thousands of replicate data points between two conditions and one wishes to quickly identify the most meaningful changes. A volcano plot combines a measure of statistical significance from a statistical test with the magnitude of the change, enabling quick visual identification of those data-points that display large magnitude changes that are also statistically significant.

The ratio average (RA) plot is an integer-based version of an MA plot for visualizing two-condition count data. Its distinctive arrow-like shape derives from the way it includes condition-unique (0,n) or (n,0) points into the plot via an epsilon factor.

The phenotype microarray approach is a technology for high-throughput phenotyping of cells. A phenotype microarray system enables one to monitor simultaneously the phenotypic reaction of cells to environmental challenges or exogenous compounds in a high-throughput manner. The phenotypic reactions are recorded as either end-point measurements or respiration kinetics similar to growth curves.

<span class="mw-page-title-main">Alicia Oshlack</span> Australian bioinformatician

Alicia Yinema Kate Nungarai Oshlack is an Australian bioinformatician and is Co-Head of Computational Biology at the Peter MacCallum Cancer Centre in Melbourne, Victoria, Australia. She is best known for her work developing methods for the analysis of transcriptome data as a measure of gene expression. She has characterized the role of gene expression in human evolution by comparisons of humans, chimpanzees, orangutans, and rhesus macaques, and works collaboratively in data analysis to improve the use of clinical sequencing of RNA samples by RNAseq for human disease diagnosis.

Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. Here, mRNA serves as a transient intermediary molecule in the information network, whilst non-coding RNAs perform additional diverse functions. A transcriptome captures a snapshot in time of the total transcripts present in a cell. Transcriptomics technologies provide a broad account of which cellular processes are active and which are dormant. A major challenge in molecular biology is to understand how a single genome gives rise to a variety of cells. Another is how gene expression is regulated.

<span class="mw-page-title-main">Rafael Irizarry (scientist)</span> American professor of biostatistics

Rafael Irizarry is a professor of biostatistics at the Harvard T.H. Chan School of Public Health and professor of biostatistics and computational biology at the Dana–Farber Cancer Institute. Irizarry is known as one of the founders of the Bioconductor project.

Jean Yee Hwa Yang is an Australian statistician known for her work on variance reduction for microarrays, and for inferring proteins from mass spectrometry data. Yang is a Professor in the School of Mathematics and Statistics at the University of Sydney.

References

↑ Robinson, M. D.; McCarthy, D. J.; Smyth, G. K. (11 November 2009). "edgeR: a Bioconductor package for differential expression analysis of digital gene expression data". Bioinformatics. 26 (1): 139–140. doi:10.1093/bioinformatics/btp616. PMC 2796818 . PMID 19910308.
↑ Love, Michael I; Huber, Wolfgang; Anders, Simon (5 December 2014). "Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2". Genome Biology. 15 (12): 550. doi: 10.1186/s13059-014-0550-8 . PMC 4302049 . PMID 25516281.
↑ YH Yang, S Dudoit, P Luu, DM Lin, V Peng, J Ngai, TP Speed. (2002). Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Research vol. 30 (4) pp. e15.
↑ Dudoit, S, Yang, YH, Callow, MJ, Speed, TP. (2002). Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat. Sin. 12:1 111–139

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Robinson, M. D.; McCarthy, D. J.; Smyth, G. K. (11 November 2009). "edgeR: a Bioconductor package for differential expression analysis of digital gene expression data". Bioinformatics. 26 (1): 139–140. doi:10.1093/bioinformatics/btp616. PMC 2796818 . PMID 19910308.

[2] Love, Michael I; Huber, Wolfgang; Anders, Simon (5 December 2014). "Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2". Genome Biology. 15 (12): 550. doi: 10.1186/s13059-014-0550-8 . PMC 4302049 . PMID 25516281.

[3] YH Yang, S Dudoit, P Luu, DM Lin, V Peng, J Ngai, TP Speed. (2002). Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Research vol. 30 (4) pp. e15.

[4] Dudoit, S, Yang, YH, Callow, MJ, Speed, TP. (2002). Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat. Sin. 12:1 111–139

[1]

[2]

[3]

[4]