Bayesian tool for methylation analysis

Last updated

Bayesian tool for methylation analysis, also known as BATMAN, is a statistical tool for analysing methylated DNA immunoprecipitation (MeDIP) profiles. It can be applied to large datasets generated using either oligonucleotide arrays (MeDIP-chip) or next-generation sequencing (MeDIP-seq), providing a quantitative estimation of absolute methylation state in a region of interest. [1]

Methylated DNA immunoprecipitation is a large-scale purification technique in molecular biology that is used to enrich for methylated DNA sequences. It consists of isolating methylated DNA fragments via an antibody raised against 5-methylcytosine (5mC). This technique was first described by Weber M. et al. in 2005 and has helped pave the way for viable methylome-level assessment efforts, as the purified fraction of methylated DNA can be input to high-throughput DNA detection methods such as high-resolution DNA microarrays (MeDIP-chip) or next-generation sequencing (MeDIP-seq). Nonetheless, understanding of the methylome remains rudimentary; its study is complicated by the fact that, like other epigenetic properties, patterns vary from cell-type to cell-type.

Oligonucleotides are short DNA or RNA molecules, oligomers, that have a wide range of applications in genetic testing, research, and forensics. Commonly made in the laboratory by solid-phase chemical synthesis, these small bits of nucleic acids can be manufactured as single-stranded molecules with any user-specified sequence, and so are vital for artificial gene synthesis, polymerase chain reaction (PCR), DNA sequencing, library construction and as molecular probes. In nature, oligonucleotides are usually found as small RNA molecules that function in the regulation of gene expression, or are degradation intermediates derived from the breakdown of larger nucleic acid molecules.

In the chemical sciences, methylation denotes the addition of a methyl group on a substrate, or the substitution of an atom by a methyl group. Methylation is a form of alkylation, with a methyl group, rather than a larger carbon chain, replacing a hydrogen atom. These terms are commonly used in chemistry, biochemistry, soil science, and the biological sciences.

Contents

Batman workflow Batman final diagram.tif
Batman workflow

Theory

MeDIP (methylated DNA immunoprecipitation) is an experimental technique used to assess DNA methylation levels by using an antibody to isolate methylated DNA sequences. The isolated fragments of DNA are either hybridized to a microarray chip (MeDIP-chip) or sequenced by next-generation sequencing (MeDIP-seq). While this tells you what areas of the genome are methylated, it does not give absolute methylation levels. Imagine two different genomic regions, A and B. Region A has six CpGs (DNA methylation in mammalian somatic cells generally occurs at CpG dinucleotides [2] ), three of which are methylated. Region B has three CpGs, all of which are methylated. As the antibody simply recognizes methylated DNA, it will bind both these regions equally and subsequent steps will therefore show equal signals for these two regions. This does not give the full picture of methylation in these two regions (in region A only half the CpGs are methylated, whereas in region B all the CpGs are methylated). Therefore, to get the full picture of methylation for a given region you have to normalize the signal you get from the MeDIP experiment to the number of CpGs in the region, and this is what the Batman algorithm does. Analysing the MeDIP signal of the above example would give Batman scores of 0.5 for region A (i.e. the region is 50% methylated) and 1 for region B (i.e. The region is 100% methylated). In this way Batman converts the signals from MeDIP experiments to absolute methylation levels.

DNA Molecule that encodes the genetic instructions used in the development and functioning of all known living organisms and many viruses

Deoxyribonucleic acid is a molecule composed of two chains that coil around each other to form a double helix carrying the genetic instructions used in the growth, development, functioning, and reproduction of all known living organisms and many viruses. DNA and ribonucleic acid (RNA) are nucleic acids; alongside proteins, lipids and complex carbohydrates (polysaccharides), nucleic acids are one of the four major types of macromolecules that are essential for all known forms of life.

Antibody large Y-shaped protein produced by B-cells, used by the immune system; large, Y-shaped protein produced mainly by plasma cells that is used by the immune system to neutralize pathogens such as pathogenic bacteria and viruses

An antibody (Ab), also known as an immunoglobulin (Ig), is a large, Y-shaped protein produced mainly by plasma cells that is used by the immune system to neutralize pathogens such as pathogenic bacteria and viruses. The antibody recognizes a unique molecule of the pathogen, called an antigen, via the Fab's variable region. Each tip of the "Y" of an antibody contains a paratope that is specific for one particular epitope on an antigen, allowing these two structures to bind together with precision. Using this binding mechanism, an antibody can tag a microbe or an infected cell for attack by other parts of the immune system, or can neutralize its target directly. Depending on the antigen, the binding may impede the biological process causing the disease or may activate macrophages to destroy the foreign substance. The ability of an antibody to communicate with the other components of the immune system is mediated via its Fc region, which contains a conserved glycosylation site involved in these interactions. The production of antibodies is the main function of the humoral immune system.

Genome entirety of an organisms hereditary information; genome of organism (encoded by the genomic DNA) is the (biological) information of heredity which is passed from one generation of organism to the next; is transcribed to produce various RNAs

In the fields of molecular biology and genetics, a genome is the genetic material of an organism. It consists of DNA. The genome includes both the genes and the noncoding DNA, as well as mitochondrial DNA and chloroplast DNA. The study of the genome is called genomics.

Development of Batman

The core principle of the Batman algorithm is to model the effects of varying density of CpG dinucleotides, and the effect this has on MeDIP enrichment of DNA fragments. The basic assumptions of Batman:

  1. Almost all DNA methylation in mammals happens at CpG dinucleotides.
  2. Most CpG-poor regions are constitutively methylated while most CpG-rich regions (CpG islands) are constitutively unmethylated. [3]
  3. There are no fragment biases in MeDIP experiment (approximate range of DNA fragment sizes is 400–700 bp).
  4. The errors on the microarray are normally distributed with precision.
  5. Only methylated CpGs contribute to the observed signal.
  6. CpG methylation state is generally highly correlated over hundreds of bases, [4] so CpGs grouped together in 50- or 100-bp windows would have the same methylation state.

Basic parameters in Batman:

  1. Ccp: coupling factor between probe p and CpG dinucleotide c, is defined as the fraction of DNA molecules hybridizing to probe p that contain the CpG c.
  2. Ctot : total CpG influence parameter, is defined as the sum of coupling factors for any given probe, which provides a measure of local CpG density
  3. mc : the methylation status at position c, which represents the fraction of chromosomes in the sample on which it is methylated. mc is considered as a continuous variable since the majority samples used in MeDIP studies contain multiple cell-types.

Based on these assumptions, the signal from the MeDIP channel of the MeDIP-chip or MeDIP-seq experiment depends on the degree of enrichment of DNA fragments overlapping that probe, which in turn depends on the amount of antibody binding, and thus to the number of methylated CpGs on those fragments. In Batman model, the complete dataset from a MeDIP/chip experiment, A, can be represented by a statistical model in the form of the following probability distribution:

In probability theory and statistics, a probability distribution is a mathematical function that provides the probabilities of occurrence of different possible outcomes in an experiment. In more technical terms, the probability distribution is a description of a random phenomenon in terms of the probabilities of events. For instance, if the random variable X is used to denote the outcome of a coin toss, then the probability distribution of X would take the value 0.5 for X = heads, and 0.5 for X = tails. Examples of random phenomena can include the results of an experiment or survey.

where (x|μ, σ2) is a Gaussian probability density function. Standard Bayesian techniques can be used to infer f(m|A), that is, the distribution of likely methylation states given one or more sets of MeDIP-chip/MeDIP-seq outputs. To solve this inference problem, Batman uses nested sampling (http://www.inference.phy.cam.ac.uk/bayesys/) to generate 100 independent samples from f(m|A) for each tiled region of the genome, then summarizes the most likely methylation state in 100-bp windows by fitting beta distributions to these samples. The modes of the most likely beta distributions were used as final methylation calls.

Normal distribution probability distribution

In probability theory, the normaldistribution is a very common continuous probability distribution. Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known. A random variable with a Gaussian distribution is said to be normally distributed and is called a normal deviate.

Probability density function Function whose integral over a region describes the probability of an event occurring in that region

In probability theory, a probability density function (PDF), or density of a continuous random variable, is a function whose value at any given sample in the sample space can be interpreted as providing a relative likelihood that the value of the random variable would equal that sample. In other words, while the absolute likelihood for a continuous random variable to take on any particular value is 0, the value of the PDF at two different samples can be used to infer, in any particular draw of the random variable, how much more likely it is that the random variable would equal one sample compared to the other sample.

Bayesian probability is an interpretation of the concept of probability, in which, instead of frequency or propensity of some phenomenon, probability is interpreted as reasonable expectation representing a state of knowledge or as quantification of a personal belief.

Work flow of Batman

Batman prerequisites:

  1. Installation: install Batman(freely available from https://github.com/dasmoth/batman under the GNU Lesser General Public License), Apache ANT, MySQL database server, and MySQL database connector.
  2. Prepare dataset: break your dataset into small blocks, namely regions of interest (ROIs), each represented by a small number (typically about 100) probes on a microarray.
  3. Identify the database server: connect to a MySQL database server using both the MySQL administration tool, and many of the Batman programs.
  4. Initialize the Batman database: create a database on your database server.
  5. Register the experiments to be analysed.
  6. Register the array design: The array design (i.e. complete list of probes, with their genomic locations) should be provided as a GFF file.
  7. Load the array data.
  8. Load the genome sequence.

Run Batman:

  1. Calibrate the Batman model: Before any data can be analysed, it is necessary to calibrate each array by estimating how much extra array signal is produced by each methylated CpG. This step can give you a quick idea whether each of your arrays is giving sensible results.
  2. Sample methylation states from the Batman model: You’ll often have multiple arrays from the same experiment, and these should normally be analysed together to improve the confidence of the final calls. Each chromosome can take several days to process; therefore, if possible, run several in parallel.
  3. Summarize methylation states to generate the final calls: The “sample” files generated by Batman contain a large set of plausible methylation states for each region. For most purposes, you’ll actually want a single estimate of the likely methylation state at that position, and perhaps an estimate of how confident you can be that this is actually the correct value.

Visualization of Batman Data:

  1. The output is in GFF format. For each window, a score (range: 0–1) is given which represents a likely fraction of methylation and the interquartile range is given as an estimate of confidence.
  2. Several genome browsers are available, such as Ensembl genome browser, which uses a colour gradient from 20 (bright yellow) to 80 (dark blue) to show the Batman methylation score for each probe in the ROI.

More details related to Batman procedure can be found in Batman manual freely online from https://web.archive.org/web/20110304143135/http://td-blade.gurdon.cam.ac.uk/software/batman/batmanual-alpha-0.2.3.pdf

Limitations

It may be useful to take the following points into account when considering using Batman:

  1. Batman is not a piece of software; it is an algorithm performed using the command prompt. As such it is not especially user-friendly and is quite a computationally technical process.
  2. Because it is non-commercial, there is very little support when using Batman beyond what is in the manual.
  3. It is quite time consuming (it can take several days to analyse one chromosome). (Note: In one government lab, running Batman on a set of 100 Agilent Human DNA Methylation Arrays (about 250,000 probes per array) took less than an hour to complete in Agilent's Genomic Workbench software. Our computer had a 2GHz processor, 24 GB RAM, 64-bit Windows 7.)
  4. Copy number variation (CNV) has to be accounted for. For example, the score for a region with a CNV value of 1.6 in a cancer (a loss of 0.4 compared to normal) would have to be multiplied by 1.25 (=2/1.6) to compensate for the loss.
  5. One of the basic assumptions of Batman is that all DNA methylation occurs at CpG dinucleotides. While this is generally the case for vertebrate somatic cells, there are situations where there is widespread non-CpG methylation, such as in plant cells and embryonic stem cells. [5] [6]

Related Research Articles

CpG site

The CpG sites or CG sites are regions of DNA where a cytosine nucleotide is followed by a guanine nucleotide in the linear sequence of bases along its 5' → 3' direction. CpG sites occur with high frequency in genomic regions called CpG islands. Cytosines in CpG dinucleotides can be methylated to form 5-methylcytosines. Enzymes that add a methyl group are called DNA methyltransferases. In mammals, 70% to 80% of CpG cytosines are methylated. Methylating the cytosine within a gene can change its expression, a mechanism that is part of a larger field of science studying gene regulation that is called epigenetics.

DNA microarray

A DNA microarray is a collection of microscopic DNA spots attached to a solid surface. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously or to genotype multiple regions of a genome. Each DNA spot contains picomoles of a specific DNA sequence, known as probes. These can be a short section of a gene or other DNA element that are used to hybridize a cDNA or cRNA sample under high-stringency conditions. Probe-target hybridization is usually detected and quantified by detection of fluorophore-, silver-, or chemiluminescence-labeled targets to determine relative abundance of nucleic acid sequences in the target. The original nucleic acid arrays were macro arrays approximately 9 cm × 12 cm and the first computerized image based analysis was published in 1981. It was invented by Patrick O. Brown.

Regulation of gene expression Any process that modulates the frequency, rate or extent of gene expression. Gene expression is the process in which a genes coding sequence is converted into a mature gene product or products (proteins or RNA). This includes the production of an RN

Regulation of gene expression includes a wide range of mechanisms that are used by cells to increase or decrease the production of specific gene products, and is informally termed gene regulation. Sophisticated programs of gene expression are widely observed in biology, for example to trigger developmental pathways, respond to environmental stimuli, or adapt to new food sources. Virtually any step of gene expression can be modulated, from transcriptional initiation, to RNA processing, and to the post-translational modification of a protein. Often, one gene regulator controls another, and so on, in a gene regulatory network.

DNA methylation The covalent transfer of a methyl group to either N-6 of adenine or C-5 or N-4 of cytosine.

DNA methylation is a process by which methyl groups are added to the DNA molecule. Methylation can change the activity of a DNA segment without changing the sequence. When located in a gene promoter, DNA methylation typically acts to repress gene transcription. In mammals DNA methylation is essential for normal development and is associated with a number of key processes including genomic imprinting, X-chromosome inactivation, repression of transposable elements, aging, and carcinogenesis.

Epigenome

An epigenome consists of a record of the chemical changes to the DNA and histone proteins of an organism; these changes can be passed down to an organism's offspring via transgenerational epigenetic inheritance. Changes to the epigenome can result in changes to the structure of chromatin and changes to the function of the genome.

ChIP-on-chip molecular biology method

ChIP-on-chip is a technology that combines chromatin immunoprecipitation ('ChIP') with DNA microarray ("chip"). Like regular ChIP, ChIP-on-chip is used to investigate interactions between proteins and DNA in vivo. Specifically, it allows the identification of the cistrome, the sum of binding sites, for DNA-binding proteins on a genome-wide basis. Whole-genome analysis can be performed to determine the locations of binding sites for almost any protein of interest. As the name of the technique suggests, such proteins are generally those operating in the context of chromatin. The most prominent representatives of this class are transcription factors, replication-related proteins, like origin recognition complex protein (ORC), histones, their variants, and histone modifications.

Bisulfite sequencing

Bisulfitesequencing (also known as bisulphite sequencing) is the use of bisulfite treatment of DNA before routine sequencing to determine the pattern of methylation. DNA methylation was the first discovered epigenetic mark, and remains the most studied. In animals it predominantly involves the addition of a methyl group to the carbon-5 position of cytosine residues of the dinucleotide CpG, and is implicated in repression of transcriptional activity.

SOLiD is a next-generation DNA sequencing technology developed by Life Technologies and has been commercially available since 2006. This next generation technology generates hundreds of millions to billions of small sequence reads at one time.

ChIP-sequencing, also known as ChIP-seq, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global binding sites precisely for any protein of interest. Previously, ChIP-on-chip was the most common technique utilized to study these protein–DNA relations.

DamID is a molecular biology protocol used to map the binding sites of DNA- and chromatin-binding proteins in eukaryotes. DamID identifies binding sites by expressing the proposed DNA-binding protein as a fusion protein with DNA methyltransferase. Binding of the protein of interest to DNA localizes the methyltransferase in the region of the binding site. Adenosine methylation does not occur naturally in eukaryotes and therefore adenine methylation in any region can be concluded to have been caused by the fusion protein, implying the region is located near a binding site. DamID is an alternate method to ChIP-on-chip or ChIP-seq.

Tiling array

Tiling arrays are a subtype of microarray chips. Like traditional microarrays, they function by hybridizing labeled DNA or RNA target molecules to probes fixed onto a solid surface.

Epigenomics is the study of the complete set of epigenetic modifications on the genetic material of a cell, known as the epigenome. The field is analogous to genomics and proteomics, which are the study of the genome and proteome of a cell. Epigenetic modifications are reversible modifications on a cell's DNA or histones that affect gene expression without altering the DNA sequence. Epigenomic maintenance is a continuous process and plays an important role in stability of eukaryotic genomes by taking part in crucial biological mechanisms like DNA repair. Plant flavones are said to be inhibiting epigenomic marks that cause cancers. Two of the most characterized epigenetic modifications are DNA methylation and histone modification. Epigenetic modifications play an important role in gene expression and regulation, and are involved in numerous cellular processes such as in differentiation/development and tumorigenesis. The study of epigenetics on a global level has been made possible only recently through the adaptation of genomic high-throughput assays.

The Illumina Methylation Assay using the Infinium I platform uses 'BeadChip' technology to generate a comprehensive genome-wide profiling of human DNA methylation. Similar to bisulfite sequencing and pyrosequencing, this method quantifies methylation levels at specific loci within the genome. This assay is used for methylation probes on the Illumina Infinium HumanMethylation27 BeadChip. Probes on the 27k array target regions of the human genome to measure methylation levels at 27,578 CpG dinucleotides in 14,495 genes. The Infinium HumanMethylation450 BeadChip array targets >450,000 methylation sites.

Chromatin immunoprecipitation

Chromatin immunoprecipitation (ChIP) is a type of immunoprecipitation experimental technique used to investigate the interaction between proteins and DNA in the cell. It aims to determine whether specific proteins are associated with specific genomic regions, such as transcription factors on promoters or other DNA binding sites, and possibly defining cistromes. ChIP also aims to determine the specific location in the genome that various histone modifications are associated with, indicating the target of the histone modifiers.

Combined bisulfite restriction analysis

Combined Bisulfite Restriction Analysis is a molecular biology technique that allows for the sensitive quantification of DNA methylation levels at a specific genomic locus on a DNA sequence in a small sample of genomic DNA. The technique is a variation of bisulfite sequencing, and combines bisulfite conversion based polymerase chain reaction with restriction digestion. Originally developed to reliably handle minute amounts of genomic DNA from microdissected paraffin-embedded tissue samples, the technique has since seen widespread usage in cancer research and epigenetics studies.

Reduced representation bisulfite sequencing Methylation process

Reduced representation bisulfite sequencing (RRBS) is an efficient and high-throughput technique for analyzing the genome-wide methylation profiles on a single nucleotide level. It combines restriction enzymes and bisulfite sequencing to enrich for areas of the genome with a high CpG content. Due to the high cost and depth of sequencing to analyze methylation status in the entire genome, Meissner et al. developed this technique in 2005 to reduce the amount of nucleotides required to sequence to 1% of the genome. The fragments that comprise the reduced genome still include the majority of promoters, as well as regions such as repeated sequences that are difficult to profile using conventional bisulfite sequencing approaches.

DRIP-seq (DRIP-sequencing) is a technology for genome-wide profiling of a type of DNA-RNA hybrid called an "R-loop". DRIP-seq utilizes a sequence-independent but structure-specific antibody for DNA-RNA immunoprecipitation (DRIP) to capture R-loops for massively parallel DNA sequencing.

Epitranscriptomic sequencing

In epitranscriptomic sequencing, most methods focus on either (1) enrichment and purification of the modified RNA molecules before running on the RNA sequencer, or (2) improving or modifying bioinformatics analysis pipelines to call the modification peaks. Most methods have been adapted and optimized for mRNA molecules, except for modified bisulfite sequencing for profiling 5-methylcytidine which was optimized for tRNAs and rRNAs.

Epigenome-wide association study (EWAS)

An epigenome-wide association study (EWAS) is an examination of a genome-wide set of quantifiable epigenetic marks, such as DNA methylation, in different individuals to derive associations between epigenetic variation and a particular identifiable phenotype/trait. When patterns change such as DNA methylation at specific loci, discriminating the phenotypically affected cases from control individuals, this is considered an indication that epigenetic perturbation has taken place that is associated, causally or consequentially, with the phenotype.

References

  1. Down, T.A. et al. A Bayesian deconvolution strategy for immunoprecipitation-based DNA methylome analysis. Nature Biotechnology26, 779–85 (2008).
  2. Lister, R. et al. Human DNA methylomes at base resolution show widespread epigenomic differences. Nature462, 315–22 (2009).
  3. Bird, A. DNA methylation patterns and epigenetic memory. Genes & Development16, 6–21 (2002).
  4. Eckhardt, F. et al. DNA methylation profiling of human chromosomes 6, 20 and 22. Nature Genetics38, 1378–85 (2006).
  5. Dodge, J.E., Ramsahoye, B.H., Wo, Z.G., Okano, M. & Li, E. De novo methylation of MMLV provirus in embryonic stem cells: CpG versus non-CpG methylation. Gene289, 41–8 (2002)
  6. Vanyushin, B.F. DNA methylation in plants. Current Topics in Microbiology and Immunology301, 67–122 (2006)