Whole genome bisulfite sequencing is a next-generation sequencing technology used to determine the DNA methylation status of single cytosines by treating the DNA with sodium bisulfite before high-throughput DNA sequencing. The DNA methylation status at various genes can reveal information regarding gene regulation and transcriptional activities. [1] This technique was developed in 2009 along with reduced representation bisulfite sequencing after bisulfite sequencing became the gold standard for DNA methylation analysis. [2] [3]
Whole genome bisulfite sequencing measures single-cytosine methylation levels genome-wide and directly estimates the ratio of molecules methylated rather than enrichment levels. Currently, this technique has recognized and tested approximately 95% of all cytosines in known genomes. [4] With the improvement of library preparation methods and next-generation sequencing technology over the past decade, whole genome bisulfite sequencing has become an increasingly widespread and informative method for analyzing DNA methylation in epigenomic-wide studies. [5]
Prior to the development of whole genome bisulfite sequencing, genome methylation analysis relied heavily on early non-specific and differential methods such as paper chromatography, high-performance liquid chromatography, and thin-layer chromatography to analyze methylation profiles. [6] These methods were limited by the inability to amplify methylated DNA via polymerase chain reaction in vitro due to loss of methylation status. [6] As a result, much of these early methods relied on detecting and analyzing naturally-manifested methylated cytosines in vivo rather than chemically methylated cytosines.
In 1970, a breakthrough occurred when it was discovered that treating DNA with sodium bisulfite deaminated cytosine residues into uracil. [6] In the following decade, this discovery led to the revelation that unmethylated cytosine reacted much faster to sodium bisulfite treatment than did 5-methylcytosine. This difference in reaction rates created the possibility of identifying chemical changes in DNA as an easily detectable genetic marker. [6] Whole genome bisulfite sequencing was derived as a combination of this bisulfite treatment and next-generation sequencing technology, such as shotgun sequencing.
The whole genome sequencing technique was first applied to the DNA methylation mapping at single nucleotide resolution to Arabidopsis thaliana in 2008, and shortly after in 2009, the first single-base-resolution DNA methylation map of the entire human genome was created using whole genome bisulfite sequencing. [7] [5] Since its development, many various protocols of whole genome bisulfite sequencing have been developed aiming to improve the efficiency and efficacy of its single-base mapping. As the costs of next-generation sequencing have decreased, whole genome bisulfite sequencing has become more widely used in clinical and experimental research. [3] Currently, multiple public datasets of genomic data have been established, and this technique has recognized and tested approximately 95% of all cytosines in known genomes. [4]
The following steps are derived from one potential workflow of conventional whole genome bisulfite sequencing: target DNA extraction, bisulfite conversion, library amplification, and bioinformatics analysis. [8] However, various sequencing systems and analysis tools often adapt the technical parameters and order of the following step processes in order to optimize assay coverage and efficacy. [3]
Library preparation protocols undergo DNA fragmentation, end repair, dA-tailing, and adapter ligation prior to bisulfite treatment and library amplification. Standard fragmentation under high-throughput technology such as Illumina Genome Analyser and Solexa requires nebulization to generate fragments that range from 0-1200 base pairs. [9] After fragmentation, end repair enzymes and complementary adapters are then applied to the DNA in an end-prep polymerase chain reaction and adapter ligation reaction, respectively. Size selection occurs before the DNA is treated with sodium bisulfite.
Conventional methods of eukaryotic DNA preparation during sequencing use a wide variety of DNA input amount, varying from as little as 10 ng for novel NGS library alternatives, such as the tagmentation approach, to as much as 500-1000 ng of DNA as sample input. [10]
The adapter-ligated DNA sample is treated with sodium bisulfite, a chemical compound that converts unmethylated cytosines into uracil, at low pH and high temperatures. [11] [12] The chemical reaction is depicted in Figure 1, where sulfonation occurs at the carbon-6 position of cytosine to produce the intermediate cytosine sulfonate. [13] This intermediate then undergoes irreversible hydrolytic deamination to create uracil sulfonate. Under alkaline conditions, uracil sulfonate desulfonates to generate uracil. [13]
This enables methylation detection by distinguishing the methylated cytosines (5-methylcytosine), which resist bisulfite treatment, from uracil. During amplification by polymerase chain reaction, the uracils are converted into thymines. [3] Methylated cytosines are then recognized as cytosines. Their locations are then identified by comparison of the bisulfite-treated and original DNA sequence.
Following bisulfite treatment, purification of the sample is required to remove unwanted products including bisulfite salts. [13]
In order to amplify the epigenome library, bisulfite-treated DNA is primed to generate DNA with a specific tagging sequence. The 3' end of this sequence is then tagged again, creating DNA fragments with markers on either end. These fragments are amplified in a final polymerase chain reaction reaction, after which the library is prepped for sequencing-by-synthesis. [8] This is demonstrated in Figure 2, in which high-throughput sequencing system developed by biotechnology company, Illumina, perform comprehensive assays based on sequencing-by-synthesis of base pairs. [8]
Following library amplification, a series of analyses can be performed on the expanded library to determine various methylation characteristics or map a genome-wide methylation profile. [8]
One such study aligns the new reads against the reference genome in order to directly compare locations of methylated cytosines and C-T mismatches. This requires software such as SOAP for side-by-side comparison of the genomes. [8] Another potential sequencing analysis is methylated cytosine calling, which computes methylated cytosine ratios by mapping probabilities based on read quality. This helps determine methylated cytosine locations across the genome. [8] Finally, global trends of methylome can be analyzed by calculating the distribution ratios of CG, CHGG, and CHH in methylated cytosines across the genome. [8] These ratios can reflect features of whole genome methylation maps of certain species.
Due to its ability to screen methylation status at single-nucleotide resolution across a given genome, whole genome bisulfite sequencing has become increasingly promising in aiding fundamental epigenomics research, novel hypotheses on DNA methylation, and investigations of future large-scale epidemiological studies. [3] [5] This whole genome approach is also capable of sensitive cytosine-methylation detection under specific sequences across an entire genome, which increases its potential to identify specific DNA methylation sites and their relation to certain gene expressions. [6]
The whole genome bisulfite sequencing technique is capable of sensitive cytosine-methylation detection under specific sequences across an entire genome, which increases its potential to identify specific DNA methylation sites and their relation to certain gene expressions. [6] The use of whole genome bisulfite sequencing to create the first human DNA methylome in 2009 also helped identify a significant ratio of non-CG methylation. [6] As a result, multiple single-base resolution methylomes of the human genome continue to be produced in order to identify the role of intragenic DNA methylation in gene expression and regulation. Future studies aim to use whole genome bisulfite sequencing in order to investigate the role DNA methylation has in multifarious cellular processes such as cellular differentiation, embryogenesis, X-inactivation, genomic imprinting, and tumorigenesis. [4] Single-nucleotide maps have already been sequenced for two human cell lines, H1 human embryonic stem cells and IMR90 fetal lung fibroblasts, in order to study patterns of non-CG methylation in human cells. [4]
Whole genome bisulfite sequencing has also been applied to developmental biology studies in which non-CG methylation was discovered prevalent in pluripotent stem cells and oocytes. This technique helped researchers discover that non-CG methylation accumulated during oocyte growth and covered over half of all methylation in mouse germinal vesicle oocytes. [14] Similarly, in plants, whole genome bisulfite sequencing was used to examine CG, CHH, and CHG[ clarification needed ] methylation. It was then discovered that the plant germline conserved CG and CHG methylation while mammals lost CHH methylation in microspores and sperm cells. [14]
The unlimited resources provided by the approach of an entire genome have spurred many novel hypotheses on how whole genome bisulfite sequencing could be used in other various fields including disease diagnosis and forensic science. Studies have shown that whole genome bisulfite sequencing could detect abnormal methylation, or more specifically hyper-methylated suppressor genes, that are often seen in cancers including leukemia. [14] Additionally, whole genome bisulfite sequencing has been applied to blood spot samples in forensic investigations to generate high-quality DNA methylation analyses on dried stains. [14]
The widespread use of whole genome bisulfite sequencing has been primarily limited by its excessive cost, complex data output, and minimal required coverage. Due to the high amount and subsequent cost of DNA input, many studies using whole genome bisulfite sequencing assays occur with few or no biological replicates. [15] For human samples, the US National Institutes of Health (NIH) Roadmap Epigenomics Project recommends a minimum of 30x coverage sequencing to achieve accurate results and approximately 80 million aligned, high quality reads. [16] Consequently, large-scale studies for genomic-wide methylation profiling remain less cost-effective, often requiring multiple re-sequences of the entire genome multiple times for every experiment. [17] Current studies are being conducted to reduce the conventional minimum coverage requirements while maintaining mapping accuracy.
Finally, the technique is also limited the complexity of data and lack of sufficiently advanced analytical tools for downstream computational requirements. [2] The current bioinformatics requirements for accurate data interpretation are ahead of existing technology, which stalls the accessibility of sequencing results to the general public.
Additionally, there are biological limitations concerning various steps in the standard protocol, particularly in the library preparation method. One of the biggest concerns is the potential of bias in the base composition of sequences and over-representation of methylated DNA data following bioinformatics analyses. [9] Bias can arise from multiple unintended effects of bisulfite conversion including DNA degradation. This degradation can cause uneven sequence coverage by misrepresenting genomic sequences and overestimating 5-methylcytosine values. [3] Additionally, the bisulfite conversion process only distinguishes unmethylated cytosine from 5-methylcytosine. As a result, specificity between 5-methylcytosine and 5-hydroxymethylcytosine is limited. [3] Another potential source of bias rises from polymerase chain reaction amplification of the library, which affects sequences with highly skewed base compositions due to high rates of polymerase sequence errors in high AT-content, bisulfite-converted DNA. [3]
Cytosine is one of the four nucleotide bases found in DNA and RNA, along with adenine, guanine, and thymine. It is a pyrimidine derivative, with a heterocyclic aromatic ring and two substituents attached. The nucleoside of cytosine is cytidine. In Watson–Crick base pairing, it forms three hydrogen bonds with guanine.
Methylation, in the chemical sciences, is the addition of a methyl group on a substrate, or the substitution of an atom by a methyl group. Methylation is a form of alkylation, with a methyl group replacing a hydrogen atom. These terms are commonly used in chemistry, biochemistry, soil science, and biology.
Deamination is the removal of an amino group from a molecule. Enzymes that catalyse this reaction are called deaminases.
5-Methylcytosine is a methylated form of the DNA base cytosine (C) that regulates gene transcription and takes several other biological roles. When cytosine is methylated, the DNA maintains the same sequence, but the expression of methylated genes can be altered. 5-Methylcytosine is incorporated in the nucleoside 5-methylcytidine.
The CpG sites or CG sites are regions of DNA where a cytosine nucleotide is followed by a guanine nucleotide in the linear sequence of bases along its 5' → 3' direction. CpG sites occur with high frequency in genomic regions called CpG islands.
DNA methylation is a biological process by which methyl groups are added to the DNA molecule. Methylation can change the activity of a DNA segment without changing the sequence. When located in a gene promoter, DNA methylation typically acts to repress gene transcription. In mammals, DNA methylation is essential for normal development and is associated with a number of key processes including genomic imprinting, X-chromosome inactivation, repression of transposable elements, aging, and carcinogenesis.
In biology, the epigenome of an organism is the collection of chemical changes to its DNA and histone proteins that affects when, where, and how the DNA is expressed; these changes can be passed down to an organism's offspring via transgenerational epigenetic inheritance. Changes to the epigenome can result in changes to the structure of chromatin and changes to the function of the genome. The human epigenome, including DNA methylation and histone modification, is maintained through cell division. The epigenome is essential for normal development and cellular differentiation, enabling cells with the same genetic code to perform different functions. The human epigenome is dynamic and can be influenced by environmental factors such as diet, stress, and toxins.
The bisulfite ion (IUPAC-recommended nomenclature: hydrogensulfite) is the ion HSO−
3. Salts containing the HSO−
3 ion are also known as "sulfite lyes". Sodium bisulfite is used interchangeably with sodium metabisulfite (Na2S2O5). Sodium metabisulfite dissolves in water to give a solution of Na+HSO−
3.
Bisulfitesequencing (also known as bisulphite sequencing) is the use of bisulfite treatment of DNA before routine sequencing to determine the pattern of methylation. DNA methylation was the first discovered epigenetic mark, and remains the most studied. In animals it predominantly involves the addition of a methyl group to the carbon-5 position of cytosine residues of the dinucleotide CpG, and is implicated in repression of transcriptional activity.
The versatility of polymerase chain reaction (PCR) has led to modifications of the basic protocol being used in a large number of variant techniques designed for various purposes. This article summarizes many of the most common variations currently or formerly used in molecular biology laboratories; familiarity with the fundamental premise by which PCR works and corresponding terms and concepts is necessary for understanding these variant techniques.
Epigenomics is the study of the complete set of epigenetic modifications on the genetic material of a cell, known as the epigenome. The field is analogous to genomics and proteomics, which are the study of the genome and proteome of a cell. Epigenetic modifications are reversible modifications on a cell's DNA or histones that affect gene expression without altering the DNA sequence. Epigenomic maintenance is a continuous process and plays an important role in stability of eukaryotic genomes by taking part in crucial biological mechanisms like DNA repair. Plant flavones are said to be inhibiting epigenomic marks that cause cancers. Two of the most characterized epigenetic modifications are DNA methylation and histone modification. Epigenetic modifications play an important role in gene expression and regulation, and are involved in numerous cellular processes such as in differentiation/development and tumorigenesis. The study of epigenetics on a global level has been made possible only recently through the adaptation of genomic high-throughput assays.
For molecular biology in mammals, DNA demethylation causes replacement of 5-methylcytosine (5mC) in a DNA sequence by cytosine (C). DNA demethylation can occur by an active process at the site of a 5mC in a DNA sequence or, in replicating cells, by preventing addition of methyl groups to DNA so that the replicated DNA will largely have cytosine in the DNA sequence.
The Illumina Methylation Assay using the Infinium I platform uses 'BeadChip' technology to generate a comprehensive genome-wide profiling of human DNA methylation. Similar to bisulfite sequencing and pyrosequencing, this method quantifies methylation levels at various loci within the genome. This assay is used for methylation probes on the Illumina Infinium HumanMethylation27 BeadChip. Probes on the 27k array target regions of the human genome to measure methylation levels at 27,578 CpG dinucleotides in 14,495 genes. In 2008, Illumina released the Infinium HumanMethylation450 BeadChip array, which targets over 450,000 methylation sites. In 2016, the Infinium MethylationEPIC BeadChip ("EPIC") was released, which interrogates over 850,000 methylation sites across the human genome.
Methylated DNA immunoprecipitation is a large-scale purification technique in molecular biology that is used to enrich for methylated DNA sequences. It consists of isolating methylated DNA fragments via an antibody raised against 5-methylcytosine (5mC). This technique was first described by Weber M. et al. in 2005 and has helped pave the way for viable methylome-level assessment efforts, as the purified fraction of methylated DNA can be input to high-throughput DNA detection methods such as high-resolution DNA microarrays (MeDIP-chip) or next-generation sequencing (MeDIP-seq). Nonetheless, understanding of the methylome remains rudimentary; its study is complicated by the fact that, like other epigenetic properties, patterns vary from cell-type to cell-type.
Combined Bisulfite Restriction Analysis is a molecular biology technique that allows for the sensitive quantification of DNA methylation levels at a specific genomic locus on a DNA sequence in a small sample of genomic DNA. The technique is a variation of bisulfite sequencing, and combines bisulfite conversion based polymerase chain reaction with restriction digestion. Originally developed to reliably handle minute amounts of genomic DNA from microdissected paraffin-embedded tissue samples, the technique has since seen widespread usage in cancer research and epigenetics studies.
Reduced representation bisulfite sequencing (RRBS) is an efficient and high-throughput technique for analyzing the genome-wide methylation profiles on a single nucleotide level. It combines restriction enzymes and bisulfite sequencing to enrich for areas of the genome with a high CpG content. Due to the high cost and depth of sequencing to analyze methylation status in the entire genome, Meissner et al. developed this technique in 2005 to reduce the amount of nucleotides required to sequence to 1% of the genome. The fragments that comprise the reduced genome still include the majority of promoters, as well as regions such as repeated sequences that are difficult to profile using conventional bisulfite sequencing approaches.
Single-cell sequencing examines the nucleic acid sequence information from individual cells with optimized next-generation sequencing technologies, providing a higher resolution of cellular differences and a better understanding of the function of an individual cell in the context of its microenvironment. For example, in cancer, sequencing the DNA of individual cells can give information about mutations carried by small populations of cells. In development, sequencing the RNAs expressed by individual cells can give insight into the existence and behavior of different cell types. In microbial systems, a population of the same species can appear genetically clonal. Still, single-cell sequencing of RNA or epigenetic modifications can reveal cell-to-cell variability that may help populations rapidly adapt to survive in changing environments.
In epitranscriptomic sequencing, most methods focus on either (1) enrichment and purification of the modified RNA molecules before running on the RNA sequencer, or (2) improving or modifying bioinformatics analysis pipelines to call the modification peaks. Most methods have been adapted and optimized for mRNA molecules, except for modified bisulfite sequencing for profiling 5-methylcytidine which was optimized for tRNAs and rRNAs.
Single cell epigenomics is the study of epigenomics in individual cells by single cell sequencing. Since 2013, methods have been created including whole-genome single-cell bisulfite sequencing to measure DNA methylation, whole-genome ChIP-sequencing to measure histone modifications, whole-genome ATAC-seq to measure chromatin accessibility and chromosome conformation capture.
Nucleosome Occupancy and Methylome Sequencing (NOMe-seq) is a genomics technique used to simultaneously detect nucleosome positioning and DNA methylation... This method is an extension of bisulfite sequencing, which is the gold standard for determining DNA methylation. NOMe-seq relies on the methyltransferase M.CviPl, which methylates cytosines in GpC dinucleotides unbound by nucleosomes or other proteins, creating a nucleosome footprint. The mammalian genome naturally contains DNA methylation, but only at CpG sites, so GpC methylation can be differentiated from genomic methylation after bisulfite sequencing. This allows simultaneous analysis of the nucleosome footprint and endogenous methylation on the same DNA molecules. In addition to nucleosome foot-printing, NOMe-seq can determine locations bound by transcription factors. Nucleosomes are bound by 147 base pairs of DNA whereas transcription factors or other proteins will only bind a region of approximately 10-80 base pairs. Following treatment with M.CviPl, nucleosome and transcription factor sites can be differentiated based on the size of the unmethylated GpC region.