This article possibly contains original research .(December 2022) |
An epigenome-wide association study (EWAS) is an examination of a genome-wide set of quantifiable epigenetic marks, such as DNA methylation, in different individuals to derive associations between epigenetic variation and a particular identifiable phenotype/trait. When patterns change such as DNA methylation at specific loci, discriminating the phenotypically affected cases from control individuals, this is considered an indication that epigenetic perturbation has taken place that is associated, causally or consequentially, with the phenotype. [1]
The epigenome is governed by both genetic and environmental factors, causing it to be highly dynamic and complex. Epigenetic information exists in the cell as DNA and histone marks, as well as non-coding RNAs. DNA methylation (DNAm) patterns change over time, and vary between developmental stage and tissue type. The main type of DNAm is at cytosines within CpG dinucleotides which is known to be involved in gene expression regulation. DNAm pattern changes have been extensively studied in complex diseases such as cancer and diabetes. [1] In a normal cell, the bulk genome is highly methylated at CpGs, whereas CpG islands (CPI) at gene promoter regions remain highly unmethylated. Aberrant DNAm is the most common type of molecular abnormality in cancer cells, where the bulk genome becomes globally ‘hypomethylated’ and CPIs in promoter regions become ‘hypermethylated’, usually leading to silencing of tumour suppressor genes. [2] More recently, studies on diabetes have uncovered further evidence to support an epigenetic component of diseases, including differences in disease-associated epigenetic marks between monozygotic twins, the rising incidence of type 1 diabetes in the general population, and developmental reprogramming events in which in utero or childhood environments can influence disease outcome in adulthood. [1]
Post-translational histone modifications include, but are not limited to, methylation, acetylation and phosphorylation on the core histone tails. These post-translational modifications are read by proteins that can then modify the chromatin state at that locus. [1] Epigenetic variation arises in three distinct ways; it can be inherited and be therefore present in all cells of the adult including the germline (a process known as transgenerational epigenetic inheritance; a controversial phenomenon that has not yet been observed in humans); it can occur randomly and be present in a subset of cells in the adult, the amount of which depending on how early in development the variation occurs; or it can be induced as a result of behavioural or environmental factors. [1] EWAS has previously associated changes in methylation with several diseases and complex conditions which do not have a known epidemiology and therefore are crucial for the identification of epigenetic factors that contribute to or are a consequence of pathogenesis of these diseases. [3]
Retrospective studies compare unrelated individuals who fall into two categories, controls without the disease or phenotype of interest, and cases who have the phenotype of interest. An advantage of such studies is that many cohorts of case-control samples already exist with available genotype and expression data that can be integrated with epigenome data. A downside, however, is that they cannot determine whether epigenetic differences are a result of disease-associated genetic differences, post-disease processes or disease-associated drug interventions. [1]
Useful to study transgenerational inheritance patterns of epigenetic marks. A main limitation of EWAS is deciphering if a phenotype is associated with epigenetic changes as a result of a variable in question or a result of previous genomic variants leading to epigenetic alterations. Comparisons between parent and offspring genomic and epigenomic data allows one to rule out the possibility that a disease or phenotype is due to genomic variation. A limitation of this study design is that very few cohorts which are large enough exist. [1]
Monozygotic twins carry identical genomic information. Therefore, if they are discordant for a particular disease or phenotype it is likely a result of epigenetic differences. However, unless the twins are studied longitudinally it is impossible to determine if epigenetic variation is the cause of or consequence of disease. Another limitation is recruiting a large enough cohort of discordant monozygotic twins with the disease of interest. [1]
Longitudinal studies follow a cohort of individuals over an extended period of time, usually from birth or before disease onset. Samples are taken and records are kept over many years, making these studies extremely useful to determine causality of particular phenotypes. Since the same individuals are followed at time points before and after disease onset, it removes the confounding effects of differences between cases and controls. Longitudinal studies are not only useful for risk studies (using DNA samples prior to disease onset), but also in intervention studies using pre- and posttreatment with specific exposures to investigate environmental impacts on the epigenome. [4] A major disadvantage is the long timeline of the studies as well as the expense. Longitudinal studies using disease-discordant monozygotic twins gives the added benefit of ruling out genetic influences on epigenetic variation. [1]
The tissue specificity of epigenomic marks create another challenge when designing an EWAS. Tissue choice is limited by both accessibility and stability of epigenetic patterning. It is crucial to choose a tissue in which epigenetic marks are variable in the population yet stable over time. If this isn't possible, it would be required to use multiple serially collected samples from the same individuals to report robust associations with a particular phenotype. EWAS for diseases are often measured using DNA methylation in blood samples because disease-relevant tissues are difficult to obtain. In some cases, the pattern of methylation is not necessarily biologically relevant to the proposed phenotype. The choice of blood also requires stringent analysis and careful interpretation due to variable cell type composition. Choosing a surrogate tissue therefore requires that the interindividual differences correlate between the tissue of interest and the surrogate, but also for the exposure to induce similar changes in both tissues. To date, an underlying issue is that there is no clear evidence that, in general, epigenetic marks respond to environmental exposures in a similar way across tissues. [5]
The platform for epigenome-wide DNAm quantification utilizes the high throughput technology Illumina Methylation Assay. In the past, the 27k Illumina array covered on average two CpG sites in the promoter regions of approximately 14,000 genes and represented less than 0.1% of the 28 million CpG sites in the human genome. This falls short of being representative of the entire human epigenome. None of the early EWAS using this array [6] [7] used independent validation to verify the associated probes. An interesting observation was a bias in the differences between cases and controls towards non-CpG island probes (which were significantly underrepresented in this array design), arguing strongly for the use of the latterly designed 450k array which does cover non-CpG islands with a higher density of probes. Presently, the Illumina 450k array is the most widely used platform in the last two years for studies reporting EWAS. The array still only covers less than 2% of the CpG sites in the genome, but does attempt to cover all known genes with a high density of probes in the promoters (including CpG islands and surrounding sequences), but also covers with a lower density across the gene bodies, 3′ untranslated regions, and other intergenic sequences. [8]
DNA methylation is typically quantified on a scale of 0–1, as the methylation array measures the proportion of DNA molecules that are methylated at a particular CpG site. The initial analyses performed are univariate tests of association to identify sites where DNA methylation varies with exposure and/or phenotype. This is followed by multiple testing corrections and utilizing an analytical strategy to reduce batch effects and other technical confounding effects in the quantification of DNA methylation. The potential confounding effects arising from alterations in tissue composition is also taken into account. Additionally, adjusting for confounding factors such as age, gender and behaviours that may influence the methylation status as covariates is conducted. The association results are also corrected for the genomic control inflation factor in order to account for the population stratification.
Generally, mean levels of CpG methylation are compared across categories using linear regression [9] which allows for the adjustment of confounders and batch effects. [10] A P-value threshold of P < 1e-7 [11] is generally used to identify CpGs associated with the tested phenotype/stimulus. These CpGs are considered to reach epigenome-wide significance. An effect size is also calculated at this significance level, indicating the difference in methylation when comparing two qualitative groups, or different quantitative values depending on your phenotype. CpG sites significantly associated with the phenotype and/or treatment/environmental stimulus are typically represented in a manhattan plot. [12]
Single CpG sites are prone to single site natural variation effects and technical variation such as bad microarray probes and outliers. To make more robust associations and take into account such variation, using adjacent measurements can help increase power. [13] [14] In previous studies, functionally relevant findings have been associated with genomic regions as opposed to single CpGs. Therefore, looking at the regional level can help identify associated regions with more confidence, guiding downstream functional studies.
Another method of analysis is using unsupervised clustering to create classes of CpG sites based on similarity of methylation variation across samples. The average methylation values within each class is used to construct data sets of reduced dimensionality, facilitating efficient tests of association between DNA methylation and phenotypes of interest. [15] This is used to reduce the dimensionality of large data sets and take advantage of substantial biologically induced correlation. This method is useful for identifying gross patterns of methylation associated with the tested variable, but may miss specific CpG sites of interest. Besides differences in mean methylation levels, differences in variation of DNA methylation across samples may also be biologically meaningful, motivating scans for differential variability between groups. [12]
The location of the associated CpG sites or islands/regions can then be analyzed in silico to imply possible functional relevance. For example, considering whether the associated CpGs are within a promoter region or determining distance from the transcription start site that may be relevant, especially when we assume that DNA methylation associated with a phenotype acts by regulating gene transcription. Many other inferences based on past biological knowledge can be inferred if that particular region of CpGs have been studied and associated with changes in transcription. This can be used as an additional filter for identifying regions to pursue for functional validation. Several bioinformatic tools that have been developed for functional enrichment analysis can be applied to differentially methylated regions by first mapping these regions to genes. This is done by mapping the distance between the CpGs and a gene promoter that is potentially regulated by this region. Enrichment analysis based on the genomic region has thus been suggested as a complementary approach and confers substantial interpretive potential. [12] Differentially methylated regions can then be compared to a catalog of genomic regions including, for example, sites enriched for specific chromatin modifications or transcription factor binding sites.
A methylation odds ratio can be calculated if we consider the mean methylation rate at a site in cases (or controls) to represent the methylation probability for a randomly chosen DNA strand in the case (or control) tissue samples. The methylation odds ratio is the odds for a random DNA strand in the tissue sample from a random case to be methylated, divided by the same odds for controls. This provides a measure of effect size that incorporates relative magnitudes, but also does not allow for the difference between cases and controls of features of the methylation spectrum, such as variance. The methylation odds ratio is also comparable across prospective and retrospective studies and its value only measures association and does not imply causation. Methylation risk scores have also been calculated which can integrate information across CpG sites by calculating a weighted methylation risk score as the sum of methylation values at each of the markers associated with the phenotype, weighted by marker-specific effect size [16]
Replication using an independent cohort is required to rule out false positives identified in the initial study. This can be done in a human cohort or in a more focused manner in animal models. It is important that, when selecting the replication cohort, the individuals are reflective of the initial cohort and that the same confounding variables are taken into account. Replication, however, can be limited due to the availability of individuals and samples.
Variations in the epigenome can cause disease but can also arise as a consequence of disease, and distinguishing between the two is a major limitation in EWAS. A way to circumvent this is to determine whether the epigenetic variation is present before any symptoms of disease, preferably via longitudinal studies following the same cohort of people over many years (this in itself has its own setbacks of expense and study time frame). Also needed to be taken into consideration is the possibility that epigenetic variation which arises before disease onset does not necessarily constitute causation for disease.
The most commonly used tissue in EWAS is blood. However, blood samples contain multiple different cell types each of which have a unique epigenetic signature. In this way, it is extremely difficult to determine if the sample you have taken is homogeneous and is therefore difficult to determine if the variation in epigenetic marks are due to the differences in phenotype/stimulus or due to the sample heterogeneity.
Currently many EWAS use blood as a surrogate tissue due to its availability and ease of collection. However, epigenetic changes in the blood may not be associated with the changes in the particular tissue associated with the disease. Many intriguing disorders that could have epigenetic causative factors affect tissues such as brain, lung, heart, etc. However, when studying human patients it is not an option to take these tissues for sampling, and they are therefore left unstudied.
EWASdb [17] (http://www.bioapp.org/ewasdb/) is the first epigenome-wide association database (first online at 2015, and first published on Nucleic Acids Res. 2018 Oct 13) which stores the results of 1319 EWAS studies associated with 302 diseases/phenotypes (p<1e-7). Three types of EWAS results were stored in EWASdb: EWAS for single epi-marker; EWAS for KEGG pathway and EWAS for GO (Gene Ontology) categories.
EWAS Atlas [18] (http://bigd.big.ac.cn/ewas) is a curated knowledgebase of EWAS that provides a comprehensive collection of EWAS knowledge. Unlike extant data-oriented epigenetic resources, EWAS Atlas features manual curation of EWAS knowledge from extensive publications. In the current implementation, EWAS Atlas focuses on DNA methylation—one of the key epigenetic marks; it integrates a large number of 388,851 high-quality EWAS associations, involving 126 tissues/cell lines and covering 351 traits, 2,230 cohorts and 390 ontology entities, which are completely based on manual curation from 649 studies reported in 495 publications. In addition, it is equipped with a powerful trait enrichment analysis tool, which is capable of profiling trait-trait and trait-epigenome relationships. Future developments include regular curation of recent EWAS publications, incorporation of more epigenetic marks and possible integration of EWAS with GWAS. Collectively, EWAS Atlas is dedicated to the curation, integration and standardization of EWAS knowledge and has the great potential to help researchers dissect molecular mechanisms of epigenetic modifications associated with biological traits.
EWAS Data Hub [19] (https://bigd.big.ac.cn/ewas/datahub) is a resource for collecting and normalizing DNA methylation array data as well as archiving associated metadata. The current release of EWAS Data Hub integrates a comprehensive collection of DNA methylation array data from 75 344 samples and employs an effective normalization method to remove batch effects among different datasets. Accordingly, taking advantages of both massive high-quality DNA methylation data and standardized metadata, EWAS Data Hub provides reference DNA methylation profiles under different contexts, involving 81 tissues/cell types (that contain 25 brain parts and 25 blood cell types), six ancestry categories, and 67 diseases (including 39 cancers). In summary, EWAS Data Hub bears great promise to aid the retrieval and discovery of methylation-based biomarkers for phenotype characterization, clinical treatment and health care.
In biology, epigenetics is the study of heritable traits, or a stable change of cell function, that happen without changes to the DNA sequence. The Greek prefix epi- in epigenetics implies features that are "on top of" or "in addition to" the traditional genetic mechanism of inheritance. Epigenetics usually involves a change that is not erased by cell division, and affects the regulation of gene expression. Such effects on cellular and physiological phenotypic traits may result from environmental factors, or be part of normal development. They can lead to cancer.
The CpG sites or CG sites are regions of DNA where a cytosine nucleotide is followed by a guanine nucleotide in the linear sequence of bases along its 5' → 3' direction. CpG sites occur with high frequency in genomic regions called CpG islands.
Regulation of gene expression, or gene regulation, includes a wide range of mechanisms that are used by cells to increase or decrease the production of specific gene products. Sophisticated programs of gene expression are widely observed in biology, for example to trigger developmental pathways, respond to environmental stimuli, or adapt to new food sources. Virtually any step of gene expression can be modulated, from transcriptional initiation, to RNA processing, and to the post-translational modification of a protein. Often, one gene regulator controls another, and so on, in a gene regulatory network.
DNA methylation is a biological process by which methyl groups are added to the DNA molecule. Methylation can change the activity of a DNA segment without changing the sequence. When located in a gene promoter, DNA methylation typically acts to repress gene transcription. In mammals, DNA methylation is essential for normal development and is associated with a number of key processes including genomic imprinting, X-chromosome inactivation, repression of transposable elements, aging, and carcinogenesis.
An epigenome consists of a record of the chemical changes to the DNA and histone proteins of an organism; these changes can be passed down to an organism's offspring via transgenerational stranded epigenetic inheritance. Changes to the epigenome can result in changes to the structure of chromatin and changes to the function of the genome.
Bisulfitesequencing (also known as bisulphite sequencing) is the use of bisulfite treatment of DNA before routine sequencing to determine the pattern of methylation. DNA methylation was the first discovered epigenetic mark, and remains the most studied. In animals it predominantly involves the addition of a methyl group to the carbon-5 position of cytosine residues of the dinucleotide CpG, and is implicated in repression of transcriptional activity.
Computational epigenetics uses statistical methods and mathematical modelling in epigenetic research. Due to the recent explosion of epigenome datasets, computational methods play an increasing role in all areas of epigenetic research.
Epigenomics is the study of the complete set of epigenetic modifications on the genetic material of a cell, known as the epigenome. The field is analogous to genomics and proteomics, which are the study of the genome and proteome of a cell. Epigenetic modifications are reversible modifications on a cell's DNA or histones that affect gene expression without altering the DNA sequence. Epigenomic maintenance is a continuous process and plays an important role in stability of eukaryotic genomes by taking part in crucial biological mechanisms like DNA repair. Plant flavones are said to be inhibiting epigenomic marks that cause cancers. Two of the most characterized epigenetic modifications are DNA methylation and histone modification. Epigenetic modifications play an important role in gene expression and regulation, and are involved in numerous cellular processes such as in differentiation/development and tumorigenesis. The study of epigenetics on a global level has been made possible only recently through the adaptation of genomic high-throughput assays.
The Illumina Methylation Assay using the Infinium I platform uses 'BeadChip' technology to generate a comprehensive genome-wide profiling of human DNA methylation. Similar to bisulfite sequencing and pyrosequencing, this method quantifies methylation levels at various loci within the genome. This assay is used for methylation probes on the Illumina Infinium HumanMethylation27 BeadChip. Probes on the 27k array target regions of the human genome to measure methylation levels at 27,578 CpG dinucleotides in 14,495 genes. In 2008, Illumina released the Infinium HumanMethylation450 BeadChip array, which targets over 450,000 methylation sites. In 2016, the Infinium MethylationEPIC BeadChip ("EPIC") was released, which interrogates over 850,000 methylation sites across the human genome.
Methylated DNA immunoprecipitation is a large-scale purification technique in molecular biology that is used to enrich for methylated DNA sequences. It consists of isolating methylated DNA fragments via an antibody raised against 5-methylcytosine (5mC). This technique was first described by Weber M. et al. in 2005 and has helped pave the way for viable methylome-level assessment efforts, as the purified fraction of methylated DNA can be input to high-throughput DNA detection methods such as high-resolution DNA microarrays (MeDIP-chip) or next-generation sequencing (MeDIP-seq). Nonetheless, understanding of the methylome remains rudimentary; its study is complicated by the fact that, like other epigenetic properties, patterns vary from cell-type to cell-type.
In recent years it has become apparent that the environment and underlying mechanisms affect gene expression and the genome outside of the central dogma of biology. It has been found that many epigenetic mechanisms are involved in the regulation and expression of genes such as DNA methylation and chromatin remodeling. These epigenetic mechanisms are believed to be a contributing factor to pathological diseases such as type 2 diabetes. An understanding of the epigenome of diabetes patients may help to elucidate otherwise hidden causes of this disease.
Embryonic stem cells are capable of self-renewing and differentiating to the desired fate depending on their position in the body. Stem cell homeostasis is maintained through epigenetic mechanisms that are highly dynamic in regulating the chromatin structure as well as specific gene transcription programs. Epigenetics has been used to refer to changes in gene expression, which are heritable through modifications not affecting the DNA sequence.
The epigenetics of schizophrenia is the study of how inherited epigenetic changes are regulated and modified by the environment and external factors and how these changes influence the onset and development of, and vulnerability to, schizophrenia. Epigenetics concerns the heritability of those changes, too. Schizophrenia is a debilitating and often misunderstood disorder that affects up to 1% of the world's population. Although schizophrenia is a heavily studied disorder, it has remained largely impervious to scientific understanding; epigenetics offers a new avenue for research, understanding, and treatment.
Differentially methylated regions (DMRs) are genomic regions with different DNA methylation status across different biological samples and regarded as possible functional regions involved in gene transcriptional regulation. The biological samples can be different cells/tissues within the same individual, the same cell/tissue at different times, cells/tissues from different individuals, even different alleles in the same cell.
An epigenetic clock is a biochemical test that can be used to measure age. The test is based on DNA methylation levels, measuring the accumulation of methyl groups to one's DNA molecules.
Neuroepigenetics is the study of how epigenetic changes to genes affect the nervous system. These changes may effect underlying conditions such as addiction, cognition, and neurological development.
CpG island hypermethylation is a phenomenon that is important for the regulation of gene expression in cancer cells, as an epigenetic control aberration responsible for gene inactivation. Hypermethylation of CpG islands has been described in almost every type of tumor.
Human epigenome is the complete set of structural modifications of chromatin and chemical modifications of histones and nucleotides. These modifications affect according to cellular type and development status. Various studies show that epigenome depends on exogenous factors.
Cellular deconvolution refers to computational techniques aiming at estimating the proportions of different cell types in samples collected from a tissue. For example, samples collected from the human brain are a mixture of various neuronal and glial cell types in different proportions, where each cell type has a diverse gene expression profile. Since most high-throughput technologies use bulk samples and measure the aggregated levels of molecular information for all cells in a sample, the measured values would be an aggregate of the values pertaining to the expression landscape of different cell types. Therefore, many downstream analyses such as differential gene expression might be confounded by the variations in cell type proportions when using the output of high-throughput technologies applied to bulk samples. The development of statistical methods to identify cell type proportions in large-scale bulk samples is an important step for better understanding of the relationship between cell type composition and diseases.
Epiphenotyping involves studying the relationship between DNA methylation patterns and phenotypic traits in individuals and populations to be able to predict a phenotype from a DNA methylation profile. In the following sections, the background of epiphenotyping, an overview of a general methodology, its applications, advantages, and limitations are covered.
This article needs additional or more specific categories .(December 2022) |