An epigenome-wide association study (EWAS) is an examination of a genome-wide set of quantifiable

epigenetic In biology, epigenetics is the study of stable phenotypic changes (known as ''marks'') that do not involve alterations in the DNA sequence. The Greek prefix '' epi-'' ( "over, outside of, around") in ''epigenetics'' implies features that are "o ...

marks, such as

DNA methylation DNA methylation is a biological process by which methyl groups are added to the DNA molecule. Methylation can change the activity of a DNA segment without changing the sequence. When located in a gene promoter, DNA methylation typically acts t ...

, in different individuals to derive associations between epigenetic variation and a particular identifiable

phenotype In genetics, the phenotype () is the set of observable characteristics or traits of an organism. The term covers the organism's morphology or physical form and structure, its developmental processes, its biochemical and physiological proper ...

/trait. When patterns change such as DNA methylation at specific loci, discriminating the phenotypically affected cases from control individuals, this is considered an indication that epigenetic perturbation has taken place that is associated, causally or consequentially, with the phenotype. EWAS workflow

Background

The

epigenome An epigenome consists of a record of the chemical changes to the DNA and histone proteins of an organism; these changes can be passed down to an organism's offspring via transgenerational stranded epigenetic inheritance. Changes to the epigenome ...

is governed by both genetic and environmental factors, causing it to be highly dynamic and complex. Epigenetic information exists in the cell as DNA and

histone In biology, histones are highly basic proteins abundant in lysine and arginine residues that are found in eukaryotic cell nuclei. They act as spools around which DNA winds to create structural units called nucleosomes. Nucleosomes in turn are wr ...

marks, as well as

non-coding RNA A non-coding RNA (ncRNA) is a functional RNA molecule that is not translated into a protein. The DNA sequence from which a functional non-coding RNA is transcribed is often called an RNA gene. Abundant and functionally important types of non-c ...

s. DNA methylation (DNAm) patterns change over time, and vary between developmental stage and tissue type. The main type of DNAm is at

cytosine Cytosine () ( symbol C or Cyt) is one of the four nucleobases found in DNA and RNA, along with adenine, guanine, and thymine (uracil in RNA). It is a pyrimidine derivative, with a heterocyclic aromatic ring and two substituents attached (an am ...

s within CpG dinucleotides which is known to be involved in gene expression regulation. DNAm pattern changes have been extensively studied in complex diseases such as cancer and diabetes. In a normal cell, the bulk

genome In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ge ...

is highly methylated at CpGs, whereas

CpG islands The CpG sites or CG sites are regions of DNA where a cytosine nucleotide is followed by a guanine nucleotide in the linear sequence of bases along its 5' → 3' direction. CpG sites occur with high frequency in genomic regions called CpG isl ...

(CPI) at

gene promoter In genetics, a promoter is a sequence of DNA to which proteins bind to initiate transcription of a single RNA transcript from the DNA downstream of the promoter. The RNA transcript may encode a protein (mRNA), or can have a function in and of i ...

regions remain highly unmethylated. Aberrant DNAm is the most common type of molecular abnormality in cancer cells, where the bulk genome becomes globally ‘hypomethylated’ and CPIs in promoter regions become ‘hypermethylated’, usually leading to silencing of tumour suppressor genes. More recently, studies on diabetes have uncovered further evidence to support an epigenetic component of diseases, including differences in disease-associated epigenetic marks between

monozygotic twins Twins are two offspring produced by the same pregnancy.MedicineNet > Definition of TwinLast Editorial Review: 19 June 2000 Twins can be either ''monozygotic'' ('identical'), meaning that they develop from one zygote, which splits and forms two em ...

, the rising incidence of

type 1 diabetes Type 1 diabetes (T1D), formerly known as juvenile diabetes, is an autoimmune disease that originates when cells that make insulin (beta cells) are destroyed by the immune system. Insulin is a hormone required for the cells to use blood sugar for ...

in the general population, and developmental reprogramming events in which ''

in utero ''In Utero'' is the third and final studio album by American rock band Nirvana. It was released on September 21, 1993, by DGC Records. After breaking into the mainstream with their second album, ''Nevermind'' (1991), Nirvana hired Steve Albini t ...

'' or childhood environments can influence disease outcome in adulthood. Post-translational histone modifications include, but are not limited to, methylation, acetylation and phosphorylation on the core histone tails. These post-translational modifications are read by proteins that can then modify the

chromatin Chromatin is a complex of DNA and protein found in eukaryotic cells. The primary function is to package long DNA molecules into more compact, denser structures. This prevents the strands from becoming tangled and also plays important roles in r ...

state at that locus. Epigenetic variation arises in three distinct ways; it can be inherited and be therefore present in all cells of the adult including the

germline In biology and genetics, the germline is the population of a multicellular organism's cells that pass on their genetic material to the progeny (offspring). In other words, they are the cells that form the egg, sperm and the fertilised egg. They ...

(a process known as

transgenerational epigenetic inheritance Transgenerational epigenetic inheritance is the transmission of epigenetic markers from one organism to the next (i.e., from parent to child) that affects the traits of offspring without altering the nucleic acid primary structure, primary structur ...

; a controversial phenomenon that has not yet been observed in humans); it can occur randomly and be present in a subset of cells in the adult, the amount of which depending on how early in development the variation occurs; or it can be induced as a result of behavioural or environmental factors. EWAS has previously associated changes in methylation with several diseases and complex conditions which do not have a known epidemiology and therefore are crucial for the identification of epigenetic factors that contribute to or are a consequence of pathogenesis of these diseases.

Methods

Types of Study Designs

Retrospective (case-control)

Retrospective studies compare unrelated individuals who fall into two categories, controls without the disease or phenotype of interest, and cases who have the phenotype of interest. An advantage of such studies is that many cohorts of case-control samples already exist with available genotype and expression data that can be integrated with epigenome data. A downside, however, is that they cannot determine whether epigenetic differences are a result of disease-associated genetic differences, post-disease processes or disease-associated drug interventions.

Family studies

Useful to study transgenerational inheritance patterns of epigenetic marks. A main limitation of EWAS is deciphering if a phenotype is associated with epigenetic changes as a result of a variable in question or a result of previous genomic variants leading to epigenetic alterations. Comparisons between parent and offspring genomic and epigenomic data allows one to rule out the possibility that a disease or phenotype is due to genomic variation. A limitation of this study design is that very few cohorts which are large enough exist.

Monozygotic twin studies

Monozygotic twins carry identical genomic information. Therefore, if they are discordant for a particular disease or phenotype it is likely a result of epigenetic differences. However, unless the twins are studied longitudinally it is impossible to determine if epigenetic variation is the cause of or consequence of disease. Another limitation is recruiting a large enough cohort of discordant monozygotic twins with the disease of interest.

Longitudinal cohorts

Longitudinal studies follow a cohort of individuals over an extended period of time, usually from birth or before disease onset. Samples are taken and records are kept over many years, making these studies extremely useful to determine causality of particular phenotypes. Since the same individuals are followed at time points before and after disease onset, it removes the confounding effects of differences between cases and controls. Longitudinal studies are not only useful for risk studies (using DNA samples prior to disease onset), but also in intervention studies using pre- and posttreatment with specific exposures to investigate environmental impacts on the epigenome. A major disadvantage is the long timeline of the studies as well as the expense. Longitudinal studies using disease-discordant monozygotic twins gives the added benefit of ruling out genetic influences on epigenetic variation.

Tissue of Interest

The tissue specificity of epigenomic marks create another challenge when designing an EWAS. Tissue choice is limited by both accessibility and stability of epigenetic patterning. It is crucial to choose a tissue in which epigenetic marks are variable in the population yet stable over time. If this isn't possible, it would be required to use multiple serially collected samples from the same individuals to report robust associations with a particular phenotype. EWAS for diseases are often measured using DNA methylation in blood samples because disease-relevant tissues are difficult to obtain. In some cases, the pattern of methylation is not necessarily biologically relevant to the proposed phenotype. The choice of blood also requires stringent analysis and careful interpretation due to variable cell type composition. Choosing a surrogate tissue therefore requires that the interindividual differences correlate between the tissue of interest and the surrogate, but also for the exposure to induce similar changes in both tissues. To date, an underlying issue is that there is no clear evidence that, in general, epigenetic marks respond to environmental exposures in a similar way across tissues. Illuminamethylationworkflow

Quantification Method: DNA Methylation

The platform for epigenome-wide DNAm quantification utilizes the high throughput technology

Illumina Methylation Assay The Illumina Methylation Assay using the Infinium I platform uses 'BeadChip' technology to generate a comprehensive genome-wide profiling of human DNA methylation. Similar to bisulfite sequencing and pyrosequencing, this method quantifies methylat ...

. In the past, the 27k Illumina array covered on average two CpG sites in the promoter regions of approximately 14,000 genes and represented less than 0.1% of the 28 million CpG sites in the human genome. This falls short of being representative of the entire human epigenome. None of the early EWAS using this array used independent validation to verify the associated probes. An interesting observation was a bias in the differences between cases and controls towards non-CpG island probes (which were significantly underrepresented in this array design), arguing strongly for the use of the latterly designed 450k array which does cover non-CpG islands with a higher density of probes. Presently, the Illumina 450k array is the most widely used platform in the last two years for studies reporting EWAS. The array still only covers less than 2% of the CpG sites in the genome, but does attempt to cover all known genes with a high density of probes in the promoters (including CpG islands and surrounding sequences), but also covers with a lower density across the gene bodies, 3′ untranslated regions, and other intergenic sequences.

Data Analysis and Interpretation

Site-by-site analysis

DNA methylation is typically quantified on a scale of 0–1, as the methylation array measures the proportion of DNA molecules that are methylated at a particular CpG site. The initial analyses performed are univariate tests of association to identify sites where DNA methylation varies with exposure and/or phenotype. This is followed by multiple testing corrections and utilizing an analytical strategy to reduce

batch effect In molecular biology, a batch effect occurs when non-biological factors in an experiment cause changes in the data produced by the experiment. Such effects can lead to inaccurate conclusions when their causes are correlated with one or more outcome ...

s and other technical confounding effects in the quantification of DNA methylation. The potential confounding effects arising from alterations in tissue composition is also taken into account. Additionally, adjusting for confounding factors such as age, gender and behaviours that may influence the methylation status as covariates is conducted. The association results are also corrected for the genomic control inflation factor in order to account for the population stratification. Generally, mean levels of CpG methylation are compared across categories using linear regression which allows for the adjustment of confounders and batch effects. A P-value threshold of P < 1e-7 is generally used to identify CpGs associated with the tested phenotype/stimulus. These CpGs are considered to reach epigenome-wide significance. An effect size is also calculated at this significance level, indicating the difference in methylation when comparing two qualitative groups, or different quantitative values depending on your phenotype. CpG sites significantly associated with the phenotype and/or treatment/environmental stimulus are typically represented in a manhattan plot.

Regional Changes Analysis

Single CpG sites are prone to single site natural variation effects and technical variation such as bad microarray probes and outliers. To make more robust associations and take into account such variation, using adjacent measurements can help increase power. In previous studies, functionally relevant findings have been associated with genomic regions as opposed to single CpGs. Therefore, looking at the regional level can help identify associated regions with more confidence, guiding downstream functional studies.

Pre-clustering or Grouping of CpG sites

Another method of analysis is using unsupervised clustering to create classes of CpG sites based on similarity of methylation variation across samples. The average methylation values within each class is used to construct data sets of reduced dimensionality, facilitating efficient tests of association between DNA methylation and phenotypes of interest. This is used to reduce the dimensionality of large data sets and take advantage of substantial biologically induced correlation. This method is useful for identifying gross patterns of methylation associated with the tested variable, but may miss specific CpG sites of interest. Besides differences in mean methylation levels, differences in variation of DNA methylation across samples may also be biologically meaningful, motivating scans for differential variability between groups.

Functional and Gene Set Enrichment

The location of the associated CpG sites or islands/regions can then be analyzed ''in silico'' to imply possible functional relevance. For example, considering whether the associated CpGs are within a promoter region or determining distance from the transcription start site that may be relevant, especially when we assume that DNA methylation associated with a phenotype acts by regulating gene transcription. Many other inferences based on past biological knowledge can be inferred if that particular region of CpGs have been studied and associated with changes in transcription. This can be used as an additional filter for identifying regions to pursue for functional validation. Several bioinformatic tools that have been developed for functional enrichment analysis can be applied to differentially methylated regions by first mapping these regions to genes. This is done by mapping the distance between the CpGs and a gene promoter that is potentially regulated by this region. Enrichment analysis based on the genomic region has thus been suggested as a complementary approach and confers substantial interpretive potential. Differentially methylated regions can then be compared to a catalog of genomic regions including, for example, sites enriched for specific chromatin modifications or transcription factor binding sites.

Methylation Odds Ratio

A methylation odds ratio can be calculated if we consider the mean methylation rate at a site in cases (or controls) to represent the methylation probability for a randomly chosen DNA strand in the case (or control) tissue samples. The methylation odds ratio is the odds for a random DNA strand in the tissue sample from a random case to be methylated, divided by the same odds for controls. This provides a measure of effect size that incorporates relative magnitudes, but also does not allow for the difference between cases and controls of features of the methylation spectrum, such as variance. The methylation odds ratio is also comparable across prospective and retrospective studies and its value only measures association and does not imply causation. Methylation risk scores have also been calculated which can integrate information across CpG sites by calculating a weighted methylation risk score as the sum of methylation values at each of the markers associated with the phenotype, weighted by marker-specific effect size

Replication

Replication using an independent cohort is required to rule out false positives identified in the initial study. This can be done in a human cohort or in a more focused manner in animal models. It is important that, when selecting the replication cohort, the individuals are reflective of the initial cohort and that the same confounding variables are taken into account. Replication, however, can be limited due to the availability of individuals and samples.

Limitations and Concerns

Causality or Consequence

Variations in the epigenome can cause disease but can also arise as a consequence of disease, and distinguishing between the two is a major limitation in EWAS. A way to circumvent this is to determine whether the epigenetic variation is present before any symptoms of disease, preferably via longitudinal studies following the same cohort of people over many years (this in itself has its own setbacks of expense and study time frame). Also needed to be taken into consideration is the possibility that epigenetic variation which arises before disease onset does not necessarily constitute causation for disease.

Sample heterogeneity

The most commonly used tissue in EWAS is blood. However, blood samples contain multiple different cell types each of which have a unique epigenetic signature. In this way, it is extremely difficult to determine if the sample you have taken is homogeneous and is therefore difficult to determine if the variation in epigenetic marks are due to the differences in phenotype/stimulus or due to the sample heterogeneity.

Tissue availability

Currently many EWAS use blood as a surrogate tissue due to its availability and ease of collection. However, epigenetic changes in the blood may not be associated with the changes in the particular tissue associated with the disease. Many intriguing disorders that could have epigenetic causative factors affect tissues such as brain, lung, heart, etc. However, when studying human patients it is not an option to take these tissues for sampling, and they are therefore left unstudied.

Related Database

EWASdb

EWASdb(http://www.bioapp.org/ewasdb/) is the first epigenome-wide association database (first online at 2015, and first published on Nucleic Acids Res. 2018 Oct 13) which stores the results of 1319 EWAS studies associated with 302 diseases/phenotypes (p<1e-7). Three types of EWAS results were stored in EWASdb: EWAS for single epi-marker; EWAS for KEGG pathway and EWAS for GO (

Gene Ontology The Gene Ontology (GO) is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species. More specifically, the project aims to: 1) maintain and develop its controlled vocabulary of gene and g ...

) categories.

EWAS Atlas

EWAS Atlas (http://bigd.big.ac.cn/ewas) is a curated knowledgebase of EWAS that provides a comprehensive collection of EWAS knowledge. Unlike extant data-oriented epigenetic resources, EWAS Atlas features manual curation of EWAS knowledge from extensive publications. In the current implementation, EWAS Atlas focuses on DNA methylation—one of the key epigenetic marks; it integrates a large number of 388,851 high-quality EWAS associations, involving 126 tissues/cell lines and covering 351 traits, 2,230 cohorts and 390 ontology entities, which are completely based on manual curation from 649 studies reported in 495 publications. In addition, it is equipped with a powerful trait enrichment analysis tool, which is capable of profiling trait-trait and trait-epigenome relationships. Future developments include regular curation of recent EWAS publications, incorporation of more epigenetic marks and possible integration of EWAS with

GWAS In genomics, a genome-wide association study (GWA study, or GWAS), also known as whole genome association study (WGA study, or WGAS), is an observational study of a genome-wide set of Single-nucleotide polymorphism, genetic variants in different i ...

. Collectively, EWAS Atlas is dedicated to the curation, integration and standardization of EWAS knowledge and has the great potential to help researchers dissect molecular mechanisms of epigenetic modifications associated with biological traits.

EWAS Data Hub

EWAS Data Hub (https://bigd.big.ac.cn/ewas/datahub) is a resource for collecting and normalizing DNA methylation array data as well as archiving associated metadata. The current release of EWAS Data Hub integrates a comprehensive collection of DNA methylation array data from 75 344 samples and employs an effective normalization method to remove batch effects among different datasets. Accordingly, taking advantages of both massive high-quality DNA methylation data and standardized metadata, EWAS Data Hub provides reference DNA methylation profiles under different contexts, involving 81 tissues/cell types (that contain 25 brain parts and 25 blood cell types), six ancestry categories, and 67 diseases (including 39 cancers). In summary, EWAS Data Hub bears great promise to aid the retrieval and discovery of methylation-based biomarkers for phenotype characterization, clinical treatment and health care.

References

Epigenetics {{Improve categories, date=December 2022