Mendelian randomization

Last updated

In epidemiology, Mendelian randomization (commonly abbreviated to MR) is a method using measured variation in genes to examine the causal effect of an exposure on an outcome. Under key assumptions (see below), the design reduces both reverse causation and confounding, which often substantially impede or mislead the interpretation of results from epidemiological studies. [1]

Contents

Gregor Mendel. The term Mendelian randomization was termed because the random assignment of genetic variants from parents to offspring is fundamental to the method. Gregor Mendel 2.jpg
Gregor Mendel. The term Mendelian randomization was termed because the random assignment of genetic variants from parents to offspring is fundamental to the method.

The study design was first proposed in 1986 [2] and subsequently described by Gray and Wheatley [3] as a method for obtaining unbiased estimates of the effects of an assumed causal variable without conducting a traditional randomized controlled trial (the standard in epidemiology for establishing causality). These authors also coined the term Mendelian randomization.

Motivation

One of the predominant aims of epidemiology is to identify modifiable causes of health outcomes and disease especially those of public health concern. In order to ascertain whether modifying a particular trait (e.g. via an intervention, treatment or policy change) will convey a beneficial effect within a population, firm evidence that this trait causes the outcome of interest is required. However, many observational epidemiological study designs are limited in the ability to discern correlation from causation - specifically whether a particular trait causes an outcome of interest, is simply related to that outcome (but does not cause it) or is a consequence of the outcome itself. Only the former will be beneficial within a public health setting where the aim is to modify that trait to reduce the burden of disease. There are many epidemiological study designs that aim to understand relationships between traits within a population sample, each with shared and unique advantages and limitations in terms of providing causal evidence, with the "gold standard" being randomized controlled trials. [4]

Well-known successful demonstrations of causal evidence consistent across multiple studies with different designs include the identified causal links between smoking and lung cancer, and between blood pressure and stroke. However, there have also been notable failures when exposures hypothesized to be a causal risk factor for a particular outcome were later shown by well conducted randomized controlled trials not to be causal. For instance, it was previously thought that hormone replacement therapy would prevent cardiovascular disease, but it is now known to have no such benefit [5] Another notable example is that of selenium and prostate cancer. Some observational studies found an association between higher circulating selenium levels (usually acquired through various foods and dietary supplements ) and lower risk of prostate cancer. However, the Selenium and Vitamin E Cancer Prevention Trial (SELECT) showed evidence that dietary selenium supplementation actually increased the risk of prostate and advanced prostate cancer and had an additional off-target effect on increasing type 2 diabetes risk. [6] Mendelian randomization methods now support the view that high selenium status may not prevent cancer in the general population, and may even increase the risk of specific types. [7] Such inconsistencies between observational epidemiological studies and randomized controlled trials are likely a function of social, behavioral, or physiological confounding factors in many observational epidemiological designs, which are particularly difficult to measure accurately and difficult to control for. Moreover, randomized controlled trials (RCTs) are usually expensive, time-consuming and laborious and many epidemiological findings cannot be ethically replicated in clinical trials. Mendelian randomization studies appear capable of resolving questions of potential confounding more efficiently than RCTs [8]

Definition

Mendelian randomization (MR) is fundamentally an instrumental variables estimation method hailing from econometrics. The method uses the properties of germline genetic variation (usually in the form of single nucleotide polymorphisms or SNPs) strongly associated with a putative exposure as a "proxy" or "instrument" for that exposure to test for and estimate a causal effect of the exposure on an outcome of interest from observational data. The genetic variation used will have either well-understood effects on exposure patterns (e.g. propensity to smoke heavily) or effects that mimic those produced by modifiable exposures (e.g., raised blood cholesterol [2] ). Importantly, the genotype must only affect the disease status indirectly via its effect on the exposure of interest. [9]

Directed acyclic graph traditionally used to represent the Mendelian randomization framework and its core assumptions.
Z
{\displaystyle Z}
is the genetic variants,
X
{\displaystyle X}
is the exposure,
Y
{\displaystyle Y}
is the outcome of interest, and
U
{\displaystyle U}
are possible confounders. Directed acylic graph for Mendelian randomization Wikipedia page.png
Directed acyclic graph traditionally used to represent the Mendelian randomization framework and its core assumptions. is the genetic variants, is the exposure, is the outcome of interest, and are possible confounders.

As genotypes are assigned randomly when passed from parents to offspring during meiosis, then groups of individuals defined by genetic variation associated with an exposure at a population level should be largely unrelated to the confounding factors that typically plague observational epidemiology studies. Germline genetic variation (i.e. that which can be inherited) is also temporarily fixed at conception and not modified by the onset of any outcome or disease, precluding reverse causation. Additionally, given improvements in modern genotyping technologies, measurement error and systematic misclassification is often low with genetic data. In this regard Mendelian randomization can be thought of as analogous to "nature's randomized controlled trial".

Mendelian randomization requires three core instrumental variable assumptions. [10] Namely that:

  1. The genetic variant(s) being used as an instrument for the exposure is associated with the exposure. This is known as the "relevance" assumption.
  2. There are no common causes (i.e. confounders) of the genetic variant(s) and the outcome of interest. This is known as the "independence" or "exchangeability" assumption.
  3. There is no independent pathway between the genetic variant(s) and the outcome other than through the exposure. This is known as the "exclusion restriction" or "no horizontal pleiotropy" assumption.

To ensure that the first core assumption is validated, Mendelian randomization requires distinct associations between genetic variation and exposures of interest. These are usually obtained from genome-wide association studies though can also be candidate gene studies. The second assumption relies on there being no population substructure (e.g. geographical factors that induce an association between the genotype and outcome), mate choice that is not associated with genotype (i.e. random mating or panmixia) and no dynastic effects (i.e. where the expression of parental genotype in the parental phenotype directly affects the offspring phenotype).[ citation needed ]

Statistical analysis

Mendelian randomization is usually applied through the use of instrumental variables estimation with genetic variants acting as instruments for the exposure of interest. [11] This can be implemented using data on the genetic variants, exposure and outcome of interest for a set of individuals in a single dataset or using summary data on the association between the genetic variants and the exposure and the association between the genetic variants and the outcome in separate datasets. The method has also been used in economic research studying the effects of obesity on earnings, and other labor market outcomes. [12]

When a single dataset is used the methods of estimation applied are those frequently used elsewhere in instrumental variable estimation, such as two-stage least squares. [13] If multiple genetic variants are associated with the exposure they can either be used individually as instruments or combined to create an allele score which is used as a single instrument.[ citation needed ]

Analysis using summary data often applies data from genome-wide association studies. In this case the association between genetic variants and the exposure is taken from the summary results produced by a genome-wide association study for the exposure. The association between the same genetic variants and the outcome is then taken from the summary results produced by a genome-wide association study for the outcome. These two sets of summary results are then used to obtain the MR estimate. Given the following notation:

effect of genetic variant on the exposure ;
estimated effect of genetic variant on the outcome
estimated standard error of this estimated effect;
MR estimate of the causal effect of the exposure on the outcome

and considering the effect of a single genetic variant, the MR estimate can be obtained from the Wald ratio:

When multiple genetic variants are used, the individual ratios for each genetic variants are combined using inverse variance weighting where each individual ratio is weighted by the uncertainty in their estimation. [14] This gives the IVW estimate which can be calculated as:

Alternatively, the same estimate can be obtained from a linear regression which used the genetic variant-outcome association as the outcome and the genetic variant-exposure association as the exposure. This linear regression is weighted by the uncertainty in the genetic-variant outcome association and does not include a constant.

These methods only provide reliable estimates of the causal effect of the exposure on the outcome under the core instrumental variable assumptions. Alternative methods are available that are robust to a violation of the third assumption, i.e. that provide reliable results under some types of horizontal pleiotropy. [15] Additionally some biases that arise from violations of the second IV assumption, such as dynastic effects, can be overcome through the use of data which includes siblings or parents and their offspring. [16]

History

The Mendelian randomization method depends on two principles derived from the original work by Gregor Mendel on genetic inheritance. Its foundation come from Mendel’s laws namely 1) the law of segregation in which there is complete segregation of the two allelomorphs in equal number of germ-cells of a heterozygote and 2) separate pairs of allelomorphs segregate independently of one another and which were first published as such in 1906 by Robert Heath Lock. Another progenitor of Mendelian randomization is Sewall Wright who introduced path analysis, a form of causal diagram used for making causal inference from non-experimental data. The method relies on causal anchors, and the anchors in the majority of his examples were provided by Mendelian inheritance, as is the basis of MR. [17] Another component of the logic of MR is the instrumental gene, the concept of which was introduced by Thomas Hunt Morgan. [18] This is important as it removed the need to understand the physiology of the gene for making the inference about genetic processes.[ citation needed ]

Since that time the literature includes examples of research using molecular genetics to make inference about modifiable risk factors, which is the essence of MR. One example is the work of Gerry Lower and colleagues in 1979 who used the N-acetyltransferase phenotype as an anchor to draw inference about various exposures including smoking and amine dyes as risk factors for bladder cancer. [19] Another example is the work of Martijn Katan (then of Wageningen University & Research, Netherlands) in which he advocated a study design using Apolipoprotein E allele as an instrumental variable anchor to study the observed relationship between low blood cholesterol levels and increased risk of cancer. [2] In fact, the term “Mendelian randomization” was first used in print by Richard Gray and Keith Wheatley (both of Radcliffe Infirmary, Oxford, UK) in 1991 in a somewhat different context; in a method allowing instrumental variable estimation but in relation to an approach relying on Mendelian inheritance rather than genotype. [3] In their 2003 paper, Shah Ebrahim and George Davey Smith use the term again to describe the method of using germline genetic variants for understanding causality in an instrumental variable analysis, and it is this methodology that is now widely used and to which the meaning is ascribed. [20] The Mendelian randomization method is now widely adopted in causal epidemiology, and the number of MR studies reported in the scientific literature has grown every year since the 2003 paper. In 2021 STROBE-MR guidelines were published to assist readers and reviewers of Mendelian randomization studies to evaluate the validity and utility of published studies. [21]

Related Research Articles

<span class="mw-page-title-main">Epidemiology</span> Study of health and disease within a population

Epidemiology is the study and analysis of the distribution, patterns and determinants of health and disease conditions in a defined population.

The science of epidemiology has matured significantly from the times of Hippocrates, Semmelweis and John Snow. The techniques for gathering and analyzing epidemiological data vary depending on the type of disease being monitored but each study will have overarching similarities.

A cohort study is a particular form of longitudinal study that samples a cohort, performing a cross-section at intervals through time. It is a type of panel study where the individuals in the panel share a common characteristic.

An odds ratio (OR) is a statistic that quantifies the strength of the association between two events, A and B. The odds ratio is defined as the ratio of the odds of A in the presence of B and the odds of A in the absence of B, or equivalently, the ratio of the odds of B in the presence of A and the odds of B in the absence of A. Two events are independent if and only if the OR equals 1, i.e., the odds of one event are the same in either the presence or absence of the other event. If the OR is greater than 1, then A and B are associated (correlated) in the sense that, compared to the absence of B, the presence of B raises the odds of A, and symmetrically the presence of A raises the odds of B. Conversely, if the OR is less than 1, then A and B are negatively correlated, and the presence of one event reduces the odds of the other event.

A case–control study is a type of observational study in which two existing groups differing in outcome are identified and compared on the basis of some supposed causal attribute. Case–control studies are often used to identify factors that may contribute to a medical condition by comparing subjects who have the condition with patients who do not have the condition but are otherwise similar. They require fewer resources but provide less evidence for causal inference than a randomized controlled trial. A case–control study is often used to produce an odds ratio. Some statistical methods make it possible to use a case–control study to also estimate relative risk, risk differences, and other quantities.

In epidemiology, a risk factor or determinant is a variable associated with an increased risk of disease or infection.

<span class="mw-page-title-main">Gene–environment interaction</span> Response to the same environmental variation differently by different genotypes

Gene–environment interaction is when two different genotypes respond to environmental variation in different ways. A norm of reaction is a graph that shows the relationship between genes and environmental factors when phenotypic differences are continuous. They can help illustrate GxE interactions. When the norm of reaction is not parallel, as shown in the figure below, there is a gene by environment interaction. This indicates that each genotype responds to environmental variation in a different way. Environmental variation can be physical, chemical, biological, behavior patterns or life events.

<span class="mw-page-title-main">Relative risk</span> Measure of association used in epidemiology

The relative risk (RR) or risk ratio is the ratio of the probability of an outcome in an exposed group to the probability of an outcome in an unexposed group. Together with risk difference and odds ratio, relative risk measures the association between the exposure and the outcome.

<span class="mw-page-title-main">Confounding</span> Variable or factor in causal inference

In causal inference, a confounder is a variable that influences both the dependent variable and independent variable, causing a spurious association. Confounding is a causal concept, and as such, cannot be described in terms of correlations or associations. The existence of confounders is an important quantitative explanation why correlation does not imply causation. Some notations are explicitly designed to identify the existence, possible existence, or non-existence of confounders in causal relationships between elements of a system.

<span class="mw-page-title-main">Causal model</span> Conceptual model in philosophy of science

In the philosophy of science, a causal model is a conceptual model that describes the causal mechanisms of a system. Several types of causal notation may be used in the development of a causal model. Causal models can improve study designs by providing clear rules for deciding which independent variables need to be included/controlled for.

<span class="mw-page-title-main">Genome-wide association study</span> Study of genetic variants in different individuals

In genomics, a genome-wide association study, is an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait. GWA studies typically focus on associations between single-nucleotide polymorphisms (SNPs) and traits like major human diseases, but can equally be applied to any other genetic variants and any other organisms.

Gene–environment correlation is said to occur when exposure to environmental conditions depends on an individual's genotype.

Population structure is the presence of a systematic difference in allele frequencies between subpopulations. In a randomly mating population, allele frequencies are expected to be roughly similar between groups. However, mating tends to be non-random to some degree, causing structure to arise. For example, a barrier like a river can separate two groups of the same species and make it difficult for potential mates to cross; if a mutation occurs, over many generations it can spread and become common in one subpopulation while being completely absent in the other.

In multivariate quantitative genetics, a genetic correlation is the proportion of variance that two traits share due to genetic causes, the correlation between the genetic influences on a trait and the genetic influences on a different trait estimating the degree of pleiotropy or causal overlap. A genetic correlation of 0 implies that the genetic effects on one trait are independent of the other, while a correlation of 1 implies that all of the genetic influences on the two traits are identical. The bivariate genetic correlation can be generalized to inferring genetic latent variable factors across > 2 traits using factor analysis. Genetic correlation models were introduced into behavioral genetics in the 1970s–1980s.

Behavioural genetics, also referred to as behaviour genetics, is a field of scientific research that uses genetic methods to investigate the nature and origins of individual differences in behaviour. While the name "behavioural genetics" connotes a focus on genetic influences, the field broadly investigates the extent to which genetic and environmental factors influence individual differences, and the development of research designs that can remove the confounding of genes and environment. Behavioural genetics was founded as a scientific discipline by Francis Galton in the late 19th century, only to be discredited through association with eugenics movements before and during World War II. In the latter half of the 20th century, the field saw renewed prominence with research on inheritance of behaviour and mental illness in humans, as well as research on genetically informative model organisms through selective breeding and crosses. In the late 20th and early 21st centuries, technological advances in molecular genetics made it possible to measure and modify the genome directly. This led to major advances in model organism research and in human studies, leading to new scientific discoveries.

The Bradford Hill criteria, otherwise known as Hill's criteria for causation, are a group of nine principles that can be useful in establishing epidemiologic evidence of a causal relationship between a presumed cause and an observed effect and have been widely used in public health research. They were established in 1965 by the English epidemiologist Sir Austin Bradford Hill.

Causal inference is the process of determining the independent, actual effect of a particular phenomenon that is a component of a larger system. The main difference between causal inference and inference of association is that causal inference analyzes the response of an effect variable when a cause of the effect variable is changed. The study of why things occur is called etiology, and can be described using the language of scientific causal notation. Causal inference is said to provide the evidence of causality theorized by causal reasoning.

Molecular pathological epidemiology is a discipline combining epidemiology and pathology. It is defined as "epidemiology of molecular pathology and heterogeneity of disease". Pathology and epidemiology share the same goal of elucidating etiology of disease, and MPE aims to achieve this goal at molecular, individual and population levels. Typically, MPE utilizes tissue pathology resources and data within existing epidemiology studies. Molecular epidemiology broadly encompasses MPE and conventional-type molecular epidemiology with the use of traditional disease designation systems.

<span class="mw-page-title-main">Forensic epidemiology</span>

The discipline of forensic epidemiology (FE) is a hybrid of principles and practices common to both forensic medicine and epidemiology. FE is directed at filling the gap between clinical judgment and epidemiologic data for determinations of causality in civil lawsuits and criminal prosecution and defense.

<span class="mw-page-title-main">Polygenic score</span> Numerical score aimed at predicting a trait based on variation in multiple genetic loci

In genetics, a polygenic score (PGS) is a number that summarizes the estimated effect of many genetic variants on an individual's phenotype. The PGS is also called the polygenic index (PGI) or genome-wide score; in the context of disease risk, it is called a polygenic risk score or genetic risk score. The score reflects an individual's estimated genetic predisposition for a given trait and can be used as a predictor for that trait. It gives an estimate of how likely an individual is to have a given trait based only on genetics, without taking environmental factors into account; and it is typically calculated as a weighted sum of trait-associated alleles.

References

  1. Haycock PC, Burgess S, Wade KH, Bowden J, Relton C, Davey Smith G (April 2016). "Best (but oft-forgotten) practices: the design, analysis, and interpretation of Mendelian randomization studies". The American Journal of Clinical Nutrition. 103 (4): 965–978. doi:10.3945/ajcn.115.118216. PMC   4807699 . PMID   26961927.
  2. 1 2 3 Katan MB (March 1986). "Apolipoprotein E isoforms, serum cholesterol, and cancer". Lancet . 1 (8479): 507–508. doi:10.1016/s0140-6736(86)92972-7. PMID   2869248. S2CID   38327985.
  3. 1 2 Gray R, Wheatley K (1991). "How to avoid bias when comparing bone marrow transplantation with chemotherapy". Bone Marrow Transplantation . 7 (Suppl 3): 9–12. PMID   1855097.
  4. Murad, M. Hassan; Asi, Noor; Alsawas, Mouaz; Alahdab, Fares (2016-08-01). "New evidence pyramid". BMJ Evidence-Based Medicine. 21 (4): 125–127. doi:10.1136/ebmed-2016-110401. ISSN   2515-446X. PMC   4975798 . PMID   27339128.
  5. "Benefits and risks of HRT | Information for the public | Menopause: diagnosis and management | Guidance | NICE". www.nice.org.uk. 12 November 2015.
  6. Klein EA, Thompson IM, Tangen CM, Crowley JJ, Lucia MS, Goodman PJ, et al. (October 2011). "Vitamin E and the risk of prostate cancer: the Selenium and Vitamin E Cancer Prevention Trial (SELECT)". JAMA. 306 (14): 1549–1556. doi:10.1001/jama.2011.1437. PMC   4169010 . PMID   21990298.
  7. [Yuan, Shuai, Amy M. Mason, Paul Carter, Mathew Vithayathil, Siddhartha Kar, Stephen Burgess, and Susanna C. Larsson. "Selenium and cancer risk: Wide‐angled Mendelian randomization analysis." International journal of cancer 150, no. 7 (2022): 1134-1140]
  8. "Researchers find a way to mimic clinical trials using genetics". MIT Technology Review.
  9. Holmes MV, Ala-Korpela M, Smith GD (October 2017). "Mendelian randomization in cardiometabolic disease: challenges in evaluating causality". Nature Reviews. Cardiology. 14 (10): 577–590. doi:10.1038/nrcardio.2017.78. PMC   5600813 . PMID   28569269.
  10. Wade K (2021). "MR Dictionary". MR Dictionary.
  11. Didelez V, Sheehan N (August 2007). "Mendelian randomization as an instrumental variable approach to causal inference". Statistical Methods in Medical Research. 16 (4): 309–330. doi:10.1177/0962280206077743. PMID   17715159. S2CID   6236517.
  12. Böckerman P, Cawley J, Viinikainen J, Lehtimäki T, Rovio S, Seppälä I, et al. (January 2019). "The effect of weight on labor market outcomes: An application of genetic instrumental variables". Health Economics . 28 (1): 65–77. doi:10.1002/hec.3828. PMC   6585973 . PMID   30240095.
  13. Wooldridge JM (2010). Econometric Analysis of Cross Section and Panel Data (2nd ed.). Cambridge, MA: MIT Press. ISBN   978-0-262-23258-6. OCLC   627701062 via worldcat.org.
  14. Burgess S, Butterworth A, Thompson SG (November 2013). "Mendelian randomization analysis with multiple genetic variants using summarized data". Genetic Epidemiology. 37 (7): 658–665. doi:10.1002/gepi.21758. PMC   4377079 . PMID   24114802.
  15. Hemani G, Bowden J, Davey Smith G (August 2018). "Evaluating the potential role of pleiotropy in Mendelian randomization studies". Human Molecular Genetics. 27 (R2): R195–R208. doi:10.1093/hmg/ddy163. PMC   6061876 . PMID   29771313.
  16. Brumpton B, Sanderson E, Heilbron K, Hartwig FP, Harrison S, Vie GÅ, et al. (July 2020). "Avoiding dynastic, assortative mating, and population stratification biases in Mendelian randomization through within-family analyses". Nature Communications. 11 (1): 3519. Bibcode:2020NatCo..11.3519B. doi:10.1038/s41467-020-17117-4. PMC   7360778 . PMID   32665587.
  17. Wright S (1921). "Correlation and causation". J. Agricultural Research. 20: 557–585.
  18. Morgan TH (1917). "The Theory of the Gene". The American Naturalist. 51 (609): 513–544. doi:10.1086/279629. ISSN   0003-0147. JSTOR   2456204. S2CID   84050307.
  19. Lower GM, Nilsson T, Nelson CE, Wolf H, Gamsky TE, Bryan GT (April 1979). "N-acetyltransferase phenotype and risk in urinary bladder cancer: approaches in molecular epidemiology. Preliminary results in Sweden and Denmark". Environmental Health Perspectives. 29: 71–79. doi:10.1289/ehp.792971. PMC   1637362 . PMID   510245.
  20. Smith GD, Ebrahim S (February 2003). "'Mendelian randomization': Can genetic epidemiology contribute to understanding environmental determinants of disease?". International Journal of Epidemiology . 32 (1): 1–22. doi:10.1093/ije/dyg070. PMID   12689998.
  21. Skrivankova VW, Richmond RC, Woolf BA, Davies NM, Swanson SA, VanderWeele TJ, et al. (October 2021). "Strengthening the reporting of observational studies in epidemiology using mendelian randomisation (STROBE-MR): explanation and elaboration". BMJ. 375: n2233. doi:10.1136/bmj.n2233. PMC   8546498 . PMID   34702754.

Further reading