Missing heritability problem

Last updated

The missing heritability problem [1] [2] [3] [4] [5] [6] refers to the difference between heritability estimates from genetic data and heritability estimates from twin and family data across many physical and mental traits, including diseases, behaviors, and other phenotypes. This is a problem that has significant implications for medicine, since a person's susceptibility to disease may depend more on the combined effect of all the genes in the background than on the disease genes in the foreground, or the role of genes may have been severely overestimated.

Contents

Discovery

The missing heritability problem was named as such in 2008 (after the "missing baryon problem" in physics). The Human Genome Project led to optimistic forecasts that the large genetic contributions to many traits and diseases (which were identified by quantitative genetics and behavioral genetics in particular) would soon be mapped and pinned down to specific genes and their genetic variants by methods such as candidate-gene studies which used small samples with limited genetic sequencing to focus on specific genes believed to be involved, examining single-nucleotide polymorphisms (SNPs). While many hits were found, they often failed to replicate in other studies.

The exponential fall in genome sequencing costs led to the use of genome-wide association studies (GWASes) which could simultaneously examine all candidate-genes in larger samples than the original finding, where the candidate-gene hits were found to almost always be false positives and only 2-6% replicate; [7] [8] [9] [10] [11] [12] in the specific case of intelligence candidate-gene hits, only 1 candidate-gene hit replicated, [13] the top 25 schizophrenia candidate-genes were no more associated with schizophrenia than chance, [14] [15] and of 15 neuroimaging hits, none did. [16] In 2012, the editorial board of Behavior Genetics noted, in setting more stringent requirements for candidate-gene publications, that "the literature on candidate gene associations is full of reports that have not stood up to rigorous replication...it now seems likely that many of the published findings of the last decade are wrong or misleading and have not contributed to real advances in knowledge". [17] Other researchers have characterized the literature as having "yielded an infinitude of publications with very few consistent replications" and called for a phase out of candidate-gene studies in favor of polygenic scores. [18]

Dilemma

Standard genetics methods have long estimated large heritabilities such as 80% for traits such as height or intelligence, yet none of the genes had been found despite sample sizes that, while small, should have been able to detect variants of reasonable effect size such as 1 inch or 5 IQ points. If genes have such strong cumulative effects - where were they? Several resolutions have been proposed, that the missing heritability is some combination of:

  1. Twin studies and other methods were grossly biased by issues long raised by their critics; there was little genetic influence to be found. Therefore, it has been proposed that the genes that supposedly underlie behavior genetic estimates of heritability simply do not exist. [19] For instance, twin studies may have neglected to measure cross-cultural environmental variation by design. [20]
  2. Genetic effects are actually epigenetics
  3. Genetic effects are generally non-additive and due to complex interactions. Among many proposals, a model has been introduced that takes into account epigenetic inheritance on the risk and recurrence risk of a complex disease. [4] The limiting pathway (LP) model has been introduced in which a trait depends on the value of k inputs that can have rate limitations due to stoichiometric ratios, reactants required in a biochemical pathway, or proteins required for transcription of a gene. Each of these k inputs is a strictly additive trait that depends on a set of common or rare variants. When k = 1, the LP model is simply a standard additive trait. [2]
  4. Genetic effects are not due to the common SNPs examined in the candidate-gene studies & GWASes, but due to very rare mutations, copy-number variations, and other exotic kinds of genetic variants. These variants tend to be harmful and kept at low frequencies by natural selection. Whole-genome sequencing would be required to track down specific rare variants.
  5. Traits are all misdiagnoses: one person's 'schizophrenia' is due to entirely different causes than another schizophrenic, and so while there may be a gene involved in one case, it will not be involved in another, rendering GWASes futile
  6. GWASes are unable to detect genes with moderate effects on phenotypes when those genes segregate at high frequencies [21]
  7. Traits are genuine but inconsistently diagnosed or genetically influenced from country to country and time to time, leading to measurement error, which combined with genetic heterogeneity, either due to race or environment, will bias meta-analyzed GWAS & GCTA results towards zero, [22] [23] [24] [25] [26] [27]
  8. Genetic effects are indeed through common SNPs acting additively, but are highly polygenic: dispersed over hundreds or thousands of variants each of small effect like a fraction of an inch or a fifth of an IQ point and with low prior probability: unexpected enough that a candidate-gene study is unlikely to select the right SNP out of hundreds of thousands of known SNPs, and GWASes up to 2010, with n<20000, would be unable to find hits which reach genome-wide statistical-significance thresholds. Much larger GWAS sample sizes, often n>100k, would be required to find any hits at all, and would steadily increase after that.
This resolution to the missing heritability problem was supported by the introduction of Genome-wide complex trait analysis (GCTA) in 2010, which demonstrated that trait similarity could be predicted by the genetic similarity of unrelated strangers on common SNPs treated additively, and for many traits the SNP heritability was indeed a substantial fraction of the overall heritability. The GCTA results were further supported by findings that a small percent of trait variance could be predicted in GWASes without any genome-wide statistically-significant hits by a linear model including all SNPs regardless of p-value; if there were no SNP contribution, this would be unlikely, but it would be what one expected from SNPs whose effects were very imprecisely estimated by a too-small sample. Combined with the upper bound on maximum effect sizes set by the GWASes up to then, this strongly implied that the highly polygenic theory was correct. Examples of complex traits where increasingly large-scale GWASes have yielded the initial hits and then increasing numbers of hits as sample sizes increased from n<20k to n>100k or n>300k include height, [28] educational attainment, [29] and schizophrenia.

Related Research Articles

<span class="mw-page-title-main">Single-nucleotide polymorphism</span> Single nucleotide in genomic DNA at which different sequence alternatives exist

In genetics and bioinformatics, a single-nucleotide polymorphism is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a sufficiently large fraction of the population, many publications do not apply such a frequency threshold.

The candidate gene approach to conducting genetic association studies focuses on associations between genetic variation within pre-specified genes of interest, and phenotypes or disease states. This is in contrast to genome-wide association studies (GWAS), which is a hypothesis-free approach that scans the entire genome for associations between common genetic variants and traits of interest. Candidate genes are most often selected for study based on a priori knowledge of the gene's biological functional impact on the trait or disease in question. The rationale behind focusing on allelic variation in specific, biologically relevant regions of the genome is that certain alleles within a gene may directly impact the function of the gene in question and lead to variation in the phenotype or disease state being investigated. This approach often uses the case-control study design to try to answer the question, "Is one allele of a candidate gene more frequently seen in subjects with the disease than in subjects without the disease?" Candidate genes hypothesized to be associated with complex traits have generally not been replicated by subsequent GWASs or highly powered replication attempts. The failure of candidate gene studies to shed light on the specific genes underlying such traits has been ascribed to insufficient statistical power, low prior probability that scientists can correctly guess a specific allele within a specific gene that is related to a trait, poor methodological practices, and data dredging.

<span class="mw-page-title-main">Heritability of autism</span>

The heritability of autism is the proportion of differences in expression of autism that can be explained by genetic variation; if the heritability of a condition is high, then the condition is considered to be primarily genetic. Autism has a strong genetic basis. Although the genetics of autism are complex, autism spectrum disorder (ASD) is explained more by multigene effects than by rare mutations with large effects.

Peter McGuffin was a Northern Irish psychiatrist and geneticist from Belfast.

<span class="mw-page-title-main">Genome-wide association study</span> Study of genetic variants in different individuals

In genomics, a genome-wide association study, is an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait. GWA studies typically focus on associations between single-nucleotide polymorphisms (SNPs) and traits like major human diseases, but can equally be applied to any other genetic variants and any other organisms.

In genetic epidemiology, endophenotype is a term used to separate behavioral symptoms into more stable phenotypes with a clear genetic connection. By seeing the EP notion as a special case of a larger collection of multivariate genetic models, which may be fitted using currently accessible methodology, it is possible to maximize its valuable potential lessons for etiological study in psychiatric disorders. The concept was coined by Bernard John and Kenneth R. Lewis in a 1966 paper attempting to explain the geographic distribution of grasshoppers. They claimed that the particular geographic distribution could not be explained by the obvious and external "exophenotype" of the grasshoppers, but instead must be explained by their microscopic and internal "endophenotype". The endophenotype idea represents the influence of two important conceptual currents in biology and psychology research. An adequate technology would be required to perceive the endophenotype, which represents an unobservable latent entity that cannot be directly observed with the unaided naked eye. In the investigation of anxiety and affective disorders, the endophenotype idea has gained popularity.

In multivariate quantitative genetics, a genetic correlation is the proportion of variance that two traits share due to genetic causes, the correlation between the genetic influences on a trait and the genetic influences on a different trait estimating the degree of pleiotropy or causal overlap. A genetic correlation of 0 implies that the genetic effects on one trait are independent of the other, while a correlation of 1 implies that all of the genetic influences on the two traits are identical. The bivariate genetic correlation can be generalized to inferring genetic latent variable factors across > 2 traits using factor analysis. Genetic correlation models were introduced into behavioral genetics in the 1970s–1980s.

<span class="mw-page-title-main">Zinc finger protein 804A</span> Protein found in humans

Zinc finger protein 804A is a protein that in humans is encoded by the ZNF804A gene. The human gene maps to chromosome 2 q32.1 and consists of 4 exons that code for a protein of 1210 amino acids.

Behavioural genetics, also referred to as behaviour genetics, is a field of scientific research that uses genetic methods to investigate the nature and origins of individual differences in behaviour. While the name "behavioural genetics" connotes a focus on genetic influences, the field broadly investigates the extent to which genetic and environmental factors influence individual differences, and the development of research designs that can remove the confounding of genes and environment. Behavioural genetics was founded as a scientific discipline by Francis Galton in the late 19th century, only to be discredited through association with eugenics movements before and during World War II. In the latter half of the 20th century, the field saw renewed prominence with research on inheritance of behaviour and mental illness in humans, as well as research on genetically informative model organisms through selective breeding and crosses. In the late 20th and early 21st centuries, technological advances in molecular genetics made it possible to measure and modify the genome directly. This led to major advances in model organism research and in human studies, leading to new scientific discoveries.

Expression quantitative trait loci (eQTLs) are genomic loci that explain variation in expression levels of mRNAs.

Predictive genomics is at the intersection of multiple disciplines: predictive medicine, personal genomics and translational bioinformatics. Specifically, predictive genomics deals with the future phenotypic outcomes via prediction in areas such as complex multifactorial diseases in humans. To date, the success of predictive genomics has been dependent on the genetic framework underlying these applications, typically explored in genome-wide association (GWA) studies. The identification of associated single-nucleotide polymorphisms underpin GWA studies in complex diseases that have ranged from Type 2 Diabetes (T2D), Age-related macular degeneration (AMD) and Crohn's disease.

<span class="mw-page-title-main">Michael Goddard</span>

Michael Edward "Mike" Goddard is a professorial fellow in animal genetics at the University of Melbourne, Australia.

Genome-wide complex trait analysis (GCTA) Genome-based restricted maximum likelihood (GREML) is a statistical method for heritability estimation in genetics, which quantifies the total additive contribution of a set of genetic variants to a trait. GCTA is typically applied to common single nucleotide polymorphisms (SNPs) on a genotyping array and thus termed "chip" or "SNP" heritability.

<span class="mw-page-title-main">Polygenic score</span> Numerical score aimed at predicting a trait based on variation in multiple genetic loci

In genetics, a polygenic score (PGS) is a number that summarizes the estimated effect of many genetic variants on an individual's phenotype. The PGS is also called the polygenic index (PGI) or genome-wide score; in the context of disease risk, it is called a polygenic risk score or genetic risk score. The score reflects an individual's estimated genetic predisposition for a given trait and can be used as a predictor for that trait. It gives an estimate of how likely an individual is to have a given trait based only on genetics, without taking environmental factors into account; and it is typically calculated as a weighted sum of trait-associated alleles.

<span class="mw-page-title-main">Complex traits</span>

Complex traits are phenotypes that are controlled by two or more genes and do not follow Mendel’s Law of Dominance. They may have a range of expression which is typically continuous. Both environmental and genetic factors often impact the variation in expression. Human height is a continuous trait meaning that there is a wide range of heights. There are an estimated 50 genes that affect the height of a human. Environmental factors, like nutrition, also play a role in a human’s height. Other examples of complex traits include: crop yield, plant color, and many diseases including diabetes and Parkinson's disease. One major goal of genetic research today is to better understand the molecular mechanisms through which genetic variants act to influence complex traits. Complex Traits are also known as polygenic traits and multigenic traits.

In statistical genetics, linkage disequilibrium score regression is a technique that aims to quantify the separate contributions of polygenic effects and various confounding factors, such as population stratification, based on summary statistics from genome-wide association studies (GWASs). The approach involves using regression analysis to examine the relationship between linkage disequilibrium scores and the test statistics of the single-nucleotide polymorphisms (SNPs) from the GWAS. Here, the "linkage disequilibrium score" for a SNP "is the sum of LD r2 measured with all other SNPs".

The Omnigenic Model, first proposed by Evan A. Boyle, Yang I. Li, and Jonathan K. Pritchard, describes a hypothesis regarding the heritability of complex traits. Expanding beyond polygenes, the authors propose that all genes expressed within a cell affect the expression of a given trait. In addition, the model states that the peripheral genes, ones that do not have a direct impact on expression, explain more heritability of traits than core genes, ones that have a direct impact on expression. The process that the authors propose that facilitates this effect is called “network pleiotropy”, in which peripheral genes can affect core genes, not by having a direct effect, but rather by virtue of being mediated within the same cell.

The GWAS catalog is a free online database that compiles data of genome-wide association studies (GWAS), summarizing unstructured data from different literature sources into accessible high quality data. It was created by the National Human Genome Research Institute (NHGRI) in 2008 and have become a collaborative project between the NHGRI and the European Bioinformatics Institute (EBI) since 2010. As of September 2018, it has included 71,673 SNP–trait associations in 3,567 publications.

Personality traits are patterns of thoughts, feelings and behaviors that reflect the tendency to respond in certain ways under certain circumstances.

Transcriptome-wide association study (TWAS) is a genetic methodology that can be used to compare the genetic components of gene expression and the genetic components of a trait to determine if an association is present between the two components. TWAS are useful for the identification and prioritization of candidate causal genes in candidate gene analysis following genome-wide association studies. TWAS looks at the RNA products of a specific tissue and gives researchers the abilities to look at the genes being expressed as well as gene expression levels, which varies by tissue type. TWAS are valuable and flexible bioinformatics tools that looks at the associations between the expressions of genes and complex traits and diseases. By looking at the association between gene expression and the trait expressed, genetic regulatory mechanisms can be investigated for the role that they play in the development of specific traits and diseases.

References

  1. Manolio, T. A.; Collins, F. S.; Cox, N. J.; Goldstein, D. B.; Hindorff, L. A.; Hunter, D. J.; McCarthy, M. I.; Ramos, E. M.; Cardon, L. R.; Chakravarti, A.; Cho, J. H.; Guttmacher, A. E.; Kong, A.; Kruglyak, L.; Mardis, E.; Rotimi, C. N.; Slatkin, M.; Valle, D.; Whittemore, A. S.; Boehnke, M.; Clark, A. G.; Eichler, E. E.; Gibson, G.; Haines, J. L.; MacKay, T. F. C.; McCarroll, S. A.; Visscher, P. M. (2009). "Finding the missing heritability of complex diseases". Nature . 461 (7265): 747–753. Bibcode:2009Natur.461..747M. doi:10.1038/nature08494. PMC   2831613 . PMID   19812666.
  2. 1 2 Zuk, O.; Hechter, E.; Sunyaev, S. R.; Lander, E. S. (2012). "The mystery of missing heritability: Genetic interactions create phantom heritability". Proceedings of the National Academy of Sciences . 109 (4): 1193–1198. Bibcode:2012PNAS..109.1193Z. doi: 10.1073/pnas.1119675109 . PMC   3268279 . PMID   22223662.
  3. Lee, S. H.; Wray, N. R.; Goddard, M. E.; Visscher, P. M. (2011). "Estimating Missing Heritability for Disease from Genome-wide Association Studies". American Journal of Human Genetics . 88 (3): 294–305. doi:10.1016/j.ajhg.2011.02.002. PMC   3059431 . PMID   21376301.
  4. 1 2 Slatkin, M. (2009). "Epigenetic Inheritance and the Missing Heritability Problem". Genetics . 182 (3): 845–850. doi:10.1534/genetics.109.102798. PMC   2710163 . PMID   19416939.
  5. Eichler, E. E.; Flint, J.; Gibson, G.; Kong, A.; Leal, S. M.; Moore, J. H.; Nadeau, J. H. (2010). "Missing heritability and strategies for finding the underlying causes of complex disease". Nature Reviews Genetics . 11 (6): 446–450. doi:10.1038/nrg2809. PMC   2942068 . PMID   20479774.
  6. Maher, Brendan (2008). "Personal genomes: The case of the missing heritability". Nature. 456 (7218): 18–21. doi:10.1038/456018a. PMID   18987709.
  7. Dumas-Mallet, Estelle; Button, Katherine; Boraud, Thomas; Munafo, Marcus; Gonon, François (2016). "Replication Validity of Initial Association Studies: A Comparison between Psychiatry, Neurology and Four Somatic Diseases". PLOS ONE. 11 (6): e0158064. Bibcode:2016PLoSO..1158064D. doi: 10.1371/journal.pone.0158064 . PMC   4919034 . PMID   27336301.
  8. Ioannidis, John P. A.; Tarone, Robert; McLaughlin, Joseph K. (2011). "The False-positive to False-negative Ratio in Epidemiologic Studies". Epidemiology. 22 (4): 450–456. doi: 10.1097/EDE.0b013e31821b506e . PMID   21490505.
  9. Samek, Diana R.; Bailey, Jennifer; Hill, Karl G.; Wilson, Sylia; Lee, Susanne; Keyes, Margaret A.; Epstein, Marina; Smolen, Andrew; Miller, Michael; Winters, Ken C.; Hawkins, J. David; Catalano, Richard F.; Iacono, William G.; McGue, Matt (2016). "A Test-Replicate Approach to Candidate Gene Research on Addiction and Externalizing Disorders: A Collaboration Across Five Longitudinal Studies". Behavior Genetics. 46 (5): 608–626. doi:10.1007/s10519-016-9800-8. PMC   5060092 . PMID   27444553.
  10. Bevan, Steve; Traylor, Matthew; Adib-Samii, Poneh; Malik, Rainer; Paul, Nicola L. M.; Jackson, Caroline; Farrall, Martin; Rothwell, Peter M.; Sudlow, Cathie; Dichgans, Martin; Markus, Hugh S. (2012). "Genetic Heritability of Ischemic Stroke and the Contribution of Previously Reported Candidate Gene and Genomewide Associations". Stroke. 43 (12): 3161–3167. doi: 10.1161/STROKEAHA.112.665760 . PMID   23042660.
  11. Siontis, K. C.; Patsopoulos, N. A.; Ioannidis, J. P. (2010). "Replication of past candidate loci for common diseases and phenotypes in 100 genome-wide association studies". European Journal of Human Genetics. 18 (7): 832–837. doi:10.1038/ejhg.2010.26. PMC   2987361 . PMID   20234392.
  12. Duncan, Laramie E.; Keller, Matthew C. (2011). "A Critical Review of the First 10 Years of Candidate Gene-by-Environment Interaction Research in Psychiatry". The American Journal of Psychiatry. 168 (10): 1041–1049. doi:10.1176/appi.ajp.2011.11020191. PMC   3222234 . PMID   21890791.
  13. Chabris, CF; Hebert, BM; Benjamin, DJ; Beauchamp, J; Cesarini, D; van der Loos, M; Johannesson, M; Magnusson, PK; Lichtenstein, P; Atwood, CS; Freese, J; Hauser, TS; Hauser, RM; Christakis, N; Laibson, D (2012). "Most reported genetic associations with general intelligence are probably false positives". Psychol Sci . 23 (11): 1314–23. doi:10.1177/0956797611435528. PMC   3498585 . PMID   23012269.
  14. Johnson, Emma C.; Border, Richard; Melroy-Greif, Whitney E.; de Leeuw, Christiaan A.; Ehringer, Marissa A.; Keller, Matthew C. (2017). "No Evidence That Schizophrenia Candidate Genes Are More Associated With Schizophrenia Than Noncandidate Genes". Biological Psychiatry. 82 (10): 702–708. doi:10.1016/j.biopsych.2017.06.033. PMC   5643230 . PMID   28823710.
  15. Avinun, Reut; Nevo, Adam; Knodt, Annchen R.; Elliott, Maxwell L.; Hariri, Ahmad R. (2018). "Replication in Imaging Genetics: The Case of Threat-Related Amygdala Reactivity". Biological Psychiatry. 84 (2): 148–159. doi:10.1016/j.biopsych.2017.11.010. PMC   5955809 . PMID   29279201.
  16. Jahanshad, Neda; Ganjgahi, Habib; Bralten, Janita; Braber, Anouk den; Faskowitz, Joshua; Knodt, Annchen R.; Lemaitre, Hervé; Nir, Talia M.; Patel, Binish; Richie, Stuart; Sprooten, Emma; Hoogman, Martine; Hulzen, Kimm van; Zavaliangos-Petropulu, Artemis; Zwiers, Marcel P. (2017). "Do Candidate Genes Affect the Brain's White Matter Microstructure? Large-Scale Evaluation of 6,165 Diffusion MRI Scans". bioRxiv   10.1101/107987 .
  17. Hewitt, John K. (2012). "Editorial Policy on Candidate Gene Association and Candidate Gene-by-Environment Interaction Studies of Complex Traits". Behavior Genetics. 42 (1): 1–2. doi:10.1007/s10519-011-9504-z. PMID   21928046.
  18. Arango, C. (2017). "Candidate gene associations studies in psychiatry: Time to move forward". European Archives of Psychiatry and Clinical Neuroscience. 267 (1): 1–2. doi:10.1007/s00406-016-0765-7. PMID   28070643.
  19. Chaufan, Claudia; Joseph, Jay (April 2013). "The 'Missing Heritability' of Common Disorders: Should Health Researchers Care?". International Journal of Health Services. 43 (2): 281–303. doi:10.2190/hs.43.2.f. ISSN   0020-7314. PMID   23821906. S2CID   25092977.
  20. Gillett, George (April 2024). "The problem with genetic heritability estimates in psychiatry: 'missing heritability' or missed cross-cultural environmental variation?". Psychiatry Research. 336. doi:10.1016/j.psychres.2024.115916. PMID   38640570.
  21. Caballero, Armando; Tenesa, Albert; Keightley, Peter D. (December 2015). "The Nature of Genetic Variation for Complex Traits Revealed by GWAS and Regional Heritability Mapping Analyses". Genetics. 201 (4): 1601–1613. doi:10.1534/genetics.115.177220. ISSN   1943-2631. PMC   4676519 . PMID   26482794.
  22. De Vlaming, Ronald; Okbay, Aysu; Rietveld, Cornelius A.; Johannesson, Magnus; Magnusson, Patrik K.E.; Uitterlinden, André G.; Van Rooij, Frank J.A.; Hofman, Albert; Groenen, Patrick J.F.; Thurik, A. Roy; Koellinger, Philipp D. (2016). "Meta-GWAS Accuracy and Power (MetaGAP) calculator shows that hiding heritability is partially due to imperfect genetic correlations across studies". bioRxiv   10.1101/048322 .
  23. Wray, Naomi R.; Maier, Robert (2014). "Genetic Basis of Complex Genetic Disease: The Contribution of Disease Heterogeneity to Missing Heritability". Current Epidemiology Reports. 1 (4): 220–227. doi:10.1007/s40471-014-0023-3.
  24. Wray, Naomi R.; Lee, Sang Hong; Kendler, Kenneth S. (2012). "Impact of diagnostic misclassification on estimation of genetic correlations using genome-wide genotypes". European Journal of Human Genetics. 20 (6): 668–674. doi:10.1038/ejhg.2011.257. PMC   3355255 . PMID   22258521.
  25. Lee et al 2013a, "Genetic relationship between five psychiatric disorders estimated from genome-wide SNPs"
  26. Lee et al 2013b, "General framework for meta-analysis of rare variants in sequencing association studies"
  27. Sham & Purcell 2014, "Statistical power and significance testing in large-scale genetic studies"
  28. "Defining the role of common variation in the genomic and biological architecture of adult human height", Wood et al 2014
  29. Chabris et al 2012 reported only 1 possible hit using a few thousand; "GWAS of 126,559 Individuals Identifies Genetic Variants Associated with Educational Attainment", Rietveld et al 2013 with n=100k reported 3 hits; "Genome-wide association study identifies 74 loci associated with educational attainment", Okbay et al 2016 reported 74 hits using n=293k and ~160 when extended to n=404k