Phenome-wide association study

Last updated

In genetics and genetic epidemiology, a phenome-wide association study, abbreviated PheWAS, is a study design in which the association between single-nucleotide polymorphisms or other types of DNA variants is tested across a large number of different phenotypes. [1] The aim of PheWAS studies (or PheWASs) is to examine the causal linkage between known sequence differences and any type of trait, including molecular, biochemical, cellular, and especially clinical diagnoses and outcomes. [2] [3] [4] It is a complementary approach to the genome-wide association study, or GWAS, methodology. [5] A fundamental difference between GWAS and PheWAS designs is the direction of inference: in a PheWAS it is from exposure (the DNA variant) to many possible outcomes, that is, from SNPs to differences in phenotypes and disease risk. In a GWAS, the polarity of analysis is from one or a few phenotypes to many possible DNA variants. [3] The approach has proven useful in rediscovering previously reported genotype-phenotype associations, [2] [5] as well as in identifying new ones. [6]

Contents

The PheWAS approach was originally developed due to the widespread availability of both anonymized human clinical electronic health record (EHR) data and matched genotype data, using phenotypes defined by groupings of (ICD) codes called phecodes. [7] Massive genome and phenome data sets for model organisms were being assembled have also proved effective for PheWAS. [8] PheWASs have also been conducted using data from existing epidemiological studies. In 2010, a proof-of-concept PheWAS study was published based on EHR billing codes from a single study site. [9] Though this study was generally underpowered, its results suggested the potential existence of new associations between multiple phenotypes, possibly due to a common underlying cause. This paper also coined the abbreviation "PheWAS". [10] As of 2019, PheWAS in the EHR has been conducted using ICD-9-CM, [11] ICD-10, and ICD-10-CM [12] diagnosis codes.

Methods

PheWAS initially started from the growing use of EMR (electronic medical record) for clinical practice and patient care. [9] One of the main components of EMR system is the International Classification of Disease version 9-CM (ICD9) codes, used as a tool for medical billing record. [9] This system includes information of 14,000 diseases binned into different hierarchy codes. [9] These phenotypic information is the basis of the PheWAS study, which associates a genetic variant (or a combination of variants) with a wide range of phenotypes. [5]

Most common PheWAS studies would divide its cohort into two groups: individuals who did not have a specific ICD9 code are treated as “controls” while individuals who has an ICD9 code associated with them are considered “cases”. [13] Starting from the given genetic variant, a PheWAS would systematically perform genetic variant (typically a SNP) analysis to identify how a particular genotype would be associated to a phenotype. [13] From the variant data, PheWAS calculates their genotype distribution and the chi-squared distribution, followed by Fisher's exact test to calculate the P-value, identifying how relevant a genotype would be to a certain phenotype of interest from the EMR. [9] [14] Often times, Bonferroni correction is then applied to take into consideration the multiple comparisons done while calculating the P-value.

Proof of Concept

The first study of PheWAS was done on 6000 European-American population with 5 SNPs of interest picked for validation: rs1333049, rs2200733, rs3135388, rs6457620, and rs1333049. [9] Quality control was done by examining marker and sample genotyping efficiency, allele frequency calculations, and Hardy-Weinberg equilibrium tests. [9]

This initial PheWAS aim to examine the impact of genetic variants across various phenotypes. [3] Since the ICD9 was not specifically designed for research purposes, this PheWAS devised a new way to simplify the code for genetic studies. Specifically, three modifications were made to the ICD9:

  1. First, they combine three-digit codes from diseases that arise from the same or similar origin. For example, tuberculosis has three subtypes and all three are merged to one case group of 010. [9]
  2. Secondly, the addition of a fourth digit identifier for phenotypes that are clinically distinct, but are categorized to be the same. An instance would be Type I and Type II diabetes, two clinically distinct phenotypes that fall under ICD9 code of ‘250’. An additional fourth digit will be added to differentiate the two phenotypes. [9]
  3. Lastly, codes that are deemed to be useless for genotypic-phenotypic analysis are ignored. Cases such as foreign object contamination or non-specific symptoms / non-specific laboratory result would fall under this category. [9]

As one example of its successes, this PheWAS show evidence of strong association between rs3135388 and multiple sclerosis (MS), which was a previously studied association. [9] Twenty-two other diseases also demonstrated significant associations with P < 0.05. [9]

Applications

Pleiotropy Study

One of the main advantages of the PheWAS study is its potential to identify genomic variants with pleiotropic properties. [5] Understanding cross-phenotype (CP) associations, where one genetic variation can affect two or more independent phenotypes, is the key to understanding the pleiotropic effect. [13] The pleiotropic effect study was done by first obtaining the summary of genotype and phenotype data from the Population Architecture using Genomics and Epidemiology (PAGE) study sites. [1] After several quality control and data organization steps, either the standard logistic or linear regression analysis is performed depending on the phenotypic information. [1] Subsequently, all continuous phenotypes are log-transformed before the association between the SNPs and the transformed phenotypes is finally calculated. [1]

Generally, there are two types of results from a PheWAS study:

Even though novel associations between phenotype is discovered, further biological studies is necessary to determine whether it actually reflects the system. [15]

Drug Response Variability

A PheWAS has also successfully highlights discrepancies in drug response among individuals. A quantitative PheWAS study was done to identify variation in thiopurine response. [16] The EMR stores quantitative value of IBD patient's TPMT (thiopurine S-methyltransferase) activity, which then allow researchers to split the patients it into three categories: low TPMTa, normal TPMTa, and very high TPMTa. [16] It was found that cohorts with very high TPMTa level are associated with diabetes mellitus and iron-deficiency anemia, which further shows that thiopurine therapy are three times more likely to fail in patients with very high TPMTa. [16] [14] Performing thiopurine therapy on patient with very high TPMTa level may increase the frequency of anemia episode. [16] This PheWAS finding may further the progress of personalized treatment based on patient's measurement. Instead of treating IBD patients with the conventional thiopurine treatment, patient may benefit more from more intensive therapy or other approaches. [16]

Clinical Significance

A clinical test has been done by utilizing PheWAS on HIV patients, obtained from the AIDS Clinical Trial Group (ACTG) datasets from 27 different laboratories. [15] Identifying accuracy between PheWAS and clinical trials is important before pushing PheWAS further for making clinical decisions. Forty-seven percent of the previously-reported associations were successfully reproduced in this study, demonstrating this PheWAS' capability for working with clinical data. [15] Additionally, several pleiotropic effects were discovered using this clinical data. Specifically, a block of SNPs on chromosome 7 were associated to both LDL-C phenotypes and the total cholesterol level according to this study. [15] For clinical relevance, more research need to be done to validate the pleiotropic effect obtained from PheWAS. [15]

Limitations

Despite the promising potentials, PheWAS has some potential limitations:

Related Research Articles

<span class="mw-page-title-main">Phenotype</span> Composite of the organisms observable characteristics or traits

In genetics, the phenotype is the set of observable characteristics or traits of an organism. The term covers the organism's morphology, its developmental processes, its biochemical and physiological properties, its behavior, and the products of behavior. An organism's phenotype results from two basic factors: the expression of an organism's genetic code and the influence of environmental factors. Both factors may interact, further affecting the phenotype. When two or more clearly different phenotypes exist in the same population of a species, the species is called polymorphic. A well-documented example of polymorphism is Labrador Retriever coloring; while the coat color depends on many genes, it is clearly seen in the environment as yellow, black, and brown. Richard Dawkins in 1978 and then again in his 1982 book The Extended Phenotype suggested that one can regard bird nests and other built structures such as caddisfly larva cases and beaver dams as "extended phenotypes".

<span class="mw-page-title-main">Single-nucleotide polymorphism</span> Single nucleotide in genomic DNA at which different sequence alternatives exist

In genetics, a single-nucleotide polymorphism is a germline substitution of a single nucleotide at a specific position in the genome and is present in a sufficiently large fraction of the population. Single nucleotide substitutions with an allele frequency of less than 1% are called "single-nucleotide variants", not SNPs.

<span class="mw-page-title-main">Pharmacogenomics</span> Study of the role of the genome in drug response

Pharmacogenomics is the study of the role of the genome in drug response. Its name reflects its combining of pharmacology and genomics. Pharmacogenomics analyzes how the genetic makeup of an individual affects their response to drugs. It deals with the influence of acquired and inherited genetic variation on drug response in patients by correlating DNA mutations with pharmacokinetic, pharmacodynamic, and/or immunogenic endpoints.

A phenome, similar to phenotype, is the set of all traits expressed by a cell, tissue, organ, organism, or species.

The candidate gene approach to conducting genetic association studies focuses on associations between genetic variation within pre-specified genes of interest, and phenotypes or disease states. This is in contrast to genome-wide association studies (GWAS), which is a hypothesis-free approach that scans the entire genome for associations between common genetic variants and traits of interest. Candidate genes are most often selected for study based on a priori knowledge of the gene's biological functional impact on the trait or disease in question. The rationale behind focusing on allelic variation in specific, biologically relevant regions of the genome is that certain alleles within a gene may directly impact the function of the gene in question and lead to variation in the phenotype or disease state being investigated. This approach often uses the case-control study design to try to answer the question, "Is one allele of a candidate gene more frequently seen in subjects with the disease than in subjects without the disease?" Candidate genes hypothesized to be associated with complex traits have generally not been replicated by subsequent GWASs or highly powered replication attempts. The failure of candidate gene studies to shed light on the specific genes underlying such traits has been ascribed to insufficient statistical power, low prior probability that scientists can correctly guess a specific allele within a specific gene that is related to a trait, poor methodological practices, and data dredging.

<span class="mw-page-title-main">Gene–environment interaction</span> Response to the same environmental variation differently by different genotypes

Gene–environment interaction is when two different genotypes respond to environmental variation in different ways. A norm of reaction is a graph that shows the relationship between genes and environmental factors when phenotypic differences are continuous. They can help illustrate GxE interactions. When the norm of reaction is not parallel, as shown in the figure below, there is a gene by environment interaction. This indicates that each genotype responds to environmental variation in a different way. Environmental variation can be physical, chemical, biological, behavior patterns or life events.

Genetic association is when one or more genotypes within a population co-occur with a phenotypic trait more often than would be expected by chance occurrence.

A tag SNP is a representative single nucleotide polymorphism (SNP) in a region of the genome with high linkage disequilibrium that represents a group of SNPs called a haplotype. It is possible to identify genetic variation and association to phenotypes without genotyping every SNP in a chromosomal region. This reduces the expense and time of mapping genome areas associated with disease, since it eliminates the need to study every individual SNP. Tag SNPs are useful in whole-genome SNP association studies in which hundreds of thousands of SNPs across the entire genome are genotyped.

<span class="mw-page-title-main">Genome-wide association study</span> Study of genetic variants in different individuals

In genomics, a genome-wide association study, also known as whole genome association study, is an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait. GWA studies typically focus on associations between single-nucleotide polymorphisms (SNPs) and traits like major human diseases, but can equally be applied to any other genetic variants and any other organisms.

dbSNP

The Single Nucleotide Polymorphism Database (dbSNP) is a free public archive for genetic variation within and across different species developed and hosted by the National Center for Biotechnology Information (NCBI) in collaboration with the National Human Genome Research Institute (NHGRI). Although the name of the database implies a collection of one class of polymorphisms only, it in fact contains a range of molecular variation: (1) SNPs, (2) short deletion and insertion polymorphisms (indels/DIPs), (3) microsatellite markers or short tandem repeats (STRs), (4) multinucleotide polymorphisms (MNPs), (5) heterozygous sequences, and (6) named variants. The dbSNP accepts apparently neutral polymorphisms, polymorphisms corresponding to known phenotypes, and regions of no variation. It was created in September 1998 to supplement GenBank, NCBI’s collection of publicly available nucleic acid and protein sequences.

In genetics, association mapping, also known as "linkage disequilibrium mapping", is a method of mapping quantitative trait loci (QTLs) that takes advantage of historic linkage disequilibrium to link phenotypes to genotypes, uncovering genetic associations.

<span class="mw-page-title-main">Exome sequencing</span> Sequencing of all the exons of a genome

Exome sequencing, also known as whole exome sequencing (WES), is a genomic technique for sequencing all of the protein-coding regions of genes in a genome. It consists of two steps: the first step is to select only the subset of DNA that encodes proteins. These regions are known as exons—humans have about 180,000 exons, constituting about 1% of the human genome, or approximately 30 million base pairs. The second step is to sequence the exonic DNA using any high-throughput DNA sequencing technology.

<span class="mw-page-title-main">CDKN2BAS</span> Non-coding RNA in the species Homo sapiens

CDKN2B-AS, also known as ANRIL is a long non-coding RNA consisting of 19 exons, spanning 126.3kb in the genome, and its spliced product is a 3834bp RNA. It is located within the p15/CDKN2B-p16/CDKN2A-p14/ARF gene cluster, in the antisense direction. Single nucleotide polymorphisms (SNPs) which alter the expression of CDKN2B-AS are associated with human healthy life expectancy, as well as with multiple diseases, including coronary artery disease, diabetes and many cancers. It binds to chromobox 7 (CBX7) within the polycomb repressive complex 1 and to SUZ12, a component of polycomb repression complex 2 and through these interactions is involved in transcriptional repression.

<span class="mw-page-title-main">Interferon Lambda 3</span> Protein-coding gene in the species Homo sapiens

Interferon lambda 3 encodes the IFNL3 protein. IFNL3 was formerly named IL28B, but the Human Genome Organization Gene Nomenclature Committee renamed this gene in 2013 while assigning a name to the then newly discovered IFNL4 gene. Together with IFNL1 and IFNL2, these genes lie in a cluster on chromosomal region 19q13. IFNL3 shares ~96% amino-acid identity with IFNL2, ~80% identity with IFNL1 and ~30% identity with IFNL4.

The Pharmacogenomics Knowledgebase (PharmGKB) is a publicly available, online knowledge base responsible for the aggregation, curation, integration and dissemination of knowledge regarding the impact of human genetic variation on drug response. It is funded by the National Institutes of Health (NIH) National Institute of General Medical Sciences (NIGMS), and is a partner of the NIH Pharmacogenomics Research Network (PGRN). It has been managed at Stanford University since its inception in 2000.

<span class="mw-page-title-main">Genotype-first approach</span>

The genotype-first approach is a type of strategy used in genetic epidemiological studies to associate specific genotypes to apparent clinical phenotypes of a complex disease or trait. As opposed to “phenotype-first”, the traditional strategy that has been guiding genome-wide association studies (GWAS) so far, this approach characterizes individuals first by a statistically common genotype based on molecular tests prior to clinical phenotypic classification. This method of grouping leads to patient evaluations based on a shared genetic etiology for the observed phenotypes, regardless of their suspected diagnosis. Thus, this approach can prevent initial phenotypic bias and allow for identification of genes that pose a significant contribution to the disease etiology.

A human disease modifier gene is a modifier gene that alters expression of a human gene at another locus that in turn causes a genetic disease. Whereas medical genetics has tended to distinguish between monogenic traits, governed by simple, Mendelian inheritance, and quantitative traits, with cumulative, multifactorial causes, increasing evidence suggests that human diseases exist on a continuous spectrum between the two.

<span class="mw-page-title-main">Polygenic score</span> Numerical score aimed at predicting a trait based on variation in multiple genetic loci

In genetics, a polygenic score (PGS), also called a polygenic risk score (PRS), polygenic index (PGI), genetic risk score, or genome-wide score, is a number that summarizes the estimated effect of many genetic variants on an individual's phenotype, typically calculated as a weighted sum of trait-associated alleles. It reflects an individual's estimated genetic predisposition for a given trait and can be used as a predictor for that trait. In other words, it gives an estimate of how likely an individual is to have a given trait only based on genetics, without taking environmental factors into account. Polygenic scores are widely used in animal breeding and plant breeding due to their efficacy in improving livestock breeding and crops. In humans, polygenic scores are typically generated from genome-wide association study (GWAS) data.

<span class="mw-page-title-main">Interferon Lambda 4</span> Protein-coding gene in the species Homo sapiens

Interferon lambda 4 is one of the most recently discovered human genes and the newest addition to the interferon lambda protein family. This gene encodes the IFNL4 protein, which is involved in immune response to viral infection.

<span class="mw-page-title-main">Cancer pharmacogenomics</span>

Cancer pharmacogenomics is the study of how variances in the genome influences an individual’s response to different cancer drug treatments. It is a subset of the broader field of pharmacogenomics, which is the area of study aimed at understanding how genetic variants influence drug efficacy and toxicity.

References

  1. 1 2 3 4 5 6 7 8 9 Pendergrass SA, Brown-Gentry K, Dudek SM, Torstenson ES, Ambite JL, Avery CL, et al. (July 2011). "The use of phenome-wide association studies (PheWAS) for exploration of novel genotype-phenotype relationships and pleiotropy discovery". Genetic Epidemiology. 35 (5): 410–422. doi:10.1002/gepi.20589. PMC   3116446 . PMID   21594894.
  2. 1 2 Denny JC, Bastarache L, Roden DM (August 2016). "Phenome-Wide Association Studies as a Tool to Advance Precision Medicine". Annual Review of Genomics and Human Genetics. 17: 353–373. doi:10.1146/annurev-genom-090314-024956. PMC   5480096 . PMID   27147087.
  3. 1 2 3 Bush WS, Oetjens MT, Crawford DC (March 2016). "Unravelling the human genome-phenome relationship using phenome-wide association studies". Nature Reviews. Genetics. 17 (3): 129–145. doi:10.1038/nrg.2015.36. PMID   26875678. S2CID   32967414.
  4. Wang X, Pandey AK, Mulligan MK, Williams EG, Mozhui K, Li Z, et al. (February 2016). "Joint mouse-human phenome-wide association to test gene function and disease risk". Nature Communications. 7: 10464. Bibcode:2016NatCo...710464W. doi:10.1038/ncomms10464. PMC   4740880 . PMID   26833085.
  5. 1 2 3 4 5 6 7 8 9 10 Hebbring SJ (February 2014). "The challenges, advantages and future of phenome-wide association studies". Immunology. 141 (2): 157–165. doi:10.1111/imm.12195. PMC   3904236 . PMID   24147732.
  6. Cronin RM, Field JR, Bradford Y, Shaffer CM, Carroll RJ, Mosley JD, et al. (2014). "Phenome-wide association studies demonstrating pleiotropy of genetic variants within FTO with and without adjustment for body mass index". Frontiers in Genetics. 5: 250. doi: 10.3389/fgene.2014.00250 . PMC   4134007 . PMID   25177340.
  7. Bastarache L (July 2021). "Using Phecodes for Research with the Electronic Health Record: From PheWAS to PheRS". Annual Review of Biomedical Data Science. 4: 1–19. doi:10.1146/annurev-biodatasci-122320-112352. PMC   9307256 . PMID   34465180.
  8. Li H, Wang X, Rukina D, Huang Q, Lin T, Sorrentino V, et al. (January 2018). "An Integrated Systems Genetics and Omics Toolkit to Probe Gene Function". Cell Systems. 6 (1): 90–102.e4. doi: 10.1016/j.cels.2017.10.016 . PMID   29199021.
  9. 1 2 3 4 5 6 7 8 9 10 11 12 Denny JC, Ritchie MD, Basford MA, Pulley JM, Bastarache L, Brown-Gentry K, et al. (May 2010). "PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations". Bioinformatics. 26 (9): 1205–1210. doi:10.1093/bioinformatics/btq126. PMC   2859132 . PMID   20335276.
  10. Roden DM (June 2017). "Phenome-wide association studies: a new method for functional genomics in humans". The Journal of Physiology. 595 (12): 4109–4115. doi:10.1113/jp273122. PMC   5471509 . PMID   28229460.
  11. Wei WQ, Bastarache LA, Carroll RJ, Marlo JE, Osterman TJ, Gamazon ER, et al. (2017). "Evaluating phecodes, clinical classification software, and ICD-9-CM codes for phenome-wide association studies in the electronic health record". PLOS ONE. 12 (7): e0175508. Bibcode:2017PLoSO..1275508W. doi: 10.1371/journal.pone.0175508 . PMC   5501393 . PMID   28686612.
  12. Wu P, Gifford A, Meng X, Li X, Campbell H, Varley T, et al. (November 2019). "Mapping ICD-10 and ICD-10-CM Codes to Phecodes: Workflow Development and Initial Evaluation". JMIR Medical Informatics. 7 (4): e14325. doi:10.2196/14325. PMC   6911227 . PMID   31553307.
  13. 1 2 3 Pendergrass SA, Ritchie MD (June 2015). "Phenome-Wide Association Studies: Leveraging Comprehensive Phenotypic and Genotypic Data for Discovery". Current Genetic Medicine Reports. 3 (2): 92–100. doi:10.1007/s40142-015-0067-9. PMC   4489156 . PMID   26146598.
  14. 1 2 Robinson JR, Denny JC, Roden DM, Van Driest SL (March 2018). "Genome-wide and Phenome-wide Approaches to Understand Variable Drug Actions in Electronic Health Records". Clinical and Translational Science. 11 (2): 112–122. doi:10.1111/cts.12522. PMC   5866959 . PMID   29148204.
  15. 1 2 3 4 5 6 Moore CB, Verma A, Pendergrass S, Verma SS, Johnson DH, Daar ES, et al. (January 2015). "Phenome-wide Association Study Relating Pretreatment Laboratory Parameters With Human Genetic Variants in AIDS Clinical Trials Group Protocols". Open Forum Infectious Diseases. 2 (1): ofu113. doi:10.1093/ofid/ofu113. PMC   4396430 . PMID   25884002.
  16. 1 2 3 4 5 Neuraz A, Chouchana L, Malamut G, Le Beller C, Roche D, Beaune P, et al. (December 2013). "Phenome-wide association studies on a quantitative trait: application to TPMT enzyme activity and thiopurine therapy in pharmacogenomics". PLOS Computational Biology. 9 (12): e1003405. Bibcode:2013PLSCB...9E3405N. doi:10.1371/journal.pcbi.1003405. PMC   3873228 . PMID   24385893.