GWAS catalog

Last updated

The GWAS catalog is a free online database that compiles data of genome-wide association studies (GWAS), summarizing unstructured data from different literature sources into accessible high quality data. [1] It was created by the National Human Genome Research Institute (NHGRI) in 2008 and have become a collaborative project between the NHGRI and the European Bioinformatics Institute (EBI) since 2010. [1] As of September 2018, it has included 71,673 SNP–trait associations in 3,567 publications. [2]

Contents

A GWAS identifies genetic loci associated with common traits and disease through the analysis of categorized variants across the genome and the catalog provides information from all published GWAS results that meet its criteria. [3] The catalog contains publication information, study groups information (origin, size) and SNP-disease association information (including SNP identifier, P-value, gene and risk allele). [4] Over the years, the GWAS catalog has enhanced its data release frequency by adding features such as graphical user interface, ontology-supported search functionality and a curation interface. [3]

The GWAS catalog is widely used to identify causal variants and understand disease mechanisms by biologists, bioinformaticians and other researchers. [4] [5] [6] [7] [8] Some GWAS identified common genomic loci that are associated diseases include: cardiovascular disease, inflammatory bowel disease, type 2 diabetes and breast cancer. [3]

Accessibility of data

The public can gain access to the GWAS Catalog’ s data in three ways: [4]

  1. The NHGRI web interface’s search: provide information on traits and study publication and an tab-delimited file that is available for download. [4]
  2. Interactive interface: provide a visualization of all SNP-associated traits in the GWAS catalog as well as SNPs’ positions on human chromosomes. [4] And all SNPs are associated with a particular trait are displayed with web links to related literature from different databases. [4]
  3. Ensembl, the UCSC Genome Browser, the PheGenI and other data portals provide access to the GWAS catalog through providing web links. [4]

Applications

Some current applications of the GWAS Catalog include the use of studies on the genetics of human diseases [5] [6] and the heritability of human traits. [7] The GWAS catalog data can also be used as a pool of markers for SNP studies. [8]

Related Research Articles

Single-nucleotide polymorphism Single nucleotide position in genomic DNA at which different sequence alternatives exist

In genetics, a single-nucleotide polymorphism is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a sufficiently large fraction of the population, many publications do not apply such a frequency threshold.

The candidate gene approach to conducting genetic association studies focuses on associations between genetic variation within pre-specified genes of interest, and phenotypes or disease states. This is in contrast to genome-wide association studies (GWAS), which is a hypothesis-free approach that scans the entire genome for associations between common genetic variants and traits of interest. Candidate genes are most often selected for study based on a priori knowledge of the gene's biological functional impact on the trait or disease in question. The rationale behind focusing on allelic variation in specific, biologically relevant regions of the genome is that certain alleles within a gene may directly impact the function of the gene in question and lead to variation in the phenotype or disease state being investigated. This approach often uses the case-control study design to try to answer the question, "Is one allele of a candidate gene more frequently seen in subjects with the disease than in subjects without the disease?" Candidate genes hypothesized to be associated with complex traits have generally not been replicated by subsequent GWASs or highly powered replication attempts. The failure of candidate gene studies to shed light on the specific genes underlying such traits has been ascribed to insufficient statistical power, low prior probability that scientists can correctly guess a specific allele within a specific gene that is related to a trait, poor methodological practices, and data dredging.

In molecular biology, SNP array is a type of DNA microarray which is used to detect polymorphisms within a population. A single nucleotide polymorphism (SNP), a variation at a single site in DNA, is the most frequent type of variation in the genome. Around 335 million SNPs have been identified in the human genome, 15 million of which are present at frequencies of 1% or higher across different populations worldwide.

A tag SNP is a representative single nucleotide polymorphism (SNP) in a region of the genome with high linkage disequilibrium that represents a group of SNPs called a haplotype. It is possible to identify genetic variation and association to phenotypes without genotyping every SNP in a chromosomal region. This reduces the expense and time of mapping genome areas associated with disease, since it eliminates the need to study every individual SNP. Tag SNPs are useful in whole-genome SNP association studies in which hundreds of thousands of SNPs across the entire genome are genotyped.

Genome-wide association study Study of genetic variants in different individuals

In genomics, a genome-wide association study, also known as whole genome association study, is an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait. GWA studies typically focus on associations between single-nucleotide polymorphisms (SNPs) and traits like major human diseases, but can equally be applied to any other genetic variants and any other organisms.

dbSNP

The Single Nucleotide Polymorphism Database (dbSNP) is a free public archive for genetic variation within and across different species developed and hosted by the National Center for Biotechnology Information (NCBI) in collaboration with the National Human Genome Research Institute (NHGRI). Although the name of the database implies a collection of one class of polymorphisms only, it in fact contains a range of molecular variation: (1) SNPs, (2) short deletion and insertion polymorphisms (indels/DIPs), (3) microsatellite markers or short tandem repeats (STRs), (4) multinucleotide polymorphisms (MNPs), (5) heterozygous sequences, and (6) named variants. The dbSNP accepts apparently neutral polymorphisms, polymorphisms corresponding to known phenotypes, and regions of no variation. It was created in September 1998 to supplement GenBank, NCBI’s collection of publicly available nucleic acid and protein sequences.

In multivariate quantitative genetics, a genetic correlation is the proportion of variance that two traits share due to genetic causes, the correlation between the genetic influences on a trait and the genetic influences on a different trait estimating the degree of pleiotropy or causal overlap. A genetic correlation of 0 implies that the genetic effects on one trait are independent of the other, while a correlation of 1 implies that all of the genetic influences on the two traits are identical. The bivariate genetic correlation can be generalized to inferring genetic latent variable factors across > 2 traits using factor analysis. Genetic correlation models were introduced into behavioral genetics in the 1970s–1980s.

Behavioural genetics Study of genetic-environment interactions influencing behaviour

Behavioural genetics, also referred to as behaviour genetics, is a field of scientific research that uses genetic methods to investigate the nature and origins of individual differences in behaviour. While the name "behavioural genetics" connotes a focus on genetic influences, the field broadly investigates the extent to which genetic and environmental factors influence individual differences, using research designs that allow removal of the confounding of genes and environment. Behavioural genetics was founded as a scientific discipline by Francis Galton in the late 19th century, only to be discredited through association with eugenics movements before and during World War II. In the latter half of the 20th century, the field saw renewed prominence with research on inheritance of behaviour and mental illness in humans, as well as research on genetically informative model organisms through selective breeding and crosses. In the late 20th and early 21st centuries, technological advances in molecular genetics made it possible to measure and modify the genome directly. This led to major advances in model organism research and in human studies, leading to new scientific discoveries.

In genetics, association mapping, also known as "linkage disequilibrium mapping", is a method of mapping quantitative trait loci (QTLs) that takes advantage of historic linkage disequilibrium to link phenotypes to genotypes, uncovering genetic associations.

GWAS Central is a publicly available database of summary-level findings from genetic association studies in humans, including genome-wide association studies (GWAS).

GWASdb is an online bioinformatics database combines collections of GVs from GWAS and their comprehensive functional annotations, as well as disease classifications.

The "missing heritability" problem is the fact that single genetic variations cannot account for much of the heritability of diseases, behaviors, and other phenotypes. This is a problem that has significant implications for medicine, since a person's susceptibility to disease may depend more on 'the combined effect of all the genes in the background than on the disease genes in the foreground', or the role of genes may have been severely overestimated.

Experimental factor ontology

Experimental factor ontology, also known as EFO, is an open-access ontology of experimental variables particularly those used in molecular biology. The ontology covers variables which include aspects of disease, anatomy, cell type, cell lines, chemical compounds and assay information. EFO is developed and maintained at the EMBL-EBI as a cross-cutting resource for the purposes of curation, querying and data integration in resources such as Ensembl, ChEMBL and Expression Atlas.

Predictive genomics is at the intersection of multiple disciplines: predictive medicine, personal genomics and translational bioinformatics. Specifically, predictive genomics deals with the future phenotypic outcomes via prediction in areas such as complex multifactorial diseases in humans. To date, the success of predictive genomics has been dependent on the genetic framework underlying these applications, typically explored in genome-wide association (GWA) studies. The identification of associated single-nucleotide polymorphisms underpin GWA studies in complex diseases that have ranged from Type 2 Diabetes (T2D), Age-related macular degeneration (AMD) and Crohn's disease.

Single nucleotide polymorphism annotation is the process of predicting the effect or function of an individual SNP using SNP annotation tools. In SNP annotation the biological information is extracted, collected and displayed in a clear form amenable to query. SNP functional annotation is typically performed based on the available information on nucleic acid and protein sequences.

Genome-wide complex trait analysis (GCTA) Genome-based restricted maximum likelihood (GREML) is a statistical method for variance component estimation in genetics which quantifies the total narrow-sense (additive) contribution to a trait's heritability of a particular subset of genetic variants. This is done by directly quantifying the chance genetic similarity of unrelated individuals and comparing it to their measured similarity on a trait; if two unrelated individuals are relatively similar genetically and also have similar trait measurements, then the measured genetics are likely to causally influence that trait, and the correlation can to some degree tell how much. This can be illustrated by plotting the squared pairwise trait differences between individuals against their estimated degree of relatedness. The GCTA framework can be applied in a variety of settings. For example, it can be used to examine changes in heritability over aging and development. It can also be extended to analyse bivariate genetic correlations between traits. There is an ongoing debate about whether GCTA generates reliable or stable estimates of heritability when used on current SNP data. The method is based on the outdated and false dichotomy of genes versus the environment. It also suffers from serious methodological weaknesses, such as susceptibility to population stratification.

Polygenic score Numerical score aimed at predicting a trait based on variation in multiple genetic loci

In genetics, a polygenic score (PGS), also called a polygenic risk score (PRS), genetic risk score, or genome-wide score, is a number that summarises the estimated effect of many genetic variants on an individual's phenotype, typically calculated as a weighted sum of trait-associated alleles. It reflects an individual's estimated genetic predisposition for a given trait and can be used as a predictor for that trait. In other words, it gives an estimate of how likely an individual is to have a given trait only based on genetics, without taking environmental factors into account. Polygenic scores are widely used in animal breeding and plant breeding due to their efficacy in improving livestock breeding and crops. In humans, polygenic scores are typically generated from genome-wide association study (GWAS) data.

Complex traits

Complex traits, also known as quantitative traits, are traits that do not behave according to simple Mendelian inheritance laws. More specifically, their inheritance cannot be explained by the genetic segregation of a single gene. Such traits show a continuous range of variation and are influenced by both environmental and genetic factors. Compared to strictly Mendelian traits, complex traits are far more common, and because they can be hugely polygenic, they are studied using statistical techniques such as QTL mapping rather than classical genetics methods. Examples of complex traits include height, circadian rhythms, enzyme kinetics, and many diseases including diabetes and Parkinson's disease. One major goal of genetic research today is to better understand the molecular mechanisms through which genetic variants act to influence complex traits.

In statistical genetics, linkage disequilibrium score regression is a technique that aims to quantify the separate contributions of polygenic effects and various confounding factors, such as population stratification, based on summary statistics from genome-wide association studies (GWASs). The approach involves using regression analysis to examine the relationship between linkage disequilibrium scores and the test statistics of the single-nucleotide polymorphisms (SNPs) from the GWAS. Here, the "linkage disequilibrium score" for a SNP "is the sum of LD r2 measured with all other SNPs".

Personality traits are patterns of thoughts, feelings and behaviors that reflect the tendency to respond in certain ways under certain circumstances.

References

  1. 1 2 "About the GWAS Catalog". GWAS Catalog. Retrieved 2019-06-17.
  2. Buniello A, MacArthur JA, Cerezo M, Harris LW, Hayhurst J, Malangone C, et al. (January 2019). "The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019". Nucleic Acids Research. 47 (D1): D1005–D1012. doi:10.1093/nar/gky1120. PMC   6323933 . PMID   30445434.
  3. 1 2 3 MacArthur J, Bowler E, Cerezo M, Gil L, Hall P, Hastings E, et al. (January 2017). "The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog)". Nucleic Acids Research. 45 (D1): D896–D901. doi:10.1093/nar/gkw1133. PMC   5210590 . PMID   27899670.
  4. 1 2 3 4 5 6 7 Welter D, MacArthur J, Morales J, Burdett T, Hall P, Junkins H, et al. (January 2014). "The NHGRI GWAS Catalog, a curated resource of SNP-trait associations". Nucleic Acids Research. 42 (Database issue): D1001-6. doi:10.1093/nar/gkt1229. PMC   3965119 . PMID   24316577.
  5. 1 2 Morales J, Welter D, Bowler EH, Cerezo M, Harris LW, McMahon AC, et al. (February 2018). "A standardized framework for representation of ancestry data in genomics studies, with application to the NHGRI-EBI GWAS Catalog". Genome Biology. 19 (1): 21. doi:10.1186/s13059-018-1396-2. PMC   5815218 . PMID   29448949.
  6. 1 2 Loos RJ, Yeo GS (January 2014). "The bigger picture of FTO: the first GWAS-identified obesity gene". Nature Reviews. Endocrinology. 10 (1): 51–61. doi:10.1038/nrendo.2013.227. PMC   4188449 . PMID   24247219.
  7. 1 2 López-Cortegano E, Caballero A (July 2019). "Inferring the Nature of Missing Heritability in Human Traits Using Data from the GWAS Catalog". Genetics. 212 (3): 891–904. doi:10.1534/genetics.119.302077. PMC   6614893 . PMID   31123044 . Retrieved 7 Nov 2019.
  8. 1 2 Pal LR, Moult J (July 2015). "Genetic Basis of Common Human Disease: Insight into the Role of Missense SNPs from Genome-Wide Association Studies". Journal of Molecular Biology. 427 (13): 2271–89. doi:10.1016/j.jmb.2015.04.014. PMC   4893807 . PMID   25937569.