Association mapping

Last updated

In genetics, association mapping, also known as "linkage disequilibrium mapping", is a method of mapping quantitative trait loci (QTLs) that takes advantage of historic linkage disequilibrium to link phenotypes (observable characteristics) to genotypes (the genetic constitution of organisms), uncovering genetic associations. [1] [2]

Contents

Theory

Association mapping is based on the idea that traits that have entered a population only recently will still be linked to the surrounding genetic sequence of the original evolutionary ancestor, or in other words, will more often be found within a given haplotype, than outside of it. It is most often performed by scanning the entire genome for significant associations between a panel of single nucleotide polymorphisms (SNPs) (which, in many cases are spotted onto glass slides to create "SNP chips") and a particular phenotype. These associations must then be independently verified in order to show that they either (a) contribute to the trait of interest directly, or (b) are linked to/ in linkage disequilibrium with a quantitative trait locus (QTL) that contributes to the trait of interest. [3]

Association mapping seeks to identify specific functional genetic variants (loci, alleles) linked to phenotypic differences in a trait to facilitate detection of trait causing DNA sequence polymorphisms and selection of genotypes that closely resemble the phenotype. In order to identify these functional variants, it requires high throughput markers like SNPs. [4]

Use

The advantage of association mapping is that it can map quantitative traits with high resolution in a way that is statistically very powerful. Association mapping, however, also requires extensive knowledge of SNPs within the genome of the organism of interest, and is therefore difficult to perform in species that have not been well studied or do not have well-annotated genomes. [5] Association mapping has been most widely applied to the study of human disease, specifically in the form of a genome-wide association study (GWAS). A genome-wide association study is performed by scanning an entire genome for SNPs associated with a particular trait of interest, or in the case of human disease, with a particular disease of interest. [3] [6] To date, thousands of genome wide associations studies have been performed on the human genome in an attempt to identify SNPs associated with a wide variety of complex human diseases (e.g. cancer, Alzheimer's disease, and obesity). The results of all such published GWAS are maintained in an NIH database (figure 1). Whether or not these studies have been clinically and/or therapeutically useful, however, remains controversial. [6]

Figure 1. Published genome-wide associations through 6/2009, 439 published GWA at p < 5 x 10 . GWAS.png
Figure 1. Published genome-wide associations through 6/2009, 439 published GWA at p < 5 × 10 .

Types and variations

(A) Association mapping in population where members are assumed to be independent.

Several standard methods to test for association. Case control studies – Case control studies was among the first approaches utilized to determine whether particular genetic variant is associated with increased risk of disease in humans. Woofle, in 1955, proposed a relative risk statistic that could be used to assess genotype dependent risk. However persistent concern regarding these studies is the adequacy of matching cases and controls. In particular, population stratification can produce false positive associations. In response to this concern, Falk and Rubenstein (1987) suggested a method for assessing relative risk that uses family based controls, obviating this source of potential error. Basically, the method uses a control sample of the parental alleles or haplotypes not transmitted to affected offspring.

(B) Association mapping population where members are assumed to be related

In the real world it is very hard to find independent (unrelated) individuals. Population based association mapping has been modified to control population stratification or relatedness in nested association mapping. Still there is one other limitation in population based QTL mapping; when the frequency of the favorable allele should be relatively high to be detected. Usually favorable alleles are rare mutant alleles (for example usually a resistant parent might be 1 out of 10000 genotypes). Another variant of association mapping in related populations is family based association mapping. In family based association mapping instead of multiple unrelated individuals multiple unrelated families or pedigrees are used. The family-based association mapping [7] can be used in situations where the mutant alleles have been introgressed in populations. One popular family-based association mapping is the transmission disequilibrium test. For details, see Family based QTL mapping.

Advantages

The advantages of population based association mapping, utilizing a sample of individuals from the germplasm collections or a natural population, over traditional QTL-mapping in biparental crosses, primarily are due to availability of broader genetic variations with wider background for marker and trait correlations. The advantage of association mapping is that it can map quantitative traits with high resolution in a way that is statistically very powerful. The resolution of the mapping depends on the extent of LD, or non-random association of markers, that has occurred across the genome. Association mapping offers the opportunity to investigate diverse genetic material and potentially identify multiple alleles and mechanisms of underlying traits. It uses recombination events that have occurred over an extended period of time. Association mapping allows the possibility of exploiting historically measured trait data for association, and lastly has no need for the development of expensive and tedious biparental populations that makes approach timesaving and cost-effective. [8] [9]

Limitations

A major issue with association studies is a tendency to find false positives. Populations showing a desired trait also carry a specific gene variant not because the variant actually controls the trait, but due to genetic relatedness. In particular, indirect associations that are not causal will not be eliminated by increasing the sample size or the number of markers. The main sources of such false positives are linkage between causal and noncausal sites, more than one causal site and epistasis. These indirect associations are not randomly distributed throughout the genome and are less common than false positives arising from population structure. [10]

Likewise, population structure has always remained a consistent issue. Population structure leads to spurious associations between markers and the trait. This generally is not a problem in linkage analysis because researchers know the genetic structure of the family they created. But in association mapping, where relationships between diverse populations are not necessarily well understood, marker–trait associations arising from kinship and evolutionary history can easily be mistaken for causal ones. This can be accounted for with mixed models MLM. Also called the Q+K model, it was developed to further reduce the false positive rate by controlling for both population structure and cryptic familial relatedness. [11]

See also

Related Research Articles

Genetic linkage is the tendency of DNA sequences that are close together on a chromosome to be inherited together during the meiosis phase of sexual reproduction. Two genetic markers that are physically near to each other are unlikely to be separated onto different chromatids during chromosomal crossover, and are therefore said to be more linked than markers that are far apart. In other words, the nearer two genes are on a chromosome, the lower the chance of recombination between them, and the more likely they are to be inherited together. Markers on different chromosomes are perfectly unlinked, although the penetrance of potentially deleterious alleles may be influenced by the presence of other alleles, and these other alleles may be located on other chromosomes than that on which a particular potentially deleterious allele is located.

<span class="mw-page-title-main">Single-nucleotide polymorphism</span> Single nucleotide in genomic DNA at which different sequence alternatives exist

In genetics, a single-nucleotide polymorphism is a germline substitution of a single nucleotide at a specific position in the genome and is present in a sufficiently large fraction of the population. Single nucleotide substitutions with an allele frequency of less than 1% are called "single-nucleotide variants", not SNPs.

A quantitative trait locus (QTL) is a locus that correlates with variation of a quantitative trait in the phenotype of a population of organisms. QTLs are mapped by identifying which molecular markers correlate with an observed trait. This is often an early step in identifying the actual genes that cause the trait variation.

In population genetics, linkage disequilibrium (LD) is the non-random association of alleles at different loci in a given population. Loci are said to be in linkage disequilibrium when the frequency of association of their different alleles is higher or lower than what would be expected if the loci were independent and associated randomly.

Genetic architecture is the underlying genetic basis of a phenotypic trait and its variational properties. Phenotypic variation for quantitative traits is, at the most basic level, the result of the segregation of alleles at quantitative trait loci (QTL). Environmental factors and other external influences can also play a role in phenotypic variation. Genetic architecture is a broad term that can be described for any given individual based on information regarding gene and allele number, the distribution of allelic and mutational effects, and patterns of pleiotropy, dominance, and epistasis.

<span class="mw-page-title-main">Identity by descent</span> Identical nucleotide sequence due to inheritance without recombination from a common ancestor

A DNA segment is identical by state (IBS) in two or more individuals if they have identical nucleotide sequences in this segment. An IBS segment is identical by descent (IBD) in two or more individuals if they have inherited it from a common ancestor without recombination, that is, the segment has the same ancestral origin in these individuals. DNA segments that are IBD are IBS per definition, but segments that are not IBD can still be IBS due to the same mutations in different individuals or recombinations that do not alter the segment.

A polygene is a member of a group of non-epistatic genes that interact additively to influence a phenotypic trait, thus contributing to multiple-gene inheritance, a type of non-Mendelian inheritance, as opposed to single-gene inheritance, which is the core notion of Mendelian inheritance. The term "monozygous" is usually used to refer to a hypothetical gene as it is often difficult to distinguish the effect of an individual gene from the effects of other genes and the environment on a particular phenotype. Advances in statistical methodology and high throughput sequencing are, however, allowing researchers to locate candidate genes for the trait. In the case that such a gene is identified, it is referred to as a quantitative trait locus (QTL). These genes are generally pleiotropic as well. The genes that contribute to type 2 diabetes are thought to be mostly polygenes. In July 2016, scientists reported identifying a set of 355 genes from the last universal common ancestor (LUCA) of all organisms living on Earth.

Genetic association is when one or more genotypes within a population co-occur with a phenotypic trait more often than would be expected by chance occurrence.

<span class="mw-page-title-main">Locus (genetics)</span> Location of a gene or region on a chromosome

In genetics, a locus is a specific, fixed position on a chromosome where a particular gene or genetic marker is located. Each chromosome carries many genes, with each gene occupying a different position or locus; in humans, the total number of protein-coding genes in a complete haploid set of 23 chromosomes is estimated at 19,000–20,000.

A molecular marker is a molecule, sampled from some source, that gives information about its source. For example, DNA is a molecular marker that gives information about the organism from which it was taken. For another example, some proteins can be molecular markers of Alzheimer's disease in a person from which they are taken. Molecular markers may be non-biological. Non-biological markers are often used in environmental studies.

A tag SNP is a representative single nucleotide polymorphism (SNP) in a region of the genome with high linkage disequilibrium that represents a group of SNPs called a haplotype. It is possible to identify genetic variation and association to phenotypes without genotyping every SNP in a chromosomal region. This reduces the expense and time of mapping genome areas associated with disease, since it eliminates the need to study every individual SNP. Tag SNPs are useful in whole-genome SNP association studies in which hundreds of thousands of SNPs across the entire genome are genotyped.

Marker assisted selection or marker aided selection (MAS) is an indirect selection process where a trait of interest is selected based on a marker linked to a trait of interest, rather than on the trait itself. This process has been extensively researched and proposed for plant- and animal- breeding.

<span class="mw-page-title-main">Genome-wide association study</span> Study of genetic variants in different individuals

In genomics, a genome-wide association study, is an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait. GWA studies typically focus on associations between single-nucleotide polymorphisms (SNPs) and traits like major human diseases, but can equally be applied to any other genetic variants and any other organisms.

Expression quantitative trait loci (eQTLs) are genomic loci that explain variation in expression levels of mRNAs.

Nested association mapping (NAM) is a technique designed by the labs of Edward Buckler, James Holland, and Michael McMullen for identifying and dissecting the genetic architecture of complex traits in corn. It is important to note that nested association mapping is a specific technique that cannot be performed outside of a specifically designed population such as the Maize NAM population, the details of which are described below.

GeneNetwork is a combined database and open-source bioinformatics data analysis software resource for systems genetics. This resource is used to study gene regulatory networks that link DNA sequence differences to corresponding differences in gene and protein expression and to variation in traits such as health and disease risk. Data sets in GeneNetwork are typically made up of large collections of genotypes and phenotypes from groups of individuals, including humans, strains of mice and rats, and organisms as diverse as Drosophila melanogaster, Arabidopsis thaliana, and barley. The inclusion of genotypes makes it practical to carry out web-based gene mapping to discover those regions of genomes that contribute to differences among individuals in mRNA, protein, and metabolite levels, as well as differences in cell function, anatomy, physiology, and behavior.

Quantitative trait loci mapping or QTL mapping is the process of identifying genomic regions that potentially contain genes responsible for important economic, health or environmental characters. Mapping QTLs is an important activity that plant breeders and geneticists routinely use to associate potential causal genes with phenotypes of interest. Family-based QTL mapping is a variant of QTL mapping where multiple-families are used.

Molecular breeding is the application of molecular biology tools, often in plant breeding and animal breeding. In the broad sense, molecular breeding can be defined as the use of genetic manipulation performed at the level of DNA to improve traits of interest in plants and animals, and it may also include genetic engineering or gene manipulation, molecular marker-assisted selection, and genomic selection. More often, however, molecular breeding implies molecular marker-assisted breeding (MAB) and is defined as the application of molecular biotechnologies, specifically molecular markers, in combination with linkage maps and genomics, to alter and improve plant or animal traits on the basis of genotypic assays.

Genome-wide complex trait analysis (GCTA) Genome-based restricted maximum likelihood (GREML) is a statistical method for variance component estimation in genetics which quantifies the total narrow-sense (additive) contribution to a trait's heritability of a particular subset of genetic variants. This is done by directly quantifying the chance genetic similarity of unrelated individuals and comparing it to their measured similarity on a trait; if two unrelated individuals are relatively similar genetically and also have similar trait measurements, then the measured genetics are likely to causally influence that trait, and the correlation can to some degree tell how much. This can be illustrated by plotting the squared pairwise trait differences between individuals against their estimated degree of relatedness. The GCTA framework can be applied in a variety of settings. For example, it can be used to examine changes in heritability over aging and development. It can also be extended to analyse bivariate genetic correlations between traits. There is an ongoing debate about whether GCTA generates reliable or stable estimates of heritability when used on current SNP data. The method is based on the outdated and false dichotomy of genes versus the environment. It also suffers from serious methodological weaknesses, such as susceptibility to population stratification.

<span class="mw-page-title-main">Complex traits</span>

Complex traits, also known as quantitative traits, are traits that do not behave according to simple Mendelian inheritance laws. More specifically, their inheritance cannot be explained by the genetic segregation of a single gene. Such traits show a continuous range of variation and are influenced by both environmental and genetic factors. Compared to strictly Mendelian traits, complex traits are far more common, and because they can be hugely polygenic, they are studied using statistical techniques such as quantitative genetics and quantitative trait loci (QTL) mapping rather than classical genetics methods. Examples of complex traits include height, circadian rhythms, enzyme kinetics, and many diseases including diabetes and Parkinson's disease. One major goal of genetic research today is to better understand the molecular mechanisms through which genetic variants act to influence complex traits.

References

  1. Breseghello, Flavio; Sorrells, Mark E (2006-02-01). "Association Mapping of Kernel Size and Milling Quality in Wheat ( Triticum aestivum L.) Cultivars". Genetics. 172 (2): 1165–1177. doi:10.1534/genetics.105.044586. ISSN   1943-2631. PMC   1456215 . PMID   16079235.
  2. Zondervan, Krina T.; Cardon, Lon R. (2007-02-01). "Designing candidate gene and genome-wide case–control association studies". Nature Protocols. 2 (10): 2492–2501. doi:10.1038/nprot.2007.366. ISSN   1750-2799. PMC   4180089 . PMID   17947991.
  3. 1 2 Gibson, G.; Muse S.V. (2009). A Primer of Genome Science. MA: Sinauer Associates.
  4. Hoeschele, I. (2004-07-15). "Mapping Quantitative Trait Loci in Outbred Pedigrees". Handbook of Statistical Genetics. Chichester: John Wiley & Sons, Ltd. doi:10.1002/0470022620.bbc17. ISBN   978-0470022627.
  5. Yu, J.; Holland, J.B.; McMullen, M.D.; Buckler, E.S. (2008). "Genetic design and statistical power of nested association mapping in maize". Genetics. 178 (1): 539–551. doi:10.1534/genetics.107.074245. PMC   2206100 . PMID   18202393.
  6. 1 2 Nussbaum, R.L.; McInnes, R.R.; Willard, H.F. (2007). Genetics in Medicine. Philadelphia, PA: Saunders Elsevier.
  7. Rosyara U.R., J.L. Gonzalez-Hernandez, K.D. Glover, K.R. Gedye and J.M. Stein. 2009. Family-based mapping of quantitative trait loci in plant breeding populations with resistance to Fusarium head blight in wheat as an illustration Theoretical and Applied Genetics 118:1617-1631 external link
  8. Abdurakhmonov, Ibrokhim Y.; Abdukarimov, Abdusattor (2008-06-08). "Application of Association Mapping to Understanding the Genetic Diversity of Plant Germplasm Resources". International Journal of Plant Genomics . Hindawi. 2008: 574927. doi: 10.1155/2008/574927 . ISSN   1687-5370. PMC   2423417 . PMID   18551188. S2CID   7629296.
  9. Kraakman, A. T. W. (2004-09-01). "Linkage Disequilibrium Mapping of Yield and Yield Stability in Modern Spring Barley Cultivars". Genetics. 168 (1): 435–446. doi:10.1534/genetics.104.026831. ISSN   0016-6731. PMC   1448125 . PMID   15454555.
  10. Platt, A.; Vilhjalmsson, B. J.; Nordborg, M. (2010-09-02). "Conditions Under Which Genome-Wide Association Studies Will be Positively Misleading". Genetics. 186 (3): 1045–1052. doi:10.1534/genetics.110.121665. ISSN   0016-6731. PMC   2975277 . PMID   20813880.
  11. Yu, Jianming; Pressoir, Gael; Briggs, William H; Vroh Bi, Irie; Yamasaki, Masanori; Doebley, John F; McMullen, Michael D; Gaut, Brandon S; Nielsen, Dahlia M (2005-12-25). "A unified mixed-model method for association mapping that accounts for multiple levels of relatedness". Nature Genetics. 38 (2): 203–208. doi:10.1038/ng1702. ISSN   1061-4036. PMID   16380716. S2CID   8507433.