Haplotype estimation

Last updated

In genetics, haplotype estimation (also known as "phasing") refers to the process of statistical estimation of haplotypes from genotype data. The most common situation arises when genotypes are collected at a set of polymorphic sites from a group of individuals. For example in human genetics, genome-wide association studies collect genotypes in thousands of individuals at between 200,000-5,000,000 SNPs using microarrays. Haplotype estimation methods are used in the analysis of these datasets and allow genotype imputation [1] [2] of alleles from reference databases such as the HapMap Project and the 1000 Genomes Project.

Contents

Genotypes and haplotypes

Genotypes measure the unordered combination of alleles at each site, whereas haplotypes are the two sequences of alleles that have been inherited together from the individual's parents. When there are heterozygous genotypes present in an individual's set of genotypes, there will be possible pairs of haplotypes that could underlie the genotypes. For example, when , we have the following haplotypes: AA/TT, AT/TA, TA/AT, and TT/AA. If there are missing genotypes then the number of possible haplotype pairs increases.

Haplotype estimation methods

Many statistical methods have been proposed for estimation of haplotypes. Some of the earliest approaches used a simple multinomial model in which each possible haplotype consistent with the sample was given an unknown frequency parameter and these parameters were estimated with an Expectation–maximization algorithm. These approaches were only able to handle small numbers of sites at once, although sequential versions were later developed, specifically the SNPHAP method.

The most accurate and widely used methods for haplotype estimation utilize some form of hidden Markov model (HMM) to carry out inference. For a long time PHASE [3] was the most accurate method. PHASE was the first method to utilize ideas from coalescent theory concerning the joint distribution of haplotypes. This method used a Gibbs sampling approach in which each individuals haplotypes were updated conditional upon the current estimates of haplotypes from all other samples. Approximations to the distribution of a haplotype conditional upon a set of other haplotypes were used for the conditional distributions of the Gibbs sampler. PHASE was used to estimate the haplotypes from the HapMap Project. PHASE was limited by its speed and was not applicable to datasets from genome-wide association studies.

The fastPHASE [4] and BEAGLE methods [5] introduced haplotype cluster models applicable to GWAS-sized datasets. Subsequently the IMPUTE2 [6] and MaCH [7] methods were introduced that were similar to the PHASE approach but much faster. These methods iteratively update the haplotype estimates of each sample conditional upon a subset of K haplotype estimates of other samples. IMPUTE2 introduced the idea of carefully choosing which subset of haplotypes to condition on to improve accuracy. Accuracy increases with K but with quadratic computational complexity.

The SHAPEIT1 method made a major advance by introducing a linear complexity method that operates only on the space of haplotypes consistent with an individual’s genotypes. [8] The HAPI-UR method subsequently proposed a very similar method. [9] SHAPEIT2 [10] combines the best features of SHAPEIT1 and IMPUTE2 to improve efficiency and accuracy.

See also

Related Research Articles

<span class="mw-page-title-main">Heritability</span> Estimation of effect of genetic variation on phenotypic variation of a trait

Heritability is a statistic used in the fields of breeding and genetics that estimates the degree of variation in a phenotypic trait in a population that is due to genetic variation between individuals in that population. The concept of heritability can be expressed in the form of the following question: "What is the proportion of the variation in a given trait within a population that is not explained by the environment or random chance?"

<span class="mw-page-title-main">Hardy–Weinberg principle</span> Principle in genetics

In population genetics, the Hardy–Weinberg principle, also known as the Hardy–Weinberg equilibrium, model, theorem, or law, states that allele and genotype frequencies in a population will remain constant from generation to generation in the absence of other evolutionary influences. These influences include genetic drift, mate choice, assortative mating, natural selection, sexual selection, mutation, gene flow, meiotic drive, genetic hitchhiking, population bottleneck, founder effect,inbreeding and outbreeding depression.

<span class="mw-page-title-main">Single-nucleotide polymorphism</span> Single nucleotide in genomic DNA at which different sequence alternatives exist

In genetics and bioinformatics, a single-nucleotide polymorphism is a germline substitution of a single nucleotide at a specific position in the genome that is present in a sufficiently large fraction of considered population.

<span class="mw-page-title-main">Haplotype</span> Group of genes from one parent

A haplotype is a group of alleles in an organism that are inherited together from a single parent.

In population genetics, linkage disequilibrium (LD) is the non-random association of alleles at different loci in a given population. Loci are said to be in linkage disequilibrium when the frequency of association of their different alleles is higher or lower than expected if the loci were independent and associated randomly.

The International HapMap Project was an organization that aimed to develop a haplotype map (HapMap) of the human genome, to describe the common patterns of human genetic variation. HapMap is used to find genetic variants affecting health, disease and responses to drugs and environmental factors. The information produced by the project is made freely available for research.

<span class="mw-page-title-main">Identity by descent</span> Identical nucleotide sequence due to inheritance without recombination from a common ancestor

A DNA segment is identical by state (IBS) in two or more individuals if they have identical nucleotide sequences in this segment. An IBS segment is identical by descent (IBD) in two or more individuals if they have inherited it from a common ancestor without recombination, that is, the segment has the same ancestral origin in these individuals. DNA segments that are IBD are IBS per definition, but segments that are not IBD can still be IBS due to the same mutations in different individuals or recombinations that do not alter the segment.

Genetic association is when one or more genotypes within a population co-occur with a phenotypic trait more often than would be expected by chance occurrence.

<span class="mw-page-title-main">Mendelian randomization</span> Statistical method in genetic epidemiology

In epidemiology, Mendelian randomization is a method using measured variation in genes to interrogate the causal effect of an exposure on an outcome. Under key assumptions, the design reduces both reverse causation and confounding, which often substantially impede or mislead the interpretation of results from epidemiological studies.

<span class="mw-page-title-main">Genome-wide association study</span> Study of genetic variants in different individuals

In genomics, a genome-wide association study, is an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait. GWA studies typically focus on associations between single-nucleotide polymorphisms (SNPs) and traits like major human diseases, but can equally be applied to any other genetic variants and any other organisms.

Population structure is the presence of a systematic difference in allele frequencies between subpopulations. In a randomly mating population, allele frequencies are expected to be roughly similar between groups. However, mating tends to be non-random to some degree, causing structure to arise. For example, a barrier like a river can separate two groups of the same species and make it difficult for potential mates to cross; if a mutation occurs, over many generations it can spread and become common in one subpopulation while being completely absent in the other.

In genetics, association mapping, also known as "linkage disequilibrium mapping", is a method of mapping quantitative trait loci (QTLs) that takes advantage of historic linkage disequilibrium to link phenotypes to genotypes, uncovering genetic associations.

<span class="mw-page-title-main">Transmission electron microscopy DNA sequencing</span> Single-molecule sequencing technology

Transmission electron microscopy DNA sequencing is a single-molecule sequencing technology that uses transmission electron microscopy techniques. The method was conceived and developed in the 1960s and 70s, but lost favor when the extent of damage to the sample was recognized.

Imputation in genetics refers to the statistical inference of unobserved genotypes. It is achieved by using known haplotypes in a population, for instance from the HapMap or the 1000 Genomes Project in humans, thereby allowing to test for association between a trait of interest and experimentally untyped genetic variants, but whose genotypes have been statistically inferred ("imputed"). Genotype imputation is usually performed on SNPs, the most common kind of genetic variation.

SNV calling from NGS data is any of a range of methods for identifying the existence of single nucleotide variants (SNVs) from the results of next generation sequencing (NGS) experiments. These are computational techniques, and are in contrast to special experimental methods based on known population-wide single nucleotide polymorphisms. Due to the increasing abundance of NGS data, these techniques are becoming increasingly popular for performing SNP genotyping, with a wide variety of algorithms designed for specific experimental designs and applications. In addition to the usual application domain of SNP genotyping, these techniques have been successfully adapted to identify rare SNPs within a population, as well as detecting somatic SNVs within an individual using multiple tissue samples.

Mega2 allows the applied statistical geneticist to convert one's data from several input formats to a large number output formats suitable for analysis by commonly used software packages. In a typical human genetics study, the analyst often needs to use a variety of different software programs to analyze the data, and these programs usually require that the data be formatted to their precise input specifications. Conversion of one's data into these multiple different formats can be tedious, time-consuming, and error-prone. Mega2, by providing validated conversion pipelines, can accelerate the analyses while reducing errors.

A multilocus genotype is the combination of alleles found at two or more loci in a single individual.

Allele age is the amount of time elapsed since an allele first appeared due to mutation. Estimating the time at which a certain allele appeared allows researchers to infer patterns of human migration, disease, and natural selection. Allele age can be estimated based on (1) the frequency of the allele in a population and (2) the genetic variation that occurs within different copies of the allele, also known as intra-allelic variation. While either of these methods can be used to estimate allele age, the use of both increases the accuracy of the estimation and can sometimes offer additional information regarding the presence of selection.

<span class="mw-page-title-main">Jonathan Marchini</span>

Jonathan Laurence Marchini is a Bayesian statistician and professor of statistical genomics in the Department of Statistics at the University of Oxford, a tutorial fellow in statistics at Somerville College, Oxford and a co-founder and director of Gensci Ltd. He co-leads the Haplotype Reference Consortium.

Sharon Ruth Browning is a statistical geneticist at the University of Washington, and a research professor with its Department of Biostatistics. Her research has various implications for the field of biogenetics.

References

  1. Marchini, J.; Howie, B. (2010). "Genotype imputation for genome-wide association studies". Nature Reviews Genetics. 11 (7): 499–511. doi:10.1038/nrg2796. PMID   20517342. S2CID   1465707.
  2. Howie, B.; Fuchsberger, C.; Stephens, M.; Marchini, J.; Abecasis, G. A. R. (2012). "Fast and accurate genotype imputation in genome-wide association studies through pre-phasing". Nature Genetics. 44 (8): 955–959. doi:10.1038/ng.2354. PMC   3696580 . PMID   22820512.
  3. Stephens, M.; Smith, N. J.; Donnelly, P. (2001). "A New Statistical Method for Haplotype Reconstruction from Population Data". The American Journal of Human Genetics. 68 (4): 978–989. doi:10.1086/319501. PMC   1275651 . PMID   11254454.
  4. Scheet, P.; Stephens, M. (2006). "A Fast and Flexible Statistical Model for Large-Scale Population Genotype Data: Applications to Inferring Missing Genotypes and Haplotypic Phase". The American Journal of Human Genetics. 78 (4): 629–644. doi:10.1086/502802. PMC   1424677 . PMID   16532393.
  5. Browning, S. R.; Browning, B. L. (2007). "Rapid and Accurate Haplotype Phasing and Missing-Data Inference for Whole-Genome Association Studies by Use of Localized Haplotype Clustering". The American Journal of Human Genetics. 81 (5): 1084–1097. doi:10.1086/521987. PMC   2265661 . PMID   17924348.
  6. Howie, B. N.; Donnelly, P.; Marchini, J. (2009). Schork, Nicholas J (ed.). "A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies". PLOS Genetics. 5 (6): e1000529. doi: 10.1371/journal.pgen.1000529 . PMC   2689936 . PMID   19543373.
  7. Li, Y.; Willer, C. J.; Ding, J.; Scheet, P.; Abecasis, G. A. R. (2010). "MaCH: Using sequence and genotype data to estimate haplotypes and unobserved genotypes". Genetic Epidemiology. 34 (8): 816–834. doi:10.1002/gepi.20533. PMC   3175618 . PMID   21058334.
  8. Delaneau, O.; Marchini, J.; Zagury, J. F. O. (2011). "A linear complexity phasing method for thousands of genomes". Nature Methods. 9 (2): 179–181. doi:10.1038/nmeth.1785. PMID   22138821. S2CID   13765612.
  9. Williams, A. L.; Patterson, N.; Glessner, J.; Hakonarson, H.; Reich, D. (2012). "Phasing of Many Thousands of Genotyped Samples". The American Journal of Human Genetics. 91 (2): 238–251. doi:10.1016/j.ajhg.2012.06.013. PMC   3415548 . PMID   22883141.
  10. Delaneau, O.; Zagury, J. F.; Marchini, J. (2012). "Improved whole-chromosome phasing for disease and population genetic studies". Nature Methods. 10 (1): 5–6. doi:10.1038/nmeth.2307. PMID   23269371. S2CID   205421216.