Fixation index

Last updated
Fst values between European populations FST no color scale.png
Fst values between European populations

The fixation index (FST) is a measure of population differentiation due to genetic structure. It is frequently estimated from genetic polymorphism data, such as single-nucleotide polymorphisms (SNP) or microsatellites. Developed as a special case of Wright's F-statistics, it is one of the most commonly used statistics in population genetics. Its values range from 0 to 1, with 0.15 being substantially differentiated and 1 being complete differentiation.

Contents

Interpretation

This comparison of genetic variability within and between populations is frequently used in applied population genetics. The values range from 0 to 1. A zero value implies complete panmixia; that is, that the two populations are interbreeding freely. A value of one implies that all genetic variation is explained by the population structure, and that the two populations do not share any genetic diversity.

For idealized models such as Wright's finite island model, FST can be used to estimate migration rates. Under that model, the migration rate is

,

where m is the migration rate per generation, and is the mutation rate per generation. [1]

The interpretation of FST can be difficult when the data analyzed are highly polymorphic. In this case, the probability of identity by descent is very low and FST can have an arbitrarily low upper bound, which might lead to misinterpretation of the data. Also, strictly speaking FST is not a distance in the mathematical sense, as it does not satisfy the triangle inequality.

For populations of plants which clearly belong to the same species, values of FST greater than 15% are considered "great" or "significant" differentiation, while values below 5% are considered "small" or "insignificant" differentiation. [2] Values for mammal populations between subspecies, or closely related species, typical values are of the order of 5% to 20%. FST between the Eurasian and North American populations of the gray wolf were reported at 9.9%, those between the Red wolf and Gray wolf populations at between 17% and 18%. The Eastern wolf, a recently recognized highly admixed "wolf-like species" has values of FST below 10% in comparison with both Eurasian (7.6%) and North American gray wolves (5.7%), with the Red wolf (8.5%), and an even lower value when paired with the Coyote (4.5%). [3]

Definition

Two of the most commonly used definitions for FST at a given locus are based on 1) the variance of allele frequencies among populations, and on 2) the probability of identity by descent.

If is the average frequency of an allele in the total population, is the variance in the frequency of the allele among different subpopulations, weighted by the sizes of the subpopulations, and is the variance of the allelic state in the total population, FST is defined as [4]

Wright's definition illustrates that FST measures the amount of genetic variance that can be explained by population structure. This can also be thought of as the fraction of total diversity that is not a consequence of the average diversity within subpopulations, where diversity is measured by the probability that two randomly selected alleles are different, namely . If the allele frequency in the th population is and the relative size of the th population is , then

Alternatively, [5]

where is the probability of identity by descent of two individuals given that the two individuals are in the same subpopulation, and is the probability that two individuals from the total population are identical by descent. Using this definition, FST can be interpreted as measuring how much closer two individuals from the same subpopulation are, compared to the total population. If the mutation rate is small, this interpretation can be made more explicit by linking the probability of identity by descent to coalescent times: Let T0 and T denote the average time to coalescence for individuals from the same subpopulation and the total population, respectively. Then,

This formulation has the advantage that the expected time to coalescence can easily be estimated from genetic data, which led to the development of various estimators for FST.

Estimation

In practice, none of the quantities used for the definitions can be easily measured. As a consequence, various estimators have been proposed. A particularly simple estimator applicable to DNA sequence data is: [6]

where and represent the average number of pairwise differences between two individuals sampled from different sub-populations () or from the same sub-population (). The average pairwise difference within a population can be calculated as the sum of the pairwise differences divided by the number of pairs. However, this estimator is biased when sample sizes are small or if they vary between populations. Therefore, more elaborate methods are used to compute FST in practice. Two of the most widely used procedures are the estimator by Weir & Cockerham (1984), [7] or performing an Analysis of molecular variance. A list of implementations is available at the end of this article.

FST in humans

FST values in selected populations Colorful FST average nonfull.png
FST values in selected populations

FST values depend strongly on the choice of populations. Closely related ethnic groups, such as the Danes vs. the Dutch, or the Portuguese vs. the Spaniards show values significantly below 1%, indistinguishable from panmixia. Within Europe, the most divergent ethnic groups have been found to have values of the order of 7% (Sámi vs. Sardinians).

Larger values are found if highly divergent homogenous groups are compared: the highest such value found was at close to 46%, between Mbuti and Papuans. [8]

A genetic distance of 0.125 implies that kinship between unrelated individuals of the same ancestry relative to the world population is equivalent to kinship between half siblings in a randomly mating population. This also implies that if a human from a given ancestral population has a mixed half-sibling, that human is closer genetically to an unrelated individual of their ancestral population than to their mixed half-sibling. [9]

Autosomal genetic distances based on classical markers

In their study The History and Geography of Human Genes (1994), Cavalli-Sforza, Menozzi and Piazza provide some of the most detailed and comprehensive estimates of genetic distances between human populations, within and across continents. Their initial database contains 76,676 gene frequencies (using 120 blood polymorphisms), corresponding to 6,633 samples in different locations. By culling and pooling such samples, they restrict their analysis to 491 populations.

The 42 world populations used by the authors and their reported FSTs. Full Fst Average.png
The 42 world populations used by the authors and their reported FSTs.

They focus on aboriginal populations that were at their present location at the end of the 15th century when the great European migrations began. [10] When studying genetic difference at the world level, the number is reduced to 42 representative populations, aggregating subpopulations characterized by a high level of genetic similarity. For these 42 populations, Cavalli-Sforza and coauthors report bilateral distances computed from 120 alleles. Among this set of 42 world populations, the greatest genetic distance observed is between Mbuti Pygmies and Papua New Guineans, where the Fst distance is 0.4573, while the smallest genetic distance (0.0021) is between the Danish and the English.

When considering more disaggregated data for 26 European populations, the smallest genetic distance (0.0009) is between the Dutch and the Danes, and the largest (0.0667) is between the Lapps and the Sardinians. The mean genetic distance among the 861 available pairings of the 42 selected populations was found to be 0.1338.[ page needed ].

The following table shows Fst calculated by Cavalli-Sforza (1994) for some populations:

Bantu Nio-Saharan W. African Mbuti Japanese Korean Thai Filipino S. Chinese Danish English Melanesian New Guinean Australian
Bantu0.00
Nio-Saharan0.010.00
W. African0.020.020.00
Mbuti0.070.080.080.00
Japanese0.240.250.230.310.00
Korean0.270.240.180.300.010.00
Thai0.340.300.250.390.070.060.00
Filipino0.290.300.230.380.100.120.060.00
S. Chinese0.300.290.200.340.050.050.010.030.00
Danish0.170.170.150.150.120.090.130.130.130.00
English0.230.180.150.240.120.100.110.110.120.000.00
Melanesian0.340.310.270.400.110.110.080.050.060.140.160.00
New Guinean0.340.330.280.460.120.140.180.130.150.160.160.070.00
Australian0.330.360.270.430.060.070.130.130.110.140.150.110.100.00

Autosomal genetic distances based on SNPs

A 2012 study based on International HapMap Project data estimated FST between the three major "continental" populations of Europeans (combined from Utah residents of Northern and Western European ancestry from the CEPH collection and Italians from Tuscany), East Asians (combining Han Chinese from Beijing, Chinese from metropolitan Denver and Japanese from Tokyo, Japan) and Sub-Saharan Africans (combining Luhya of Webuye, Kenya, Maasai of Kinyawa, Kenya and Yoruba of Ibadan, Nigeria). It reported a value close to 12% between continental populations, and values close to panmixia (smaller than 1%) within continental populations. [11]

Intercontinental autosomal genetic distances based on SNPs [12]
Europe (CEU)Sub-Saharan Africa (Yoruba)East-Asia (Japanese)
Sub-Saharan Africa (Yoruba)0.153
East-Asia (Japanese)0.1110.190
East-Asia (Chinese)0.1100.1920.007
Intra-European/mediterranean autosomal genetic distances based on SNPs [12] [13]
ItaliansPalestiniansSwedishFinnsSpanishGermansRussiansFrenchGreeks
Palestinians0.0064
Swedish0.0064-0.00900.0191
Finns0.0130-0.02300.0050-0.0110
Spanish0.0010-0.00500.01010.0040-00550.0110-0.0170
Germans0.0029-0.00800.01360.0007-0.00100.0060-0.01300.0015-0.0030
Russians0.0088-0.01200.02020.0030-0.00360.0060-0.01200.0070-0.00790.0030-0.0037
French0.0030-0.00500.00200.0080-0.01500.00100.00100.0050
Greeks0.00000.00570.00840.00350.00390.0108

Programs for calculating FST

Modules for calculating FST

Related Research Articles

<span class="mw-page-title-main">Cauchy distribution</span> Probability distribution

The Cauchy distribution, named after Augustin Cauchy, is a continuous probability distribution. It is also known, especially among physicists, as the Lorentz distribution, Cauchy–Lorentz distribution, Lorentz(ian) function, or Breit–Wigner distribution. The Cauchy distribution is the distribution of the x-intercept of a ray issuing from with a uniformly distributed angle. It is also the distribution of the ratio of two independent normally distributed random variables with mean zero.

<span class="mw-page-title-main">Median</span> Middle quantile of a data set or probability distribution

In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic feature of the median in describing data compared to the mean is that it is not skewed by a small proportion of extremely large or small values, and therefore provides a better representation of the center. Median income, for example, may be a better way to describe the center of the income distribution because increases in the largest incomes alone have no effect on the median. For this reason, the median is of central importance in robust statistics.

<span class="mw-page-title-main">Normal distribution</span> Probability distribution

In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

<span class="mw-page-title-main">Standard deviation</span> In statistics, a measure of variation

In statistics, the standard deviation is a measure of the amount of variation of a random variable expected about its mean. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

The weighted arithmetic mean is similar to an ordinary arithmetic mean, except that instead of each of the data points contributing equally to the final average, some data points contribute more than others. The notion of weighted mean plays a role in descriptive statistics and also occurs in a more general form in several other areas of mathematics.

<span class="mw-page-title-main">Allan variance</span> Measure of frequency stability in clocks and oscillators

The Allan variance (AVAR), also known as two-sample variance, is a measure of frequency stability in clocks, oscillators and amplifiers. It is named after David W. Allan and expressed mathematically as . The Allan deviation (ADEV), also known as sigma-tau, is the square root of the Allan variance, .

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

<span class="mw-page-title-main">Buffon's needle problem</span> Question in geometric probability

In probability theory, Buffon's needle problem is a question first posed in the 18th century by Georges-Louis Leclerc, Comte de Buffon:

In population genetics, F-statistics describe the statistically expected level of heterozygosity in a population; more specifically the expected degree of (usually) a reduction in heterozygosity when compared to Hardy–Weinberg expectation.

In statistics, sometimes the covariance matrix of a multivariate random variable is not known but has to be estimated. Estimation of covariance matrices then deals with the question of how to approximate the actual covariance matrix on the basis of a sample from the multivariate distribution. Simple cases, where observations are complete, can be dealt with by using the sample covariance matrix. The sample covariance matrix (SCM) is an unbiased and efficient estimator of the covariance matrix if the space of covariance matrices is viewed as an extrinsic convex cone in Rp×p; however, measured using the intrinsic geometry of positive-definite matrices, the SCM is a biased and inefficient estimator. In addition, if the random variable has a normal distribution, the sample covariance matrix has a Wishart distribution and a slightly differently scaled version of it is the maximum likelihood estimate. Cases involving missing data, heteroscedasticity, or autocorrelated residuals require deeper considerations. Another issue is the robustness to outliers, to which sample covariance matrices are highly sensitive.

<span class="mw-page-title-main">Directional statistics</span>

Directional statistics is the subdiscipline of statistics that deals with directions, axes or rotations in Rn. More generally, directional statistics deals with observations on compact Riemannian manifolds including the Stiefel manifold.

Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their value affects the distribution of the measured data. An estimator attempts to approximate the unknown parameters using the measurements. In estimation theory, two approaches are generally considered:

In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to the method of maximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior distribution over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of maximum likelihood estimation.

von Mises distribution Probability distribution on the circle

In probability theory and directional statistics, the von Mises distribution is a continuous probability distribution on the circle. It is a close approximation to the wrapped normal distribution, which is the circular analogue of the normal distribution. A freely diffusing angle on a circle is a wrapped normally distributed random variable with an unwrapped variance that grows linearly in time. On the other hand, the von Mises distribution is the stationary distribution of a drift and diffusion process on the circle in a harmonic potential, i.e. with a preferred orientation. The von Mises distribution is the maximum entropy distribution for circular data when the real and imaginary parts of the first circular moment are specified. The von Mises distribution is a special case of the von Mises–Fisher distribution on the N-dimensional sphere.

<span class="mw-page-title-main">Genetic distance</span> Measure of divergence between populations

Genetic distance is a measure of the genetic divergence between species or between populations within a species, whether the distance measures time from common ancestor or degree of differentiation. Populations with many similar alleles have small genetic distances. This indicates that they are closely related and have a recent common ancestor.

In estimation theory and decision theory, a Bayes estimator or a Bayes action is an estimator or decision rule that minimizes the posterior expected value of a loss function. Equivalently, it maximizes the posterior expectation of a utility function. An alternative way of formulating an estimator within Bayesian statistics is maximum a posteriori estimation.

In population genetics, fixation is the change in a gene pool from a situation where there exists at least two variants of a particular gene (allele) in a given population to a situation where only one of the alleles remains. That is, the allele becomes fixed. In the absence of mutation or heterozygote advantage, any allele must eventually either be lost completely from the population, or fixed, i.e. permanently established at 100% frequency in the population. Whether a gene will ultimately be lost or fixed is dependent on selection coefficients and chance fluctuations in allelic proportions. Fixation can refer to a gene in general or particular nucleotide position in the DNA chain (locus).

Population structure is the presence of a systematic difference in allele frequencies between subpopulations. In a randomly mating population, allele frequencies are expected to be roughly similar between groups. However, mating tends to be non-random to some degree, causing structure to arise. For example, a barrier like a river can separate two groups of the same species and make it difficult for potential mates to cross; if a mutation occurs, over many generations it can spread and become common in one subpopulation while being completely absent in the other.

<span class="mw-page-title-main">Wrapped normal distribution</span>

In probability theory and directional statistics, a wrapped normal distribution is a wrapped probability distribution that results from the "wrapping" of the normal distribution around the unit circle. It finds application in the theory of Brownian motion and is a solution to the heat equation for periodic boundary conditions. It is closely approximated by the von Mises distribution, which, due to its mathematical simplicity and tractability, is the most commonly used distribution in directional statistics.

<span class="mw-page-title-main">Isolation by distance</span>

Isolation by distance (IBD) is a term used to refer to the accrual of local genetic variation under geographically limited dispersal. The IBD model is useful for determining the distribution of gene frequencies over a geographic region. Both dispersal variance and migration probabilities are variables in this model and both contribute to local genetic differentiation. Isolation by distance is usually the simplest model for the cause of genetic isolation between populations. Evolutionary biologists and population geneticists have been exploring varying theories and models for explaining population structure. Yoichi Ishida compares two important theories of isolation by distance and clarifies the relationship between the two. According to Ishida, Sewall Wright's isolation by distance theory is termed ecological isolation by distance while Gustave Malécot's theory is called genetic isolation by distance. Isolation by distance is distantly related to speciation. Multiple types of isolating barriers, namely prezygotic isolating barriers, including isolation by distance, are considered the key factor in keeping populations apart, limiting gene flow.

References

  1. Peter Beerli, Estimation of migration rates and population sizes in geographically structured populations (1998), Advances in molecular ecology (ed. G. Carvalho). NATO Science Series A: Life Sciences, IOS Press, Amsterdam, 39-53.
  2. Frankham, R., Ballou, J.D., Briscoe, D.A., 2002. Introduction to Conservation Genetics. Cambridge University Press, Cambridge. Hartl DL, Clark AG (1997) Principles of Population Genetics, 3nd edn. Sinauer Associates, Inc, Sunderland, MA.
  3. B. M. von Holdt et al., "Whole-genome sequence analysis shows that two endemic species of North American wolf are admixtures of the coyote and gray wolf", Science Advances 27 Jul 2016: Vol. 2, no. 7, e1501714, doi : 10.1126/sciadv.1501714.
  4. Holsinger, Kent E.; Bruce S. Weir (2009). "Genetics in geographically structured populations: defining, estimating and interpreting FST". Nat Rev Genet. 10 (9): 639–650. doi:10.1038/nrg2611. ISSN   1471-0056. PMC   4687486 . PMID   19687804.
  5. Richard Durrett (12 August 2008). Probability Models for DNA Sequence Evolution. Springer. ISBN   978-0-387-78168-6 . Retrieved 25 October 2012.
  6. Hudson, RR.; Slatkin, M.; Maddison, WP. (Oct 1992). "Estimation of Levels of Gene Flow from DNA Sequence Data". Genetics. 132 (2): 583–9. doi:10.1093/genetics/132.2.583. PMC   1205159 . PMID   1427045.
  7. Weir, B. S.; Cockerham, C. Clark (1984). "Estimating F-Statistics for the Analysis of Population Structure". Evolution. 38 (6): 1358–1370. doi:10.2307/2408641. ISSN   0014-3820. JSTOR   2408641. PMID   28563791.
  8. Cavalli-Sforza et al. (1994), cited after V. Ginsburgh, S. Weber, The Palgrave Handbook of Economics and Language, Springer (2016), p. 182.
  9. Harpending, Henry (2002). "Kinship and Population Subdivision". Population and Environment. 24 (2): 141–147. doi:10.1023/A:1020815420693. S2CID   15208802.
  10. Cavalli-Sforza et al., 1994, p. 24
  11. Elhaik, Eran (2012). "Empirical Distributions of FST from Large-Scale Human Polymorphism Data". PLOS ONE. 7 (11): e49837. Bibcode:2012PLoSO...749837E. doi: 10.1371/journal.pone.0049837 . PMC   3504095 . PMID   23185452.
  12. 1 2 Nelis, Mari; et al. (2009-05-08). Fleischer, Robert C. (ed.). "Genetic Structure of Europeans: A View from the North–East". PLOS ONE. 4 (5): e5472. Bibcode:2009PLoSO...4.5472N. doi: 10.1371/journal.pone.0005472 . PMC   2675054 . PMID   19424496., see table
  13. Tian, Chao; et al. (November 2009). "European Population Genetic Substructure: Further Definition of Ancestry Informative Markers for Distinguishing among Diverse European Ethnic Groups". Molecular Medicine. 15 (11–12): 371–383. doi:10.2119/molmed.2009.00094. ISSN   1076-1551. PMC   2730349 . PMID   19707526., see table
  14. Crawford, Nicholas G. (2010). "smogd: software for the measurement of genetic diversity". Molecular Ecology Resources . 10 (3): 556–557. doi:10.1111/j.1755-0998.2009.02801.x. PMID   21565057. S2CID   205970662.
  15. Kitada S, Kitakado T, Kishino H (2007). "Empirical Bayes inference of pairwise F(ST) and its distribution in the genome". Genetics. 177 (2): 861–73. doi:10.1534/genetics.107.077263. PMC   2034649 . PMID   17660541.

Further reading

  • Evolution and the Genetics of Populations Volume 2: the Theory of Gene Frequencies, pg 294–295, S. Wright, Univ. of Chicago Press, Chicago, 1969
  • A haplotype map of the human genome, The International HapMap Consortium, Nature 2005

See also