# Goodness of fit

Last updated

The goodness of fit of a statistical model describes how well it fits a set of observations. Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in question. Such measures can be used in statistical hypothesis testing, e.g. to test for normality of residuals, to test whether two samples are drawn from identical distributions (see Kolmogorov–Smirnov test), or whether outcome frequencies follow a specified distribution (see Pearson's chi-square test). In the analysis of variance, one of the components into which the variance is partitioned may be a lack-of-fit sum of squares.

## Fit of distributions

In assessing whether a given distribution is suited to a data-set, the following tests and their underlying measures of fit can be used:

## Regression analysis

In regression analysis, the following topics relate to goodness of fit:

## Categorical data

The following are examples that arise in the context of categorical data.

### Pearson's chi-square test

Pearson's chi-square test uses a measure of goodness of fit which is the sum of differences between observed and expected outcome frequencies (that is, counts of observations), each squared and divided by the expectation:

${\displaystyle \chi ^{2}=\sum _{i=1}^{n}{{\frac {(O_{i}-E_{i})}{E_{i}}}^{2}}}$

where:

• Oi = an observed count for bin i
• Ei = an expected count for bin i, asserted by the null hypothesis.

The expected frequency is calculated by:

${\displaystyle E_{i}\,=\,{\bigg (}F(Y_{u})\,-\,F(Y_{l}){\bigg )}\,N}$

where:

The resulting value can be compared with a chi-square distribution to determine the goodness of fit. The chi-square distribution has (kc) degrees of freedom, where k is the number of non-empty cells and c is the number of estimated parameters (including location and scale parameters and shape parameters) for the distribution plus one. For example, for a 3-parameter Weibull distribution, c = 4.

#### Example: equal frequencies of men and women

For example, to test the hypothesis that a random sample of 100 people has been drawn from a population in which men and women are equal in frequency, the observed number of men and women would be compared to the theoretical frequencies of 50 men and 50 women. If there were 44 men in the sample and 56 women, then

${\displaystyle \chi ^{2}={(44-50)^{2} \over 50}+{(56-50)^{2} \over 50}=1.44}$

If the null hypothesis is true (i.e., men and women are chosen with equal probability in the sample), the test statistic will be drawn from a chi-square distribution with one degree of freedom. Though one might expect two degrees of freedom (one each for the men and women), we must take into account that the total number of men and women is constrained (100), and thus there is only one degree of freedom (2  1). In other words, if the male count is known the female count is determined, and vice versa.

Consultation of the chi-square distribution for 1 degree of freedom shows that the cumulative probability of observing a difference more than ${\displaystyle \chi ^{2}=1.44}$ if men and women are equally numerous in the population is approximately 0.23. This probability is higher than the conventionally accepted criteria for statistical significance (a probability of .001-.05), so normally we would not reject the null hypothesis that the number of men in the population is the same as the number of women (i.e. we would consider our sample within the range of what we'd expect for a 50/50 male/female ratio.)

Note the assumption that the mechanism that has generated the sample is random, in the sense of independent random selection with the same probability, here 0.5 for both males and females. If, for example, each of the 44 males selected brought a male buddy, and each of the 56 females brought a female buddy, each ${\textstyle {(O_{i}-E_{i})}^{2}}$ will increase by a factor of 4, while each ${\textstyle E_{i}}$ will increase by a factor of 2. The value of the statistic will double to 2.88. Knowing this underlying mechanism, we should of course be counting pairs. In general, the mechanism, if not defensibly random, will not be known. The distribution to which the test statistic should be referred may, accordingly, be very different from chi-square. [5]

#### Binomial case

A binomial experiment is a sequence of independent trials in which the trials can result in one of two outcomes, success or failure. There are n trials each with probability of success, denoted by p. Provided that npi  1 for every i (where i = 1, 2, ..., k), then

${\displaystyle \chi ^{2}=\sum _{i=1}^{k}{\frac {(N_{i}-np_{i})^{2}}{np_{i}}}=\sum _{\mathrm {all\ cells} }^{}{\frac {(\mathrm {O} -\mathrm {E} )^{2}}{\mathrm {E} }}.}$

This has approximately a chi-square distribution with k  1 degrees of freedom. The fact that there are k  1 degrees of freedom is a consequence of the restriction ${\textstyle \sum N_{i}=n}$. We know there are k observed cell counts, however, once any k  1 are known, the remaining one is uniquely determined. Basically, one can say, there are only k  1 freely determined cell counts, thus k  1 degrees of freedom.

### G-test

G-tests are likelihood-ratio tests of statistical significance that are increasingly being used in situations where Pearson's chi-square tests were previously recommended. [6]

The general formula for G is

${\displaystyle G=2\sum _{i}{O_{i}\cdot \ln \left({\frac {O_{i}}{E_{i}}}\right)},}$

where ${\textstyle O_{i}}$ and ${\textstyle E_{i}}$ are the same as for the chi-square test, ${\textstyle \ln }$ denotes the natural logarithm, and the sum is taken over all non-empty cells. Furthermore, the total observed count should be equal to the total expected count:

${\displaystyle \sum _{i}O_{i}=\sum _{i}E_{i}=N}$

where ${\textstyle N}$ is the total number of observations.

G-tests have been recommended at least since the 1981 edition of the popular statistics textbook by Robert R. Sokal and F. James Rohlf. [7]

## Related Research Articles

In statistics, the Kolmogorov–Smirnov test is a nonparametric test of the equality of continuous, one-dimensional probability distributions that can be used to compare a sample with a reference probability distribution, or to compare two samples. In essence, the test answers the question "What is the probability that this collection of samples could have been drawn from that probability distribution?" or, in the second case, "What is the probability that these two sets of samples were drawn from the same probability distribution?". It is named after Andrey Kolmogorov and Nikolai Smirnov.

In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models based on the ratio of their likelihoods, specifically one found by maximization over the entire parameter space and another found after imposing some constraint. If the constraint is supported by the observed data, the two likelihoods should not differ by more than sampling error. Thus the likelihood-ratio test tests whether this ratio is significantly different from one, or equivalently whether its natural logarithm is significantly different from zero.

In probability theory and statistics, the chi-squared distribution with k degrees of freedom is the distribution of a sum of the squares of k independent standard normal random variables. The chi-squared distribution is a special case of the gamma distribution and is one of the most widely used probability distributions in inferential statistics, notably in hypothesis testing and in construction of confidence intervals. This distribution is sometimes called the central chi-squared distribution, a special case of the more general noncentral chi-squared distribution.

In statistics, the (binary) logistic model is a statistical model that models the probability of one event taking place by having the log-odds for the event be a linear combination of one or more independent variables ("predictors"). In regression analysis, logistic regression is estimating the parameters of a logistic model. Formally, in binary logistic regression there is a single binary dependent variable, coded by a indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

Pearson's chi-squared test is a statistical test applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance. It is the most widely used of many chi-squared tests – statistical procedures whose results are evaluated by reference to the chi-squared distribution. Its properties were first investigated by Karl Pearson in 1900. In contexts where it is important to improve a distinction between the test statistic and its distribution, names similar to Pearson χ-squared test or statistic are used.

A chi-squared test is a statistical hypothesis test that is valid to perform when the test statistic is chi-squared distributed under the null hypothesis, specifically Pearson's chi-squared test and variants thereof. Pearson's chi-squared test is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table.

In population genetics, the Hardy–Weinberg principle, also known as the Hardy–Weinberg equilibrium, model, theorem, or law, states that allele and genotype frequencies in a population will remain constant from generation to generation in the absence of other evolutionary influences. These influences include genetic drift, mate choice, assortative mating, natural selection, sexual selection, mutation, gene flow, meiotic drive, genetic hitchhiking, population bottleneck, founder effect and inbreeding.

An F-test is any statistical test in which the test statistic has an F-distribution under the null hypothesis. It is most often used when comparing statistical models that have been fitted to a data set, in order to identify the model that best fits the population from which the data were sampled. Exact "F-tests" mainly arise when the models have been fitted to the data using least squares. The name was coined by George W. Snedecor, in honour of Ronald Fisher. Fisher initially developed the statistic as the variance ratio in the 1920s.

In statistics, deviance is a goodness-of-fit statistic for a statistical model; it is often used for statistical hypothesis testing. It is a generalization of the idea of using the sum of squares of residuals (RSS) in ordinary least squares to cases where model-fitting is achieved by maximum likelihood. It plays an important role in exponential dispersion models and generalized linear models.

The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection.

Fisher's exact test is a statistical significance test used in the analysis of contingency tables. Although in practice it is employed when sample sizes are small, it is valid for all sample sizes. It is named after its inventor, Ronald Fisher, and is one of a class of exact tests, so called because the significance of the deviation from a null hypothesis can be calculated exactly, rather than relying on an approximation that becomes exact in the limit as the sample size grows to infinity, as with many statistical tests.

In statistics, G-tests are likelihood-ratio or maximum likelihood statistical significance tests that are increasingly being used in situations where chi-squared tests were previously recommended.

Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their value affects the distribution of the measured data. An estimator attempts to approximate the unknown parameters using the measurements. In estimation theory, two approaches are generally considered:

In statistics, the score test assesses constraints on statistical parameters based on the gradient of the likelihood function—known as the score—evaluated at the hypothesized parameter value under the null hypothesis. Intuitively, if the restricted estimator is near the maximum of the likelihood function, the score should not differ from zero by more than sampling error. While the finite sample distributions of score tests are generally unknown, they have an asymptotic χ2-distribution under the null hypothesis as first proved by C. R. Rao in 1948, a fact that can be used to determine statistical significance.

In statistics, the Bayesian information criterion (BIC) or Schwarz information criterion is a criterion for model selection among a finite set of models; models with lower BIC are generally preferred. It is based, in part, on the likelihood function and it is closely related to the Akaike information criterion (AIC).

Omnibus tests are a kind of statistical test. They test whether the explained variance in a set of data is significantly greater than the unexplained variance, overall. One example is the F-test in the analysis of variance. There can be legitimate significant effects within a model even if the omnibus test is not significant. For instance, in a model with two independent variables, if only one variable exerts a significant effect on the dependent variable and the other does not, then the omnibus test may be non-significant. This fact does not affect the conclusions that may be drawn from the one significant variable. In order to test effects within an omnibus test, researchers often use contrasts.

In statistics, the multinomial test is the test of the null hypothesis that the parameters of a multinomial distribution equal specified values; it is used for categorical data.

In statistics Wilks' theorem offers an asymptotic distribution of the log-likelihood ratio statistic, which can be used to produce confidence intervals for maximum-likelihood estimates or as a test statistic for performing the likelihood-ratio test.

Log-linear analysis is a technique used in statistics to examine the relationship between more than two categorical variables. The technique is used for both hypothesis testing and model building. In both these uses, models are tested to find the most parsimonious model that best accounts for the variance in the observed frequencies.

In statistics, minimum variance to be chi-square estimation is a method of estimation of unobserved quantities based on observed data.

## References

1. Liu, Qiang; Lee, Jason; Jordan, Michael (20 June 2016). "A Kernelized Stein Discrepancy for Goodness-of-fit Tests". Proceedings of the 33rd International Conference on Machine Learning. The 33rd International Conference on Machine Learning. New York, New York, USA: Proceedings of Machine Learning Research. pp. 276–284.
2. Chwialkowski, Kacper; Strathmann, Heiko; Gretton, Arthur (20 June 2016). "A Kernel Test of Goodness of Fit". Proceedings of the 33rd International Conference on Machine Learning. The 33rd International Conference on Machine Learning. New York, New York, USA: Proceedings of Machine Learning Research. pp. 2606–2615.
3. Zhang, Jin (2002). "Powerful goodness-of-fit tests based on the likelihood ratio" (PDF). J. R. Stat. Soc. B. 64 (2): 281–294. doi:10.1111/1467-9868.00337 . Retrieved 5 November 2018.
4. Vexler, Albert; Gurevich, Gregory (2010). "Empirical Likelihood Ratios Applied to Goodness-of-Fit Tests Based on Sample Entropy". Computational Statistics and Data Analysis. 54 (2): 531–545. doi:10.1016/j.csda.2009.09.025.
5. Maindonald, J. H.; Braun, W. J. (2010). (Third ed.). New York: Cambridge University Press. pp.  116-118. ISBN   978-0-521-76293-9.
6. McDonald, J.H. (2014). "G–test of goodness-of-fit". Handbook of Biological Statistics (Third ed.). Baltimore, Maryland: Sparky House Publishing. pp. 53–58.
7. Sokal, R. R.; Rohlf, F. J. (1981). (Second ed.). W. H. Freeman. ISBN   0-7167-2411-1.
• Huber-Carol, C.; Balakrishnan, N.; Nikulin, M. S.; Mesbah, M., eds. (2002), Goodness-of-Fit Tests and Model Validity, Springer
• Ingster, Yu. I.; Suslina, I. A. (2003), Nonparametric Goodness-of-Fit Testing Under Gaussian Models, Springer
• Rayner, J. C. W.; Thas, O.; Best, D. J. (2009), Smooth Tests of Goodness of Fit (2nd ed.), Wiley
• Vexler, Albert; Gurevich, Gregory (2010), "Empirical likelihood ratios applied to goodness-of-fit tests based on sample entropy", Computational Statistics & Data Analysis , 54 (2): 531–545, doi:10.1016/j.csda.2009.09.025