# Statistical model

Last updated

A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form, the data-generating process. [1]

A mathematical model is a description of a system using mathematical concepts and language. The process of developing a mathematical model is termed mathematical modeling. Mathematical models are used in the natural sciences and engineering disciplines, as well as in the social sciences.

In statistics and quantitative research methodology, a data sample is a set of data collected and the world selected from a statistical population by a defined procedure. The elements of a sample are known as sample points, sampling units or observations.

In statistics, a population is a set of similar items or events which is of interest for some question or experiment. A statistical population can be a group of existing objects or a hypothetical and potentially infinite group of objects conceived as a generalization from experience. A common aim of statistical analysis is to produce information about some chosen population.

## Contents

A statistical model is usually specified as a mathematical relationship between one or more random variables and other non-random variables. As such, a statistical model is "a formal representation of a theory" (Herman Adèr quoting Kenneth Bollen). [2]

Hermanus Johannes "Herman J." Adèr is a Dutch statistician/methodologist and consultant at the Vrije Universiteit, the VU University Medical Center and the University of Stavanger, known for work on Methodological Modelling and Social Research Methodology.

Kenneth A. Bollen is the Henry Rudolf Immerwahr Distinguished Professor of Sociology at the University of North Carolina at Chapel Hill. Bollen joined UNC-Chapel Hill in 1985. He is also a member of the faculty in the Quantitative Psychology Program housed in the L. L. Thurstone Psychometric Laboratory. He is a fellow at the Carolina Population Center, the American Statistical Association and the American Association for the Advancement of Science. He was also the Director of the Odum Institute for Research in Social Science from 2000 to 2010. His specialties are population studies and cross-national analyses of democratization.

All statistical hypothesis tests and all statistical estimators are derived via statistical models. More generally, statistical models are part of the foundation of statistical inference.

A statistical hypothesis, sometimes called confirmatory data analysis, is a hypothesis that is testable on the basis of observing a process that is modeled via a set of random variables. A statistical hypothesis test is a method of statistical inference. Commonly, two statistical data sets are compared, or a data set obtained by sampling is compared against a synthetic data set from an idealized model. A hypothesis is proposed for the statistical relationship between the two data sets, and this is compared as an alternative to an idealized null hypothesis that proposes no relationship between two data sets. The comparison is deemed statistically significant if the relationship between the data sets would be an unlikely realization of the null hypothesis according to a threshold probability—the significance level. Hypothesis tests are used when determining what outcomes of a study would lead to a rejection of the null hypothesis for a pre-specified level of significance.

In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule, the quantity of interest and its result are distinguished.

Statistical inference is the process of using data analysis to deduce properties of an underlying probability distribution. Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates. It is assumed that the observed data set is sampled from a larger population.

## Introduction

Informally, a statistical model can be thought of as a statistical assumption (or set of statistical assumptions) with a certain property: that the assumption allows us to calculate the probability of any event. As an example, consider a pair of ordinary six-sided dice. We will study two different statistical assumptions about the dice.

Statistics, like all mathematical disciplines, does not infer valid conclusions from nothing. Inferring interesting conclusions about real statistical populations almost always requires some background assumptions. Those assumptions must be made carefully, because incorrect assumptions can generate wildly inaccurate conclusions.

In probability theory, an event is a set of outcomes of an experiment to which a probability is assigned. A single outcome may be an element of many different events, and different events in an experiment are usually not equally likely, since they may include very different groups of outcomes. An event defines a complementary event, namely the complementary set, and together these define a Bernoulli trial: did the event occur or not?

Dice are small throwable objects that can rest in multiple positions, used for generating random numbers. Dice are commonly used in tabletop games—including dice games, board games, and role-playing games—and for gambling.

The first statistical assumption is this: for each of the dice, the probability of each face (1, 2, 3, 4, 5, and 6) coming up is 1/6. From that assumption, we can calculate the probability of both dice coming up 5:1/6 × 1/6=1/36. More generally, we can calculate the probability of any event: e.g. (1 and 2) or (3 and 3) or (5 and 6).

The alternative statistical assumption is this: for each of the dice, the probability of the face 5 coming up is 1/8 (because the dice are weighted). From that assumption, we can calculate the probability of both dice coming up 5:1/8 × 1/8=1/64. We cannot, however, calculate the probability of any other nontrivial event, as the probabilities of the other faces are unknown.

The first statistical assumption constitutes a statistical model: because with the assumption alone, we can calculate the probability of any event. The alternative statistical assumption does not constitute a statistical model: because with the assumption alone, we cannot calculate the probability of every event.

In the example above, with the first assumption, calculating the probability of an event is easy. With some other examples, though, the calculation can be difficult, or even impractical (e.g. it might require millions of years of computation). For an assumption to constitute a statistical model, such difficulty is acceptable: doing the calculation does not need to be practicable, just theoretically possible.

## Formal definition

In mathematical terms, a statistical model is usually thought of as a pair (${\displaystyle S,{\mathcal {P}}}$), where ${\displaystyle S}$ is the set of possible observations, i.e. the sample space, and ${\displaystyle {\mathcal {P}}}$ is a set of probability distributions on ${\displaystyle S}$. [3]

In probability theory, the sample space of an experiment or random trial is the set of all possible outcomes or results of that experiment. A sample space is usually denoted using set notation, and the possible ordered outcomes are listed as elements in the set. It is common to refer to a sample space by the labels S, Ω, or U.

The intuition behind this definition is as follows. It is assumed that there is a "true" probability distribution induced by the process that generates the observed data. We choose ${\displaystyle {\mathcal {P}}}$ to represent a set (of distributions) which contains a distribution that adequately approximates the true distribution.

Note that we do not require that ${\displaystyle {\mathcal {P}}}$ contains the true distribution, and in practice that is rarely the case. Indeed, as Burnham & Anderson state, "A model is a simplification or approximation of reality and hence will not reflect all of reality" [4] whence the saying "all models are wrong".

The set ${\displaystyle {\mathcal {P}}}$ is almost always parameterized: ${\displaystyle {\mathcal {P}}=\{P_{\theta }:\theta \in \Theta \}}$. The set ${\displaystyle \Theta }$ defines the parameters of the model. A parameterization is generally required to have distinct parameter values give rise to distinct distributions, i.e. ${\displaystyle P_{\theta _{1}}=P_{\theta _{2}}\Rightarrow \theta _{1}=\theta _{2}}$ must hold (in other words, it must be injective). A parameterization that meets the requirement is said to be identifiable . [3]

## An example

Suppose that we have a population of school children, with the ages of the children distributed uniformly, in the population. The height of a child will be stochastically related to the age: e.g. when we know that a child is of age 7, this influences the chance of the child being 1.5 meters tall. We could formalize that relationship in a linear regression model, like this: heighti = b0 + b1agei + εi, where b0 is the intercept, b1 is a parameter that age is multiplied by to obtain a prediction of height, εi is the error term, and i identifies the child. This implies that height is predicted by age, with some error.

An admissible model must be consistent with all the data points. Thus, a straight line (heighti = b0 + b1agei) cannot be the equation for a model of the data—unless it exactly fits all the data points, i.e. all the data points lie perfectly on the line. The error term, εi, must be included in the equation, so that the model is consistent with all the data points.

To do statistical inference, we would first need to assume some probability distributions for the εi. For instance, we might assume that the εi distributions are i.i.d. Gaussian, with zero mean. In this instance, the model would have 3 parameters: b0, b1, and the variance of the Gaussian distribution.

We can formally specify the model in the form (${\displaystyle S,{\mathcal {P}}}$) as follows. The sample space, ${\displaystyle S}$, of our model comprises the set of all possible pairs (age, height). Each possible value of ${\displaystyle \theta }$ = (b0, b1, σ2) determines a distribution on ${\displaystyle S}$; denote that distribution by ${\displaystyle P_{\theta }}$. If ${\displaystyle \Theta }$ is the set of all possible values of ${\displaystyle \theta }$, then ${\displaystyle {\mathcal {P}}=\{P_{\theta }:\theta \in \Theta \}}$. (The parameterization is identifiable, and this is easy to check.)

In this example, the model is determined by (1) specifying ${\displaystyle S}$ and (2) making some assumptions relevant to ${\displaystyle {\mathcal {P}}}$. There are two assumptions: that height can be approximated by a linear function of age; that errors in the approximation are distributed as i.i.d. Gaussian. The assumptions are sufficient to specify ${\displaystyle {\mathcal {P}}}$as they are required to do.

## General remarks

A statistical model is a special class of mathematical model. What distinguishes a statistical model from other mathematical models is that a statistical model is non-deterministic. Thus, in a statistical model specified via mathematical equations, some of the variables do not have specific values, but instead have probability distributions; i.e. some of the variables are stochastic. In the above example with children's heights, ε is a stochastic variable; without that stochastic variable, the model would be deterministic.

Statistical models are often used even when the data-generating process being modeled is deterministic. For instance, coin tossing is, in principle, a deterministic process; yet it is commonly modeled as stochastic (via a Bernoulli process).

Choosing an appropriate statistical model to represent a given data-generating process is sometimes extremely difficult, and may require knowledge of both the process and relevant statistical analyses. Relatedly, the statistician Sir David Cox has said, "How [the] translation from subject-matter problem to statistical model is done is often the most critical part of an analysis". [5]

There are three purposes for a statistical model, according to Konishi & Kitagawa. [6]

• Predictions
• Extraction of information
• Description of stochastic structures

Those three purposes are essentially the same as the three purposes indicated by Friendly & Meyer: prediction, estimation, description. [7] The three purposes correspond with the three kinds of logical reasoning: deductive reasoning, inductive reasoning, abductive reasoning.

## Dimension of a model

Suppose that we have a statistical model (${\displaystyle S,{\mathcal {P}}}$) with ${\displaystyle {\mathcal {P}}=\{P_{\theta }:\theta \in \Theta \}}$. The model is said to be parametric if ${\displaystyle \Theta }$ has a finite dimension. In notation, we write that ${\displaystyle \Theta \subseteq \mathbb {R} ^{k}}$ where k is a positive integer (${\displaystyle \mathbb {R} }$ denotes the real numbers; other sets can be used, in principle). Here, k is called the dimension of the model.

As an example, if we assume that data arise from a univariate Gaussian distribution, then we are assuming that

${\displaystyle {\mathcal {P}}=\left\{P_{\mu ,\sigma }(x)\equiv {\frac {1}{{\sqrt {2\pi }}\sigma }}\exp \left(-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}\right):\mu \in \mathbb {R} ,\sigma >0\right\}}$.

In this example, the dimension, k, equals 2.

As another example, suppose that the data consists of points (x, y) that we assume are distributed according to a straight line with i.i.d. Gaussian residuals (with zero mean): this leads to the same statistical model as was used in the example with children's heights. The dimension of the statistical model is 3: the intercept of the line, the slope of the line, and the variance of the distribution of the residuals. (Note that in geometry, a straight line has dimension 1.)

Although formally ${\displaystyle \theta \in \Theta }$ is a single parameter that has dimension k, it is sometimes regarded as comprising k separate parameters. For example, with the univariate Gaussian distribution, ${\displaystyle \theta }$ is formally a single parameter with dimension 2, but it is sometimes regarded as comprising 2 separate parameters—the mean and the standard deviation.

A statistical model is nonparametric if the parameter set ${\displaystyle \Theta }$ is infinite dimensional. A statistical model is semiparametric if it has both finite-dimensional and infinite-dimensional parameters. Formally, if k is the dimension of ${\displaystyle \Theta }$ and n is the number of samples, both semiparametric and nonparametric models have ${\displaystyle k\rightarrow \infty }$ as ${\displaystyle n\rightarrow \infty }$. If ${\displaystyle k/n\rightarrow 0}$ as ${\displaystyle n\rightarrow \infty }$, then the model is semiparametric; otherwise, the model is nonparametric.

Parametric models are by far the most commonly used statistical models. Regarding semiparametric and nonparametric models, Sir David Cox has said, "These typically involve fewer assumptions of structure and distributional form but usually contain strong assumptions about independencies". [8]

## Nested models

Two statistical models are nested if the first model can be transformed into the second model by imposing constraints on the parameters of the first model. As an example, the set of all Gaussian distributions has, nested within it, the set of zero-mean Gaussian distributions: we constrain the mean in the set of all Gaussian distributions to get the zero-mean distributions. As a second example, the quadratic model

y = b0 + b1x + b2x2 + ε,   ε ~ 𝒩(0, σ2)

has, nested within it, the linear model

y = b0 + b1x + ε,   ε ~ 𝒩(0, σ2)

—we constrain the parameter b2 to equal 0.

In both those examples, the first model has a higher dimension than the second model (for the first example, the zero-mean model has dimension 1). Such is often, but not always, the case. As a different example, the set of positive-mean Gaussian distributions, which has dimension 2, is nested within the set of all Gaussian distributions.

## Comparing models

Comparing statistical models is fundamental for much of statistical inference. Indeed, Konishi & Kitagawa (2008, p. 75) state the following.

The majority of the problems in statistical inference can be considered to be problems related to statistical modeling. They are typically formulated as comparisons of several statistical models.

Common criteria for comparing models include the following: R2, Bayes factor, and the likelihood-ratio test together with its generalization relative likelihood.

## Notes

1. Cox 2006 , p. 178
2. Adèr 2008 , p.  280
3. Burnham & Anderson 2002 , §1.2.5
4. Cox 2006 , p. 197
5. Konishi & Kitagawa 2008 , §1.1
6. Friendly & Meyer 2016 , §11.6
7. Cox 2006 , p. 2

## Related Research Articles

In statistics, the likelihood function expresses how probable a given set of observations is for different values of statistical parameters. It is equal to the joint probability distribution of the random sample evaluated at the given observations, and it is, thus, solely a function of parameters that index the family of those probability distributions.

In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models based on the ratio of their likelihoods, specifically one found by maximization over the entire parameter space and another found after imposing some constraint. If the constraint is supported by the observed data, the two likelihoods should not differ by more than sampling error. Thus the likelihood-ratio test tests whether this ratio is significantly different from one, or equivalently whether its natural logarithm is significantly different from zero.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution, i.e. every finite linear combination of them is normally distributed. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

In mathematical statistics, the Kullback–Leibler divergence is a measure of how one probability distribution is different from a second, reference probability distribution. Applications include characterizing the relative (Shannon) entropy in information systems, randomness in continuous time-series, and information gain when comparing statistical models of inference. In contrast to variation of information, it is a distribution-wise asymmetric measure and thus does not qualify as a statistical metric of spread. In the simple case, a Kullback–Leibler divergence of 0 indicates that the two distributions in question are identical. In simplified terms, it is a measure of surprise, with diverse applications such as applied statistics, fluid mechanics, neuroscience and machine learning.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information. In Bayesian statistics, the asymptotic distribution of the posterior mode depends on the Fisher information and not on the prior. The role of the Fisher information in the asymptotic theory of maximum-likelihood estimation was emphasized by the statistician Ronald Fisher. The Fisher information is also used in the calculation of the Jeffreys prior, which is used in Bayesian statistics.

The Akaike information criterion (AIC) is an estimator for out-of-sample deviance and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection.

In Bayesian probability theory, if the posterior distributions p(θ | x) are in the same probability distribution family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function. For example, the Gaussian family is conjugate to itself with respect to a Gaussian likelihood function: if the likelihood function is Gaussian, choosing a Gaussian prior over the mean will ensure that the posterior distribution is also Gaussian. This means that the Gaussian distribution is a conjugate prior for the likelihood that is also Gaussian. The concept, as well as the term "conjugate prior", were introduced by Howard Raiffa and Robert Schlaifer in their work on Bayesian decision theory. A similar concept had been discovered independently by George Alfred Barnard.

In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, while problems associated with "mixture distributions" relate to deriving the properties of the overall population from those of the sub-populations, "mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information.

In Bayesian probability, the Jeffreys prior, named after Sir Harold Jeffreys, is a non-informative (objective) prior distribution for a parameter space; it is proportional to the square root of the determinant of the Fisher information matrix:

In statistics, a parametric model or parametric family or finite-dimensional model is a particular class of statistical models. Specifically, a parametric model is a family of probability distributions that has a finite number of parameters.

Robust statistics are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal. Robust statistical methods have been developed for many common problems, such as estimating location, scale, and regression parameters. One motivation is to produce statistical methods that are not unduly affected by outliers. Another motivation is to provide methods with good performance when there are small departures from parametric distribution. For example, robust methods work well for mixtures of two normal distributions with different standard-deviations; under this model, non-robust methods like a t-test work poorly.

In mathematics, a π-system on a set Ω is a collection P of certain subsets of Ω, such that

In statistics, a semiparametric model is a statistical model that has parametric and nonparametric components.

An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal “noise”. Along with the reduction side, a reconstructing side is learnt, where the autoencoder tries to generate from the reduced encoding a representation as close as possible to its original input, hence its name. Several variants exist to the basic model, with the aim of forcing the learned representations of the input to assume useful properties. Examples are the regularized autoencoders, proven effective in learning representations for subsequent classification tasks, and Variational autoencoders, with their recent applications as generative models.

In statistics, bootstrapping is any test or metric that relies on random sampling with replacement. Bootstrapping allows assigning measures of accuracy to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods. Generally, it falls in the broader class of resampling methods.

In statistics, Basu's theorem states that any boundedly complete minimal sufficient statistic is independent of any ancillary statistic. This is a 1955 result of Debabrata Basu.

Algorithmic inference gathers new developments in the statistical inference methods made feasible by the powerful computing devices widely available to any data analyst. Cornerstones in this field are computational learning theory, granular computing, bioinformatics, and, long ago, structural probability . The main focus is on the algorithms which compute statistics rooting the study of a random phenomenon, along with the amount of data they must feed on to produce reliable results. This shifts the interest of mathematicians from the study of the distribution laws to the functional properties of the statistics, and the interest of computer scientists from the algorithms for processing data to the information they process.

In statistics, identifiability is a property which a model must satisfy in order for precise inference to be possible. A model is identifiable if it is theoretically possible to learn the true values of this model's underlying parameters after obtaining an infinite number of observations from it. Mathematically, this is equivalent to saying that different values of the parameters must generate different probability distributions of the observable variables. Usually the model is identifiable only under certain technical restrictions, in which case the set of these requirements is called the identification conditions.

In statistics, asymptotic theory, or large sample theory, is a framework for assessing properties of estimators and statistical tests. Within this framework, it is typically assumed that the sample size n grows indefinitely; the properties of estimators and tests are then evaluated in the limit as n → ∞. In practice, a limit evaluation is treated as being approximately valid for large finite sample sizes, as well.

## References

• Adèr, H. J. (2008), "Modelling", in Adèr, H. J.; Mellenbergh, G. J. (eds.), Advising on Research Methods: A consultant's companion, Huizen, The Netherlands: Johannes van Kessel Publishing, pp. 271–304.
• Burnham, K. P.; Anderson, D. R. (2002), Model Selection and Multimodel Inference (2nd ed.), Springer-Verlag.
• Cox, D. R. (2006), Principles of Statistical Inference, Cambridge University Press.
• Friendly, M.; Meyer, D. (2016), Discrete Data Analysis with R, Chapman & Hall .
• Konishi, S.; Kitagawa, G. (2008), Information Criteria and Statistical Modeling, Springer.
• McCullagh, P. (2002), "What is a statistical model?" (PDF), Annals of Statistics , 30 (5): 1225–1310, doi:10.1214/aos/1035844977 .