Identifiability

Last updated November 12, 2024

In statistics, identifiability is a property which a model must satisfy for precise inference to be possible. A model is identifiable if it is theoretically possible to learn the true values of this model's underlying parameters after obtaining an infinite number of observations from it. Mathematically, this is equivalent to saying that different values of the parameters must generate different probability distributions of the observable variables. Usually the model is identifiable only under certain technical restrictions, in which case the set of these requirements is called the identification conditions.

A model that fails to be identifiable is said to be non-identifiable or unidentifiable: two or more parametrizations are observationally equivalent. In some cases, even though a model is non-identifiable, it is still possible to learn the true values of a certain subset of the model parameters. In this case we say that the model is partially identifiable. In other cases it may be possible to learn the location of the true parameter up to a certain finite region of the parameter space, in which case the model is set identifiable.

Aside from strictly theoretical exploration of the model properties, identifiability can be referred to in a wider scope when a model is tested with experimental data sets, using identifiability analysis.^[1]

Definition

Let ${\mathcal {P}}=\{P_{\theta }:\theta \in \Theta \}$ be a statistical model with parameter space $\Theta$ . We say that ${\mathcal {P}}$ is identifiable if the mapping $\theta \mapsto P_{\theta }$ is one-to-one:^[2]

P_{\theta _{1}}=P_{\theta _{2}}\quad \Rightarrow \quad \theta _{1}=\theta _{2}\quad \ {\text{for all }}\theta _{1},\theta _{2}\in \Theta .

This definition means that distinct values of θ should correspond to distinct probability distributions: if θ₁≠θ₂, then also P_θ₁≠P_θ₂.^[3] If the distributions are defined in terms of the probability density functions (pdfs), then two pdfs should be considered distinct only if they differ on a set of non-zero measure (for example two functions ƒ₁(x) = 1_{0 ≤ x < 1} and ƒ₂(x) = 1_{0 ≤ x ≤ 1} differ only at a single point x = 1 — a set of measure zero — and thus cannot be considered as distinct pdfs).

Identifiability of the model in the sense of invertibility of the map $\theta \mapsto P_{\theta }$ is equivalent to being able to learn the model's true parameter if the model can be observed indefinitely long. Indeed, if {X_t} ⊆ S is the sequence of observations from the model, then by the strong law of large numbers,

{\frac {1}{T}}\sum _{t=1}^{T}\mathbf {1} _{\{X_{t}\in A\}}\ {\xrightarrow {\text{a.s.}}}\ \Pr[X_{t}\in A],

for every measurable set A ⊆ S (here 1_{...} is the indicator function). Thus, with an infinite number of observations we will be able to find the true probability distribution P₀ in the model, and since the identifiability condition above requires that the map $\theta \mapsto P_{\theta }$ be invertible, we will also be able to find the true value of the parameter which generated given distribution P₀.

Examples

Example 1

Let ${\mathcal {P}}$ be the normal location-scale family:

{\mathcal {P}}={\Big \{}\ f_{\theta }(x)={\tfrac {1}{{\sqrt {2\pi }}\sigma }}e^{-{\frac {1}{2\sigma ^{2}}}(x-\mu )^{2}}\ {\Big |}\ \theta =(\mu ,\sigma ):\mu \in \mathbb {R} ,\,\sigma \!>0\ {\Big \}}.

Then

{\begin{aligned}&f_{\theta _{1}}(x)=f_{\theta _{2}}(x)\\[6pt]\Longleftrightarrow {}&{\frac {1}{{\sqrt {2\pi }}\sigma _{1}}}\exp \left(-{\frac {1}{2\sigma _{1}^{2}}}(x-\mu _{1})^{2}\right)={\frac {1}{{\sqrt {2\pi }}\sigma _{2}}}\exp \left(-{\frac {1}{2\sigma _{2}^{2}}}(x-\mu _{2})^{2}\right)\\[6pt]\Longleftrightarrow {}&{\frac {1}{\sigma _{1}^{2}}}(x-\mu _{1})^{2}+\ln \sigma _{1}={\frac {1}{\sigma _{2}^{2}}}(x-\mu _{2})^{2}+\ln \sigma _{2}\\[6pt]\Longleftrightarrow {}&x^{2}\left({\frac {1}{\sigma _{1}^{2}}}-{\frac {1}{\sigma _{2}^{2}}}\right)-2x\left({\frac {\mu _{1}}{\sigma _{1}^{2}}}-{\frac {\mu _{2}}{\sigma _{2}^{2}}}\right)+\left({\frac {\mu _{1}^{2}}{\sigma _{1}^{2}}}-{\frac {\mu _{2}^{2}}{\sigma _{2}^{2}}}+\ln \sigma _{1}-\ln \sigma _{2}\right)=0\end{aligned}}

This expression is equal to zero for almost all x only when all its coefficients are equal to zero, which is only possible when |σ₁| = |σ₂| and μ₁ = μ₂. Since in the scale parameter σ is restricted to be greater than zero, we conclude that the model is identifiable: ƒ_θ₁ = ƒ_θ₂ ⇔ θ₁ = θ₂.

Example 2

Let ${\mathcal {P}}$ be the standard linear regression model:

y=\beta 'x+\varepsilon ,\quad \mathrm {E} [\,\varepsilon \mid x\,]=0

(where ′ denotes matrix transpose). Then the parameter β is identifiable if and only if the matrix $\mathrm {E} [xx']$ is invertible. Thus, this is the identification condition in the model.

Example 3

Suppose ${\mathcal {P}}$ is the classical errors-in-variables linear model:

{\begin{cases}y=\beta x^{*}+\varepsilon ,\\x=x^{*}+\eta ,\end{cases}}

where (ε,η,x*) are jointly normal independent random variables with zero expected value and unknown variances, and only the variables (x,y) are observed. Then this model is not identifiable^[4], only the product βσ²_∗ is (where σ²_∗ is the variance of the latent regressor x*). This is also an example of a set identifiable model: although the exact value of β cannot be learned, we can guarantee that it must lie somewhere in the interval (β_yx, 1÷β_xy), where β_yx is the coefficient in OLS regression of y on x, and β_xy is the coefficient in OLS regression of x on y.^[5]

If we abandon the normality assumption and require that x* were not normally distributed, retaining only the independence condition ε ⊥ η ⊥ x*, then the model becomes identifiable.^[4]

Related Research Articles

In statistics, a location parameter of a probability distribution is a scalar- or vector-valued parameter $, which determines the "location" or shift of the distribution. In the literature of location parameter estimation, the probability distributions with such parameter are found to be formally defined in one of the following equivalent ways:$

In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is $The parameter is the mean or expectation of the distribution, while the parameter is the variance. The standard deviation of the distribution is (sigma). A random variable with a Gaussian distribution is said to be normally distributed, and is called a normal deviate .$

A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data. A statistical model represents, often in considerably idealized form, the data-generating process. When referring specifically to probabilities, the corresponding term is probabilistic model. All statistical hypothesis tests and all statistical estimators are derived via statistical models. More generally, statistical models are part of the foundation of statistical inference. A statistical model is usually specified as a mathematical relationship between one or more random variables and other non-random variables. As such, a statistical model is "a formal representation of a theory".

In statistics, the likelihood-ratio test is a hypothesis test that involves comparing the goodness of fit of two competing statistical models, typically one found by maximization over the entire parameter space and another found after imposing some constraint, based on the ratio of their likelihoods. If the more constrained model is supported by the observed data, the two likelihoods should not differ by more than sampling error. Thus the likelihood-ratio test tests whether this ratio is significantly different from one, or equivalently whether its natural logarithm is significantly different from zero.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In mathematics, a Gaussian function, often simply referred to as a Gaussian, is a function of the base form $and with parametric extension for arbitrary real constants a, b and non-zero c . It is named after the mathematician Carl Friedrich Gauss. The graph of a Gaussian is a characteristic symmetric "bell curve" shape. The parameter a is the height of the curve's peak, b is the position of the center of the peak, and c controls the width of the "bell".$

In information geometry, the Fisher information metric is a particular Riemannian metric which can be defined on a smooth statistical manifold, i.e., a smooth manifold whose points are probability measures defined on a common probability space. It can be used to calculate the informational difference between measurements.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

In mathematical statistics, the Kullback–Leibler (KL) divergence, denoted $, is a type of statistical distance: a measure of how one reference probability distribution P is different from a second probability distribution Q . Mathematically, it is defined as$

In statistics, a consistent estimator or asymptotically consistent estimator is an estimator—a rule for computing estimates of a parameter θ₀—having the property that as the number of data points used increases indefinitely, the resulting sequence of estimates converges in probability to θ₀. This means that the distributions of the estimates become more and more concentrated near the true value of the parameter being estimated, so that the probability of the estimator being arbitrarily close to θ₀ converges to one.

Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their value affects the distribution of the measured data. An estimator attempts to approximate the unknown parameters using the measurements. In estimation theory, two approaches are generally considered:

In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to the method of maximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior distribution over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of maximum likelihood estimation.

In statistics, the Bayesian information criterion (BIC) or Schwarz information criterion is a criterion for model selection among a finite set of models; models with lower BIC are generally preferred. It is based, in part, on the likelihood function and it is closely related to the Akaike information criterion (AIC).

In statistics, a parametric model or parametric family or finite-dimensional model is a particular class of statistical models. Specifically, a parametric model is a family of probability distributions that has a finite number of parameters.

Covariance matrix adaptation evolution strategy (CMA-ES) is a particular kind of strategy for numerical optimization. Evolution strategies (ES) are stochastic, derivative-free methods for numerical optimization of non-linear or non-convex continuous optimization problems. They belong to the class of evolutionary algorithms and evolutionary computation. An evolutionary algorithm is broadly based on the principle of biological evolution, namely the repeated interplay of variation and selection: in each generation (iteration) new individuals are generated by variation of the current parental individuals, usually in a stochastic way. Then, some individuals are selected to become the parents in the next generation based on their fitness or objective function value $. Like this, individuals with better and better -values are generated over the generation sequence.$

A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.

In probability and statistics, the class of exponential dispersion models (EDM), also called exponential dispersion family (EDF), is a set of probability distributions that represents a generalisation of the natural exponential family. Exponential dispersion models play an important role in statistical theory, in particular in generalized linear models because they have a special structure which enables deductions to be made about appropriate statistical inference.

In statistics, the variance function is a smooth function that depicts the variance of a random quantity as a function of its mean. The variance function is a measure of heteroscedasticity and plays a large role in many settings of statistical modelling. It is a main ingredient in the generalized linear model framework and a tool used in non-parametric regression, semiparametric regression and functional data analysis. In parametric modeling, variance functions take on a parametric form and explicitly describe the relationship between the variance and the mean of a random quantity. In a non-parametric setting, the variance function is assumed to be a smooth function.

In machine learning, a variational autoencoder (VAE) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling. It is part of the families of probabilistic graphical models and variational Bayesian methods.

The reparameterization trick is a technique used in statistical machine learning, particularly in variational inference, variational autoencoders, and stochastic optimization. It allows for the efficient computation of gradients through random variables, enabling the optimization of parametric probability models using stochastic gradient descent, and the variance reduction of estimators.

References

Citations

↑ Raue, A.; Kreutz, C.; Maiwald, T.; Bachmann, J.; Schilling, M.; Klingmuller, U.; Timmer, J. (2009-08-01). "Structural and practical identifiability analysis of partially observed dynamical models by exploiting the profile likelihood". Bioinformatics. 25 (15): 1923–1929. doi: 10.1093/bioinformatics/btp358 . PMID 19505944.
↑ Lehmann & Casella 1998 , Ch. 1, Definition 5.2
↑ van der Vaart 1998 , p. 62
1 2 Reiersøl 1950
↑ Casella & Berger 2002 , p. 583

Sources

Casella, George; Berger, Roger L. (2002), Statistical Inference (2nd ed.), ISBN 0-534-24312-6, LCCN 2001025794
Hsiao, Cheng (1983), Identification, Handbook of Econometrics, Vol. 1, Ch.4, North-Holland Publishing Company
Lehmann, E. L.; Casella, G. (1998), Theory of Point Estimation (2nd ed.), Springer, ISBN 0-387-98502-6
Reiersøl, Olav (1950), "Identifiability of a linear relation between variables which are subject to error", Econometrica , 18 (4): 375–389, doi:10.2307/1907835, JSTOR 1907835
van der Vaart, A. W. (1998), Asymptotic Statistics, Cambridge University Press, ISBN 978-0-521-49603-2