Algorithmic inference

Last updated

Algorithmic inference gathers new developments in the statistical inference methods made feasible by the powerful computing devices widely available to any data analyst. Cornerstones in this field are computational learning theory, granular computing, bioinformatics, and, long ago, structural probability ( Fraser 1966 ). The main focus is on the algorithms which compute statistics rooting the study of a random phenomenon, along with the amount of data they must feed on to produce reliable results. This shifts the interest of mathematicians from the study of the distribution laws to the functional properties of the statistics, and the interest of computer scientists from the algorithms for processing data to the information they process.

Contents

The Fisher parametric inference problem

Concerning the identification of the parameters of a distribution law, the mature reader may recall lengthy disputes in the mid 20th century about the interpretation of their variability in terms of fiducial distribution ( Fisher 1956 ), structural probabilities ( Fraser 1966 ), priors/posteriors ( Ramsey 1925 ), and so on. From an epistemology viewpoint, this entailed a companion dispute as to the nature of probability: is it a physical feature of phenomena to be described through random variables or a way of synthesizing data about a phenomenon? Opting for the latter, Fisher defines a fiducial distribution law of parameters of a given random variable that he deduces from a sample of its specifications. With this law he computes, for instance "the probability that μ (mean of a Gaussian variable – omeur note) is less than any assigned value, or the probability that it lies between any assigned values, or, in short, its probability distribution, in the light of the sample observed".

The classic solution

Fisher fought hard to defend the difference and superiority of his notion of parameter distribution in comparison to analogous notions, such as Bayes' posterior distribution, Fraser's constructive probability and Neyman's confidence intervals. For half a century, Neyman's confidence intervals won out for all practical purposes, crediting the phenomenological nature of probability. With this perspective, when you deal with a Gaussian variable, its mean μ is fixed by the physical features of the phenomenon you are observing, where the observations are random operators, hence the observed values are specifications of a random sample. Because of their randomness, you may compute from the sample specific intervals containing the fixed μ with a given probability that you denote confidence.

Example

Let X be a Gaussian variable [1] with parameters and and a sample drawn from it. Working with statistics

and

is the sample mean, we recognize that

follows a Student's t distribution ( Wilks 1962 ) with parameter (degrees of freedom) m  1, so that

Gauging T between two quantiles and inverting its expression as a function of you obtain confidence intervals for .

With the sample specification:

having size m = 10, you compute the statistics and , and obtain a 0.90 confidence interval for with extremes (3.03, 5.65).

Inferring functions with the help of a computer

From a modeling perspective the entire dispute looks like a chicken-egg dilemma: either fixed data by first and probability distribution of their properties as a consequence, or fixed properties by first and probability distribution of the observed data as a corollary. The classic solution has one benefit and one drawback. The former was appreciated particularly back when people still did computations with sheet and pencil. Per se, the task of computing a Neyman confidence interval for the fixed parameter θ is hard: you do not know θ, but you look for disposing around it an interval with a possibly very low probability of failing. The analytical solution is allowed for a very limited number of theoretical cases. Vice versa a large variety of instances may be quickly solved in an approximate way via the central limit theorem in terms of confidence interval around a Gaussian distribution – that's the benefit. The drawback is that the central limit theorem is applicable when the sample size is sufficiently large. Therefore, it is less and less applicable with the sample involved in modern inference instances. The fault is not in the sample size on its own part. Rather, this size is not sufficiently large because of the complexity of the inference problem.

With the availability of large computing facilities, scientists refocused from isolated parameters inference to complex functions inference, i.e. re sets of highly nested parameters identifying functions. In these cases we speak about learning of functions (in terms for instance of regression, neuro-fuzzy system or computational learning) on the basis of highly informative samples. A first effect of having a complex structure linking data is the reduction of the number of sample degrees of freedom, i.e. the burning of a part of sample points, so that the effective sample size to be considered in the central limit theorem is too small. Focusing on the sample size ensuring a limited learning error with a given confidence level, the consequence is that the lower bound on this size grows with complexity indices such as VC dimension or detail of a class to which the function we want to learn belongs.

Example

A sample of 1,000 independent bits is enough to ensure an absolute error of at most 0.081 on the estimation of the parameter p of the underlying Bernoulli variable with a confidence of at least 0.99. The same size cannot guarantee a threshold less than 0.088 with the same confidence 0.99 when the error is identified with the probability that a 20-year-old man living in New York does not fit the ranges of height, weight and waistline observed on 1,000 Big Apple inhabitants. The accuracy shortage occurs because both the VC dimension and the detail of the class of parallelepipeds, among which the one observed from the 1,000 inhabitants' ranges falls, are equal to 6.

The general inversion problem solving the Fisher question

With insufficiently large samples, the approach: fixed sample – random properties suggests inference procedures in three steps:

1.Sampling mechanism. It consists of a pair , where the seed Z is a random variable without unknown parameters, while the explaining function is a function mapping from samples of Z to samples of the random variable X we are interested in. The parameter vector is a specification of the random parameter . Its components are the parameters of the X distribution law. The Integral Transform Theorem ensures the existence of such a mechanism for each (scalar or vector) X when the seed coincides with the random variable U uniformly distributed in .
Example. For X following a Pareto distribution with parameters a and k, i.e.

a sampling mechanism for X with seed U reads:

or, equivalently,

2.Master equations. The actual connection between the model and the observed data is tossed in terms of a set of relations between statistics on the data and unknown parameters that come as a corollary of the sampling mechanisms. We call these relations master equations. Pivoting around the statistic , the general form of a master equation is:
.

With these relations we may inspect the values of the parameters that could have generated a sample with the observed statistic from a particular setting of the seeds representing the seed of the sample. Hence, to the population of sample seeds corresponds a population of parameters. In order to ensure this population clean properties, it is enough to draw randomly the seed values and involve either sufficient statistics or, simply, well-behaved statistics w.r.t. the parameters, in the master equations.

For example, the statistics and prove to be sufficient for parameters a and k of a Pareto random variable X. Thanks to the (equivalent form of the) sampling mechanism we may read them as

respectively.

3.Parameter population. Having fixed a set of master equations, you may map sample seeds into parameters either numerically through a population bootstrap, or analytically through a twisting argument. Hence from a population of seeds you obtain a population of parameters.
Example. From the above master equation we can draw a pair of parameters, , compatible with the observed sample by solving the following system of equations:

where and are the observed statistics and a set of uniform seeds. Transferring to the parameters the probability (density) affecting the seeds, you obtain the distribution law of the random parameters A and K compatible with the statistics you have observed.

Compatibility denotes parameters of compatible populations, i.e. of populations that could have generated a sample giving rise to the observed statistics. You may formalize this notion as follows:

Definition

For a random variable and a sample drawn from it a compatible distribution is a distribution having the same sampling mechanism of X with a value of the random parameter derived from a master equation rooted on a well-behaved statistic s.

Example

Joint empirical cumulative distribution function of parameters
(
A
,
K
)
{\displaystyle (A,K)}
of a Pareto random variable. Parecdf.png
Joint empirical cumulative distribution function of parameters of a Pareto random variable.
Cumulative distribution function of the mean M of a Gaussian random variable Mucdf.png
Cumulative distribution function of the mean M of a Gaussian random variable

You may find the distribution law of the Pareto parameters A and K as an implementation example of the population bootstrap  method as in the figure on the left.

Implementing the twisting argument  method, you get the distribution law  of the mean M of a Gaussian variable X on the basis of the statistic  when  is known to be equal to  ( Apolloni, Malchiodi & Gaito 2006 ). Its expression is:

shown in the figure on the right, where is the cumulative distribution function of a standard normal distribution.

Upper (purple curve) and lower (blue curve) extremes of a 90% confidence interval of the mean M of a Gaussian random variable for a fixed
s
{\displaystyle \sigma }
and different values of the statistic sm. Muconfint.png
Upper (purple curve) and lower (blue curve) extremes of a 90% confidence interval of the mean M of a Gaussian random variable for a fixed and different values of the statistic sm.

Computing a confidence interval  for M given its distribution function is straightforward: we need only find two quantiles (for instance  and  quantiles in case we are interested in a confidence interval of level δ symmetric in the tail's probabilities) as indicated on the left in the diagram showing the behavior of the two bounds for different values of the statistic sm.

The Achilles heel of Fisher's approach lies in the joint distribution of more than one parameter, say mean and variance of a Gaussian distribution. On the contrary, with the last approach (and above-mentioned methods: population bootstrap and twisting argument) we may learn the joint distribution of many parameters. For instance, focusing on the distribution of two or many more parameters, in the figures below we report two confidence regions where the function to be learnt falls with a confidence of 90%. The former concerns the probability with which an extended support vector machine attributes a binary label 1 to the points of the plane. The two surfaces are drawn on the basis of a set of sample points in turn labelled according to a specific distribution law ( Apolloni et al. 2008 ). The latter concerns the confidence region of the hazard rate of breast cancer recurrence computed from a censored sample ( Apolloni, Malchiodi & Gaito 2006 ).

90% confidence region for the family of support vector machines endowed with hyperbolic tangent profile function Svmconf.png
90% confidence region for the family of support vector machines endowed with hyperbolic tangent profile function
90% confidence region for the hazard function of breast cancer recurrence computed from the censored sample
t
=
(
9
,
13
,
>
13
,
18
,
12
,
23
,
31
,
34
,
>
45
,
48
,
>
161
)
,
{\displaystyle t=(9,13,>13,18,12,23,31,34,>45,48,>161),\,}
with > t denoting a censored time Hazardconf.png
90% confidence region for the hazard function of breast cancer recurrence computed from the censored sample with > t denoting a censored time


Notes

  1. By default, capital letters (such as U, X) will denote random variables and small letters (u, x) their corresponding specifications.

Related Research Articles

In statistics, a location parameter of a probability distribution is a scalar- or vector-valued parameter , which determines the "location" or shift of the distribution. In the literature of location parameter estimation, the probability distributions with such parameter are found to be formally defined in one of the following equivalent ways:

<span class="mw-page-title-main">Normal distribution</span> Probability distribution

In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

<span class="mw-page-title-main">Student's t-distribution</span> Probability distribution

In probability and statistics, Student's t distribution is a continuous probability distribution that generalizes the standard normal distribution. Like the latter, it is symmetric around zero and bell-shaped.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In statistical inference, specifically predictive inference, a prediction interval is an estimate of an interval in which a future observation will fall, with a certain probability, given what has already been observed. Prediction intervals are often used in regression analysis.

<span class="mw-page-title-main">Continuous uniform distribution</span> Uniform distribution on an interval

In probability theory and statistics, the continuous uniform distributions or rectangular distributions are a family of symmetric probability distributions. Such a distribution describes an experiment where there is an arbitrary outcome that lies between certain bounds. The bounds are defined by the parameters, and which are the minimum and maximum values. The interval can either be closed or open. Therefore, the distribution is often abbreviated where stands for uniform distribution. The difference between the bounds defines the interval length; all intervals of the same length on the distribution's support are equally probable. It is the maximum entropy probability distribution for a random variable under no constraint other than that it is contained in the distribution's support.

In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to the method of maximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior distribution over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of maximum likelihood estimation.

<span class="mw-page-title-main">Rice distribution</span> Probability distribution

In probability theory, the Rice distribution or Rician distribution is the probability distribution of the magnitude of a circularly-symmetric bivariate normal random variable, possibly with non-zero mean (noncentral). It was named after Stephen O. Rice (1907–1986).

In Bayesian probability, the Jeffreys prior, named after Sir Harold Jeffreys, is a non-informative prior distribution for a parameter space; its density function is proportional to the square root of the determinant of the Fisher information matrix:

<span class="mw-page-title-main">Inverse Gaussian distribution</span> Family of continuous probability distributions

In probability theory, the inverse Gaussian distribution is a two-parameter family of continuous probability distributions with support on (0,∞).

In statistics, a pivotal quantity or pivot is a function of observations and unobservable parameters such that the function's probability distribution does not depend on the unknown parameters. A pivot need not be a statistic — the function and its 'value' can depend on the parameters of the model, but its 'distribution' must not. If it is a statistic, then it is known as an 'ancillary statistic'.

Bootstrapping is any test or metric that uses random sampling with replacement, and falls under the broader class of resampling methods. Bootstrapping assigns measures of accuracy to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.

In estimation theory and decision theory, a Bayes estimator or a Bayes action is an estimator or decision rule that minimizes the posterior expected value of a loss function. Equivalently, it maximizes the posterior expectation of a utility function. An alternative way of formulating an estimator within Bayesian statistics is maximum a posteriori estimation.

Neyman construction, named after Jerzy Neyman, is a frequentist method to construct an interval at a confidence level such that if we repeat the experiment many times the interval will contain the true value of some parameter a fraction of the time.

<span class="mw-page-title-main">68–95–99.7 rule</span> Shorthand used in statistics

In statistics, the 68–95–99.7 rule, also known as the empirical rule, is a shorthand used to remember the percentage of values that lie within an interval estimate in a normal distribution: 68%, 95%, and 99.7% of the values lie within one, two, and three standard deviations of the mean, respectively.

<span class="mw-page-title-main">Truncated normal distribution</span> Type of probability distribution

In probability and statistics, the truncated normal distribution is the probability distribution derived from that of a normally distributed random variable by bounding the random variable from either below or above. The truncated normal distribution has wide applications in statistics and econometrics.

In probability and statistics, the class of exponential dispersion models (EDM), also called exponential dispersion family (EDF), is a set of probability distributions that represents a generalisation of the natural exponential family. Exponential dispersion models play an important role in statistical theory, in particular in generalized linear models because they have a special structure which enables deductions to be made about appropriate statistical inference.

Exact statistics, such as that described in exact test, is a branch of statistics that was developed to provide more accurate results pertaining to statistical testing and interval estimation by eliminating procedures based on asymptotic and approximate statistical methods. The main characteristic of exact methods is that statistical tests and confidence intervals are based on exact probability statements that are valid for any sample size. Exact statistical methods help avoid some of the unreasonable assumptions of traditional statistical methods, such as the assumption of equal variances in classical ANOVA. They also allow exact inference on variance components of mixed models.

In statistics, a generalized p-value is an extended version of the classical p-value, which except in a limited number of applications, provides only approximate solutions.

In statistical inference, the concept of a confidence distribution (CD) has often been loosely referred to as a distribution function on the parameter space that can represent confidence intervals of all levels for a parameter of interest. Historically, it has typically been constructed by inverting the upper limits of lower sided confidence intervals of all levels, and it was also commonly associated with a fiducial interpretation, although it is a purely frequentist concept. A confidence distribution is NOT a probability distribution function of the parameter of interest, but may still be a function useful for making inferences.

References