Variance function

Last updated

In statistics, the variance function is a smooth function that depicts the variance of a random quantity as a function of its mean. The variance function is a measure of heteroscedasticity and plays a large role in many settings of statistical modelling. It is a main ingredient in the generalized linear model framework and a tool used in non-parametric regression, [1] semiparametric regression [1] and functional data analysis. [2] In parametric modeling, variance functions take on a parametric form and explicitly describe the relationship between the variance and the mean of a random quantity. In a non-parametric setting, the variance function is assumed to be a smooth function.

Contents

Intuition

In a regression model setting, the goal is to establish whether or not a relationship exists between a response variable and a set of predictor variables. Further, if a relationship does exist, the goal is then to be able to describe this relationship as best as possible. A main assumption in linear regression is constant variance or (homoscedasticity), meaning that different response variables have the same variance in their errors, at every predictor level. This assumption works well when the response variable and the predictor variable are jointly normal. As we will see later, the variance function in the Normal setting is constant; however, we must find a way to quantify heteroscedasticity (non-constant variance) in the absence of joint Normality.

When it is likely that the response follows a distribution that is a member of the exponential family, a generalized linear model may be more appropriate to use, and moreover, when we wish not to force a parametric model onto our data, a non-parametric regression approach can be useful. The importance of being able to model the variance as a function of the mean lies in improved inference (in a parametric setting), and estimation of the regression function in general, for any setting.

Variance functions play a very important role in parameter estimation and inference. In general, maximum likelihood estimation requires that a likelihood function be defined. This requirement then implies that one must first specify the distribution of the response variables observed. However, to define a quasi-likelihood, one need only specify a relationship between the mean and the variance of the observations to then be able to use the quasi-likelihood function for estimation. [3] Quasi-likelihood estimation is particularly useful when there is overdispersion. Overdispersion occurs when there is more variability in the data than there should otherwise be expected according to the assumed distribution of the data.

In summary, to ensure efficient inference of the regression parameters and the regression function, the heteroscedasticity must be accounted for. Variance functions quantify the relationship between the variance and the mean of the observed data and hence play a significant role in regression estimation and inference.

Types

The variance function and its applications come up in many areas of statistical analysis. A very important use of this function is in the framework of generalized linear models and non-parametric regression.

Generalized linear model

When a member of the exponential family has been specified, the variance function can easily be derived. [4] :29 The general form of the variance function is presented under the exponential family context, as well as specific forms for Normal, Bernoulli, Poisson, and Gamma. In addition, we describe the applications and use of variance functions in maximum likelihood estimation and quasi-likelihood estimation.

Derivation

The generalized linear model (GLM), is a generalization of ordinary regression analysis that extends to any member of the exponential family. It is particularly useful when the response variable is categorical, binary or subject to a constraint (e.g. only positive responses make sense). A quick summary of the components of a GLM are summarized on this page, but for more details and information see the page on generalized linear models.

A GLM consists of three main ingredients:

1. Random Component: a distribution of y from the exponential family,
2. Linear predictor:
3. Link function:

First it is important to derive a couple key properties of the exponential family.

Any random variable in the exponential family has a probability density function of the form,

with loglikelihood,

Here, is the canonical parameter and the parameter of interest, and is a nuisance parameter which plays a role in the variance. We use the Bartlett's Identities to derive a general expression for the variance function. The first and second Bartlett results ensures that under suitable conditions (see Leibniz integral rule), for a density function dependent on ,

These identities lead to simple calculations of the expected value and variance of any random variable in the exponential family .

Expected value of Y: Taking the first derivative with respect to of the log of the density in the exponential family form described above, we have

Then taking the expected value and setting it equal to zero leads to,

Variance of Y: To compute the variance we use the second Bartlett identity,

We have now a relationship between and , namely

and , which allows for a relationship between and the variance,

Note that because , then is invertible. We derive the variance function for a few common distributions.

Example – normal

The normal distribution is a special case where the variance function is a constant. Let then we put the density function of y in the form of the exponential family described above:

where

To calculate the variance function , we first express as a function of . Then we transform into a function of

Therefore, the variance function is constant.

Example – Bernoulli

Let , then we express the density of the Bernoulli distribution in exponential family form,

logit(p), which gives us expit
and
expit

This give us

Example – Poisson

Let , then we express the density of the Poisson distribution in exponential family form,

which gives us
and

This give us

Here we see the central property of Poisson data, that the variance is equal to the mean.

Example – Gamma

The Gamma distribution and density function can be expressed under different parametrizations. We will use the form of the gamma with parameters

Then in exponential family form we have

And we have

Application – weighted least squares

A very important application of the variance function is its use in parameter estimation and inference when the response variable is of the required exponential family form as well as in some cases when it is not (which we will discuss in quasi-likelihood). Weighted least squares (WLS) is a special case of generalized least squares. Each term in the WLS criterion includes a weight that determines that the influence each observation has on the final parameter estimates. As in regular least squares, the goal is to estimate the unknown parameters in the regression function by finding values for parameter estimates that minimize the sum of the squared deviations between the observed responses and the functional portion of the model.

While WLS assumes independence of observations it does not assume equal variance and is therefore a solution for parameter estimation in the presence of heteroscedasticity. The Gauss–Markov theorem and Aitken demonstrate that the best linear unbiased estimator (BLUE), the unbiased estimator with minimum variance, has each weight equal to the reciprocal of the variance of the measurement.

In the GLM framework, our goal is to estimate parameters , where . Therefore, we would like to minimize and if we define the weight matrix W as

where are defined in the previous section, it allows for iteratively reweighted least squares (IRLS) estimation of the parameters. See the section on iteratively reweighted least squares for more derivation and information.

Also, important to note is that when the weight matrix is of the form described here, minimizing the expression also minimizes the Pearson distance. See Distance correlation for more.

The matrix W falls right out of the estimating equations for estimation of . Maximum likelihood estimation for each parameter , requires

, where is the log-likelihood.

Looking at a single observation we have,

This gives us

, and noting that
we have that

The Hessian matrix is determined in a similar manner and can be shown to be,

Noticing that the Fisher Information (FI),

, allows for asymptotic approximation of
, and hence inference can be performed.

Application – quasi-likelihood

Because most features of GLMs only depend on the first two moments of the distribution, rather than the entire distribution, the quasi-likelihood can be developed by just specifying a link function and a variance function. That is, we need to specify

  • the link function,
  • the variance function, , where the

With a specified variance function and link function we can develop, as alternatives to the log-likelihood function, the score function, and the Fisher information, a quasi-likelihood , a quasi-score, and the quasi-information. This allows for full inference of .

Quasi-likelihood (QL)

Though called a quasi-likelihood, this is in fact a quasi-log-likelihood. The QL for one observation is

And therefore the QL for all n observations is

From the QL we have the quasi-score

Quasi-score (QS)

Recall the score function, U, for data with log-likelihood is

We obtain the quasi-score in an identical manner,

Noting that, for one observation the score is

The first two Bartlett equations are satisfied for the quasi-score, namely

and

In addition, the quasi-score is linear in y.

Ultimately the goal is to find information about the parameters of interest . Both the QS and the QL are actually functions of . Recall, , and , therefore,

Quasi-information (QI)

The quasi-information, is similar to the Fisher information,

QL, QS, QI as functions of

The QL, QS and QI all provide the building blocks for inference about the parameters of interest and therefore it is important to express the QL, QS and QI all as functions of .

Recalling again that , we derive the expressions for QL, QS and QI parametrized under .

Quasi-likelihood in ,

The QS as a function of is therefore

Where,

The quasi-information matrix in is,

Obtaining the score function and the information of allows for parameter estimation and inference in a similar manner as described in Application – weighted least squares.

Non-parametric regression analysis

A scattor plot of years in the major league against salary (x$1000). The line is the trend in the mean. The plot demonstrates that the variance is not constant. Years in Major League vs. Salary.png
A scattor plot of years in the major league against salary (x$1000). The line is the trend in the mean. The plot demonstrates that the variance is not constant.
The smoothed conditional variance against the smoothed conditional mean. The quadratic shape is indicative of the Gamma Distribution. The variance function of a Gamma is V(
m
{\displaystyle \mu }
) =
m
2
{\displaystyle \mu ^{2}} Variance Function.png
The smoothed conditional variance against the smoothed conditional mean. The quadratic shape is indicative of the Gamma Distribution. The variance function of a Gamma is V() =

Non-parametric estimation of the variance function and its importance, has been discussed widely in the literature [5] [6] [7] In non-parametric regression analysis, the goal is to express the expected value of your response variable(y) as a function of your predictors (X). That is we are looking to estimate a mean function, without assuming a parametric form. There are many forms of non-parametric smoothing methods to help estimate the function . An interesting approach is to also look at a non-parametric variance function, . A non-parametric variance function allows one to look at the mean function as it relates to the variance function and notice patterns in the data.

An example is detailed in the pictures to the right. The goal of the project was to determine (among other things) whether or not the predictor, number of years in the major leagues (baseball), had an effect on the response, salary, a player made. An initial scatter plot of the data indicates that there is heteroscedasticity in the data as the variance is not constant at each level of the predictor. Because we can visually detect the non-constant variance, it useful now to plot , and look to see if the shape is indicative of any known distribution. One can estimate and using a general smoothing method. The plot of the non-parametric smoothed variance function can give the researcher an idea of the relationship between the variance and the mean. The picture to the right indicates a quadratic relationship between the mean and the variance. As we saw above, the Gamma variance function is quadratic in the mean.

Notes

  1. 1 2 Muller and Zhao (1995). "On a semi parametric variance function model and a test for heteroscedasticity". The Annals of Statistics. 23 (3): 946–967. doi: 10.1214/aos/1176324630 . JSTOR   2242430.
  2. Muller, Stadtmuller and Yao (2006). "Functional Variance Processes". Journal of the American Statistical Association. 101 (475): 1007–1018. doi:10.1198/016214506000000186. JSTOR   27590778. S2CID   13712496.
  3. Wedderburn, R.W.M. (1974). "Quasi-likelihood functions, generalized linear models, and the Gauss–Newton Method". Biometrika. 61 (3): 439–447. doi:10.1093/biomet/61.3.439. JSTOR   2334725.
  4. McCullagh, Peter; Nelder, John (1989). Generalized Linear Models (second ed.). London: Chapman and Hall. ISBN   0-412-31760-5.
  5. Muller and StadtMuller (1987). "Estimation of Heteroscedasticity in Regression Analysis". The Annals of Statistics. 15 (2): 610–625. doi: 10.1214/aos/1176350364 . JSTOR   2241329.
  6. Cai and Wang, T.; Wang, Lie (2008). "Adaptive Variance Function Estimation in Heteroscedastic Nonparametric Regression". The Annals of Statistics. 36 (5): 2025–2054. arXiv: 0810.4780 . Bibcode:2008arXiv0810.4780C. doi:10.1214/07-AOS509. JSTOR   2546470. S2CID   9184727.
  7. Rice and Silverman (1991). "Estimating the Mean and Covariance structure nonparametrically when the data are curves". Journal of the Royal Statistical Society. 53 (1): 233–243. JSTOR   2345738.

Related Research Articles

<span class="mw-page-title-main">Normal distribution</span> Probability distribution

In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

<span class="mw-page-title-main">Beta distribution</span> Probability distribution

In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] or in terms of two positive parameters, denoted by alpha (α) and beta (β), that appear as exponents of the variable and its complement to 1, respectively, and control the shape of the distribution.

<span class="mw-page-title-main">Gamma distribution</span> Probability distribution

In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the gamma distribution. There are two equivalent parameterizations in common use:

  1. With a shape parameter and a scale parameter .
  2. With a shape parameter and an inverse scale parameter , called a rate parameter.
<span class="mw-page-title-main">Cramér–Rao bound</span> Lower bound on variance of an estimator

In estimation theory and statistics, the Cramér–Rao bound (CRB) relates to estimation of a deterministic parameter. The result is named in honor of Harald Cramér and C. R. Rao, but has also been derived independently by Maurice Fréchet, Georges Darmois, and by Alexander Aitken and Harold Silverstone. It states that the precision of any unbiased estimator is at most the Fisher information; or (equivalently) the reciprocal of the Fisher information is a lower bound on its variance.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

<span class="mw-page-title-main">Stable distribution</span> Distribution of variables which satisfies a stability property under linear combinations

In probability theory, a distribution is said to be stable if a linear combination of two independent random variables with this distribution has the same distribution, up to location and scale parameters. A random variable is said to be stable if its distribution is stable. The stable distribution family is also sometimes referred to as the Lévy alpha-stable distribution, after Paul Lévy, the first mathematician to have studied it.

<span class="mw-page-title-main">Consistent estimator</span> Statistical estimator converging in probability to a true parameter as sample size increases

In statistics, a consistent estimator or asymptotically consistent estimator is an estimator—a rule for computing estimates of a parameter θ0—having the property that as the number of data points used increases indefinitely, the resulting sequence of estimates converges in probability to θ0. This means that the distributions of the estimates become more and more concentrated near the true value of the parameter being estimated, so that the probability of the estimator being arbitrarily close to θ0 converges to one.

In physics, spherically symmetric spacetimes are commonly used to obtain analytic and numerical solutions to Einstein's field equations in the presence of radially moving matter or energy. Because spherically symmetric spacetimes are by definition irrotational, they are not realistic models of black holes in nature. However, their metrics are considerably simpler than those of rotating spacetimes, making them much easier to analyze.

<span class="mw-page-title-main">Ornstein–Uhlenbeck process</span> Stochastic process modeling random walk with friction

In mathematics, the Ornstein–Uhlenbeck process is a stochastic process with applications in financial mathematics and the physical sciences. Its original application in physics was as a model for the velocity of a massive Brownian particle under the influence of friction. It is named after Leonard Ornstein and George Eugene Uhlenbeck.

A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.

The term generalized logistic distribution is used as the name for several different families of probability distributions. For example, Johnson et al. list four forms, which are listed below.

A product distribution is a probability distribution constructed as the distribution of the product of random variables having two other known distributions. Given two statistically independent random variables X and Y, the distribution of the random variable Z that is formed as the product is a product distribution.

For many paramagnetic materials, the magnetization of the material is directly proportional to an applied magnetic field, for sufficiently high temperatures and small fields. However, if the material is heated, this proportionality is reduced. For a fixed value of the field, the magnetic susceptibility is inversely proportional to temperature, that is

<span class="mw-page-title-main">Hermite distribution</span>

In probability theory and statistics, the Hermite distribution, named after Charles Hermite, is a discrete probability distribution used to model count data with more than one parameter. This distribution is flexible in terms of its ability to allow a moderate over-dispersion in the data.

Bayesian hierarchical modelling is a statistical model written in multiple levels that estimates the parameters of the posterior distribution using the Bayesian method. The sub-models combine to form the hierarchical model, and Bayes' theorem is used to integrate them with the observed data and account for all the uncertainty that is present. The result of this integration is the posterior distribution, also known as the updated probability estimate, as additional evidence on the prior distribution is acquired.

The generalized functional linear model (GFLM) is an extension of the generalized linear model (GLM) that allows one to regress univariate responses of various types on functional predictors, which are mostly random trajectories generated by a square-integrable stochastic processes. Similarly to GLM, a link function relates the expected value of the response variable to a linear predictor, which in case of GFLM is obtained by forming the scalar product of the random predictor function with a smooth parameter function . Functional Linear Regression, Functional Poisson Regression and Functional Binomial Regression, with the important Functional Logistic Regression included, are special cases of GFLM. Applications of GFLM include classification and discrimination of stochastic processes and functional data.

<span class="mw-page-title-main">Batch normalization</span> Method used to make artificial neural networks faster and stable by re-centering and re-scaling

Batch normalization is a method used to make training of artificial neural networks faster and more stable through normalization of the layers' inputs by re-centering and re-scaling. It was proposed by Sergey Ioffe and Christian Szegedy in 2015.

<span class="mw-page-title-main">Variational autoencoder</span> Deep learning generative model to encode data representation

In machine learning, a variational autoencoder (VAE) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling. It is part of the families of probabilistic graphical models and variational Bayesian methods.

References