Generalized linear mixed model

Last updated

In statistics, a generalized linear mixed model (GLMM) is an extension to the generalized linear model (GLM) in which the linear predictor contains random effects in addition to the usual fixed effects. [1] [2] [3] They also inherit from generalized linear models the idea of extending linear mixed models to non-normal data.

Contents

Generalized linear mixed models provide a broad range of models for the analysis of grouped data, since the differences between groups can be modelled as a random effect. These models are useful in the analysis of many kinds of data, including longitudinal data. [4]

Model

Generalized linear mixed models are generally defined such that, conditioned on the random effects , the dependent variable is distributed according to the exponential family with its expectation related to the linear predictor via a link function :

.

Here and are the fixed effects design matrix, and fixed effects respectively; and are the random effects design matrix and random effects respectively. To understand this very brief definition you will first need to understand the definition of a generalized linear model and of a mixed model.

Generalized linear mixed models are a special cases of hierarchical generalized linear models in which the random effects are normally distributed.

The complete likelihood [5]

has no general closed form, and integrating over the random effects is usually extremely computationally intensive. In addition to numerically approximating this integral(e.g. via Gauss–Hermite quadrature), methods motivated by Laplace approximation have been proposed. [6] For example, the penalized quasi-likelihood method, which essentially involves repeatedly fitting (i.e. doubly iterative) a weighted normal mixed model with a working variate, [7] is implemented by various commercial and open source statistical programs.

Fitting a model

Fitting generalized linear mixed models via maximum likelihood (as via the Akaike information criterion (AIC)) involves integrating over the random effects. In general, those integrals cannot be expressed in analytical form. Various approximate methods have been developed, but none has good properties for all possible models and data sets (e.g. ungrouped binary data are particularly problematic). For this reason, methods involving numerical quadrature or Markov chain Monte Carlo have increased in use, as increasing computing power and advances in methods have made them more practical.

The Akaike information criterion is a common criterion for model selection. Estimates of the Akaike information criterion for generalized linear mixed models based on certain exponential family distributions have recently been obtained. [8]

Software

See also

Related Research Articles

The likelihood function is the joint probability mass of observed data, but viewed as a function of the parameters of a statistical model. That is, the likelihood function , which gives the likelihood of a vector of parameters under the assumption that a set of observed data is true, is numerically the same as the probability function , which gives the probability of a set of data under the assumption that a vector of parameters is true.

The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection.

In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

In statistics, the Breusch–Pagan test, developed in 1979 by Trevor Breusch and Adrian Pagan, is used to test for heteroskedasticity in a linear regression model. It was independently suggested with some extension by R. Dennis Cook and Sanford Weisberg in 1983. Derived from the Lagrange multiplier test principle, it tests whether the variance of the errors from a regression is dependent on the values of the independent variables. In that case, heteroskedasticity is present.

In statistics, a generalized additive model (GAM) is a generalized linear model in which the linear response variable depends linearly on unknown smooth functions of some predictor variables, and interest focuses on inference about these smooth functions.

Multilevel models are statistical models of parameters that vary at more than one level. An example could be a model of student performance that contains measures for individual students as well as measures for classrooms within which the students are grouped. These models can be seen as generalizations of linear models, although they can also extend to non-linear models. These models became much more popular after sufficient computing power and software became available.

A mixed model, mixed-effects model or mixed error-component model is a statistical model containing both fixed effects and random effects. These models are useful in a wide variety of disciplines in the physical, biological and social sciences. They are particularly useful in settings where repeated measurements are made on the same statistical units, or where measurements are made on clusters of related statistical units. Mixed models are often preferred over traditional analysis of variance regression models because they don't rely on the independent observations assumption. Further, they have their flexibility in dealing with missing values and uneven spacing of repeated measurements. The Mixed model analysis allows measurements to be explicitly modeled in a wider variety of correlation and variance-covariance avoiding biased estimations. structures.

In statistics, model specification is part of the process of building a statistical model: specification consists of selecting an appropriate functional form for the model and choosing which variables to include. For example, given personal income together with years of schooling and on-the-job experience , we might specify a functional relationship as follows:

The topic of heteroskedasticity-consistent (HC) standard errors arises in statistics and econometrics in the context of linear regression and time series analysis. These are also known as heteroskedasticity-robust standard errors, Eicker–Huber–White standard errors, to recognize the contributions of Friedhelm Eicker, Peter J. Huber, and Halbert White.

<span class="mw-page-title-main">Quantile regression</span> Statistics concept

Quantile regression is a type of regression analysis used in statistics and econometrics. Whereas the method of least squares estimates the conditional mean of the response variable across values of the predictor variables, quantile regression estimates the conditional median of the response variable. Quantile regression is an extension of linear regression used when the conditions of linear regression are not met.

In statistics, a generalized estimating equation (GEE) is used to estimate the parameters of a generalized linear model with a possible unmeasured correlation between observations from different timepoints. Although some believe that GEEs are robust in everything, even with the wrong choice of working correlation matrix, generalized estimating equations are robust only to loss of consistency with the wrong choice.

The following outline is provided as an overview of and topical guide to regression analysis:

In statistics, hierarchical generalized linear models extend generalized linear models by relaxing the assumption that error components are independent. This allows models to be built in situations where more than one error term is necessary and also allows for dependencies between error terms. The error components can be correlated and not necessarily follow a normal distribution. When there are different clusters, that is, groups of observations, the observations in the same cluster are correlated. In fact, they are positively correlated because observations in the same cluster share some common features. In this situation, using generalized linear models and ignoring the correlations may cause problems.

In statistics, the variance function is a smooth function that depicts the variance of a random quantity as a function of its mean. The variance function is a measure of heteroscedasticity and plays a large role in many settings of statistical modelling. It is a main ingredient in the generalized linear model framework and a tool used in non-parametric regression, semiparametric regression and functional data analysis. In parametric modeling, variance functions take on a parametric form and explicitly describe the relationship between the variance and the mean of a random quantity. In a non-parametric setting, the variance function is assumed to be a smooth function.

The generalized functional linear model (GFLM) is an extension of the generalized linear model (GLM) that allows one to regress univariate responses of various types on functional predictors, which are mostly random trajectories generated by a square-integrable stochastic processes. Similarly to GLM, a link function relates the expected value of the response variable to a linear predictor, which in case of GFLM is obtained by forming the scalar product of the random predictor function with a smooth parameter function . Functional Linear Regression, Functional Poisson Regression and Functional Binomial Regression, with the important Functional Logistic Regression included, are special cases of GFLM. Applications of GFLM include classification and discrimination of stochastic processes and functional data.

In econometrics, the Arellano–Bond estimator is a generalized method of moments estimator used to estimate dynamic models of panel data. It was proposed in 1991 by Manuel Arellano and Stephen Bond, based on the earlier work by Alok Bhargava and John Denis Sargan in 1983, for addressing certain endogeneity problems. The GMM-SYS estimator is a system that contains both the levels and the first difference equations. It provides an alternative to the standard first difference GMM estimator.

In statistics, linear regression is a statistical model which estimates the linear relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable. If the explanatory variables are measured with error then errors-in-variables models are required, also known as measurement error models.

In statistics, the class of vector generalized linear models (VGLMs) was proposed to enlarge the scope of models catered for by generalized linear models (GLMs). In particular, VGLMs allow for response variables outside the classical exponential family and for more than one parameter. Each parameter can be transformed by a link function. The VGLM framework is also large enough to naturally accommodate multiple responses; these are several independent responses each coming from a particular statistical distribution with possibly different parameter values.

Probabilistic numerics is an active field of study at the intersection of applied mathematics, statistics, and machine learning centering on the concept of uncertainty in computation. In probabilistic numerics, tasks in numerical analysis such as finding numerical solutions for integration, linear algebra, optimization and simulation and differential equations are seen as problems of statistical, probabilistic, or Bayesian inference.

References

  1. Breslow, N. E.; Clayton, D. G. (1993), "Approximate Inference in Generalized Linear Mixed Models", Journal of the American Statistical Association , 88 (421): 9–25, doi:10.2307/2290687, JSTOR   2290687
  2. Stroup, W.W. (2012), Generalized Linear Mixed Models, CRC Press
  3. Jiang, J. (2007), Linear and Generalized Linear Mixed Models and Their Applications, Springer
  4. Fitzmaurice, G. M.; Laird, N. M.; Ware, J.. (2011), Applied Longitudinal Analysis (2nd ed.), John Wiley & Sons, ISBN   978-0-471-21487-8
  5. Pawitan, Yudi. In All Likelihood: Statistical Modelling and Inference Using Likelihood (Paperbackition ed.). OUP Oxford. p. 459. ISBN   978-0199671229.
  6. Breslow, N. E.; Clayton, D. G. (20 December 2012). "Approximate Inference in Generalized Linear Mixed Models". Journal of the American Statistical Association. 88 (421): 9–25. doi:10.1080/01621459.1993.10594284.
  7. Wolfinger, Russ; O'connell, Michael (December 1993). "Generalized linear mixed models a pseudo-likelihood approach". Journal of Statistical Computation and Simulation. 48 (3–4): 233–243. doi:10.1080/00949659308811554.
  8. Saefken, B.; Kneib, T.; van Waveren, C.-S.; Greven, S. (2014), "A unifying approach to the estimation of the conditional Akaike information in generalized linear mixed models", Electronic Journal of Statistics , 8: 201–225, doi:10.1214/14-EJS881
  9. Pinheiro, J. C.; Bates, D. M. (2000), Mixed-effects models in S and S-PLUS, Springer, New York
  10. Berridge, D. M.; Crouchley, R. (2011), Multivariate Generalized Linear Mixed Models Using R, CRC Press
  11. "lme4 package - RDocumentation". www.rdocumentation.org. Retrieved 15 September 2022.
  12. "glmm package - RDocumentation". www.rdocumentation.org. Retrieved 15 September 2022.
  13. "IBM Knowledge Center". www.ibm.com. Retrieved 6 December 2017.
  14. "Statsmodels Documentation". www.statsmodels.org. Retrieved 17 March 2021.
  15. "Details of the parameter estimation · MixedModels". juliastats.org. Retrieved 16 June 2021.
  16. Installing, loading and citing the package , retrieved 2022-08-24