Generalized additive model for location, scale and shape

Last updated

The Generalized Additive Model for Location, Scale and Shape (GAMLSS) is an approach to statistical modelling and learning. GAMLSS is a modern distribution-based approach to (semiparametric) regression. A parametric distribution is assumed for the response (target) variable but the parameters of this distribution can vary according to explanatory variables using linear, nonlinear or smooth functions. In machine learning parlance, GAMLSS is a form of supervised machine learning.

Contents

In particular, the GAMLSS statistical framework enables flexible regression and smoothing models to be fitted to the data. The GAMLSS model assumes the response variable has any parametric distribution which might be heavy or light-tailed, and positively or negatively skewed. In addition, all the parameters of the distribution [location (e.g., mean), scale (e.g., variance) and shape (skewness and kurtosis)] can be modeled as linear, nonlinear or smooth functions of explanatory variables.

Overview of the model

The generalized additive model for location, scale and shape (GAMLSS) is a statistical model developed by Rigby and Stasinopoulos (and later expanded) to overcome some of the limitations associated with the popular generalized linear models (GLMs) and generalized additive models (GAMs). For an overview of these limitations see Nelder and Wedderburn (1972) [1] and Hastie's and Tibshirani's book. [2]

In GAMLSS the exponential family distribution assumption for the response variable, (), (essential in GLMs and GAMs), is relaxed and replaced by a general distribution family, including highly skew and/or kurtotic continuous and discrete distributions.

The systematic part of the model is expanded to allow modeling not only of the mean (or location) but other parameters of the distribution of y as linear and/or nonlinear, parametric and/or additive non-parametric functions of explanatory variables and/or random effects.

GAMLSS is especially suited for modelling a leptokurtic or platykurtic and/or positively or negatively skewed response variable. For count type response variable data it deals with over-dispersion by using proper over-dispersed discrete distributions. Heterogeneity also is dealt with by modeling the scale or shape parameters using explanatory variables. There are several packages written in R related to GAMLSS models, [3] and tutorials for using and interpreting GAMLSS. [4]

A GAMLSS model assumes independent observations for with probability (density) function conditional on a vector of four distribution parameters, each of which can be a function of the explanatory variables. The first two population distribution parameters and are usually characterized as location and scale parameters, while the remaining parameter(s), if any, are characterized as shape parameters, e.g. skewness and kurtosis parameters, although the model may be applied more generally to the parameters of any population distribution with up to four distribution parameters, and can be generalized to more than four distribution parameters.

where μ, σ, ν, τ and are vectors of length , is a parameter vector of length , is a fixed known design matrix of order and is a smooth non-parametric function of explanatory variable , and . are link functions.

For centile estimation the WHO Multicentre Growth Reference Study Group have recommended GAMLSS and the Box–Cox power exponential (BCPE) distributions [5] for the construction of the WHO Child Growth Standards. [6] [7]

What distributions can be used

The form of the distribution assumed for the response variable y, is very general. For example, an implementation of GAMLSS in R [8] has around 100 different distributions available. Such implementations also allow use of truncated distributions and censored (or interval) response variables. [8]

Related Research Articles

<span class="mw-page-title-main">Least squares</span> Approximation method in statistics

The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems by minimizing the sum of the squares of the residuals made in the results of each individual equation.

<span class="mw-page-title-main">Logistic regression</span> Statistical model for a binary dependent variable

In statistics, the logistic model is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear combination of one or more independent variables. In regression analysis, logistic regression is estimating the parameters of a logistic model. Formally, in binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family", or the older term Koopman–Darmois family. The terms "distribution" and "family" are often used loosely: specifically, an exponential family is a set of distributions, where the specific distribution varies with the parameter; however, a parametric family of distributions is often referred to as "a distribution", and the set of all exponential families is sometimes loosely referred to as "the" exponential family. They are distinct because they possess a variety of desirable properties, most importantly the existence of a sufficient statistic.

<span class="mw-page-title-main">Regression analysis</span> Set of statistical processes for estimating the relationships among variables

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables. The most common form of regression analysis is linear regression, in which one finds the line that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line that minimizes the sum of squared differences between the true data and that line. For specific mathematical reasons, this allows the researcher to estimate the conditional expectation of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the conditional expectation across a broader collection of non-linear models.

The general linear model or general multivariate regression model is a compact way of simultaneously writing several multiple linear regression models. In that sense it is not a separate statistical linear model. The various multiple linear regression models may be compactly written as

In probability theory and statistics, the generalized extreme value (GEV) distribution is a family of continuous probability distributions developed within extreme value theory to combine the Gumbel, Fréchet and Weibull families also known as type I, II and III extreme value distributions. By the extreme value theorem the GEV distribution is the only possible limit distribution of properly normalized maxima of a sequence of independent and identically distributed random variables. Note that a limit distribution needs to exist, which requires regularity conditions on the tail of the distribution. Despite this, the GEV distribution is often used as an approximation to model the maxima of long (finite) sequences of random variables.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable.

<span class="mw-page-title-main">Beta prime distribution</span> Probability distribution

In probability theory and statistics, the beta prime distribution is an absolutely continuous probability distribution. If has a beta distribution, then the odds has a beta prime distribution.

In statistics, a generalized additive model (GAM) is a generalized linear model in which the linear response variable depends linearly on unknown smooth functions of some predictor variables, and interest focuses on inference about these smooth functions.

<span class="mw-page-title-main">Log-logistic distribution</span>

In probability and statistics, the log-logistic distribution is a continuous probability distribution for a non-negative random variable. It is used in survival analysis as a parametric model for events whose rate increases initially and decreases later, as, for example, mortality rate from cancer following diagnosis or treatment. It has also been used in hydrology to model stream flow and precipitation, in economics as a simple model of the distribution of wealth or income, and in networking to model the transmission times of data considering both the network and the software.

The generalized normal distribution or generalized Gaussian distribution (GGD) is either of two families of parametric continuous probability distributions on the real line. Both families add a shape parameter to the normal distribution. To distinguish the two families, they are referred to below as "symmetric" and "asymmetric"; however, this is not a standard nomenclature.

<span class="mw-page-title-main">Errors-in-variables models</span> Regression models accounting for possible errors in independent variables

In statistics, errors-in-variables models or measurement error models are regression models that account for measurement errors in the independent variables. In contrast, standard regression models assume that those regressors have been measured exactly, or observed without error; as such, those models account only for errors in the dependent variables, or responses.

In statistics, projection pursuit regression (PPR) is a statistical model developed by Jerome H. Friedman and Werner Stuetzle which is an extension of additive models. This model adapts the additive models in that it first projects the data matrix of explanatory variables in the optimal direction before applying smoothing functions to these explanatory variables.

In statistics, the variance function is a smooth function that depicts the variance of a random quantity as a function of its mean. The variance function is a measure of heteroscedasticity and plays a large role in many settings of statistical modelling. It is a main ingredient in the generalized linear model framework and a tool used in non-parametric regression, semiparametric regression and functional data analysis. In parametric modeling, variance functions take on a parametric form and explicitly describe the relationship between the variance and the mean of a random quantity. In a non-parametric setting, the variance function is assumed to be a smooth function.

The generalized functional linear model (GFLM) is an extension of the generalized linear model (GLM) that allows one to regress univariate responses of various types on functional predictors, which are mostly random trajectories generated by a square-integrable stochastic processes. Similarly to GLM, a link function relates the expected value of the response variable to a linear predictor, which in case of GFLM is obtained by forming the scalar product of the random predictor function with a smooth parameter function . Functional Linear Regression, Functional Poisson Regression and Functional Binomial Regression, with the important Functional Logistic Regression included, are special cases of GFLM. Applications of GFLM include classification and discrimination of stochastic processes and functional data.

In statistics and econometrics, the maximum score estimator is a nonparametric estimator for discrete choice models developed by Charles Manski in 1975. Unlike the multinomial probit and multinomial logit estimators, it makes no assumptions about the distribution of the unobservable part of utility. However, its statistical properties are more complicated than the multinomial probit and logit models, making statistical inference difficult. To address these issues, Joel Horowitz proposed a variant, called the smoothed maximum score estimator.

In statistics, linear regression is a linear approach for modelling a predictive relationship between a scalar response and one or more explanatory variables, which are measured without error. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.If the explanatory variables are measured with error then errors-in-variables models are required, also known as measurement error models.

In statistics, the class of vector generalized linear models (VGLMs) was proposed to enlarge the scope of models catered for by generalized linear models (GLMs). In particular, VGLMs allow for response variables outside the classical exponential family and for more than one parameter. Each parameter can be transformed by a link function. The VGLM framework is also large enough to naturally accommodate multiple responses; these are several independent responses each coming from a particular statistical distribution with possibly different parameter values.

An additive process, in probability theory, is a cadlag, continuous in probability stochastic process with independent increments. An additive process is the generalization of a Lévy process. An example of an additive process that is not a Lévy process is a Brownian motion with a time-dependent drift. The additive process was introduced by Paul Lévy in 1937.

Nonlinear mixed-effects models constitute a class of statistical models generalizing linear mixed-effects models. Like linear mixed-effects models, they are particularly useful in settings where there are multiple measurements within the same statistical units or when there are dependencies between measurements on related statistical units. Nonlinear mixed-effects models are applied in many fields including medicine, public health, pharmacology, and ecology.

References

  1. Nelder, J.A.; Wedderburn, R.W.M (1972). "Generalized linear models". J. R. Stat. Soc. A. 135 (3): 370–384. doi:10.2307/2344614. JSTOR   2344614.
  2. Hastie, TJ; Tibshirani, RJ (1990). Generalized additive models. London: Chapman and Hall.
  3. Stasinopoulos, D. Mikis; Rigby, Robert A (December 2007). "Generalized additive models for location scale and shape (GAMLSS) in R". Journal of Statistical Software. 23 (7). doi: 10.18637/jss.v023.i07 .
  4. David, Bann; Liam, Wright; Tim J, Cole (2022). "Risk factors relate to the variability of health outcomes as well as the mean: A GAMLSS tutorial". eLife. 11 (11). doi: 10.7554/eLife.72357 . PMC   8791632 . PMID   34985412.
  5. Rigby, Robert; Stasinopoulos, D. Mikis (February 2004). "Smooth Centile Curves for Skew and Kurtotic data Modelled Using the Box–Cox Power Exponential Distribution". Statistics in Medicine. 23 (19): 3053–3076. doi:10.1002/sim.1861. PMID   15351960.
  6. Borghi, E.; De Onis, M.; Garza, C.; Van Den Broeck, J.; Frongillo, E. A.; Grummer-Strawn, L.; Van Buuren, S.; Pan, H.; Molinari, L.; Martorell, R.; Onyango, A. W.; Martines, J. C.; WHO Multicentre Growth Reference Study Group (2006). "Construction of the World Health Organization child growth standards: Selection of methods for attained growth curves". Statistics in Medicine. 25 (2): 247–265. doi:10.1002/sim.2227. PMID   16143968.
  7. WHO Multicentre Growth Reference Study Group (2006) WHO Child Growth Standards: Length/height-for-age, weight-for-age, weight-for-length, weight-for-height and body mass index-for-age: Methods and development. Geneva: World Health Organization.
  8. 1 2 "The R packages | gamlss". The R packages | gamlss. Retrieved 4 May 2020.

Further reading