Generalized estimating equation

Last updated

In statistics, a generalized estimating equation (GEE) is used to estimate the parameters of a generalized linear model with a possible unmeasured correlation between observations from different timepoints. [1] [2] Although some believe that GEEs are robust in everything[ who? ], even with the wrong choice of working correlation matrix, generalized estimating equations are robust only to loss of consistency with the wrong choice[ citation needed ].

Contents

Regression beta coefficient estimates from the Liang-Zeger GEE are consistent, unbiased, and asymptotically normal even when the working correlation is misspecified, under mild regularity conditions. GEE is higher in efficiency than generalized linear models (GLMs) in the presence of high autocorrelation. [1] When the true working correlation is known, consistency does not require the assumption that missing data is missing completely at random. [1] Huber-White standard errors improve the efficiency of Liang-Zeger GEE in the absence of serial autocorrelation but may remove the marginal interpretation. GEE estimates the average response over the population ("population-averaged" effects) with Liang-Zeger standard errors, and in individuals using Huber-White standard errors, also known as "robust standard error" or "sandwich variance" estimates. [3] Huber-White GEE was used since 1997, and Liang-Zeger GEE dates to the 1980s based on a limited literature review. [4] Several independent formulations of these standard error estimators contribute to GEE theory. Placing the independent standard error estimators under the umbrella term "GEE" may exemplify abuse of terminology.

GEEs belong to a class of regression techniques that are referred to as semiparametric because they rely on specification of only the first two moments. They are a popular alternative to the likelihood-based generalized linear mixed model which is more at risk for consistency loss at variance structure specification. [5] The trade-off of variance-structure misspecification and consistent regression coefficient estimates is loss of efficiency, yielding inflated Wald test p-values as a result of higher variance of standard errors than that of the most optimal. [6] They are commonly used in large epidemiological studies, especially multi-site cohort studies, because they can handle many types of unmeasured dependence between outcomes.

Formulation

Given a mean model for subject and time that depends upon regression parameters , and variance structure, , the estimating equation is formed via: [7]

The parameters are estimated by solving and are typically obtained via the Newton–Raphson algorithm. The variance structure is chosen to improve the efficiency of the parameter estimates. The Hessian of the solution to the GEEs in the parameter space can be used to calculate robust standard error estimates. The term "variance structure" refers to the algebraic form of the covariance matrix between outcomes, Y, in the sample. Examples of variance structure specifications include independence, exchangeable, autoregressive, stationary m-dependent, and unstructured. The most popular form of inference on GEE regression parameters is the Wald test using naive or robust standard errors, though the Score test is also valid and preferable when it is difficult to obtain estimates of information under the alternative hypothesis. The likelihood ratio test is not valid in this setting because the estimating equations are not necessarily likelihood equations. Model selection can be performed with the GEE equivalent of the Akaike Information Criterion (AIC), the quasi-likelihood under the independence model criterion (QIC). [8]

Relationship with Generalized Method of Moments

The generalized estimating equation is a special case of the generalized method of moments (GMM). [9] This relationship is immediately obvious from the requirement that the score function satisfy the equation:

Computation

Software for solving generalized estimating equations is available in MATLAB, [10] SAS (proc genmod [11] ), SPSS (the gee procedure [12] ), Stata (the xtgee command [13] ), R (packages glmtoolbox, [14] gee, [15] geepack [16] and multgee [17] ), Julia (package GEE.jl [18] ) and Python (package statsmodels [19] ).

Comparisons among software packages for the analysis of binary correlated data [20] [21] and ordinal correlated data [22] via GEE are available.

See also

Related Research Articles

<span class="mw-page-title-main">Least squares</span> Approximation method in statistics

The method of least squares is a parameter estimation method in regression analysis based on minimizing the sum of the squares of the residuals made in the results of each individual equation.

In statistics, the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. The errors do not need to be normal, nor do they need to be independent and identically distributed. The requirement that the estimator be unbiased cannot be dropped, since biased estimators exist with lower variance. See, for example, the James–Stein estimator, ridge regression, or simply any degenerate estimator.

<span class="mw-page-title-main">Logistic regression</span> Statistical model for a binary dependent variable

In statistics, the logistic model is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression estimates the parameters of a logistic model. In binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

<span class="mw-page-title-main">Regression analysis</span> Set of statistical processes for estimating the relationships among variables

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more error-free independent variables. The most common form of regression analysis is linear regression, in which one finds the line that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line that minimizes the sum of squared differences between the true data and that line. For specific mathematical reasons, this allows the researcher to estimate the conditional expectation of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the conditional expectation across a broader collection of non-linear models.

<span class="mw-page-title-main">Coefficient of determination</span> Indicator for how well data points fit a line or curve

In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).

<span class="mw-page-title-main">Ordinary least squares</span> Method for estimating the unknown parameters in a linear regression model

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable. Some sources consider OLS to be linear regression.

In statistics, Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable Y has a Poisson distribution, and assumes the logarithm of its expected value can be modeled by a linear combination of unknown parameters. A Poisson regression model is sometimes known as a log-linear model, especially when used to model contingency tables.

In econometrics, the seemingly unrelated regressions (SUR) or seemingly unrelated regression equations (SURE) model, proposed by Arnold Zellner in (1962), is a generalization of a linear regression model that consists of several regression equations, each having its own dependent variable and potentially different sets of exogenous explanatory variables. Each equation is a valid linear regression on its own and can be estimated separately, which is why the system is called seemingly unrelated, although some authors suggest that the term seemingly related would be more appropriate, since the error terms are assumed to be correlated across the equations.

In statistics, a generalized additive model (GAM) is a generalized linear model in which the linear response variable depends linearly on unknown smooth functions of some predictor variables, and interest focuses on inference about these smooth functions.

In statistics, binomial regression is a regression analysis technique in which the response has a binomial distribution: it is the number of successes in a series of independent Bernoulli trials, where each trial has probability of success . In binomial regression, the probability of a success is related to explanatory variables: the corresponding concept in ordinary regression is to relate the mean value of the unobserved response to explanatory variables.

The topic of heteroskedasticity-consistent (HC) standard errors arises in statistics and econometrics in the context of linear regression and time series analysis. These are also known as heteroskedasticity-robust standard errors, Eicker–Huber–White standard errors, to recognize the contributions of Friedhelm Eicker, Peter J. Huber, and Halbert White.

The generalized normal distribution (GND) or generalized Gaussian distribution (GGD) is either of two families of parametric continuous probability distributions on the real line. Both families add a shape parameter to the normal distribution. To distinguish the two families, they are referred to below as "symmetric" and "asymmetric"; however, this is not a standard nomenclature.

Linear least squares (LLS) is the least squares approximation of linear functions to data. It is a set of formulations for solving statistical problems involved in linear regression, including variants for ordinary (unweighted), weighted, and generalized (correlated) residuals. Numerical methods for linear least squares include inverting the matrix of the normal equations and orthogonal decomposition methods.

In statistics, the variance function is a smooth function that depicts the variance of a random quantity as a function of its mean. The variance function is a measure of heteroscedasticity and plays a large role in many settings of statistical modelling. It is a main ingredient in the generalized linear model framework and a tool used in non-parametric regression, semiparametric regression and functional data analysis. In parametric modeling, variance functions take on a parametric form and explicitly describe the relationship between the variance and the mean of a random quantity. In a non-parametric setting, the variance function is assumed to be a smooth function.

The generalized functional linear model (GFLM) is an extension of the generalized linear model (GLM) that allows one to regress univariate responses of various types on functional predictors, which are mostly random trajectories generated by a square-integrable stochastic processes. Similarly to GLM, a link function relates the expected value of the response variable to a linear predictor, which in case of GFLM is obtained by forming the scalar product of the random predictor function with a smooth parameter function . Functional Linear Regression, Functional Poisson Regression and Functional Binomial Regression, with the important Functional Logistic Regression included, are special cases of GFLM. Applications of GFLM include classification and discrimination of stochastic processes and functional data.

In statistics, linear regression is a statistical model that estimates the linear relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable. If the explanatory variables are measured with error then errors-in-variables models are required, also known as measurement error models.

In statistics, the class of vector generalized linear models (VGLMs) was proposed to enlarge the scope of models catered for by generalized linear models (GLMs). In particular, VGLMs allow for response variables outside the classical exponential family and for more than one parameter. Each parameter can be transformed by a link function. The VGLM framework is also large enough to naturally accommodate multiple responses; these are several independent responses each coming from a particular statistical distribution with possibly different parameter values.

<span class="mw-page-title-main">Homoscedasticity and heteroscedasticity</span> Statistical property

In statistics, a sequence of random variables is homoscedastic if all its random variables have the same finite variance; this is also known as homogeneity of variance. The complementary notion is called heteroscedasticity, also known as heterogeneity of variance. The spellings homoskedasticity and heteroskedasticity are also frequently used. “Skedasticity” comes from the Ancient Greek word “skedánnymi”, meaning “to scatter”. Assuming a variable is homoscedastic when in reality it is heteroscedastic results in unbiased but inefficient point estimates and in biased estimates of standard errors, and may result in overestimating the goodness of fit as measured by the Pearson coefficient.

Beta regression is a form of regression which is used when the response variable, , takes values within and can be assumed to follow a beta distribution. It is generalisable to variables which takes values in the arbitrary open interval through transformations. Beta regression was developed in the early 2000s by two sets of statisticians: Kieschnick and McCullough in 2003 and Ferrari and Cribari-Neto in 2004.

References

  1. 1 2 3 Kung-Yee Liang; Scott Zeger (1986). "Longitudinal data analysis using generalized linear models". Biometrika. 73 (1): 13–22. doi: 10.1093/biomet/73.1.13 .
  2. Hardin, James; Hilbe, Joseph (2003). Generalized Estimating Equations . London: Chapman and Hall/CRC. ISBN   978-1-58488-307-4.
  3. Abadie, Alberto; Athey, Susan; Imbens, Guido W; Wooldridge, Jeffrey M (October 2022). "When Should You Adjust Standard Errors for Clustering?". The Quarterly Journal of Economics. 138 (1): 1–35. arXiv: 1710.02926 . doi:10.1093/qje/qjac038.
  4. Wolfe, Frederick; Anderson, Janice; Harkness, Deborah; Bennett, Robert M.; Caro, Xavier J.; Goldenberg, Don L.; Russell, I. Jon; Yunus, Muhammad B. (1997). "A prospective, longitudinal, multicenter study of service utilization and costs in fibromyalgia". Arthritis & Rheumatism. 40 (9): 1560–1570. doi:10.1002/art.1780400904. PMID   9324009.
  5. Fong, Y; Rue, H; Wakefield, J (2010). "Bayesian inference for generalized linear mixed models". Biostatistics. 11 (3): 397–412. doi:10.1093/biostatistics/kxp053. PMC   2883299 . PMID   19966070.
  6. O'Brien, Liam M.; Fitzmaurice, Garrett M.; Horton, Nicholas J. (October 2006). "Maximum Likelihood Estimation of Marginal Pairwise Associations with Multiple Source Predictors". Biometrical Journal. 48 (5): 860–875. doi:10.1002/bimj.200510227. ISSN   0323-3847. PMC   1764610 . PMID   17094349.
  7. Diggle, Peter J.; Patrick Heagerty; Kung-Yee Liang; Scott L. Zeger (2002). Analysis of Longitudinal Data. Oxford Statistical Science Series. ISBN   978-0-19-852484-7.
  8. Pan, W. (2001), "Akaike's information criterion in generalized estimating equations", Biometrics , 57 (1): 120–125, doi:10.1111/j.0006-341X.2001.00120.x, PMID   11252586, S2CID   7862441 .
  9. Breitung, Jörg; Chaganty, N. Rao; Daniel, Rhian M.; Kenward, Michael G.; Lechner, Michael; Martus, Peter; Sabo, Roy T.; Wang, You-Gan; Zorn, Christopher (2010). "Discussion of 'Generalized Estimating Equations: Notes on the Choice of the Working Correlation Matrix'". Methods of Information in Medicine. 49 (5): 426–432. doi:10.1055/s-0038-1625133. S2CID   3213776.
  10. Sarah J. Ratcliffe; Justine Shults (2008). "GEEQBOX: A MATLAB Toolbox for Generalized Estimating Equations and Quasi-Least Squares". Journal of Statistical Software. 25 (14): 1–14.
  11. "The GENMOD Procedure". The SAS Institute.
  12. "IBM SPSS Advanced Statistics". IBM SPSS website. 5 April 2024.
  13. "Stata's implementation of GEE" (PDF). Stata website.
  14. "glmtoolbox: Set of Tools to Data Analysis using Generalized Linear Models". CRAN. 10 October 2023.
  15. "gee: Generalized Estimation Equation solver". CRAN. 7 November 2019.
  16. geepack: Generalized Estimating Equation Package, CRAN, 18 December 2020{{citation}}: CS1 maint: location missing publisher (link)
  17. multgee: GEE solver for correlated nominal or ordinal multinomial responses using a local odds ratios parameterization, CRAN, 13 May 2021{{citation}}: CS1 maint: location missing publisher (link)
  18. Shedden, Kerby (23 June 2022). "Generalized Estimating Equations in Julia". GitHub. Retrieved 24 June 2022.
  19. "Generalized Estimating Equations — statsmodels".
  20. Andreas Ziegler; Ulrike Grömping (1998). "The generalised estimating equations: a comparison of procedures available in commercial statistical software packages". Biometrical Journal. 40 (3): 245–260. doi:10.1002/(sici)1521-4036(199807)40:3<245::aid-bimj245>3.0.co;2-n.
  21. Nicholas J. HORTON; Stuart R. LIPSITZ (1999). "Review of software to fit generalized estimating equation regression models". The American Statistician. 53 (2): 160–169. CiteSeerX   10.1.1.22.9325 . doi:10.1080/00031305.1999.10474451.
  22. Nazanin Nooraee; Geert Molenberghs; Edwin R. van den Heuvel (2014). "GEE for longitudinal ordinal data: Comparing R-geepack, R-multgee, R-repolr, SAS-GENMOD, SPSS-GENLIN" (PDF). Computational Statistics & Data Analysis. 77: 70–83. doi:10.1016/j.csda.2014.03.009. S2CID   15063953.

Further reading