Quasi-variance

Last updated

Quasi-variance (qv) estimates are a statistical approach that is suitable for communicating the effects of a categorical explanatory variable within a statistical model. In standard statistical models the effects of a categorical explanatory variable are assessed by comparing one category (or level) that is set as a benchmark against which all other categories are compared. The benchmark category is usually referred to as the 'reference' or 'base' category. In order for comparisons to be made the reference category is arbitrarily fixed to zero. Statistical data analysis software usually undertakes formal comparisons of whether or not each level of the categorical variable differs from the reference category. These comparisons generate the well known ‘significance values’ of parameter estimates (i.e., coefficients). Whilst it is straightforward to compare any one category with the reference category, it is more difficult to formally compare two other categories (or levels) of an explanatory variable with each other when neither is the reference category. This is known as the reference category problem.

Contents

Quasi-variances are approximations of variances. Quasi-variances are statistics associated with the parameter estimates (coefficients) of the different levels of categorical explanatory variables within statistical models. Quasi-variances can be presented alongside parameter estimates to enable readers to assess differences between any combinations of parameter estimates for a categorical explanatory variable. The approach is beneficial because such comparisons are not usually possible without access to the full variance-covariance matrix for the estimates.

Using quasi-variance estimates addresses the reference category problem. The underlying idea was first proposed by Ridout [1] but the technique was set out by David Firth and Renee Menezes. [2] [3] The suitability of this technique for social science data analysis has been demonstrated. [4] An on-line tool for the calculation of quasi-variance estimates is available and a short technical description of the methodology is provided.

Quasi-variances can be calculated in Stata using the QV module [5] and can also be calculated in R using the package qvcalc.

See also

Related Research Articles

Overfitting Analysis that corresponds too closely to a particular set of data and may fail to fit additional data

In statistics, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably". An overfitted model is a statistical model that contains more parameters than can be justified by the data. The essence of overfitting is to have unknowingly extracted some of the residual variation as if that variation represented underlying model structure.

In statistics, the logistic model is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc. Each object being detected in the image would be assigned a probability between 0 and 1, with a sum of one.

In statistics and econometrics, particularly in regression analysis, a dummy variable is one that takes only the value 0 or 1 to indicate the absence or presence of some categorical effect that may be expected to shift the outcome. They can be thought of as numeric stand-ins for qualitative facts in a regression model, sorting data into mutually exclusive categories.

Interaction (statistics)

In statistics, an interaction may arise when considering the relationship among three or more variables, and describes a situation in which the effect of one causal variable on an outcome depends on the state of a second causal variable. Although commonly thought of in terms of causal relationships, the concept of an interaction can also describe non-causal associations. Interactions are often considered in the context of regression analyses or factorial experiments.

In statistics, the generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

In statistics, a categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property. In computer science and some branches of mathematics, categorical variables are referred to as enumerations or enumerated types. Commonly, each of the possible values of a categorical variable is referred to as a level. The probability distribution associated with a random categorical variable is called a categorical distribution.

Coefficient of determination

In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared", is the proportion of the variance in the dependent variable that is predictable from the independent variable(s).

In statistics, the ordered logit model is an ordinal regression model—that is, a regression model for ordinal dependent variables—first considered by Peter McCullagh. For example, if one question on a survey is to be answered by a choice among "poor", "fair", "good", and "excellent", and the purpose of the analysis is to see how well that response can be predicted by the responses to other questions, some of which may be quantitative, then ordered logistic regression may be used. It can be thought of as an extension of the logistic regression model that applies to dichotomous dependent variables, allowing for more than two (ordered) response categories.

Multilevel models are statistical models of parameters that vary at more than one level. An example could be a model of student performance that contains measures for individual students as well as measures for classrooms within which the students are grouped. These models can be seen as generalizations of linear models, although they can also extend to non-linear models. These models became much more popular after sufficient computing power and software became available.

Omnibus tests are a kind of statistical test. They test whether the explained variance in a set of data is significantly greater than the unexplained variance, overall. One example is the F-test in the analysis of variance. There can be legitimate significant effects within a model even if the omnibus test is not significant. For instance, in a model with two independent variables, if only one variable exerts a significant effect on the dependent variable and the other does not, then the omnibus test may be non-significant. This fact does not affect the conclusions that may be drawn from the one significant variable. In order to test effects within an omnibus test, researchers often use contrasts.

The Heckman correction is a statistical technique to correct bias from non-randomly selected samples or otherwise incidentally truncated dependent variables, a pervasive issue in quantitative social sciences when using observational data. Conceptually, this is achieved by explicitly modelling the individual sampling probability of each observation together with the conditional expectation of the dependent variable. The resulting likelihood function is mathematically similar to the tobit model for censored dependent variables, a connection first drawn by James Heckman in 1974. Heckman also developed a two-step control function approach to estimate this model, which avoids the computational burden of having to estimate both equations jointly, albeit at the cost of inefficiency. Heckman received the Nobel Memorial Prize in Economic Sciences in 2000 for his work in this field.

In statistics, a generalized estimating equation (GEE) is used to estimate the parameters of a generalized linear model with a possible unknown correlation between outcomes.

In statistics and regression analysis, moderation occurs when the relationship between two variables depends on a third variable. The third variable is referred to as the moderator variable or simply the moderator. The effect of a moderating variable is characterized statistically as an interaction; that is, a categorical or quantitative variable that affects the direction and/or strength of the relation between dependent and independent variables. Specifically within a correlational analysis framework, a moderator is a third variable that affects the zero-order correlation between two other variables, or the value of the slope of the dependent variable on the independent variable. In analysis of variance (ANOVA) terms, a basic moderator effect can be represented as an interaction between a focal independent variable and a factor that specifies the appropriate conditions for its operation.

In statistics, the principle of marginality is the fact that the average effects, of variables in an analysis are marginal to their interaction effect—that is, the main effect of one explanatory variable captures the effect of that variable averaged over all values of a second explanatory variable whose value influences the first variable's effect. The principle of marginality implies that, in general, it is wrong to test, estimate, or interpret main effects of explanatory variables where the variables interact or, similarly, to model interaction effects but delete main effects that are marginal to them. While such models are interpretable, they lack applicability, as they ignore the dependence of a variable's effect upon another variable's value.

In statistics, the Sobel test is a method of testing the significance of a mediation effect. The test is based on the work of Michael E. Sobel, a statistics professor at Columbia University in New York, NY, and is an application of the delta method. In mediation, the relationship between the independent variable and the dependent variable is hypothesized to be an indirect effect that exists due to the influence of a third variable. As a result when the mediator is included in a regression analysis model with the independent variable, the effect of the independent variable is reduced and the effect of the mediator remains significant. The Sobel test is basically a specialized t test that provides a method to determine whether the reduction in the effect of the independent variable, after including the mediator in the model, is a significant reduction and therefore whether the mediation effect is statistically significant.

David Firth is a British statistician specialising in social-science and biostatistical applications. He was awarded the Guy Medal in Silver in 2012.

In statistics, linear regression is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.

In statistics, the class of vector generalized linear models (VGLMs) was proposed to enlarge the scope of models catered for by generalized linear models (GLMs). In particular, VGLMs allow for response variables outside the classical exponential family and for more than one parameter. Each parameter can be transformed by a link function. The VGLM framework is also large enough to naturally accommodate multiple responses; these are several independent responses each coming from a particular statistical distribution with possibly different parameter values.

In statistics, specifically regression analysis, a binary regression estimates a relationship between one or more explanatory variables and a single output binary variable. Generally the probability of the two alternatives is modeled, instead of simply outputting a single value, as in linear regression.

References

  1. Ridout, M.S. (1989). Summarizing the Results of Fitting Generalized Linear Models to Data from Designed Experiments. New York: Springer-Verlag. pp. 262–9.
  2. Firth, David (2016-06-24). "1. Overcoming the Reference Category Problem in the Presentation of Statistical Models". Sociological Methodology. 33 (1): 1–18. doi:10.1111/j.0081-1750.2003.t01-1-00125.x.
  3. Firth, David; Menezes, RX (2004). "Quasi-variances" (PDF). Biometrika. 91 (1): 65–80. doi: 10.1093/biomet/91.1.65 . Retrieved 2017-03-17.
  4. Gayle, Vernon; Lambert, Paul S. (2007-12-01). "Using Quasi-variance to Communicate Sociological Results from Statistical Models". Sociology. 41 (6): 1191–1208. CiteSeerX   10.1.1.611.3153 . doi:10.1177/0038038507084830. ISSN   0038-0385.
  5. Chen, Aspen (2014-07-21), QV: Stata module to compute quasi-variances , retrieved 2017-03-15

An extended set of resources with examples in Stata and SPSS are also available