Goodman and Kruskal's gamma

Last updated

In statistics, Goodman and Kruskal's gamma is a measure of rank correlation, i.e., the similarity of the orderings of the data when ranked by each of the quantities. It measures the strength of association of the cross tabulated data when both variables are measured at the ordinal level. It makes no adjustment for either table size or ties. Values range from −1 (100% negative association, or perfect inversion) to +1 (100% positive association, or perfect agreement). A value of zero indicates the absence of association.

Contents

This statistic (which is distinct from Goodman and Kruskal's lambda) is named after Leo Goodman and William Kruskal, who proposed it in a series of papers from 1954 to 1972. [1] [2] [3] [4]

Definition

The estimate of gamma, G, depends on two quantities:

  • Ns, the number of pairs of cases ranked in the same order on both variables (number of concordant pairs),
  • Nd, the number of pairs of cases ranked in reversed order on both variables (number of reversed pairs),

where "ties" (cases where either of the two variables in the pair are equal) are dropped. Then

This statistic can be regarded as the maximum likelihood estimator for the theoretical quantity , where

and where Ps and Pd are the probabilities that a randomly selected pair of observations will place in the same or opposite order respectively, when ranked by both variables.

Critical values for the gamma statistic are sometimes found by using an approximation, whereby a transformed value, t of the statistic is referred to Student t distribution, where[ citation needed ]

and where n is the number of observations (not the number of pairs):

Yule's Q

A special case of Goodman and Kruskal's gamma is Yule's Q, also known as the Yule coefficient of association, [5] which is specific to 2×2 matrices. Consider the following contingency table of events, where each value is a count of an event's frequency:

YesNoTotals
Positiveaba+b
Negativecdc+d
Totalsa+cb+dn

Yule's Q is given by:

Although computed in the same fashion as Goodman and Kruskal's gamma, it has a slightly broader interpretation because the distinction between nominal and ordinal scales becomes a matter of arbitrary labeling for dichotomous distinctions. Thus, whether Q is positive or negative depends merely on which pairings the analyst considers to be concordant, but is otherwise symmetric.

Q varies from −1 to +1. −1 reflects total negative association, +1 reflects perfect positive association and 0 reflects no association at all. The sign depends on which pairings the analyst initially considered to be concordant, but this choice does not affect the magnitude.

In term of the odds ratio OR, Yule's Q is given by

and so Yule's Q and Yule's Y are related by

See also

Related Research Articles

Cauchy distribution Probability distribution

The Cauchy distribution, named after Augustin Cauchy, is a continuous probability distribution. It is also known, especially among physicists, as the Lorentz distribution, Cauchy–Lorentz distribution, Lorentz(ian) function, or Breit–Wigner distribution. The Cauchy distribution is the distribution of the x-intercept of a ray issuing from with a uniformly distributed angle. It is also the distribution of the ratio of two independent normally distributed random variables with mean zero.

Gini coefficient measure of inequality in the income distribution

In economics, the Gini coefficient, sometimes called the Gini index or Gini ratio, is a measure of statistical dispersion intended to represent the income or wealth distribution of a nation's residents, and is the most commonly used measurement of inequality. It was developed by the Italian statistician and sociologist Corrado Gini and published in his 1912 paper Variability and Mutability.

Correlation and dependence Statistical concept

In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. In the broadest sense correlation is any statistical association, though it commonly refers to the degree to which a pair of variables are linearly related. Familiar examples of dependent phenomena include the correlation between the physical statures of parents and their offspring, and the correlation between the price of a good and the quantity the consumers are willing to purchase, as it is depicted in the so-called demand curve.

Pearson correlation coefficient

In statistics, the Pearson correlation coefficient, also referred to as Pearson's r, the Pearson product-moment correlation coefficient (PPMCC) or the bivariate correlation, is a measure of the linear correlation between two variables X and Y. According to the Cauchy–Schwarz inequality it has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation. It is widely used in the sciences. It was developed by Karl Pearson from a related idea introduced by Francis Galton in the 1880s and for which the mathematical formula was derived and published by Auguste Bravais in 1844. The naming of the coefficient is thus an example of Stigler's Law.

Spearmans rank correlation coefficient statistic

In statistics, Spearman's rank correlation coefficient or Spearman's ρ, named after Charles Spearman and often denoted by the Greek letter (rho) or as , is a nonparametric measure of rank correlation. It assesses how well the relationship between two variables can be described using a monotonic function.

In statistics, an effect size is a quantitative measure of the magnitude of a phenomenon. Examples of effect sizes are the correlation between two variables, the regression coefficient in a regression, the mean difference, or even the risk with which something happens, such as how many people survive after a heart attack for every one person that does not survive. For most types of effect size, a larger absolute value always indicates a stronger effect, with the main exception being if the effect size is an odds ratio. Effect sizes complement statistical hypothesis testing, and play an important role in power analyses, sample size planning, and in meta-analyses. They are the first item (magnitude) in the MAGIC criteria for evaluating the strength of a statistical claim. Especially in meta-analysis, where the purpose is to combine multiple effect sizes, the standard error (S.E.) of the effect size is of critical importance. The S.E. of the effect size is used to weigh effect sizes when combining studies, so that large studies are considered more important than small studies in the analysis. The S.E. of the effect size is calculated differently for each type of effect size, but generally only requires knowing the study's sample size (N), or the number of observations in each group.

Fisher transformation

In statistics, the Fisher transformation can be used to test hypotheses about the value of the population correlation coefficient ρ between variables X and Y. This is because, when the transformation is applied to the sample correlation coefficient, the sampling distribution of the resulting variable is approximately normal, with a variance that is stable over different values of the underlying true correlation.

In statistics, a contingency table is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. They are heavily used in survey research, business intelligence, engineering and scientific research. They provide a basic picture of the interrelation between two variables and can help find interactions between them. The term contingency table was first used by Karl Pearson in "On the Theory of Contingency and Its Relation to Association and Normal Correlation", part of the Drapers' Company Research Memoirs Biometric Series I published in 1904.

In probability theory and statistics, the coefficient of variation (CV), also known as relative standard deviation (RSD), is a standardized measure of dispersion of a probability distribution or frequency distribution. It is often expressed as a percentage, and is defined as the ratio of the standard deviation to the mean . The CV or RSD is widely used in analytical chemistry to express the precision and repeatability of an assay. It is also commonly used in fields such as engineering or physics when doing quality assurance studies and ANOVA gauge R&R. In addition, CV is utilized by economists and investors in economic models.

The Kruskal–Wallis test by ranks, Kruskal–Wallis H test, or one-way ANOVA on ranks is a non-parametric method for testing whether samples originate from the same distribution. It is used for comparing two or more independent samples of equal or different sample sizes. It extends the Mann–Whitney U test, which is used for comparing only two groups. The parametric equivalent of the Kruskal–Wallis test is the one-way analysis of variance (ANOVA).

Ordinary least squares method for estimating the unknown parameters in a linear regression model

In statistics, ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the given dataset and those predicted by the linear function.

The Friedman test is a non-parametric statistical test developed by Milton Friedman. Similar to the parametric repeated measures ANOVA, it is used to detect differences in treatments across multiple test attempts. The procedure involves ranking each row together, then considering the values of ranks by columns. Applicable to complete block designs, it is thus a special case of the Durbin test.

In statistics, a rank correlation is any of several statistics that measure an ordinal association—the relationship between rankings of different ordinal variables or different rankings of the same variable, where a "ranking" is the assignment of the ordering labels "first", "second", "third", etc. to different observations of a particular variable. A rank correlation coefficient measures the degree of similarity between two rankings, and can be used to assess the significance of the relation between them. For example, two common nonparametric methods of significance that use rank correlation are the Mann–Whitney U test and the Wilcoxon signed-rank test.

William Henry ("Bill") Kruskal was an American mathematician and statistician. He is best known for having formulated the Kruskal–Wallis one-way analysis of variance, a widely used nonparametric statistical method.

In statistics, a concordant pair is a pair of observations, each on two variables, {X1,Y1} and {X2,Y2}, having the property that

In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's τ coefficient, is a statistic used to measure the ordinal association between two measured quantities. A τ test is a non-parametric hypothesis test for statistical dependence based on the τ coefficient.

In statistics, the concordance correlation coefficient measures the agreement between two variables, e.g., to evaluate reproducibility or for inter-rater reliability.

In probability theory and statistics, Goodman & Kruskal's lambda is a measure of proportional reduction in error in cross tabulation analysis. For any sample with a nominal independent variable and dependent variable, it indicates the extent to which the modal categories and frequencies for each value of the independent variable differ from the overall modal category and frequency, i.e. for all values of the independent variable together. can be calculated with the equation

In statistics, Yule's Y, also known as the coefficient of colligation, is a measure of association between two binary variables. The measure was developed by George Udny Yule in 1912, and should not be confused with Yule's coefficient for measuring skewness based on quartiles.

In statistics, Somers’ D, sometimes incorrectly referred to as Somer’s D, is a measure of ordinal association between two possibly dependent random variables X and Y. Somers’ D takes values between when all pairs of the variables disagree and when all pairs of the variables agree. Somers’ D is named after Robert H. Somers, who proposed it in 1962.

References

  1. Goodman, Leo A.; Kruskal, William H. (1954). "Measures of Association for Cross Classifications". Journal of the American Statistical Association . 49 (268): 732–764. doi:10.2307/2281536. JSTOR   2281536.
  2. Goodman, Leo A.; Kruskal, William H. (1959). "Measures of Association for Cross Classifications. II: Further Discussion and References". Journal of the American Statistical Association . 54 (285): 123–163. doi:10.1080/01621459.1959.10501503. JSTOR   2282143.
  3. Goodman, Leo A.; Kruskal, William H. (1963). "Measures of Association for Cross Classifications III: Approximate Sampling Theory". Journal of the American Statistical Association . 58 (302): 310–364. doi:10.1080/01621459.1963.10500850. JSTOR   2283271.
  4. Goodman, Leo A.; Kruskal, William H. (1972). "Measures of Association for Cross Classifications, IV: Simplification of Asymptotic Variances". Journal of the American Statistical Association . 67 (338): 415–421. doi:10.1080/01621459.1972.10482401. JSTOR   2284396.
  5. Yule, G U. (1912). "On the methods of measuring association between two attributes" (PDF). Journal of the Royal Statistical Society . 49 (6): 579–652. JSTOR   2340126.

Further reading