Correlation coefficient

Last updated

A correlation coefficient is a numerical measure of some type of linear correlation, meaning a statistical relationship between two variables. [a] The variables may be two columns of a given data set of observations, often called a sample, or two components of a multivariate random variable with a known distribution.[ citation needed ]

Contents

Several types of correlation coefficient exist, each with their own definition and own range of usability and characteristics. They all assume values in the range from −1 to +1, where ±1 indicates the strongest possible correlation and 0 indicates no correlation. [2] As tools of analysis, correlation coefficients present certain problems, including the propensity of some types to be distorted by outliers and the possibility of incorrectly being used to infer a causal relationship between the variables (for more, see Correlation does not imply causation). [3]

Types

There are several different measures for the degree of correlation in data, depending on the kind of data: principally whether the data is a measurement, ordinal, or categorical.

Pearson

The Pearson product-moment correlation coefficient, also known as r, R, or Pearson's r, is a measure of the strength and direction of the linear relationship between two variables that is defined as the covariance of the variables divided by the product of their standard deviations. [4] This is the best-known and most commonly used type of correlation coefficient. When the term "correlation coefficient" is used without further qualification, it usually refers to the Pearson product-moment correlation coefficient.

Intra-class

Intraclass correlation (ICC) is a descriptive statistic that can be used, when quantitative measurements are made on units that are organized into groups; it describes how strongly units in the same group resemble each other.

Rank

Rank correlation is a measure of the relationship between the rankings of two variables, or two rankings of the same variable:

Tetrachoric and polychoric

The polychoric correlation coefficient measures association between two ordered-categorical variables. It's technically defined as the estimate of the Pearson correlation coefficient one would obtain if:

  1. The two variables were measured on a continuous scale, instead of as ordered-category variables.
  2. The two continuous variables followed a bivariate normal distribution.

When both variables are dichotomous instead of ordered-categorical, the polychoric correlation coefficient is called the tetrachoric correlation coefficient.

Interpreting correlation coefficient values

The correlation between two variables have different associations that are measured in values such as r or R. Correlation values range from −1 to +1, where ±1 indicates the strongest possible correlation and 0 indicates no correlation between variables. [5]

r or Rr or RStrength or weakness of association between variables [6]
+1.0 to +0.8-1.0 to -0.8Perfect or very strong association
+0.8 to +0.6-0.8 to -0.6Strong association
+0.6 to +0.4-0.6 to -0.4Moderate association
+0.4 to +0.2-0.4 to -0.2Weak association
+0.2 to 0.0-0.2 to 0.0Very weak or no association

See also

Notes

  1. Correlation coefficient: A statistic used to show how the scores from one measure relate to scores on a second measure for the same group of individuals. A high value (approaching +1.00) is a strong direct relationship, values near 0.50 are considered moderate and values below 0.30 are considered to show weak relationship. A low negative value (approaching -1.00) is similarly a strong inverse relationship, and values near 0.00 indicate little, if any, relationship. [1]

Related Research Articles

<span class="mw-page-title-main">Statistics</span> Study of the collection and analysis of data

Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model to be studied. Populations can be diverse groups of people or objects such as "all people living in a country" or "every atom composing a crystal". Statistics deals with every aspect of data, including the planning of data collection in terms of the design of surveys and experiments.

<span class="mw-page-title-main">Correlation</span> Statistical concept

In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics it usually refers to the degree to which a pair of variables are linearly related. Familiar examples of dependent phenomena include the correlation between the height of parents and their offspring, and the correlation between the price of a good and the quantity the consumers are willing to purchase, as it is depicted in the so-called demand curve.

<span class="mw-page-title-main">Spearman's rank correlation coefficient</span> Nonparametric measure of rank correlation

In statistics, Spearman's rank correlation coefficient or Spearman's ρ, named after Charles Spearman and often denoted by the Greek letter (rho) or as , is a nonparametric measure of rank correlation. It assesses how well the relationship between two variables can be described using a monotonic function.

In the social sciences, scaling is the process of measuring or ordering entities with respect to quantitative attributes or traits. For example, a scaling technique might involve estimating individuals' levels of extraversion, or the perceived quality of products. Certain methods of scaling permit estimation of magnitudes on a continuum, while other methods provide only for relative ordering of the entities.

<span class="mw-page-title-main">Scatter plot</span> Plot using the dispersal of scattered dots to show the relationship between variables

A scatter plot, also called a scatterplot, scatter graph, scatter chart, scattergram, or scatter diagram, is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. If the points are coded (color/shape/size), one additional variable can be displayed. The data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis.

<span class="mw-page-title-main">Likert scale</span> Psychometric measurement scale

A Likert scale is a psychometric scale named after its inventor, American social psychologist Rensis Likert, which is commonly used in research questionnaires. It is the most widely used approach to scaling responses in survey research, such that the term is often used interchangeably with rating scale, although there are other types of rating scales.

Level of measurement or scale of measure is a classification that describes the nature of information within the values assigned to variables. Psychologist Stanley Smith Stevens developed the best-known classification with four levels, or scales, of measurement: nominal, ordinal, interval, and ratio. This framework of distinguishing levels of measurement originated in psychology and has since had a complex history, being adopted and extended in some disciplines and by some scholars, and criticized or rejected by others. Other classifications include those by Mosteller and Tukey, and by Chrisman.

In statistics, a categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property. In computer science and some branches of mathematics, categorical variables are referred to as enumerations or enumerated types. Commonly, each of the possible values of a categorical variable is referred to as a level. The probability distribution associated with a random categorical variable is called a categorical distribution.

In statistics, a contingency table is a type of table in a matrix format that displays the multivariate frequency distribution of the variables. They are heavily used in survey research, business intelligence, engineering, and scientific research. They provide a basic picture of the interrelation between two variables and can help find interactions between them. The term contingency table was first used by Karl Pearson in "On the Theory of Contingency and Its Relation to Association and Normal Correlation", part of the Drapers' Company Research Memoirs Biometric Series I published in 1904.

<span class="mw-page-title-main">Coefficient of determination</span> Indicator for how well data points fit a line or curve

In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).

In statistics, a rank correlation is any of several statistics that measure an ordinal association — the relationship between rankings of different ordinal variables or different rankings of the same variable, where a "ranking" is the assignment of the ordering labels "first", "second", "third", etc. to different observations of a particular variable. A rank correlation coefficient measures the degree of similarity between two rankings, and can be used to assess the significance of the relation between them. For example, two common nonparametric methods of significance that use rank correlation are the Mann–Whitney U test and the Wilcoxon signed-rank test.

In statistics, inter-rater reliability is the degree of agreement among independent observers who rate, code, or assess the same phenomenon.

In statistics, confirmatory factor analysis (CFA) is a special form of factor analysis, most commonly used in social science research. It is used to test whether measures of a construct are consistent with a researcher's understanding of the nature of that construct. As such, the objective of confirmatory factor analysis is to test whether the data fit a hypothesized measurement model. This hypothesized model is based on theory and/or previous analytic research. CFA was first developed by Jöreskog (1969) and has built upon and replaced older methods of analyzing construct validity such as the MTMM Matrix as described in Campbell & Fiske (1959).

<span class="mw-page-title-main">Intraclass correlation</span> Descriptive statistic

In statistics, the intraclass correlation, or the intraclass correlation coefficient (ICC), is a descriptive statistic that can be used when quantitative measurements are made on units that are organized into groups. It describes how strongly units in the same group resemble each other. While it is viewed as a type of correlation, unlike most other correlation measures, it operates on data structured as groups rather than data structured as paired observations.

In statistics, polychoric correlation is a technique for estimating the correlation between two hypothesised normally distributed continuous latent variables, from two observed ordinal variables. Tetrachoric correlation is a special case of the polychoric correlation applicable when both observed variables are dichotomous. These names derive from the polychoric and tetrachoric series which are used for estimation of these correlations.

The multitrait-multimethod (MTMM) matrix is an approach to examining construct validity developed by Campbell and Fiske (1959). It organizes convergent and discriminant validity evidence for comparison of how a measure relates to other measures. The conceptual approach has influenced experimental design and measurement theory in psychology, including applications in structural equation models.

<span class="mw-page-title-main">Exploratory factor analysis</span> Statistical method in psychology

In multivariate statistics, exploratory factor analysis (EFA) is a statistical method used to uncover the underlying structure of a relatively large set of variables. EFA is a technique within factor analysis whose overarching goal is to identify the underlying relationships between measured variables. It is commonly used by researchers when developing a scale and serves to identify a set of latent constructs underlying a battery of measured variables. It should be used when the researcher has no a priori hypothesis about factors or patterns of measured variables. Measured variables are any one of several attributes of people that may be observed and measured. Examples of measured variables could be the physical height, weight, and pulse rate of a human being. Usually, researchers would have a large number of measured variables, which are assumed to be related to a smaller number of "unobserved" factors. Researchers must carefully consider the number of measured variables to include in the analysis. EFA procedures are more accurate when each factor is represented by multiple measured variables in the analysis.

<span class="mw-page-title-main">Bivariate analysis</span> Concept in statistical analysis

Bivariate analysis is one of the simplest forms of quantitative (statistical) analysis. It involves the analysis of two variables, for the purpose of determining the empirical relationship between them.

In statistics, data can have any of various types. Statistical data types include categorical, directional, count, or real intervals.

Ordinal data is a categorical, statistical data type where the variables have natural, ordered categories and the distances between the categories are not known. These data exist on an ordinal scale, one of four levels of measurement described by S. S. Stevens in 1946. The ordinal scale is distinguished from the nominal scale by having a ranking. It also differs from the interval scale and ratio scale by not having category widths that represent equal increments of the underlying attribute.

References

  1. "correlation coefficient". NCME.org. National Council on Measurement in Education. Archived from the original on July 22, 2017. Retrieved April 17, 2014.
  2. Taylor, John R. (1997). An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements (PDF) (2nd ed.). Sausalito, CA: University Science Books. p. 217. ISBN   0-935702-75-X. Archived from the original (PDF) on 15 February 2019. Retrieved 14 February 2019.
  3. Boddy, Richard; Smith, Gordon (2009). Statistical Methods in Practice: For scientists and technologists. Chichester, U.K.: Wiley. pp. 95–96. ISBN   978-0-470-74664-6.
  4. Weisstein, Eric W. "Statistical Correlation". mathworld.wolfram.com. Retrieved 2020-08-22.
  5. Taylor, John R. (1997). An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements (PDF) (2nd ed.). Sausalito, CA: University Science Books. p. 217. ISBN   0-935702-75-X. Archived from the original (PDF) on 15 February 2019. Retrieved 14 February 2019.
  6. "The Correlation Coefficient (r)". Boston University.