Correlogram

Last updated April 01, 2023

In the analysis of data, a correlogram is a chart of correlation statistics. For example, in time series analysis, a plot of the sample autocorrelations $r_{h}\,$ versus $h\,$ (the time lags) is an autocorrelogram. If cross-correlation is plotted, the result is called a cross-correlogram.

In addition, correlograms are used in the model identification stage for Box–Jenkins autoregressive moving average time series models. Autocorrelations should be near-zero for randomness; if the analyst does not check for randomness, then the validity of many of the statistical conclusions becomes suspect. The correlogram is an excellent way of checking for such randomness.

In multivariate analysis, correlation matrices shown as color-mapped images may also be called "correlograms" or "corrgrams".^[1]^[2]^[3]

Applications

The correlogram can help provide answers to the following questions:^[4]

Are the data random?
Is an observation related to an adjacent observation?
Is an observation related to an observation twice-removed? (etc.)
Is the observed time series white noise?
Is the observed time series sinusoidal?
Is the observed time series autoregressive?
What is an appropriate model for the observed time series?
Is the model

Y={\text{constant}}+{\text{error}}

valid and sufficient?

Is the formula $s_{\bar {Y}}=s/{\sqrt {N}}$ valid?

Importance

Randomness (along with fixed model, fixed variation, and fixed distribution) is one of the four assumptions that typically underlie all measurement processes. The randomness assumption is critically important for the following three reasons:

Most standard statistical tests depend on randomness. The validity of the test conclusions is directly linked to the validity of the randomness assumption.
Many commonly used statistical formulae depend on the randomness assumption, the most common formula being the formula for determining the standard error of the sample mean:

s_{\bar {Y}}=s/{\sqrt {N}}

where s is the standard deviation of the data. Although heavily used, the results from using this formula are of no value unless the randomness assumption holds.

For univariate data, the default model is

Y={\text{constant}}+{\text{error}}

If the data are not random, this model is incorrect and invalid, and the estimates for the parameters (such as the constant) become nonsensical and invalid.

Estimation of autocorrelations

The autocorrelation coefficient at lag h is given by

r_{h}=c_{h}/c_{0}\,

where c_h is the autocovariance function

c_{h}={\frac {1}{N}}\sum _{t=1}^{N-h}\left(Y_{t}-{\bar {Y}}\right)\left(Y_{t+h}-{\bar {Y}}\right)

and c₀ is the variance function

c_{0}={\frac {1}{N}}\sum _{t=1}^{N}\left(Y_{t}-{\bar {Y}}\right)^{2}

The resulting value of r_h will range between −1 and +1.

Alternate estimate

Some sources may use the following formula for the autocovariance function:

c_{h}={\frac {1}{N-h}}\sum _{t=1}^{N-h}\left(Y_{t}-{\bar {Y}}\right)\left(Y_{t+h}-{\bar {Y}}\right)

Although this definition has less bias, the (1/N) formulation has some desirable statistical properties and is the form most commonly used in the statistics literature. See pages 20 and 49–50 in Chatfield for details.

In contrast to the definition above, this definition allows us to compute $c_{h}$ in a slightly more intuitive way. Consider the sample $Y_{1},\dots ,Y_{N}$ , where $Y_{i}\in \mathbb {R} ^{n}$ for $i=1,\dots ,N$ . Then, let

X={\begin{bmatrix}Y_{1}-{\bar {Y}}&\cdots &Y_{N}-{\bar {Y}}\end{bmatrix}}\in \mathbb {R} ^{n\times N}

We then compute the Gram matrix $Q=X^{\top }X$ . Finally, $c_{h}$ is computed as the sample mean of the $h$ th diagonal of $Q$ . For example, the $0$ th diagonal (the main diagonal) of $Q$ has $N$ elements, and its sample mean corresponds to $c_{0}$ . The $1$ st diagonal (to the right of the main diagonal) of $Q$ has $N-1$ elements, and its sample mean corresponds to $c_{1}$ , and so on.

Statistical inference with correlograms

In the same graph one can draw upper and lower bounds for autocorrelation with significance level $\alpha \,$ :

B=\pm z_{1-\alpha /2}SE(r_{h})\,

with

r_{h}\,

as the estimated autocorrelation at lag

h\,

.

If the autocorrelation is higher (lower) than this upper (lower) bound, the null hypothesis that there is no autocorrelation at and beyond a given lag is rejected at a significance level of $\alpha \,$ . This test is an approximate one and assumes that the time-series is Gaussian.

In the above, z_1−α/2 is the quantile of the normal distribution; SE is the standard error, which can be computed by Bartlett's formula for MA(ℓ) processes:

SE(r_{1})={\frac {1}{\sqrt {N}}}

SE(r_{h})={\sqrt {\frac {1+2\sum _{i=1}^{h-1}r_{i}^{2}}{N}}}

for

h>1.\,

In the example plotted, we can reject the null hypothesis that there is no autocorrelation between time-points which are separated by lags up to 4. For most longer periods one cannot reject the null hypothesis of no autocorrelation.

Note that there are two distinct formulas for generating the confidence bands:

1. If the correlogram is being used to test for randomness (i.e., there is no time dependence in the data), the following formula is recommended:

\pm {\frac {z_{1-\alpha /2}}{\sqrt {N}}}

where N is the sample size, z is the quantile function of the standard normal distribution and α is the significance level. In this case, the confidence bands have fixed width that depends on the sample size.

2. Correlograms are also used in the model identification stage for fitting ARIMA models. In this case, a moving average model is assumed for the data and the following confidence bands should be generated:

\pm z_{1-\alpha /2}{\sqrt {{\frac {1}{N}}\left(1+2\sum _{i=1}^{k}r_{i}^{2}\right)}}

where k is the lag. In this case, the confidence bands increase as the lag increases.

Software

Correlograms are available in most general purpose statistical libraries.

Correlograms:

python pandas: pandas.plotting.autocorrelation_plot^[5]
R: functions acf and pacf

Corrgrams:

python seaborn: heatmap, pairplot
R: corrgram^[2]^[3]

Related techniques

Related Research Articles

Autocorrelation, sometimes known as serial correlation in the discrete time case, is the correlation of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations of a random variable as a function of the time lag between them. The analysis of autocorrelation is a mathematical tool for finding repeating patterns, such as the presence of a periodic signal obscured by noise, or identifying the missing fundamental frequency in a signal implied by its harmonic frequencies. It is often used in signal processing for analyzing functions or series of values, such as time domain signals.

In probability theory and statistics, the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question, and each with its own Boolean-valued outcome: success or failure. A single success/failure experiment is also called a Bernoulli trial or Bernoulli experiment, and a sequence of outcomes is called a Bernoulli process; for a single trial, i.e., n = 1, the binomial distribution is a Bernoulli distribution. The binomial distribution is the basis for the popular binomial test of statistical significance.

In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

In probability theory and statistics, the chi-squared distribution with $degrees of freedom is the distribution of a sum of the squares of independent standard normal random variables. The chi-squared distribution is a special case of the gamma distribution and is one of the most widely used probability distributions in inferential statistics, notably in hypothesis testing and in construction of confidence intervals. This distribution is sometimes called the central chi-squared distribution, a special case of the more general noncentral chi-squared distribution.$

In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics it usually refers to the degree to which a pair of variables are linearly related. Familiar examples of dependent phenomena include the correlation between the height of parents and their offspring, and the correlation between the price of a good and the quantity the consumers are willing to purchase, as it is depicted in the so-called demand curve.

In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-square distribution are special cases of the gamma distribution. There are two equivalent parameterizations in common use:

With a shape parameter $and a scale parameter .$
With a shape parameter $and an inverse scale parameter, called a rate parameter.$

In statistics, the Pearson correlation coefficient ― also known as Pearson's r, the Pearson product-moment correlation coefficient (PPMCC), the bivariate correlation, or colloquially simply as the correlation coefficient ― is a measure of linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus, it is essentially a normalized measurement of the covariance, such that the result always has a value between −1 and 1. As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationships or correlations. As a simple example, one would expect the age and height of a sample of teenagers from a high school to have a Pearson correlation coefficient significantly greater than 0, but less than 1.

A t-test is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis. It is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is estimated based on the data, the test statistic—under certain conditions—follows a Student's t distribution. The t-test's most common application is to test whether the means of two populations are different.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable.

In statistics, simple linear regression is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable and finds a linear function that, as accurately as possible, predicts the dependent variable values as a function of the independent variable. The adjective simple refers to the fact that the outcome variable is related to a single predictor.

<span class="mw-page-title-main">Empirical distribution function</span> Distribution function associated with the empirical measure of a sample

In statistics, an empirical distribution function is the distribution function associated with the empirical measure of a sample. This cumulative distribution function is a step function that jumps up by $1/ n$ at each of the $n$ data points. Its value at any specified value of the measured variable is the fraction of observations of the measured variable that are less than or equal to the specified value.

In statistics, the Durbin–Watson statistic is a test statistic used to detect the presence of autocorrelation at lag 1 in the residuals from a regression analysis. It is named after James Durbin and Geoffrey Watson. The small sample distribution of this ratio was derived by John von Neumann. Durbin and Watson applied this statistic to the residuals from least squares regressions, and developed bounds tests for the null hypothesis that the errors are serially uncorrelated against the alternative that they follow a first order autoregressive process. Note that the distribution of this test statistic does not depend on the estimated regression coefficients and the variance of the errors.

The Ljung–Box test is a type of statistical test of whether any of a group of autocorrelations of a time series are different from zero. Instead of testing randomness at each distinct lag, it tests the "overall" randomness based on a number of lags, and is therefore a portmanteau test.

The Hurst exponent is used as a measure of long-term memory of time series. It relates to the autocorrelations of the time series, and the rate at which these decrease as the lag between pairs of values increases. Studies involving the Hurst exponent were originally developed in hydrology for the practical matter of determining optimum dam sizing for the Nile river's volatile rain and drought conditions that had been observed over a long period of time. The name "Hurst exponent", or "Hurst coefficient", derives from Harold Edwin Hurst (1880–1978), who was the lead researcher in these studies; the use of the standard notation H for the coefficient also relates to his name.

In probability theory and statistics, partial correlation measures the degree of association between two random variables, with the effect of a set of controlling random variables removed. When determining the numerical relationship between two variables of interest, using their correlation coefficient will give misleading results if there is another confounding variable that is numerically related to both variables of interest. This misleading information can be avoided by controlling for the confounding variable, which is done by computing the partial correlation coefficient. This is precisely the motivation for including other right-side variables in a multiple regression; but while multiple regression gives unbiased results for the effect size, it does not give a numerical value of a measure of the strength of the relationship between the two variables of interest.

In time series analysis, the partial autocorrelation function (PACF) gives the partial correlation of a stationary time series with its own lagged values, regressed the values of the time series at all shorter lags. It contrasts with the autocorrelation function, which does not control for other lags.

In survey methodology, the design effect is a measure of the expected impact of a sampling design on the variance of an estimator for some parameter. It is calculated as the ratio of the variance of an estimator based on a sample from an (often) complex sampling design, to the variance of an alternative estimator based on a simple random sample (SRS) of the same number of elements. The Deff can be used to adjust the variance of an estimator in cases where the sample is not drawn using simple random sampling. It may also be useful in sample size calculations and for quantifying the representativeness of a sample. The term "design effect" was coined by Leslie Kish in 1965.

In physics, Liouville field theory is a two-dimensional conformal field theory whose classical equation of motion is a generalization of Liouville's equation.

Massless free scalar bosons are a family of two-dimensional conformal field theories, whose symmetry is described by an abelian affine Lie algebra.

References

↑ Friendly, Michael (19 August 2002). "Corrgrams: Exploratory displays for correlation matrices" (PDF). The American Statistician . Taylor & Francis. 56 (4): 316–324. doi:10.1198/000313002533 . Retrieved 19 January 2014.
1 2 "CRAN – Package corrgram". cran.r-project.org. 29 August 2013. Retrieved 19 January 2014.
1 2 "Quick-R: Correlograms". statmethods.net. Retrieved 19 January 2014.
↑ "1.3.3.1. Autocorrelation Plot". www.itl.nist.gov. Retrieved 20 August 2018.
↑ "Visualization § Autocorrelation plot".

External links

Autocorrelation Plot

This article incorporates public domain material from the National Institute of Standards and Technology.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Friendly, Michael (19 August 2002). "Corrgrams: Exploratory displays for correlation matrices" (PDF). The American Statistician . Taylor & Francis. 56 (4): 316–324. doi:10.1198/000313002533 . Retrieved 19 January 2014.

[cran_corrgram-2] 1 2 "CRAN – Package corrgram". cran.r-project.org. 29 August 2013. Retrieved 19 January 2014.

[statsmethods_correlograms-3] 1 2 "Quick-R: Correlograms". statmethods.net. Retrieved 19 January 2014.

[4] "1.3.3.1. Autocorrelation Plot". www.itl.nist.gov. Retrieved 20 August 2018.

[5] "Visualization § Autocorrelation plot".

[1]

[2]

[3]

[4]

[5]