# Cointegration

Last updated

Cointegration is a statistical property of a collection (X1, X2, ..., Xk) of time series variables. First, all of the series must be integrated of order d (see Order of integration). Next, if a linear combination of this collection is integrated of order less than d, then the collection is said to be co-integrated. Formally, if (X,Y,Z) are each integrated of order d, and there exist coefficients a,b,c such that aX + bY + cZ is integrated of order less than d, then X, Y, and Z are cointegrated. Cointegration has become an important property in contemporary time series analysis. Time series often have trends—either deterministic or stochastic. In an influential paper, Charles Nelson and Charles Plosser (1982) provided statistical evidence that many US macroeconomic time series (like GNP, wages, employment, etc.) have stochastic trends.

## Introduction

If two or more series are individually integrated (in the time series sense) but some linear combination of them has a lower order of integration, then the series are said to be cointegrated. A common example is where the individual series are first-order integrated (${\displaystyle I(1)}$) but some (cointegrating) vector of coefficients exists to form a stationary linear combination of them. For instance, a stock market index and the price of its associated futures contract move through time, each roughly following a random walk. Testing the hypothesis that there is a statistically significant connection between the futures price and the spot price could now be done by testing for the existence of a cointegrated combination of the two series.

### History

The first to introduce and analyse the concept of spurious—or nonsense—regression was Udny Yule in 1926. [1] Before the 1980s, many economists used linear regressions on non-stationary time series data, which Nobel laureate Clive Granger and Paul Newbold showed to be a dangerous approach that could produce spurious correlation, [2] [3] since standard detrending techniques can result in data that are still non-stationary. [4] Granger's 1987 paper with Robert Engle formalized the cointegrating vector approach, and coined the term. [5]

For integrated ${\displaystyle I(1)}$ processes, Granger and Newbold showed that de-trending does not work to eliminate the problem of spurious correlation, and that the superior alternative is to check for co-integration. Two series with ${\displaystyle I(1)}$ trends can be co-integrated only if there is a genuine relationship between the two. Thus the standard current methodology for time series regressions is to check all-time series involved for integration. If there are ${\displaystyle I(1)}$ series on both sides of the regression relationship, then it's possible for regressions to give misleading results.

The possible presence of cointegration must be taken into account when choosing a technique to test hypotheses concerning the relationship between two variables having unit roots (i.e. integrated of at least order one). [2] The usual procedure for testing hypotheses concerning the relationship between non-stationary variables was to run ordinary least squares (OLS) regressions on data which had been differenced. This method is biased if the non-stationary variables are cointegrated.

For example, regressing the consumption series for any country (e.g. Fiji) against the GNP for a randomly selected dissimilar country (e.g. Afghanistan) might give a high R-squared relationship (suggesting high explanatory power on Fiji's consumption from Afghanistan's GNP). This is called spurious regression: two integrated ${\displaystyle I(1)}$ series which are not directly causally related may nonetheless show a significant correlation; this phenomenon is called spurious correlation.

## Tests

The three main methods for testing for cointegration are:

### Engle–Granger two-step method

If ${\displaystyle x_{t}}$ and ${\displaystyle y_{t}}$ are non-stationary and Order of integration d=1, then a linear combination of them must be stationary for some value of ${\displaystyle \beta }$ and ${\displaystyle u_{t}}$ . In other words:

${\displaystyle y_{t}-\beta x_{t}=u_{t}\,}$

where ${\displaystyle u_{t}}$ is stationary.

If we knew ${\displaystyle \beta }$, we could just test it for stationarity with something like a Dickey–Fuller test, Phillips–Perron test and be done. But because we don't know ${\displaystyle \beta }$, we must estimate this first, generally by using ordinary least squares (by regressing ${\displaystyle y_{t}}$ on ${\displaystyle x_{t}}$ and an intercept) and then run our stationarity test on the estimated ${\displaystyle u_{t}}$ series, often denoted ${\displaystyle {\hat {u}}_{t}}$.

A second regression is then run on the first differenced variables from the first regression, and the lagged residuals ${\displaystyle {\hat {u}}_{t-1}}$ is included as a regressor.

### Johansen test

The Johansen test is a test for cointegration that allows for more than one cointegrating relationship, unlike the Engle–Granger method, but this test is subject to asymptotic properties, i.e. large samples. If the sample size is too small then the results will not be reliable and one should use Auto Regressive Distributed Lags (ARDL). [6] [7]

### Phillips–Ouliaris cointegration test

Peter C. B. Phillips and Sam Ouliaris (1990) show that residual-based unit root tests applied to the estimated cointegrating residuals do not have the usual Dickey–Fuller distributions under the null hypothesis of no-cointegration. [8] Because of the spurious regression phenomenon under the null hypothesis, the distribution of these tests have asymptotic distributions that depend on (1) the number of deterministic trend terms and (2) the number of variables with which co-integration is being tested. These distributions are known as Phillips–Ouliaris distributions and critical values have been tabulated. In finite samples, a superior alternative to the use of these asymptotic critical value is to generate critical values from simulations.

### Multicointegration

In practice, cointegration is often used for two ${\displaystyle I(1)}$ series, but it is more generally applicable and can be used for variables integrated of higher order (to detect correlated accelerations or other second-difference effects). Multicointegration extends the cointegration technique beyond two variables, and occasionally to variables integrated at different orders.

### Variable shifts in long time series

Tests for cointegration assume that the cointegrating vector is constant during the period of study. In reality, it is possible that the long-run relationship between the underlying variables change (shifts in the cointegrating vector can occur). The reason for this might be technological progress, economic crises, changes in the people's preferences and behaviour accordingly, policy or regime alteration, and organizational or institutional developments. This is especially likely to be the case if the sample period is long. To take this issue into account, tests have been introduced for cointegration with one unknown structural break, [9] and tests for cointegration with two unknown breaks are also available. [10]

### Bayesian inference

Several Bayesian methods have been proposed to compute the posterior distribution of the number of cointegrating relationships and the cointegrating linear combinations. [11]

## Related Research Articles

In statistics, a sequence of random variables is homoscedastic if all its random variables have the same finite variance. This is also known as homogeneity of variance. The complementary notion is called heteroscedasticity. The spellings homoskedasticity and heteroskedasticity are also frequently used.

Linear trend estimation is a statistical technique to aid interpretation of data. When a series of measurements of a process are treated as, for example, a sequences or time series, trend estimation can be used to make and justify statements about tendencies in the data, by relating the measurements to the times at which they occurred. This model can then be used to describe the behaviour of the observed data, without explaining it. In this case linear trend estimation expresses data as a linear function of time, and can also be used to determine the significance of differences in a set of data linked by a categorical factor. An example of the latter from biomedical science would be levels of a molecule in the blood or tissues of patients with incrementally worsening disease – such as mild, moderate and severe. This is in contrast to an ANOVA, which is reserved for three or more independent groups.

In statistics, a spurious relationship or spurious correlation is a mathematical relationship in which two or more events or variables are associated but not causally related, due to either coincidence or the presence of a certain third, unseen factor.

In statistics, a vector of random variables is heteroscedastic if the variability of the random disturbance is different across elements of the vector. Here, variability could be quantified by the variance or any other measure of statistical dispersion. Thus heteroscedasticity is the absence of homoscedasticity. A typical example is the set of observations of income in different cities.

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables. The most common form of regression analysis is linear regression, in which one finds the line that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line that minimizes the sum of squared differences between the true data and that line. For specific mathematical reasons, this allows the researcher to estimate the conditional expectation of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the conditional expectation across a broader collection of non-linear models.

In statistics, the coefficient of determination, also spelled coëfficient, denoted R2 or r2 and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).

In statistics, econometrics, epidemiology and related disciplines, the method of instrumental variables (IV) is used to estimate causal relationships when controlled experiments are not feasible or when a treatment is not successfully delivered to every unit in a randomized experiment. Intuitively, IVs are used when an explanatory variable of interest is correlated with the error term, in which case ordinary least squares and ANOVA give biased results. A valid instrument induces changes in the explanatory variable but has no independent effect on the dependent variable, allowing a researcher to uncover the causal effect of the explanatory variable on the dependent variable.

The Granger causality test is a statistical hypothesis test for determining whether one time series is useful in forecasting another, first proposed in 1969. Ordinarily, regressions reflect "mere" correlations, but Clive Granger argued that causality in economics could be tested for by measuring the ability to predict the future values of a time series using prior values of another time series. Since the question of "true causality" is deeply philosophical, and because of the post hoc ergo propter hoc fallacy of assuming that one thing preceding another can be used as a proof of causation, econometricians assert that the Granger test finds only "predictive causality". Using the term "causality" alone is a misnomer, as Granger-causality is better described as "precedence", or, as Granger himself later claimed in 1977, "temporally related". Rather than testing whether Xcauses Y, the Granger causality tests whether X forecastsY.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the given dataset and those predicted by the linear function of the independent variable.

In probability theory and statistics, a unit root is a feature of some stochastic processes that can cause problems in statistical inference involving time series models. A linear stochastic process has a unit root if 1 is a root of the process's characteristic equation. Such a process is non-stationary but does not always have a trend.

In statistics, the Dickey–Fuller test tests the null hypothesis that a unit root is present in an autoregressive time series model. The alternative hypothesis is different depending on which version of the test is used, but is usually stationarity or trend-stationarity. The test is named after the statisticians David Dickey and Wayne Fuller, who developed it in 1979.

In econometrics, the seemingly unrelated regressions (SUR) or seemingly unrelated regression equations (SURE) model, proposed by Arnold Zellner in (1962), is a generalization of a linear regression model that consists of several regression equations, each having its own dependent variable and potentially different sets of exogenous explanatory variables. Each equation is a valid linear regression on its own and can be estimated separately, which is why the system is called seemingly unrelated, although some authors suggest that the term seemingly related would be more appropriate, since the error terms are assumed to be correlated across the equations.

In statistics, the Breusch–Pagan test, developed in 1979 by Trevor Breusch and Adrian Pagan, is used to test for heteroskedasticity in a linear regression model. It was independently suggested with some extension by R. Dennis Cook and Sanford Weisberg in 1983. Derived from the Lagrange multiplier test principle, it tests whether the variance of the errors from a regression is dependent on the values of the independent variables. In that case, heteroskedasticity is present.

Cochrane–Orcutt estimation is a procedure in econometrics, which adjusts a linear model for serial correlation in the error term. Developed in the 1940s, it is named after statisticians Donald Cochrane and Guy Orcutt.

In statistics, the Durbin–Watson statistic is a test statistic used to detect the presence of autocorrelation at lag 1 in the residuals from a regression analysis. It is named after James Durbin and Geoffrey Watson. The small sample distribution of this ratio was derived by John von Neumann. Durbin and Watson applied this statistic to the residuals from least squares regressions, and developed bounds tests for the null hypothesis that the errors are serially uncorrelated against the alternative that they follow a first order autoregressive process. Note that the distribution of this test statistic does not depend on the estimated regression coefficients and the variance of the errors.

In statistics, the Breusch–Godfrey test is used to assess the validity of some of the modelling assumptions inherent in applying regression-like models to observed data series. In particular, it tests for the presence of serial correlation that has not been included in a proposed model structure and which, if present, would mean that incorrect conclusions would be drawn from other tests or that sub-optimal estimates of model parameters would be obtained.

An error correction model (ECM) belongs to a category of multiple time series models most commonly used for data where the underlying variables have a long-run common stochastic trend, also known as cointegration. ECMs are a theoretically-driven approach useful for estimating both short-term and long-term effects of one time series on another. The term error-correction relates to the fact that last-period's deviation from a long-run equilibrium, the error, influences its short-run dynamics. Thus ECMs directly estimate the speed at which a dependent variable returns to equilibrium after a change in other variables.

The following outline is provided as an overview of and topical guide to regression analysis:

In statistics, the Johansen test, named after Søren Johansen, is a procedure for testing cointegration of several, say k, I(1) time series. This test permits more than one cointegrating relationship so is more generally applicable than the Engle–Granger test which is based on the Dickey–Fuller test for unit roots in the residuals from a single (estimated) cointegrating relationship.

In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.

## References

1. Yule, U. (1926). "Why do we sometimes get nonsense-correlations between time series? - A study in sampling and the nature of time series". Journal of the Royal Statistical Society . 89 (1): 11–63. doi:10.2307/2341482. JSTOR   2341482. S2CID   126346450.
2. Granger, C.; Newbold, P. (1974). "Spurious Regressions in Econometrics". Journal of Econometrics . 2 (2): 111–120. CiteSeerX  . doi:10.1016/0304-4076(74)90034-7.
3. Mahdavi Damghani, Babak; et al. (2012). "The Misleading Value of Measured Correlation". Wilmott . 2012 (1): 64–73. doi:10.1002/wilm.10167.
4. Granger, Clive (1981). "Some Properties of Time Series Data and Their Use in Econometric Model Specification". Journal of Econometrics . 16 (1): 121–130. doi:10.1016/0304-4076(81)90079-8.
5. Engle, Robert F.; Granger, Clive W. J. (1987). "Co-integration and error correction: Representation, estimation and testing" (PDF). Econometrica . 55 (2): 251–276. doi:10.2307/1913236. JSTOR   1913236.
6. Giles, David. "ARDL Models - Part II - Bounds Tests" . Retrieved 4 August 2014.
7. Pesaran, M.H.; Shin, Y.; Smith, R.J. (2001). "Bounds testing approaches to the analysis of level relationships". Journal of Applied Econometrics. 16 (3): 289–326. doi:10.1002/jae.616. hdl:.
8. Phillips, P. C. B.; Ouliaris, S. (1990). "Asymptotic Properties of Residual Based Tests for Cointegration" (PDF). Econometrica . 58 (1): 165–193. doi:10.2307/2938339. JSTOR   2938339.
9. Gregory, Allan W.; Hansen, Bruce E. (1996). "Residual-based tests for cointegration in models with regime shifts" (PDF). Journal of Econometrics. 70 (1): 99–126. doi:10.1016/0304-4076(69)41685-7.
10. Hatemi-J, A. (2008). "Tests for cointegration with two unknown regime shifts with an application to financial market integration". Empirical Economics . 35 (3): 497–505. doi:10.1007/s00181-007-0175-9.
11. Koop, G.; Strachan, R.; van Dijk, H.K.; Villani, M. (January 1, 2006). "Chapter 17: Bayesian Approaches to Cointegration". In Mills, T.C.; Patterson, K. (eds.). Handbook of Econometrics Vol.1 Econometric Theory. Palgrave Macmillan. pp. 871–898. ISBN   978-1-4039-4155-8.