Linear trend estimation

Last updated

Linear trend estimation is a statistical technique used to analyze data patterns. Data patterns, or trends, occur when the information gathered tends to increase or decrease over time or is influenced by changes in an external factor. Linear trend estimation essentially creates a straight line on a graph of data that models the general direction that the data is heading.

Contents

Fitting a trend: Least-squares

Given a set of data, there are a variety of functions that can be chosen to fit the data. The simplest function is a straight line with the dependent variable (typically the measured data) on the vertical axis and the independent variable (often time) on the horizontal axis.

The least-squares fit is a common method to fit a straight line through the data. This method minimizes the sum of the squared errors in the data series . Given a set of points in time and data values observed for those points in time, values of and are chosen to minimize the sum of squared errors

.

This formula first calculates the difference between the observed data and the estimate , the difference at each data point is squared, and then added together, giving the "sum of squares" measurement of error. The values of and derived from the data parameterize the simple linear estimator . The term "trend" refers to the slope in the least squares estimator.

Data as trend and noise

To analyze a (time) series of data, it can be assumed that it may be represented as trend plus noise:

where and are unknown constants and the 's are randomly distributed errors. If one can reject the null hypothesis that the errors are non-stationary, then the non-stationary series is called trend-stationary. The least-squares method assumes the errors are independently distributed with a normal distribution. If this is not the case, hypothesis tests about the unknown parameters and may be inaccurate. It is simplest if the 's all have the same distribution, but if not (if some have higher variance, meaning that those data points are effectively less certain), then this can be taken into account during the least-squares fitting by weighting each point by the inverse of the variance of that point.

Commonly, where only a single time series exists to be analyzed, the variance of the 's is estimated by fitting a trend to obtain the estimated parameter values and thus allowing the predicted values

to be subtracted from the data (thus detrending the data), leaving the residuals as the detrended data, and estimating the variance of the 's from the residuals — this is often the only way of estimating the variance of the 's.

Once the "noise" of the series is known, the significance of the trend can be assessed by making the null hypothesis that the trend, , is not different from 0. From the above discussion of trends in random data with known variance, the distribution of calculated trends is to be expected from random (trendless) data. If the estimated trend, , is larger than the critical value for a certain significance level, then the estimated trend is deemed significantly different from zero at that significance level, and the null hypothesis of a zero underlying trend is rejected.

The use of a linear trend line has been the subject of criticism, leading to a search for alternative approaches to avoid its use in model estimation. One of the alternative approaches involves unit root tests and the cointegration technique in econometric studies.

The estimated coefficient associated with a linear trend variable such as time is interpreted as a measure of the impact of a number of unknown or known but immeasurable factors on the dependent variable over one unit of time. Strictly speaking, this interpretation is applicable for the estimation time frame only. Outside of this time frame, it cannot be determined how these immeasurable factors behave both qualitatively and quantitatively.

Research results by mathematicians, statisticians, econometricians, and economists have been published in response to those questions. For example, detailed notes on the meaning of linear time trends in the regression model are given in Cameron (2005); [1] Granger, Engle, and many other econometricians have written on stationarity, unit root testing, co-integration, and related issues (a summary of some of the works in this area can be found in an information paper [2] by the Royal Swedish Academy of Sciences (2003)); and Ho-Trieu & Tucker (1990) have written on logarithmic time trends with results indicating linear time trends are special cases of cycles.

Noisy time series

It is harder to see a trend in a noisy time series. For example, if the true series is 0, 1, 2, 3, all plus some independent normally distributed "noise" e of standard deviation  E, and a sample series of length 50 is given, then if E = 0.1, the trend will be obvious; if E = 100, the trend will probably be visible; but if E = 10000, the trend will be buried in the noise.

Consider a concrete example, such as the global surface temperature record of the past 140 years as presented by the IPCC. [3] The interannual variation is about 0.2 °C, and the trend is about 0.6 °C over 140 years, with 95% confidence limits of 0.2 °C (by coincidence, about the same value as the interannual variation). Hence, the trend is statistically different from 0. However, as noted elsewhere, [4] this time series doesn't conform to the assumptions necessary for least-squares to be valid.

Goodness of fit (r-squared) and trend

Illustration of the effect of filtering on r . Black = unfiltered data; red = data averaged every 10 points; blue = data averaged every 100 points. All have the same trend, but more filtering leads to higher r of fitted trend line. Random-data-plus-trend-r2.png
Illustration of the effect of filtering on r . Black = unfiltered data; red = data averaged every 10 points; blue = data averaged every 100 points. All have the same trend, but more filtering leads to higher r of fitted trend line.

The least-squares fitting process produces a value, r-squared (r2), which is 1 minus the ratio of the variance of the residuals to the variance of the dependent variable. It says what fraction of the variance of the data is explained by the fitted trend line. It does not relate to the statistical significance of the trend line (see graph); the statistical significance of the trend is determined by its t-statistic. Often, filtering a series increases r2 while making little difference to the fitted trend.

Advanced models

Thus far, the data have been assumed to consist of the trend plus noise, with the noise at each data point being independent and identically distributed random variables with a normal distribution. Real data (for example, climate data) may not fulfill these criteria. This is important, as it makes an enormous difference to the ease with which the statistics can be analyzed so as to extract maximum information from the data series. If there are other non-linear effects that have a correlation to the independent variable (such as cyclic influences), the use of least-squares estimation of the trend is not valid. Also, where the variations are significantly larger than the resulting straight line trend, the choice of start and end points can significantly change the result. That is, the model is mathematically misspecified. Statistical inferences (tests for the presence of a trend, confidence intervals for the trend, etc.) are invalid unless departures from the standard assumptions are properly accounted for, for example, as follows:

In R, the linear trend in data can be estimated by using the 'tslm' function of the 'forecast' package.

Medical and biomedical studies often seek to determine a link between sets of data, such as of a clinical or scientific metric in three different diseases. But data may also be linked in time (such as change in the effect of a drug from baseline, to month 1, to month 2), or by an external factor that may or may not be determined by the researcher and/or their subject (such as no pain, mild pain, moderate pain, or severe pain). In these cases, one would expect the effect test statistic (e.g., influence of a statin on levels of cholesterol, an analgesic on the degree of pain, or increasing doses of different strengths of a drug on a measurable index, i.e. a dose - response effect) to change in direct order as the effect develops. Suppose the mean level of cholesterol before and after the prescription of a statin falls from 5.6 mmol/L at baseline to 3.4 mmol/L at one month and to 3.7 mmol/L at two months. Given sufficient power, an ANOVA (analysis of variance) would most likely find a significant fall at one and two months, but the fall is not linear. Furthermore, a post-hoc test may be required. An alternative test may be a repeated measures (two way) ANOVA or Friedman test, depending on the nature of the data. Nevertheless, because the groups are ordered, a standard ANOVA is inappropriate. Should the cholesterol fall from 5.4 to 4.1 to 3.7, there is a clear linear trend. The same principle may be applied to the effects of allele/genotype frequency, where it could be argued that a single-nucleotide polymorphism in nucleotides XX, XY, YY are in fact a trend of no Y's, one Y, and then two Y's. [3]

The mathematics of linear trend estimation is a variant of the standard ANOVA, giving different information, and would be the most appropriate test if the researchers hypothesize a trend effect in their test statistic. One example is levels of serum trypsin in six groups of subjects ordered by age decade (10–19 years up to 60–69 years). Levels of trypsin (ng/mL) rise in a direct linear trend of 128, 152, 194, 207, 215, 218 (data from Altman). Unsurprisingly, a 'standard' ANOVA gives p < 0.0001, whereas linear trend estimation gives p = 0.00006. Incidentally, it could be reasonably argued that as age is a natural continuously variable index, it should not be categorized into decades, and an effect of age and serum trypsin is sought by correlation (assuming the raw data is available). A further example is of a substance measured at four time points in different groups:

#meanSD
11.60.56
21.940.75
32.220.66
42.400.79

This is a clear trend. ANOVA gives p = 0.091, because the overall variance exceeds the means, whereas linear trend estimation gives p = 0.012. However, should the data have been collected at four time points in the same individuals, linear trend estimation would be inappropriate, and a two-way (repeated measures) ANOVA would have been applied.

See also

Notes

  1. "Making Regression More Useful II: Dummies and Trends" (PDF). Retrieved June 17, 2012.
  2. "The Royal Swedish Academy of Sciences" (PDF). 8 October 2003. Retrieved June 17, 2012.
  3. 1 2 "IPCC Third Assessment Report – Climate Change 2001 – Complete online versions". Archived from the original on November 20, 2009. Retrieved June 17, 2012.
  4. 1 2 Forecasting: principles and practice. 20 September 2014. Retrieved May 17, 2015.

Related Research Articles

Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures used to analyze the differences among means. ANOVA was developed by the statistician Ronald Fisher. ANOVA is based on the law of total variance, where the observed variance in a particular variable is partitioned into components attributable to different sources of variation. In its simplest form, ANOVA provides a statistical test of whether two or more population means are equal, and therefore generalizes the t-test beyond two means. In other words, the ANOVA is used to test the difference between two or more means.

<span class="mw-page-title-main">Least squares</span> Approximation method in statistics

The method of least squares is a parameter estimation method in regression analysis based on minimizing the sum of the squares of the residuals made in the results of each individual equation.

<span class="mw-page-title-main">Pearson correlation coefficient</span> Measure of linear correlation

In statistics, the Pearson correlation coefficient (PCC) is a correlation coefficient that measures linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus, it is essentially a normalized measurement of the covariance, such that the result always has a value between −1 and 1. As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationships or correlations. As a simple example, one would expect the age and height of a sample of children from a primary school to have a Pearson correlation coefficient significantly greater than 0, but less than 1.

<i>F</i>-test Statistical hypothesis test, mostly using multiple restrictions

An F-test is any statistical test used to compare the variances of two samples or the ratio of variances between multiple samples. The test statistic, random variable F, is used to determine if the tested data has an F-distribution under the true null hypothesis, and true customary assumptions about the error term (ε). It is most often used when comparing statistical models that have been fitted to a data set, in order to identify the model that best fits the population from which the data were sampled. Exact "F-tests" mainly arise when the models have been fitted to the data using least squares. The name was coined by George W. Snedecor, in honour of Ronald Fisher. Fisher initially developed the statistic as the variance ratio in the 1920s.

Analysis of covariance (ANCOVA) is a general linear model that blends ANOVA and regression. ANCOVA evaluates whether the means of a dependent variable (DV) are equal across levels of one or more categorical independent variables (IV) and across one or more continuous variables. For example, the categorical variable(s) might describe treatment and the continuous variable(s) might be covariates (CV)'s, typically nuisance variables; or vice versa. Mathematically, ANCOVA decomposes the variance in the DV into variance explained by the CV(s), variance explained by the categorical IV, and residual variance. Intuitively, ANCOVA can be thought of as 'adjusting' the DV by the group means of the CV(s).

In statistics and optimization, errors and residuals are two closely related and easily confused measures of the deviation of an observed value of an element of a statistical sample from its "true value". The error of an observation is the deviation of the observed value from the true value of a quantity of interest. The residual is the difference between the observed value and the estimated value of the quantity of interest. The distinction is most important in regression analysis, where the concepts are sometimes called the regression errors and regression residuals and where they lead to the concept of studentized residuals. In econometrics, "errors" are also called disturbances.

<span class="mw-page-title-main">Regression analysis</span> Set of statistical processes for estimating the relationships among variables

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables. The most common form of regression analysis is linear regression, in which one finds the line that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line that minimizes the sum of squared differences between the true data and that line. For specific mathematical reasons, this allows the researcher to estimate the conditional expectation of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the conditional expectation across a broader collection of non-linear models.

<span class="mw-page-title-main">Coefficient of determination</span> Indicator for how well data points fit a line or curve

In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).

In the statistical analysis of time series, a trend-stationary process is a stochastic process from which an underlying trend can be removed, leaving a stationary process. The trend does not have to be linear.

<span class="mw-page-title-main">Ordinary least squares</span> Method for estimating the unknown parameters in a linear regression model

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable. Some sources consider OLS to be linear regression.

In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary.

In statistics, a generalized additive model (GAM) is a generalized linear model in which the linear response variable depends linearly on unknown smooth functions of some predictor variables, and interest focuses on inference about these smooth functions.

A mixed model, mixed-effects model or mixed error-component model is a statistical model containing both fixed effects and random effects. These models are useful in a wide variety of disciplines in the physical, biological and social sciences. They are particularly useful in settings where repeated measurements are made on the same statistical units, or where measurements are made on clusters of related statistical units. Mixed models are often preferred over traditional analysis of variance regression models because they don't rely on the independent observations assumption. Further, they have their flexibility in dealing with missing values and uneven spacing of repeated measurements. The Mixed model analysis allows measurements to be explicitly modeled in a wider variety of correlation and variance-covariance avoiding biased estimations structures.

Omnibus tests are a kind of statistical test. They test whether the explained variance in a set of data is significantly greater than the unexplained variance, overall. One example is the F-test in the analysis of variance. There can be legitimate significant effects within a model even if the omnibus test is not significant. For instance, in a model with two independent variables, if only one variable exerts a significant effect on the dependent variable and the other does not, then the omnibus test may be non-significant. This fact does not affect the conclusions that may be drawn from the one significant variable. In order to test effects within an omnibus test, researchers often use contrasts.

<span class="mw-page-title-main">Errors-in-variables models</span> Regression models accounting for possible errors in independent variables

In statistics, errors-in-variables models or measurement error models are regression models that account for measurement errors in the independent variables. In contrast, standard regression models assume that those regressors have been measured exactly, or observed without error; as such, those models account only for errors in the dependent variables, or responses.

Linear least squares (LLS) is the least squares approximation of linear functions to data. It is a set of formulations for solving statistical problems involved in linear regression, including variants for ordinary (unweighted), weighted, and generalized (correlated) residuals. Numerical methods for linear least squares include inverting the matrix of the normal equations and orthogonal decomposition methods.

In econometrics, the Park test is a test for heteroscedasticity. The test is based on the method proposed by Rolla Edward Park for estimating linear regression parameters in the presence of heteroscedastic error terms.

In statistics, linear regression is a statistical model which estimates the linear relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable. If the explanatory variables are measured with error then errors-in-variables models are required, also known as measurement error models.

<span class="mw-page-title-main">Homoscedasticity and heteroscedasticity</span> Statistical property

In statistics, a sequence of random variables is homoscedastic if all its random variables have the same finite variance; this is also known as homogeneity of variance. The complementary notion is called heteroscedasticity, also known as heterogeneity of variance. The spellings homoskedasticity and heteroskedasticity are also frequently used. Skedasticity comes from the Ancient Greek word skedánnymi, meaning “to scatter”. Assuming a variable is homoscedastic when in reality it is heteroscedastic results in unbiased but inefficient point estimates and in biased estimates of standard errors, and may result in overestimating the goodness of fit as measured by the Pearson coefficient.

References