Segmented regression

Last updated April 20, 2024

Segmented regression, also known as piecewise regression or broken-stick regression, is a method in regression analysis in which the independent variable is partitioned into intervals and a separate line segment is fit to each interval. Segmented regression analysis can also be performed on multivariate data by partitioning the various independent variables. Segmented regression is useful when the independent variables, clustered into different groups, exhibit different relationships between the variables in these regions. The boundaries between the segments are breakpoints.

Segmented linear regression, two segments

1st limb horizontal SegReg3.gif — 1st limb horizontal

1st limb sloping up SegReg1.gif — 1st limb sloping up

1st limb sloping down SegReg2.gif — 1st limb sloping down

Segmented linear regression with two segments separated by a breakpoint can be useful to quantify an abrupt change of the response function (Yr) of a varying influential factor (x). The breakpoint can be interpreted as a critical, safe, or threshold value beyond or below which (un)desired effects occur. The breakpoint can be important in decision making ^[1]

The figures illustrate some of the results and regression types obtainable.

A segmented regression analysis is based on the presence of a set of ( y, x ) data, in which y is the dependent variable and x the independent variable.

The least squares method applied separately to each segment, by which the two regression lines are made to fit the data set as closely as possible while minimizing the sum of squares of the differences (SSD) between observed (y) and calculated (Yr) values of the dependent variable, results in the following two equations:

Yr = A₁.x + K₁ for x < BP (breakpoint)
Yr = A₂.x + K₂ for x > BP (breakpoint)

where:

Yr is the expected (predicted) value of y for a certain value of x;

A₁ and A₂ are regression coefficients (indicating the slope of the line segments);

K₁ and K₂ are regression constants (indicating the intercept at the y-axis).

The data may show many types or trends,^[2] see the figures.

The method also yields two correlation coefficients (R):

$R_{1}^{2}=1-{\frac {\sum (y-Y_{r})^{2}}{\sum (y-Y_{a1})^{2}}}$ for x < BP (breakpoint)

and

$R_{2}^{2}=1-{\frac {\sum (y-Y_{r})^{2}}{\sum (y-Y_{a2})^{2}}}$ for x > BP (breakpoint)

where:

\sum (y-Y_{r})^{2}

is the minimized SSD per segment

and

Y_a1 and Y_a2 are the average values of y in the respective segments.

In the determination of the most suitable trend, statistical tests must be performed to ensure that this trend is reliable (significant).

When no significant breakpoint can be detected, one must fall back on a regression without breakpoint.

Example

For the blue figure at the right that gives the relation between yield of mustard (Yr = Ym, t/ha) and soil salinity (x = Ss, expressed as electric conductivity of the soil solution EC in dS/m) it is found that:^[3]

BP = 4.93, A₁ = 0, K₁ = 1.74, A₂ = −0.129, K₂ = 2.38, R₁² = 0.0035 (insignificant), R₂² = 0.395 (significant) and:

Ym = 1.74 t/ha for Ss < 4.93 (breakpoint)
Ym = −0.129 Ss + 2.38 t/ha for Ss > 4.93 (breakpoint)

indicating that soil salinities < 4.93 dS/m are safe and soil salinities > 4.93 dS/m reduce the yield @ 0.129 t/ha per unit increase of soil salinity.

The figure also shows confidence intervals and uncertainty as elaborated hereunder.

Test procedures

Example time series, type 5 CHAO.png — Example time series, type 5

The following statistical tests are used to determine the type of trend:

significance of the breakpoint (BP) by expressing BP as a function of regression coefficients A₁ and A₂ and the means Y₁ and Y₂ of the y-data and the means X₁ and X₂ of the x data (left and right of BP), using the laws of propagation of errors in additions and multiplications to compute the standard error (SE) of BP, and applying Student's t-test
significance of A₁ and A₂ applying Student's t-distribution and the standard error SE of A₁ and A₂
significance of the difference of A₁ and A₂ applying Student's t-distribution using the SE of their difference.
significance of the difference of Y₁ and Y₂ applying Student's t-distribution using the SE of their difference.
A more formal statistical approach to test for the existence of a breakpoint, is via the pseudo score test which does not require estimation of the segmented line.^[4]

In addition, use is made of the correlation coefficient of all data (Ra), the coefficient of determination or coefficient of explanation, confidence intervals of the regression functions, and ANOVA analysis.^[5]

The coefficient of determination for all data (Cd), that is to be maximized under the conditions set by the significance tests, is found from:

$C_{d}=1-{\sum (y-Y_{r})^{2} \over \sum (y-Y_{a})^{2}}$

where Yr is the expected (predicted) value of y according to the former regression equations and Ya is the average of all y values.

The Cd coefficient ranges between 0 (no explanation at all) to 1 (full explanation, perfect match).
In a pure, unsegmented, linear regression, the values of Cd and Ra² are equal. In a segmented regression, Cd needs to be significantly larger than Ra² to justify the segmentation.

The optimal value of the breakpoint may be found such that the Cd coefficient is maximum.

No-effect range

Segmented regression is often used to detect over which range an explanatory variable (X) has no effect on the dependent variable (Y), while beyond the reach there is a clear response, be it positive or negative. The reach of no effect may be found at the initial part of X domain or conversely at its last part. For the "no effect" analysis, application of the least squares method for the segmented regression analysis ^[6] may not be the most appropriate technique because the aim is rather to find the longest stretch over which the Y-X relation can be considered to possess zero slope while beyond the reach the slope is significantly different from zero but knowledge about the best value of this slope is not material. The method to find the no-effect range is progressive partial regression ^[7] over the range, extending the range with small steps until the regression coefficient gets significantly different from zero.

In the next figure the break point is found at X=7.9 while for the same data (see blue figure above for mustard yield), the least squares method yields a break point only at X=4.9. The latter value is lower, but the fit of the data beyond the break point is better. Hence, it will depend on the purpose of the analysis which method needs to be employed.

Related Research Articles

In statistics, the Pearson correlation coefficient (PCC) is a correlation coefficient that measures linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus, it is essentially a normalized measurement of the covariance, such that the result always has a value between −1 and 1. As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationships or correlations. As a simple example, one would expect the age and height of a sample of teenagers from a high school to have a Pearson correlation coefficient significantly greater than 0, but less than 1.

In statistics, the standard score is the number of standard deviations by which the value of a raw score is above or below the mean value of what is being observed or measured. Raw scores above the mean have positive standard scores, while those below the mean have negative standard scores.

In statistics, the logistic model is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression is estimating the parameters of a logistic model. Formally, in binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

In statistics, Spearman's rank correlation coefficient or Spearman's ρ, named after Charles Spearman and often denoted by the Greek letter (rho) or as $, is a nonparametric measure of rank correlation. It assesses how well the relationship between two variables can be described using a monotonic function.$

In mathematics, a piecewise linear or segmented function is a real-valued function of a real variable, whose graph is composed of straight-line segments.

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables. The most common form of regression analysis is linear regression, in which one finds the line that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line that minimizes the sum of squared differences between the true data and that line. For specific mathematical reasons, this allows the researcher to estimate the conditional expectation of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the conditional expectation across a broader collection of non-linear models.

In statistics, nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables. The data are fitted by a method of successive approximations (iterations).

In statistics, the coefficient of determination, denoted R² or r² and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable.

In statistics, simple linear regression (SLR) is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable and finds a linear function that, as accurately as possible, predicts the dependent variable values as a function of the independent variable. The adjective simple refers to the fact that the outcome variable is related to a single predictor.

The Chow test, proposed by econometrician Gregory Chow in 1960, is a test of whether the true coefficients in two linear regressions on different data sets are equal. In econometrics, it is most commonly used in time series analysis to test for the presence of a structural break at a period which can be assumed to be known a priori. In program evaluation, the Chow test is often used to determine whether the independent variables have different impacts on different subgroups of the population.

In statistics, particularly in analysis of variance and linear regression, a contrast is a linear combination of variables whose coefficients add up to zero, allowing comparison of different treatments.

Omnibus tests are a kind of statistical test. They test whether the explained variance in a set of data is significantly greater than the unexplained variance, overall. One example is the F-test in the analysis of variance. There can be legitimate significant effects within a model even if the omnibus test is not significant. For instance, in a model with two independent variables, if only one variable exerts a significant effect on the dependent variable and the other does not, then the omnibus test may be non-significant. This fact does not affect the conclusions that may be drawn from the one significant variable. In order to test effects within an omnibus test, researchers often use contrasts.

Drainage research is the study of agricultural drainage systems and their effects to arrive at optimal system design.

In statistics, polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modeled as an nth degree polynomial in x. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y |x). Although polynomial regression fits a nonlinear model to the data, as a statistical estimation problem it is linear, in the sense that the regression function E(y | x) is linear in the unknown parameters that are estimated from the data. For this reason, polynomial regression is considered to be a special case of multiple linear regression.

<span class="mw-page-title-main">Bivariate analysis</span> Concept in statistical analysis

Bivariate analysis is one of the simplest forms of quantitative (statistical) analysis. It involves the analysis of two variables, for the purpose of determining the empirical relationship between them.

<span class="mw-page-title-main">CumFreq</span> Software tool for data analysis and statistics

In statistics and data analysis the application software CumFreq is a tool for cumulative frequency analysis of a single variable and for probability distribution fitting.

In statistics and data analysis, the application software SegReg is a free and user-friendly tool for linear segmented regression analysis to determine the breakpoint where the relation between the dependent variable and the independent variable changes abruptly.

In statistics, linear regression is a statistical model which estimates the linear relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable. If the explanatory variables are measured with error then errors-in-variables models are required, also known as measurement error models.

Salt tolerance of crops is the maximum salt level a crop tolerates without losing its productivity while it is affected negatively at higher levels. The salt level is often taken as the soil salinity or the salinity of the irrigation water.

References

↑ Frequency and Regression Analysis. Chapter 6 in: H.P.Ritzema (ed., 1994), Drainage Principles and Applications, Publ. 16, pp. 175-224, International Institute for Land Reclamation and Improvement (ILRI), Wageningen, The Netherlands. ISBN 90-70754-33-9 . Free download from the webpage , under nr. 20, or directly as PDF :
↑ Drainage research in farmers' fields: analysis of data. Part of project "Liquid Gold" of the International Institute for Land Reclamation and Improvement (ILRI), Wageningen, The Netherlands. Download as PDF :
↑ R.J.Oosterbaan, D.P.Sharma, K.N.Singh and K.V.G.K.Rao, 1990, Crop production and soil salinity: evaluation of field data from India by segmented linear regression. In: Proceedings of the Symposium on Land Drainage for Salinity Control in Arid and Semi-Arid Regions, February 25th to March 2nd, 1990, Cairo, Egypt, Vol. 3, Session V, p. 373 - 383.
↑ Muggeo, VMR (2016). "Testing with a nuisance parameter present only under the alternative: a score-based approach with application to segmented modelling" (PDF). Journal of Statistical Computation and Simulation. 86 (15): 3059–3067. doi:10.1080/00949655.2016.1149855. S2CID 124914264.
↑ Statistical significance of segmented linear regression with break-point using variance analysis and F-tests. Download from under nr. 13, or directly as PDF :
↑ Segmented regression analysis, International Institute for Land Reclamation and Improvement (ILRI), Wageningen, The Netherlands. Free download from the webpage
↑ Partial Regression Analysis, International Institute for Land Reclamation and Improvement (ILRI), Wageningen, The Netherlands. Free download from the webpage

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Frequency and Regression Analysis. Chapter 6 in: H.P.Ritzema (ed., 1994), Drainage Principles and Applications, Publ. 16, pp. 175-224, International Institute for Land Reclamation and Improvement (ILRI), Wageningen, The Netherlands. ISBN 90-70754-33-9 . Free download from the webpage , under nr. 20, or directly as PDF :

[2] Drainage research in farmers' fields: analysis of data. Part of project "Liquid Gold" of the International Institute for Land Reclamation and Improvement (ILRI), Wageningen, The Netherlands. Download as PDF :

[3] R.J.Oosterbaan, D.P.Sharma, K.N.Singh and K.V.G.K.Rao, 1990, Crop production and soil salinity: evaluation of field data from India by segmented linear regression. In: Proceedings of the Symposium on Land Drainage for Salinity Control in Arid and Semi-Arid Regions, February 25th to March 2nd, 1990, Cairo, Egypt, Vol. 3, Session V, p. 373 - 383.

[4] Muggeo, VMR (2016). "Testing with a nuisance parameter present only under the alternative: a score-based approach with application to segmented modelling" (PDF). Journal of Statistical Computation and Simulation. 86 (15): 3059–3067. doi:10.1080/00949655.2016.1149855. S2CID 124914264.

[5] Statistical significance of segmented linear regression with break-point using variance analysis and F-tests. Download from under nr. 13, or directly as PDF :

[6] Segmented regression analysis, International Institute for Land Reclamation and Improvement (ILRI), Wageningen, The Netherlands. Free download from the webpage

[7] Partial Regression Analysis, International Institute for Land Reclamation and Improvement (ILRI), Wageningen, The Netherlands. Free download from the webpage

[1]

[2]

[3]

[4]

[5]

[6]

[7]