SegReg

Last updated
Segmented regression software
Graphics tab sheet of SegReg.png
Screenshot of graphics tab sheet
Developer(s) Institute for Land Reclamation and Improvement (ILRI)
Written in Delphi
Operating system Microsoft Windows
Available inEnglish
Type Statistical software
License Proprietary Freeware
Website SegReg

In statistics and data analysis, the application software SegReg is a free and user-friendly tool for linear segmented regression analysis to determine the breakpoint where the relation between the dependent variable and the independent variable changes abruptly. [1]

Contents

Features

Screenprint of input tabsheet SegReg Input Screen.png
Screenprint of input tabsheet
Segmented regression of residuals on number of irrigations. Confidence intervals are shown. SegReg Residuals Graphics.png
Segmented regression of residuals on number of irrigations. Confidence intervals are shown.
Screenprint of Anova table SegReg Anova Table.png
Screenprint of Anova table

SegReg permits the introduction of one or two independent variables. When two variables are used, it first determines the relation between the dependent variable and the most influential independent variable, where after it finds the relation between the residuals and the second independent variable. Residuals are the deviations of observed values of the dependent variable from the values obtained by segmented regression on the first independent variable.

The breakpoint is found numerically by adopting a series tentative breakpoints and performing a linear regression at both sides of them. The tentative breakpoint that provides the largest coefficient of determination (as a parameter for the fit of the regression lines to the observed data values) is selected as the true breakpoint. To assure that the lines at both sides of the breakpoint intersect each other exactly at the breakpoint, SegReg employs two methods and selects the method giving the best fit.

SegReg recognizes many types of relations and selects the ultimate type on the basis of statistical criteria like the significance of the regression coefficients. The SegReg output provides statistical confidence belts of the regression lines and a confidence block for the breakpoint. [2] The confidence level can be selected as 90%, 95% and 98% of certainty.

To complete the confidence statements, SegReg provides an analysis of variance and an Anova table. [3]

During the input phase, the user can indicate a preference for or an exclusion of a certain type. The preference for a certain type is only accepted when it is statistically significant, even when the significance of another type is higher.

ILRI [4] provides examples of application to magnitudes like crop yield, watertable depth, and soil salinity.

A list of publications in which SegReg is used can be consulted. [5]

Equations

When only one independent variable is present, the results may look like:

where BP is the breakpoint, Y is the dependent variable, X the independent variable, A the regression coefficient, B the regression constant, and RY the residual of Y. When two independent variables are present, the results may look like:

where, additionally, BPX is BP of X, BPZ is BP of Z, Z is the second independent variable, C is the regression coefficient, and D the regression constant for the regression of RY on Z.

Substituting the expressions of RY in the second set of equations into the first set yields:

where E1 = B1+D1, E2 = B1+D2, E3 = B2+D1, and E4 = B2+D2 .

Alternative

Screenprint, data showing a tolerance level (threshold) of the wheat crop for soil salinity expressed in electric conductivity as ECe = 7.1 dS/m. Gohana1ScrSh.png
Screenprint, data showing a tolerance level (threshold) of the wheat crop for soil salinity expressed in electric conductivity as ECe = 7.1 dS/m.

As an alternative to regressions at both sides of the breakpoint (threshold), the method of partial regression can be used to find the longest possible horizontal stretch with insignificant regression coefficient, outside of which there is a definite slope with a significant regression coefficient. The alternative method can be used for segmented regressions of Type 3 and Type 4 when it is the intention to detect a tolerance level of the dependent variable for varying quantities of the independent, explanatory, variable (also called predictor). [6]

The attached figure concerns the same data as shown in the blue graph in the infobox at the top of this page. Here, the wheat crop has a tolerance for soil salinity up to the level of EC=7.1 dS/m instead of 4.6 in the blue figure. However, the fit of the data beyond the threshold is not as well as in the blue figure that has been made using the principle of minimization of the sum of squares of deviations of the observed values from the regression lines over the whole domain of explanatory variable X (i.e. maximization of the coefficient of determination), while the partial regression is designed only to find the point where the horizontal trend changes into a sloping trend.

See also

Related Research Articles

Pearson correlation coefficient Measure of linear correlation

In statistics, the Pearson correlation coefficient ― also known as Pearson's r, the Pearson product-moment correlation coefficient (PPMCC), the bivariate correlation, or colloquially simply as the correlation coefficient ― is a measure of linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus it is essentially a normalized measurement of the covariance, such that the result always has a value between −1 and 1. As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationship or correlation. As a simple example, one would expect the age and height of a sample of teenagers from a high school to have a Pearson correlation coefficient significantly greater than 0, but less than 1.

In statistics and optimization, errors and residuals are two closely related and easily confused measures of the deviation of an observed value of an element of a statistical sample from its "true value". The error of an observation is the deviation of the observed value from the true value of a quantity of interest. The residual is the difference between the observed value and the estimated value of the quantity of interest. The distinction is most important in regression analysis, where the concepts are sometimes called the regression errors and regression residuals and where they lead to the concept of studentized residuals. In econometrics, "errors" are also called disturbances.

Regression analysis Set of statistical processes for estimating the relationships among variables

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables. The most common form of regression analysis is linear regression, in which one finds the line that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line that minimizes the sum of squared differences between the true data and that line. For specific mathematical reasons, this allows the researcher to estimate the conditional expectation of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the conditional expectation across a broader collection of non-linear models.

The general linear model or general multivariate regression model is a compact way of simultaneously writing several multiple linear regression models. In that sense it is not a separate statistical linear model. The various multiple linear regression models may be compactly written as

Total least squares

In applied statistics, total least squares is a type of errors-in-variables regression, a least squares data modeling technique in which observational errors on both dependent and independent variables are taken into account. It is a generalization of Deming regression and also of orthogonal regression, and can be applied to both linear and non-linear models.

Nonlinear regression Regression analysis

In statistics, nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables. The data are fitted by a method of successive approximations.

In statistics, the coefficient of multiple correlation is a measure of how well a given variable can be predicted using a linear function of a set of other variables. It is the correlation between the variable's values and the best predictions that can be computed linearly from the predictive variables.

Coefficient of determination Indicator for how well data points fit a line or curve

In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).

In statistics, omitted-variable bias (OVB) occurs when a statistical model leaves out one or more relevant variables. The bias results in the model attributing the effect of the missing variables to those that were included.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the given dataset and those predicted by the linear function of the independent variable.

Simple linear regression Linear regression model with a single explanatory variable

In statistics, simple linear regression is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable and finds a linear function that, as accurately as possible, predicts the dependent variable values as a function of the independent variable. The adjective simple refers to the fact that the outcome variable is related to a single predictor.

Bootstrapping is any test or metric that uses random sampling with replacement, and falls under the broader class of resampling methods. Bootstrapping assigns measures of accuracy to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.

In probability theory and statistics, partial correlation measures the degree of association between two random variables, with the effect of a set of controlling random variables removed. If we are interested in finding to what extent there is a numerical relationship between two variables of interest, using their correlation coefficient will give misleading results if there is another, confounding, variable that is numerically related to both variables of interest. This misleading information can be avoided by controlling for the confounding variable, which is done by computing the partial correlation coefficient. This is precisely the motivation for including other right-side variables in a multiple regression; but while multiple regression gives unbiased results for the effect size, it does not give a numerical value of a measure of the strength of the relationship between the two variables of interest.

In statistics, the fraction of variance unexplained (FVU) in the context of a regression task is the fraction of variance of the regressand Y which cannot be explained, i.e., which is not correctly predicted, by the explanatory variables X.

Segmented regression, also known as piecewise regression or broken-stick regression, is a method in regression analysis in which the independent variable is partitioned into intervals and a separate line segment is fit to each interval. Segmented regression analysis can also be performed on multivariate data by partitioning the various independent variables. Segmented regression is useful when the independent variables, clustered into different groups, exhibit different relationships between the variables in these regions. The boundaries between the segments are breakpoints.

In statistics and regression analysis, moderation occurs when the relationship between two variables depends on a third variable. The third variable is referred to as the moderator variable or simply the moderator. The effect of a moderating variable is characterized statistically as an interaction; that is, a categorical or quantitative variable that affects the direction and/or strength of the relation between dependent and independent variables. Specifically within a correlational analysis framework, a moderator is a third variable that affects the zero-order correlation between two other variables, or the value of the slope of the dependent variable on the independent variable. In analysis of variance (ANOVA) terms, a basic moderator effect can be represented as an interaction between a focal independent variable and a factor that specifies the appropriate conditions for its operation.

In statistics, polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modelled as an nth degree polynomial in x. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y |x). Although polynomial regression fits a nonlinear model to the data, as a statistical estimation problem it is linear, in the sense that the regression function E(y | x) is linear in the unknown parameters that are estimated from the data. For this reason, polynomial regression is considered to be a special case of multiple linear regression.

The following outline is provided as an overview of and topical guide to regression analysis:

Linear least squares (LLS) is the least squares approximation of linear functions to data. It is a set of formulations for solving statistical problems involved in linear regression, including variants for ordinary (unweighted), weighted, and generalized (correlated) residuals. Numerical methods for linear least squares include inverting the matrix of the normal equations and orthogonal decomposition methods.

In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.

References

  1. Statistical principles of segmented regression with break-point
  2. determination of the confidence interval of the break-point
  3. F-tests in the analysis of variance for segmented linear regression
  4. Drainage research in farmers' fields: analysis of data, 2002. Contribution to the project “Liquid Gold” of the International Institute for Land Reclamation and Improvement (ILRI), Wageningen, The Netherlands.
  5. List of publications using SegReg
  6. Free software for partial regression