Isotonic regression

Last updated October 25, 2024

In statistics and numerical analysis, isotonic regression or monotonic regression is the technique of fitting a free-form line to a sequence of observations such that the fitted line is non-decreasing (or non-increasing) everywhere, and lies as close to the observations as possible.

Applications

Isotonic regression has applications in statistical inference. For example, one might use it to fit an isotonic curve to the means of some set of experimental results when an increase in those means according to some particular ordering is expected. A benefit of isotonic regression is that it is not constrained by any functional form, such as the linearity imposed by linear regression, as long as the function is monotonic increasing.

Another application is nonmetric multidimensional scaling,^[1] where a low-dimensional embedding for data points is sought such that order of distances between points in the embedding matches order of dissimilarity between points. Isotonic regression is used iteratively to fit ideal distances to preserve relative dissimilarity order.

Isotonic regression is also used in probabilistic classification to calibrate the predicted probabilities of supervised machine learning models.^[2]

Isotonic regression for the simply ordered case with univariate $x,y$ has been applied to estimating continuous dose-response relationships in fields such as anesthesiology and toxicology. Narrowly speaking, isotonic regression only provides point estimates at observed values of $x.$ Estimation of the complete dose-response curve without any additional assumptions is usually done via linear interpolation between the point estimates. ^[3]

Software for computing isotone (monotonic) regression has been developed for R,^[4]^[5]^[6] Stata, and Python.^[7]

Problem statement and algorithms

Let $(x_{1},y_{1}),\ldots ,(x_{n},y_{n})$ be a given set of observations, where the $y_{i}\in \mathbb {R}$ and the $x_{i}$ fall in some partially ordered set. For generality, each observation $(x_{i},y_{i})$ may be given a weight $w_{i}\geq 0$ , although commonly $w_{i}=1$ for all $i$ .

Isotonic regression seeks a weighted least-squares fit ${\hat {y}}_{i}\approx y_{i}$ for all $i$ , subject to the constraint that ${\hat {y}}_{i}\leq {\hat {y}}_{j}$ whenever $x_{i}\leq x_{j}$ . This gives the following quadratic program (QP) in the variables ${\hat {y}}_{1},\ldots ,{\hat {y}}_{n}$ :

\min \sum _{i=1}^{n}w_{i}({\hat {y}}_{i}-y_{i})^{2}

subject to

{\hat {y}}_{i}\leq {\hat {y}}_{j}{\text{ for all }}(i,j)\in E

where $E=\{(i,j):x_{i}\leq x_{j}\}$ specifies the partial ordering of the observed inputs $x_{i}$ (and may be regarded as the set of edges of some directed acyclic graph (dag) with vertices $1,2,\ldots n$ ). Problems of this form may be solved by generic quadratic programming techniques.

In the usual setting where the $x_{i}$ values fall in a totally ordered set such as $\mathbb {R}$ , we may assume WLOG that the observations have been sorted so that $x_{1}\leq x_{2}\leq \cdots \leq x_{n}$ , and take $E=\{(i,i+1):1\leq i<n\}$ . In this case, a simple iterative algorithm for solving the quadratic program is the pool adjacent violators algorithm. Conversely, Best and Chakravarti^[8] studied the problem as an active set identification problem, and proposed a primal algorithm. These two algorithms can be seen as each other's dual, and both have a computational complexity of $O(n)$ on already sorted data.^[8]

To complete the isotonic regression task, we may then choose any non-decreasing function $f(x)$ such that $f(x_{i})={\hat {y}}_{i}$ for all i. Any such function obviously solves

\min _{f}\sum _{i=1}^{n}w_{i}(f(x_{i})-y_{i})^{2}

subject to

f

being nondecreasing

and can be used to predict the $y$ values for new values of $x$ . A common choice when $x_{i}\in \mathbb {R}$ would be to interpolate linearly between the points $(x_{i},{\hat {y}}_{i})$ , as illustrated in the figure, yielding a continuous piecewise linear function:

f(x)={\begin{cases}{\hat {y}}_{1}&{\text{if }}x\leq x_{1}\\{\hat {y}}_{i}+{\frac {x-x_{i}}{x_{i+1}-x_{i}}}({\hat {y}}_{i+1}-{\hat {y}}_{i})&{\text{if }}x_{i}\leq x\leq x_{i+1}\\{\hat {y}}_{n}&{\text{if }}x\geq x_{n}\end{cases}}

Centered isotonic regression

As this article's first figure shows, in the presence of monotonicity violations the resulting interpolated curve will have flat (constant) intervals. In dose-response applications it is usually known that $f(x)$ is not only monotone but also smooth. The flat intervals are incompatible with $f(x)$ 's assumed shape, and can be shown to be biased. A simple improvement for such applications, named centered isotonic regression (CIR), was developed by Oron and Flournoy and shown to substantially reduce estimation error for both dose-response and dose-finding applications.^[9] Both CIR and the standard isotonic regression for the univariate, simply ordered case, are implemented in the R package "cir".^[4] This package also provides analytical confidence-interval estimates.

Related Research Articles

In machine learning, support vector machines are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories, SVMs are one of the most studied models, being based on statistical learning frameworks of VC theory proposed by Vapnik and Chervonenkis (1974).

The method of least squares is a parameter estimation method in regression analysis based on minimizing the sum of the squares of the residuals made in the results of each individual equation.

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more error-free independent variables. The most common form of regression analysis is linear regression, in which one finds the line that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line that minimizes the sum of squared differences between the true data and that line. For specific mathematical reasons, this allows the researcher to estimate the conditional expectation of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the conditional expectation across a broader collection of non-linear models.

In statistics, nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables. The data are fitted by a method of successive approximations (iterations).

Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on data. Statistical learning theory has led to successful applications in fields such as computer vision, speech recognition, and bioinformatics.

Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that works by creating a multitude of decision trees during training. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the output is the average of the predictions of the trees. Random forests correct for decision trees' habit of overfitting to their training set.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable. Some sources consider OLS to be linear regression.

<span class="mw-page-title-main">Regularization (mathematics)</span> Technique to make a model more generalizable and transferable

In mathematics, statistics, finance, and computer science, particularly in machine learning and inverse problems, regularization is a process that converts the answer of a problem to a simpler one. It is often used in solving ill-posed problems or to prevent overfitting.

In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary.

In statistics, a generalized additive model (GAM) is a generalized linear model in which the linear response variable depends linearly on unknown smooth functions of some predictor variables, and interest focuses on inference about these smooth functions.

Local regression or local polynomial regression, also known as moving regression, is a generalization of the moving average and polynomial regression. Its most common methods, initially developed for scatterplot smoothing, are LOESS and LOWESS, both pronounced LOH-ess. They are two strongly related non-parametric regression methods that combine multiple regression models in a k-nearest-neighbor-based meta-model. In some fields, LOESS is known and commonly referred to as Savitzky–Golay filter.

In statistics, M-estimators are a broad class of extremum estimators for which the objective function is a sample average. Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators. The definition of M-estimators was motivated by robust statistics, which contributed new types of M-estimators. However, M-estimators are not inherently robust, as is clear from the fact that they include maximum likelihood estimators, which are in general not robust. The statistical procedure of evaluating an M-estimator on a data set is called M-estimation.

<span class="mw-page-title-main">Quantile regression</span> Statistical modeling technique

Quantile regression is a type of regression analysis used in statistics and econometrics. Whereas the method of least squares estimates the conditional mean of the response variable across values of the predictor variables, quantile regression estimates the conditional median of the response variable. [There is also a method for predicting the conditional geometric mean of the response variable, .] Quantile regression is an extension of linear regression used when the conditions of linear regression are not met.

A kernel smoother is a statistical technique to estimate a real valued function $as the weighted average of neighboring observed data. The weight is defined by the kernel, such that closer points are given higher weights. The estimated function is smooth, and the level of smoothness is set by a single parameter. Kernel smoothing is a type of weighted moving average.$

Least absolute deviations (LAD), also known as least absolute errors (LAE), least absolute residuals (LAR), or least absolute values (LAV), is a statistical optimality criterion and a statistical optimization technique based on minimizing the sum of absolute deviations or the L₁ norm of such values. It is analogous to the least squares technique, except that it is based on absolute values instead of squared values. It attempts to find a function which closely approximates a set of data by minimizing residuals between points generated by the function and corresponding data points. The LAD estimate also arises as the maximum likelihood estimate if the errors have a Laplace distribution. It was introduced in 1757 by Roger Joseph Boscovich.

In statistical theory, the field of high-dimensional statistics studies data whose dimension is larger than typically considered in classical multivariate analysis. The area arose owing to the emergence of many modern data sets in which the dimension of the data vectors may be comparable to, or even larger than, the sample size, so that justification for the use of traditional techniques, often based on asymptotic arguments with the dimension held fixed as the sample size increased, was lacking.

In statistics and machine learning, lasso is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the resulting statistical model. The lasso method assumes that the coefficients of the linear model are sparse, meaning that few of them are non-zero. It was originally introduced in geophysics, and later by Robert Tibshirani, who coined the term.

Gradient boosting is a machine learning technique based on boosting in a functional space, where the target is pseudo-residuals instead of residuals as in traditional boosting. It gives a prediction model in the form of an ensemble of weak prediction models, i.e., models that make very few assumptions about the data, which are typically simple decision trees. When a decision tree is the weak learner, the resulting algorithm is called gradient-boosted trees; it usually outperforms random forest. As with other boosting methods, a gradient-boosted trees model is built in stages, but it generalizes the other methods by allowing optimization of an arbitrary differentiable loss function.

In statistics, linear regression is a statistical model that estimates the linear relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable. If the explanatory variables are measured with error then errors-in-variables models are required, also known as measurement error models.

<span class="mw-page-title-main">Up-and-down design</span> Statistical experiment designs

Up-and-down designs (UDDs) are a family of statistical experiment designs used in dose-finding experiments in science, engineering, and medical research. Dose-finding experiments have binary responses: each individual outcome can be described as one of two possible values, such as success vs. failure or toxic vs. non-toxic. Mathematically the binary responses are coded as 1 and 0. The goal of dose-finding experiments is to estimate the strength of treatment (i.e., the "dose") that would trigger the "1" response a pre-specified proportion of the time. This dose can be envisioned as a percentile of the distribution of response thresholds. An example where dose-finding is used is in an experiment to estimate the LD₅₀ of some toxic chemical with respect to mice.

References

↑ Kruskal, J. B. (1964). "Nonmetric Multidimensional Scaling: A numerical method". Psychometrika. 29 (2): 115–129. doi:10.1007/BF02289694. S2CID 11709679.
↑ Niculescu-Mizil, Alexandru; Caruana, Rich (2005). "Predicting good probabilities with supervised learning". In De Raedt, Luc; Wrobel, Stefan (eds.). Proceedings of the Twenty-Second International Conference on Machine Learning (ICML 2005), Bonn, Germany, August 7–11, 2005. ACM International Conference Proceeding Series. Vol. 119. Association for Computing Machinery. pp. 625–632. doi:10.1145/1102351.1102430.
↑ Stylianou, MP; Flournoy, N (2002). "Dose finding using the biased coin up-and-down design and isotonic regression". Biometrics. 58 (1): 171–177. doi:10.1111/j.0006-341x.2002.00171.x. PMID 11890313. S2CID 8743090.
1 2 Oron, Assaf. "Package 'cir'". CRAN. R Foundation for Statistical Computing. Retrieved 26 December 2020.
↑ Leeuw, Jan de; Hornik, Kurt; Mair, Patrick (2009). "Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods". Journal of Statistical Software. 32 (5): 1–24. doi: 10.18637/jss.v032.i05 . ISSN 1548-7660.
↑ Xu, Zhipeng; Sun, Chenkai; Karunakaran, Aman. "Package UniIsoRegression" (PDF). CRAN. R Foundation for Statistical Computing. Retrieved 29 October 2021.
↑ Pedregosa, Fabian; et al. (2011). "Scikit-learn:Machine learning in Python". Journal of Machine Learning Research. 12: 2825–2830. arXiv: 1201.0490 . Bibcode:2011JMLR...12.2825P.
1 2 Best, Michael J.; Chakravarti, Nilotpal (1990). "Active set algorithms for isotonic regression; A unifying framework". Mathematical Programming. 47 (1–3): 425–439. doi:10.1007/bf01580873. ISSN 0025-5610. S2CID 31879613.
↑ Oron, AP; Flournoy, N (2017). "Centered Isotonic Regression: Point and Interval Estimation for Dose-Response Studies". Statistics in Biopharmaceutical Research. 9 (3): 258–267. arXiv: 1701.05964 . doi:10.1080/19466315.2017.1286256. S2CID 88521189.