# Isotonic regression

Last updated

In statistics, isotonic regression or monotonic regression is the technique of fitting a free-form line to a sequence of observations such that the fitted line is non-decreasing (or non-increasing) everywhere, and lies as close to the observations as possible.

## Applications

Isotonic regression has applications in statistical inference. For example, one might use it to fit an isotonic curve to the means of some set of experimental results when an increase in those means according to some particular ordering is expected. A benefit of isotonic regression is that it is not constrained by any functional form, such as the linearity imposed by linear regression, as long as the function is monotonic increasing.

Another application is nonmetric multidimensional scaling, [1] where a low-dimensional embedding for data points is sought such that order of distances between points in the embedding matches order of dissimilarity between points. Isotonic regression is used iteratively to fit ideal distances to preserve relative dissimilarity order.

Isotonic regression is also used in probabilistic classification to calibrate the predicted probabilities of supervised machine learning models. [2]

Isotonic regression for the simply ordered case with univariate ${\displaystyle x,y}$ has been applied to estimating continuous dose-response relationships in fields such as anesthesiology and toxicology. Narrowly speaking, isotonic regression only provides point estimates at observed values of ${\displaystyle x.}$ Estimation of the complete dose-response curve without any additional assumptions is usually done via linear interpolation between the point estimates. [3]

Software for computing isotone (monotonic) regression has been developed for R, [4] [5] Stata, and Python. [6]

## Problem Statement and Algorithms

Let ${\displaystyle (x_{1},y_{1}),\ldots ,(x_{n},y_{n})}$ be a given set of observations, where the ${\displaystyle y_{i}\in \mathbb {R} }$ and the ${\displaystyle x_{i}}$ fall in some partially ordered set. For generality, each observation ${\displaystyle (x_{i},y_{i})}$ may be given a weight ${\displaystyle w_{i}\geq 0}$, although commonly ${\displaystyle w_{i}=1}$ for all ${\displaystyle i}$.

Isotonic regression seeks a weighted least-squares fit ${\displaystyle {\hat {y}}_{i}\approx y_{i}}$ for all ${\displaystyle i}$, subject to the constraint that ${\displaystyle {\hat {y}}_{i}\leq {\hat {y}}_{j}}$ whenever ${\displaystyle x_{i}\leq x_{j}}$. This gives the following quadratic program (QP) in the variables ${\displaystyle {\hat {y}}_{1},\ldots ,{\hat {y}}_{n}}$:

${\displaystyle \min \sum _{i=1}^{n}w_{i}({\hat {y}}_{i}-y_{i})^{2}}$ subject to ${\displaystyle {\hat {y}}_{i}\leq {\hat {y}}_{j}{\text{ for all }}(i,j)\in E}$

where ${\displaystyle E=\{(i,j):x_{i}\leq x_{j}\}}$ specifies the partial ordering of the observed inputs ${\displaystyle x_{i}}$ (and may be regarded as the set of edges of some directed graph with vertices ${\displaystyle 1,2,\ldots n}$). Problems of this form may be solved by generic quadratic programming techniques.

In the usual setting where the ${\displaystyle x_{i}}$ values fall in a totally ordered set such as ${\displaystyle \mathbb {R} }$, we may assume WLOG that the observations have been sorted so that ${\displaystyle x_{1}\leq x_{2}\leq \cdots \leq x_{n}}$, and take ${\displaystyle E=\{(i,i+1):1\leq i. In this case, a simple iterative algorithm for solving the quadratic program is the pool adjacent violators algorithm. Conversely, Best and Chakravarti [7] studied the problem as an active set identification problem, and proposed a primal algorithm. These two algorithms can be seen as each other's dual, and both have a computational complexity of ${\displaystyle O(n)}$ on already sorted data. [7]

To complete the isotonic regression task, we may then choose any non-decreasing function ${\displaystyle f(x)}$ such that ${\displaystyle f(x_{i})={\hat {y}}_{i}}$ for all i. Any such function obviously solves

${\displaystyle \min _{f}\sum _{i=1}^{n}w_{i}(f(x_{i})-y_{i})^{2}}$ subject to ${\displaystyle f}$ being nondecreasing

and can be used to predict the ${\displaystyle y}$ values for new values of ${\displaystyle x}$. A common choice when ${\displaystyle x_{i}\in \mathbb {R} }$ would be to interpolate linearly between the points ${\displaystyle (x_{i},{\hat {y}}_{i})}$, as illustrated in the figure, yielding a continuous piecewise linear function:

${\displaystyle f(x)={\begin{cases}{\hat {y}}_{1}&{\text{if }}x\leq x_{1}\\{\hat {y}}_{i}+{\frac {x-x_{i}}{x_{i+1}-x_{i}}}({\hat {y}}_{i+1}-{\hat {y}}_{i})&{\text{if }}x_{i}\leq x\leq x_{i+1}\\{\hat {y}}_{n}&{\text{if }}x\geq x_{n}\end{cases}}}$

## Centered Isotonic Regression

As this article's first figure shows, in the presence of monotonicity violations the resulting interpolated curve will have flat (constant) intervals. In dose-response applications it is usually known that ${\displaystyle f(x)}$ is not only monotone but also smooth. The flat intervals are incompatible with ${\displaystyle f(x)}$'s assumed shape, and can be shown to be biased. A simple improvement for such applications, named centered isotonic regression (CIR), was developed by Oron and Flournoy and shown to substantially reduce estimation error for both dose-response and dose-finding applications. [8] Both CIR and the standard isotonic regression for the univariate, simply ordered case, are implemented in the R package "cir". [4] This package also provides analytical confidence-interval estimates.

## Related Research Articles

Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. In supervised learning, each example is a pair consisting of an input object and a desired output value. A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured through the so-called generalization error.

In machine learning, support-vector machines are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues, SVMs are one of the most robust prediction methods, being based on statistical learning frameworks or VC theory proposed by Vapnik and Chervonenkis (1974) and Vapnik. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier. An SVM maps training examples to points in space so as to maximise the width of the gap between the two categories. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems by minimizing the sum of the squares of the residuals made in the results of every single equation.

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables. The most common form of regression analysis is linear regression, in which one finds the line that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line that minimizes the sum of squared differences between the true data and that line. For specific mathematical reasons, this allows the researcher to estimate the conditional expectation of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the conditional expectation across a broader collection of non-linear models.

In statistics, nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables. The data are fitted by a method of successive approximations.

Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. Especially in high-dimensional optimization problems this reduces the computational burden, achieving faster iterations in trade for a lower convergence rate.

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean/average prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set. Random forests generally outperform decision trees, but their accuracy is lower than gradient boosted trees. However, data characteristics can affect their performance.

In mathematics, statistics, finance, computer science, particularly in machine learning and inverse problems, regularization is the process of adding information in order to solve an ill-posed problem or to prevent overfitting.

In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary.

In statistics, a generalized additive model (GAM) is a generalized linear model in which the linear response variable depends linearly on unknown smooth functions of some predictor variables, and interest focuses on inference about these smooth functions. GAMs were originally developed by Trevor Hastie and Robert Tibshirani to blend properties of generalized linear models with additive models.

In statistics, M-estimators are a broad class of extremum estimators for which the objective function is a sample average. Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators. The definition of M-estimators was motivated by robust statistics, which contributed new types of M-estimators. The statistical procedure of evaluating an M-estimator on a data set is called M-estimation. 48 samples of robust M-estimators can be founded in a recent review study.

Quantile regression is a type of regression analysis used in statistics and econometrics. Whereas the method of least squares estimates the conditional mean of the response variable across values of the predictor variables, quantile regression estimates the conditional median of the response variable. Quantile regression is an extension of linear regression used when the conditions of linear regression are not met.

A kernel smoother is a statistical technique to estimate a real valued function as the weighted average of neighboring observed data. The weight is defined by the kernel, such that closer points are given higher weights. The estimated function is smooth, and the level of smoothness is set by a single parameter.

Least absolute deviations (LAD), also known as least absolute errors (LAE), least absolute value (LAV), least absolute residual (LAR), sum of absolute deviations, or the L1 norm condition, is a statistical optimality criterion and the statistical optimization technique that relies on it. Similar to the least squares technique, it attempts to find a function which closely approximates a set of data. In the simple case of a set of (x,y) data, the approximation function is a simple "trend line" in two-dimensional Cartesian coordinates. The method minimizes the sum of absolute errors (SAE). The least absolute deviations estimate also arises as the maximum likelihood estimate if the errors have a Laplace distribution. It was introduced in 1757 by Roger Joseph Boscovich.

In statistics and machine learning, lasso is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the resulting statistical model. It was originally introduced in geophysics, and later by Robert Tibshirani, who coined the term.

In statistics, projection pursuit regression (PPR) is a statistical model developed by Jerome H. Friedman and Werner Stuetzle which is an extension of additive models. This model adapts the additive models in that it first projects the data matrix of explanatory variables in the optimal direction before applying smoothing functions to these explanatory variables.

Sliced inverse regression (SIR) is a tool for dimension reduction in the field of multivariate statistics.

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. When a decision tree is the weak learner, the resulting algorithm is called gradient boosted trees, which usually outperforms random forest. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.

In statistics, linear regression is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.

Up-and-Down Designs (UDDs) are a family of statistical experiment designs used in dose-finding experiments in science, engineering, and medical research. Dose-finding experiments have binary responses: each individual outcome can be described as one of two possible values, such as success vs. failure or toxic vs. non-toxic. Mathematically the binary responses are coded as 1 and 0. The goal of dose-finding experiments is to estimate the strength of treatment (i.e., the 'dose') that would trigger the "1" response a pre-specified proportion of the time. This dose can be envisioned as a percentile of the distribution of response thresholds. An example where dose-finding is used: an experiment to estimate the LD50 of some toxic chemical with respect to mice.

## References

1. Kruskal, J. B. (1964). "Nonmetric Multidimensional Scaling: A numerical method". Psychometrika. 29 (2): 115–129. doi:10.1007/BF02289694.
2. "Predicting good probabilities with supervised learning | Proceedings of the 22nd international conference on Machine learning". dl.acm.org. Retrieved 2020-07-07.
3. Stylianou, MP; Flournoy, N (2002). "Dose finding using the biased coin up-and-down design and isotonic regression". Biometrics. 58: 171–177. doi:10.1111/j.0006-341x.2002.00171.x.
4. Oron, Assaf. "Package 'cir'". CRAN. R Foundation for Statistical Computing. Retrieved 26 December 2020.
5. Leeuw, Jan de; Hornik, Kurt; Mair, Patrick (2009). "Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods". Journal of Statistical Software. 32 (5): 1–24. doi:. ISSN   1548-7660.
6. Pedregosa, Fabian; et al. (2011). "Scikit-learn:Machine learning in Python". Journal of Machine Learning Research. 12: 2825–2830. arXiv:. Bibcode:2012arXiv1201.0490P.
7. Best, Michael J.; Chakravarti, Nilotpal (1990). "Active set algorithms for isotonic regression; A unifying framework". Mathematical Programming. 47 (1–3): 425–439. doi:10.1007/bf01580873. ISSN   0025-5610.
8. Oron, AP; Flournoy, N (2017). "Centered Isotonic Regression: Point and Interval Estimation for Dose-Response Studies". Statistics in Biopharmaceutical Research. 9: 258–267. arXiv:. doi:10.1080/19466315.2017.1286256.
• Robertson, T.; Wright, F. T.; Dykstra, R. L. (1988). Order restricted statistical inference. New York: Wiley. ISBN   978-0-471-91787-8.
• Barlow, R. E.; Bartholomew, D. J.; Bremner, J. M.; Brunk, H. D. (1972). Statistical inference under order restrictions; the theory and application of isotonic regression. New York: Wiley. ISBN   978-0-471-04970-8.
• Shively, T.S., Sager, T.W., Walker, S.G. (2009). "A Bayesian approach to non-parametric monotone function estimation". Journal of the Royal Statistical Society, Series B. 71 (1): 159–175. CiteSeerX  . doi:10.1111/j.1467-9868.2008.00677.x.CS1 maint: multiple names: authors list (link)
• Wu, W. B.; Woodroofe, M.; Mentz, G. (2001). "Isotonic regression: Another look at the changepoint problem". Biometrika. 88 (3): 793–804. doi:10.1093/biomet/88.3.793.