Total least squares

Last updated October 29, 2024

In applied statistics, total least squares is a type of errors-in-variables regression, a least squares data modeling technique in which observational errors on both dependent and independent variables are taken into account. It is a generalization of Deming regression and also of orthogonal regression, and can be applied to both linear and non-linear models.

Linear model

Background

In the least squares method of data modeling, the objective function to be minimized, S, is a quadratic form:

S=\mathbf {r^{T}Wr} ,

where r is the vector of residuals and W is a weighting matrix. In linear least squares the model contains equations which are linear in the parameters appearing in the parameter vector ${\boldsymbol {\beta }}$ , so the residuals are given by

\mathbf {r=y-X{\boldsymbol {\beta }}} .

There are m observations in y and n parameters in β with m>n. X is a m×n matrix whose elements are either constants or functions of the independent variables, x. The weight matrix W is, ideally, the inverse of the variance-covariance matrix $\mathbf {M} _{y}$ of the observations y. The independent variables are assumed to be error-free. The parameter estimates are found by setting the gradient equations to zero, which results in the normal equations ^{[note 1]}

\mathbf {X^{T}WX{\boldsymbol {\beta }}=X^{T}Wy} .

Allowing observation errors in all variables

Now, suppose that both x and y are observed subject to error, with variance-covariance matrices $\mathbf {M} _{x}$ and $\mathbf {M} _{y}$ respectively. In this case the objective function can be written as

S=\mathbf {r_{x}^{T}M_{x}^{-1}r_{x}+r_{y}^{T}M_{y}^{-1}r_{y}} ,

where $\mathbf {r} _{x}$ and $\mathbf {r} _{y}$ are the residuals in x and y respectively. Clearly^{[ further explanation needed ]} these residuals cannot be independent of each other, but they must be constrained by some kind of relationship. Writing the model function as $\mathbf {f(r_{x},r_{y},{\boldsymbol {\beta }})}$ , the constraints are expressed by m condition equations.^[2]

\mathbf {F=\Delta y-{\frac {\partial f}{\partial r_{x}}}r_{x}-{\frac {\partial f}{\partial r_{y}}}r_{y}-X\Delta {\boldsymbol {\beta }}=0} .

Thus, the problem is to minimize the objective function subject to the m constraints. It is solved by the use of Lagrange multipliers. After some algebraic manipulations,^[3] the result is obtained.

\mathbf {X^{T}M^{-1}X\Delta {\boldsymbol {\beta }}=X^{T}M^{-1}\Delta y} ,

or alternatively $\mathbf {X^{T}M^{-1}X{\boldsymbol {\beta }}=X^{T}M^{-1}y} ,$ where M is the variance-covariance matrix relative to both independent and dependent variables.

\mathbf {M=K_{x}M_{x}K_{x}^{T}+K_{y}M_{y}K_{y}^{T};\ K_{x}=-{\frac {\partial f}{\partial r_{x}}},\ K_{y}=-{\frac {\partial f}{\partial r_{y}}}} .

Example

When the data errors are uncorrelated, all matrices M and W are diagonal. Then, take the example of straight line fitting.

f(x_{i},\beta )=\alpha +\beta x_{i}

in this case

M_{ii}=\sigma _{y,i}^{2}+\beta ^{2}\sigma _{x,i}^{2}

showing how the variance at the ith point is determined by the variances of both independent and dependent variables and by the model being used to fit the data. The expression may be generalized by noting that the parameter $\beta$ is the slope of the line.

M_{ii}=\sigma _{y,i}^{2}+\left({\frac {dy}{dx}}\right)_{i}^{2}\sigma _{x,i}^{2}

An expression of this type is used in fitting pH titration data where a small error on x translates to a large error on y when the slope is large.

Algebraic point of view

As was shown in 1980 by Golub and Van Loan, the TLS problem does not have a solution in general.^[4] The following considers the simple case where a unique solution exists without making any particular assumptions.

The computation of the TLS using singular value decomposition (SVD) is described in standard texts.^[5] We can solve the equation

XB\approx Y

for B where X is m-by-n and Y is m-by-k. ^{[note 2]}

That is, we seek to find B that minimizes error matrices E and F for X and Y respectively. That is,

\mathrm {argmin} _{B,E,F}\|[E\;F]\|_{F},\qquad (X+E)B=Y+F

where $[E\;F]$ is the augmented matrix with E and F side by side and $\|\cdot \|_{F}$ is the Frobenius norm, the square root of the sum of the squares of all entries in a matrix and so equivalently the square root of the sum of squares of the lengths of the rows or columns of the matrix.

This can be rewritten as

[(X+E)\;(Y+F)]{\begin{bmatrix}B\\-I_{k}\end{bmatrix}}=0.

where $I_{k}$ is the $k\times k$ identity matrix. The goal is then to find $[E\;F]$ that reduces the rank of $[X\;Y]$ by k. Define $[U][\Sigma ][V]^{*}$ to be the singular value decomposition of the augmented matrix $[X\;Y]$ .

[X\;Y]=[U_{X}\;U_{Y}]{\begin{bmatrix}\Sigma _{X}&0\\0&\Sigma _{Y}\end{bmatrix}}{\begin{bmatrix}V_{XX}&V_{XY}\\V_{YX}&V_{YY}\end{bmatrix}}^{*}=[U_{X}\;U_{Y}]{\begin{bmatrix}\Sigma _{X}&0\\0&\Sigma _{Y}\end{bmatrix}}{\begin{bmatrix}V_{XX}^{*}&V_{YX}^{*}\\V_{XY}^{*}&V_{YY}^{*}\end{bmatrix}}

where V is partitioned into blocks corresponding to the shape of X and Y.

Using the Eckart–Young theorem, the approximation minimising the norm of the error is such that matrices $U$ and $V$ are unchanged, while the smallest $k$ singular values are replaced with zeroes. That is, we want

[(X+E)\;(Y+F)]=[U_{X}\;U_{Y}]{\begin{bmatrix}\Sigma _{X}&0\\0&0_{k\times k}\end{bmatrix}}{\begin{bmatrix}V_{XX}&V_{XY}\\V_{YX}&V_{YY}\end{bmatrix}}^{*}

so by linearity,

[E\;F]=-[U_{X}\;U_{Y}]{\begin{bmatrix}0_{n\times n}&0\\0&\Sigma _{Y}\end{bmatrix}}{\begin{bmatrix}V_{XX}&V_{XY}\\V_{YX}&V_{YY}\end{bmatrix}}^{*}.

We can then remove blocks from the U and Σ matrices, simplifying to

[E\;F]=-U_{Y}\Sigma _{Y}{\begin{bmatrix}V_{XY}\\V_{YY}\end{bmatrix}}^{*}=-[X\;Y]{\begin{bmatrix}V_{XY}\\V_{YY}\end{bmatrix}}{\begin{bmatrix}V_{XY}\\V_{YY}\end{bmatrix}}^{*}.

This provides E and F so that

[(X+E)\;(Y+F)]{\begin{bmatrix}V_{XY}\\V_{YY}\end{bmatrix}}=0.

Now if $V_{YY}$ is nonsingular, which is not always the case (note that the behavior of TLS when $V_{YY}$ is singular is not well understood yet), we can then right multiply both sides by $-V_{YY}^{-1}$ to bring the bottom block of the right matrix to the negative identity, giving^[6]

[(X+E)\;(Y+F)]{\begin{bmatrix}-V_{XY}V_{YY}^{-1}\\-V_{YY}V_{YY}^{-1}\end{bmatrix}}=[(X+E)\;(Y+F)]{\begin{bmatrix}B\\-I_{k}\end{bmatrix}}=0,

and so

B=-V_{XY}V_{YY}^{-1}.

A naive GNU Octave implementation of this is:

functionB=tls(X, Y)[mn]=size(X);% n is the width of X (X is m by n)Z=[XY];% Z is X augmented with Y.[USV]=svd(Z,0);% find the SVD of Z.VXY=V(1:n,1+n:end);% Take the block of V consisting of the first n rows and the n+1 to last columnVYY=V(1+n:end,1+n:end);% Take the bottom-right block of V.B=-VXY/VYY;end

The way described above of solving the problem, which requires that the matrix $V_{YY}$ is nonsingular, can be slightly extended by the so-called classical TLS algorithm.^[7]

Computation

The standard implementation of classical TLS algorithm is available through Netlib, see also.^[8]^[9] All modern implementations based, for example, on solving a sequence of ordinary least squares problems, approximate the matrix $B$ (denoted $X$ in the literature), as introduced by Van Huffel and Vandewalle. It is worth noting, that this $B$ is, however, not the TLS solution in many cases.^[10]^[11]

Non-linear model

For non-linear systems similar reasoning shows that the normal equations for an iteration cycle can be written as

\mathbf {J^{T}M^{-1}J\Delta {\boldsymbol {\beta }}=J^{T}M^{-1}\Delta y} ,

where $\mathbf {J}$ is the Jacobian matrix.

Geometrical interpretation

When the independent variable is error-free a residual represents the "vertical" distance between the observed data point and the fitted curve (or surface). In total least squares a residual represents the distance between a data point and the fitted curve measured along some direction. In fact, if both variables are measured in the same units and the errors on both variables are the same, then the residual represents the shortest distance between the data point and the fitted curve, that is, the residual vector is perpendicular to the tangent of the curve. For this reason, this type of regression is sometimes called two dimensional Euclidean regression (Stein, 1983)^[12] or orthogonal regression.

Scale invariant methods

A serious difficulty arises if the variables are not measured in the same units. First consider measuring distance between a data point and the line: what are the measurement units for this distance? If we consider measuring distance based on Pythagoras' Theorem then it is clear that we shall be adding quantities measured in different units, which is meaningless. Secondly, if we rescale one of the variables e.g., measure in grams rather than kilograms, then we shall end up with different results (a different line). To avoid these problems it is sometimes suggested that we convert to dimensionless variables—this may be called normalization or standardization. However, there are various ways of doing this, and these lead to fitted models which are not equivalent to each other. One approach is to normalize by known (or estimated) measurement precision thereby minimizing the Mahalanobis distance from the points to the line, providing a maximum-likelihood solution;^{[ citation needed ]} the unknown precisions could be found via analysis of variance.

In short, total least squares does not have the property of units-invariance—i.e. it is not scale invariant. For a meaningful model we require this property to hold. A way forward is to realise that residuals (distances) measured in different units can be combined if multiplication is used instead of addition. Consider fitting a line: for each data point the product of the vertical and horizontal residuals equals twice the area of the triangle formed by the residual lines and the fitted line. We choose the line which minimizes the sum of these areas. Nobel laureate Paul Samuelson proved in 1942 that, in two dimensions, it is the only line expressible solely in terms of the ratios of standard deviations and the correlation coefficient which (1) fits the correct equation when the observations fall on a straight line, (2) exhibits scale invariance, and (3) exhibits invariance under interchange of variables.^[13] This solution has been rediscovered in different disciplines and is variously known as standardised major axis (Ricker 1975, Warton et al., 2006),^[14]^[15] the reduced major axis, the geometric mean functional relationship (Draper and Smith, 1998),^[16]least products regression, diagonal regression, line of organic correlation, and the least areas line (Tofallis, 2002).^[17]

Tofallis (2015, 2023)^[18]^[19] has extended this approach to deal with multiple variables. The calculations are simpler than for total least squares as they only require knowledge of covariances, and can be computed using standard spreadsheet functions.

Notes

↑ An alternative form is $\mathbf {X^{T}WX{\boldsymbol {\Delta }}{\boldsymbol {\beta }}=X^{T}W{\boldsymbol {\Delta }}y}$ , where ${\boldsymbol {\Delta }}{\boldsymbol {\beta }}$ is the parameter shift from some starting estimate of ${\boldsymbol {\beta }}$ and ${\boldsymbol {\Delta }}\mathbf {y}$ is the difference between y and the value calculated using the starting value of ${\boldsymbol {\beta }}$
↑ The notation XB ≈ Y is used here to reflect the notation used in the earlier part of the article. In the computational literature the problem has been more commonly presented as AX ≈ B, i.e. with the letter X used for the n-by-k matrix of unknown regression coefficients.

Related Research Articles

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables, each of which clusters around a mean value.

The method of least squares is a parameter estimation method in regression analysis based on minimizing the sum of the squares of the residuals made in the results of each individual equation.

In statistics, the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. The errors do not need to be normal, nor do they need to be independent and identically distributed. The requirement that the estimator be unbiased cannot be dropped, since biased estimators exist with lower variance. See, for example, the James–Stein estimator, ridge regression, or simply any degenerate estimator.

<span class="mw-page-title-main">Covariance matrix</span> Measure of covariance of components of a random vector

In probability theory and statistics, a covariance matrix is a square matrix giving the covariance between each pair of elements of a given random vector.

In statistics, nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables. The data are fitted by a method of successive approximations (iterations).

In statistics, the theory of minimum norm quadratic unbiased estimation (MINQUE) was developed by C. R. Rao. MINQUE is a theory alongside other estimation methods in estimation theory, such as the method of moments or maximum likelihood estimation. Similar to the theory of best linear unbiased estimation, MINQUE is specifically concerned with linear regression models. The method was originally conceived to estimate heteroscedastic error variance in multiple linear regression. MINQUE estimators also provide an alternative to maximum likelihood estimators or restricted maximum likelihood estimators for variance components in mixed effects models. MINQUE estimators are quadratic forms of the response variable and are used to estimate a linear function of the variances.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable. Some sources consider OLS to be linear regression.

Weighted least squares (WLS), also known as weighted linear regression, is a generalization of ordinary least squares and linear regression in which knowledge of the unequal variance of observations (heteroscedasticity) is incorporated into the regression. WLS is also a specialization of generalized least squares, when all the off-diagonal entries of the covariance matrix of the errors, are null.

In statistics, simple linear regression (SLR) is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable and finds a linear function that, as accurately as possible, predicts the dependent variable values as a function of the independent variable. The adjective simple refers to the fact that the outcome variable is related to a single predictor.

In statistics, generalized least squares (GLS) is a method used to estimate the unknown parameters in a linear regression model. It is used when there is a non-zero amount of correlation between the residuals in the regression model. GLS is employed to improve statistical efficiency and reduce the risk of drawing erroneous inferences, as compared to conventional least squares and weighted least squares methods. It was first described by Alexander Aitken in 1935.

Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients and ultimately allowing the out-of-sample prediction of the regressandconditional on observed values of the regressors. The simplest and most widely used version of this model is the normal linear model, in which $given is distributed Gaussian. In this model, and under a particular choice of prior probabilities for the parameters—so-called conjugate priors—the posterior can be found analytically. With more arbitrarily chosen priors, the posteriors generally have to be approximated.$

In statistics, Bayesian multivariate linear regression is a Bayesian approach to multivariate linear regression, i.e. linear regression where the predicted outcome is a vector of correlated random variables rather than a single scalar random variable. A more general treatment of this approach can be found in the article MMSE estimator.

In probability theory and statistics, partial correlation measures the degree of association between two random variables, with the effect of a set of controlling random variables removed. When determining the numerical relationship between two variables of interest, using their correlation coefficient will give misleading results if there is another confounding variable that is numerically related to both variables of interest. This misleading information can be avoided by controlling for the confounding variable, which is done by computing the partial correlation coefficient. This is precisely the motivation for including other right-side variables in a multiple regression; but while multiple regression gives unbiased results for the effect size, it does not give a numerical value of a measure of the strength of the relationship between the two variables of interest.

The topic of heteroskedasticity-consistent (HC) standard errors arises in statistics and econometrics in the context of linear regression and time series analysis. These are also known as heteroskedasticity-robust standard errors, Eicker–Huber–White standard errors, to recognize the contributions of Friedhelm Eicker, Peter J. Huber, and Halbert White.

In statistics, the projection matrix $, sometimes also called the influence matrix or hat matrix, maps the vector of response values to the vector of fitted values. It describes the influence each response value has on each fitted value. The diagonal elements of the projection matrix are the leverages, which describe the influence each response value has on the fitted value for that same observation.$

Non-linear least squares is the form of least squares analysis used to fit a set of m observations with a model that is non-linear in n unknown parameters (m ≥ n). It is used in some forms of nonlinear regression. The basis of the method is to approximate the model by a linear one and to refine the parameters by successive iterations. There are many similarities to linear least squares, but also some significant differences. In economic theory, the non-linear least squares method is applied in (i) the probit regression, (ii) threshold regression, (iii) smooth regression, (iv) logistic link regression, (v) Box–Cox transformed regressors ( $).$

In statistics, principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). PCR is a form of reduced rank regression.More specifically, PCR is used for estimating the unknown regression coefficients in a standard linear regression model.

In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. High-leverage points, if any, are outliers with respect to the independent variables. That is, high-leverage points have no neighboring points in $space, where is the number of independent variables in a regression model. This makes the fitted model likely to pass close to a high leverage observation. Hence high-leverage points have the potential to cause large changes in the parameter estimates when they are deleted i.e., to be influential points. Although an influential point will typically have high leverage, a high leverage point is not necessarily an influential point. The leverage is typically defined as the diagonal elements of the hat matrix.$

Linear least squares (LLS) is the least squares approximation of linear functions to data. It is a set of formulations for solving statistical problems involved in linear regression, including variants for ordinary (unweighted), weighted, and generalized (correlated) residuals. Numerical methods for linear least squares include inverting the matrix of the normal equations and orthogonal decomposition methods.

Numerical methods for linear least squares entails the numerical analysis of linear least squares problems.

References

↑ I. Markovsky and S. Van Huffel, Overview of total least squares methods. Signal Processing, vol. 87, pp. 2283–2302, 2007. preprint
↑ W.E. Deming, Statistical Adjustment of Data, Wiley, 1943
↑ Gans, Peter (1992). Data Fitting in the Chemical Sciences. Wiley. ISBN 9780471934127 . Retrieved 4 December 2012.
↑ G. H. Golub and C. F. Van Loan, An analysis of the total least squares problem. Numer. Anal., 17, 1980, pp. 883–893.
↑ Golub, Gene H.; Van Loan, Charles F. (1996). Matrix Computations (3rd ed.). The Johns Hopkins University Press. pp 596.
↑ Bjõrck, Ake (1996) Numerical Methods for Least Squares Problems, Society for Industrial and Applied Mathematics. ISBN 978-0898713602 ^{[ page needed ]}
↑ S. Van Huffel and J. Vandewalle (1991) The Total Least Squares Problems: Computational Aspects and Analysis. SIAM Publications, Philadelphia PA.
↑ S. Van Huffel, Documented Fortran 77 programs of the extended classical total least squares algorithm, the partial singular value decomposition algorithm and the partial total least squares algorithm, Internal Report ESAT-KUL 88/1, ESAT Lab., Dept. of Electrical Engineering, Katholieke Universiteit Leuven, 1988.
↑ S. Van Huffel, The extended classical total least squares algorithm, J. Comput. Appl. Math., 25, pp. 111–119, 1989.
↑ M. Plešinger, The Total Least Squares Problem and Reduction of Data in AX ≈ B. Doctoral Thesis, TU of Liberec and Institute of Computer Science, AS CR Prague, 2008. Ph.D. Thesis
↑ I. Hnětynková, M. Plešinger, D. M. Sima, Z. Strakoš, and S. Van Huffel, The total least squares problem in AX ≈ B. A new classification with the relationship to the classical works. SIMAX vol. 32 issue 3 (2011), pp. 748–770.
↑ Stein, Yaakov J. "Two Dimensional Euclidean Regression" (PDF).{{cite journal}}: Cite journal requires |journal= (help)
↑ Samuelson, Paul A. (1942). "A Note on Alternative Regressions". Econometrica. 10 (1): 80–83. doi:10.2307/1907024. JSTOR 1907024.
↑ Ricker, W. E. (1975). "A note concerning Professor Jolicoeur's Comments". Journal of the Fisheries Research Board of Canada. 32 (8): 1494–1498. doi:10.1139/f75-172.
↑ Warton, David I.; Wright, Ian J.; Falster, Daniel S.; Westoby, Mark (2006). "Bivariate line-fitting methods for allometry". Biological Reviews. 81 (2): 259–291. CiteSeerX 10.1.1.461.9154 . doi:10.1017/S1464793106007007. PMID 16573844. S2CID 16462731.
↑ Draper, NR and Smith, H. Applied Regression Analysis, 3rd edition, pp. 92–96. 1998
↑ Tofallis, Chris (2002). "Model Fitting for Multiple Variables by Minimising the Geometric Mean Deviation". In Van Huffel, Sabine; Lemmerling, P. (eds.). Total Least Squares and Errors-in-Variables Modeling: Analysis, Algorithms and Applications. Dordrecht: Kluwer Academic Publ. ISBN 978-1402004766. SSRN 1077322.
↑ Tofallis, Chris (2015). "Fitting Equations to Data with the Perfect Correlation Relationship". SSRN 2707593.
↑ Tofallis, C. (2023). Fitting an Equation to Data Impartially. Mathematics, 11(18), 3957. https://ssrn.com/abstract=4556739 https://doi.org/10.3390/math11183957

Others

I. Hnětynková, M. Plešinger, D. M. Sima, Z. Strakoš, and S. Van Huffel, The total least squares problem in AX ≈ B. A new classification with the relationship to the classical works. SIMAX vol. 32 issue 3 (2011), pp. 748–770. Available as a preprint.
M. Plešinger, The Total Least Squares Problem and Reduction of Data in AX ≈ B. Doctoral Thesis, TU of Liberec and Institute of Computer Science, AS CR Prague, 2008. Ph.D. Thesis
C. C. Paige, Z. Strakoš, Core problems in linear algebraic systems. SIAM J. Matrix Anal. Appl. 27, 2006, pp. 861–875. doi : 10.1137/040616991
S. Van Huffel and P. Lemmerling, Total Least Squares and Errors-in-Variables Modeling: Analysis, Algorithms and Applications. Dordrecht, The Netherlands: Kluwer Academic Publishers, 2002.
S. Jo and S. W. Kim, Consistent normalized least mean square filtering with noisy data matrix. IEEE Trans. Signal Process., vol. 53, no. 6, pp. 2112–2123, Jun. 2005.
R. D. DeGroat and E. M. Dowling, The data least squares problem and channel equalization. IEEE Trans. Signal Process., vol. 41, no. 1, pp. 407–411, Jan. 1993.
S. Van Huffel and J. Vandewalle, The Total Least Squares Problems: Computational Aspects and Analysis. SIAM Publications, Philadelphia PA, 1991. doi : 10.1137/1.9781611971002
T. Abatzoglou and J. Mendel, Constrained total least squares, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP’87), Apr. 1987, vol. 12, pp. 1485–1488.
P. de Groen An introduction to total least squares, in Nieuw Archief voor Wiskunde, Vierde serie, deel 14, 1996, pp. 237–253 arxiv.org.
G. H. Golub and C. F. Van Loan, An analysis of the total least squares problem. SIAM J. on Numer. Anal., 17, 1980, pp. 883–893. doi : 10.1137/0717073
Perpendicular Regression Of A Line at MathPages
A. R. Amiri-Simkooei and S. Jazaeri Weighted total least squares formulated by standard least squares theory, in Journal of Geodetic Science, 2 (2): 113–124, 2012 .

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[2] An alternative form is $\mathbf {X^{T}WX{\boldsymbol {\Delta }}{\boldsymbol {\beta }}=X^{T}W{\boldsymbol {\Delta }}y}$ , where ${\boldsymbol {\Delta }}{\boldsymbol {\beta }}$ is the parameter shift from some starting estimate of ${\boldsymbol {\beta }}$ and ${\boldsymbol {\Delta }}\mathbf {y}$ is the difference between y and the value calculated using the starting value of ${\boldsymbol {\beta }}$

[7] The notation XB ≈ Y is used here to reflect the notation used in the earlier part of the article. In the computational literature the problem has been more commonly presented as AX ≈ B, i.e. with the letter X used for the n-by-k matrix of unknown regression coefficients.

[1] I. Markovsky and S. Van Huffel, Overview of total least squares methods. Signal Processing, vol. 87, pp. 2283–2302, 2007. preprint

[3] W.E. Deming, Statistical Adjustment of Data, Wiley, 1943

[4] Gans, Peter (1992). Data Fitting in the Chemical Sciences. Wiley. ISBN 9780471934127 . Retrieved 4 December 2012.

[5] G. H. Golub and C. F. Van Loan, An analysis of the total least squares problem. Numer. Anal., 17, 1980, pp. 883–893.

[6] Golub, Gene H.; Van Loan, Charles F. (1996). Matrix Computations (3rd ed.). The Johns Hopkins University Press. pp 596.

[8] Bjõrck, Ake (1996) Numerical Methods for Least Squares Problems, Society for Industrial and Applied Mathematics. ISBN 978-0898713602 ^{[ page needed ]}

[9] S. Van Huffel and J. Vandewalle (1991) The Total Least Squares Problems: Computational Aspects and Analysis. SIAM Publications, Philadelphia PA.

[10] S. Van Huffel, Documented Fortran 77 programs of the extended classical total least squares algorithm, the partial singular value decomposition algorithm and the partial total least squares algorithm, Internal Report ESAT-KUL 88/1, ESAT Lab., Dept. of Electrical Engineering, Katholieke Universiteit Leuven, 1988.

[11] S. Van Huffel, The extended classical total least squares algorithm, J. Comput. Appl. Math., 25, pp. 111–119, 1989.

[12] M. Plešinger, The Total Least Squares Problem and Reduction of Data in AX ≈ B. Doctoral Thesis, TU of Liberec and Institute of Computer Science, AS CR Prague, 2008. Ph.D. Thesis

[13] I. Hnětynková, M. Plešinger, D. M. Sima, Z. Strakoš, and S. Van Huffel, The total least squares problem in AX ≈ B. A new classification with the relationship to the classical works. SIMAX vol. 32 issue 3 (2011), pp. 748–770.

[14] Stein, Yaakov J. "Two Dimensional Euclidean Regression" (PDF).{{cite journal}}: Cite journal requires |journal= (help)

[15] Samuelson, Paul A. (1942). "A Note on Alternative Regressions". Econometrica. 10 (1): 80–83. doi:10.2307/1907024. JSTOR 1907024.

[16] Ricker, W. E. (1975). "A note concerning Professor Jolicoeur's Comments". Journal of the Fisheries Research Board of Canada. 32 (8): 1494–1498. doi:10.1139/f75-172.

[17] Warton, David I.; Wright, Ian J.; Falster, Daniel S.; Westoby, Mark (2006). "Bivariate line-fitting methods for allometry". Biological Reviews. 81 (2): 259–291. CiteSeerX 10.1.1.461.9154 . doi:10.1017/S1464793106007007. PMID 16573844. S2CID 16462731.

[18] Draper, NR and Smith, H. Applied Regression Analysis, 3rd edition, pp. 92–96. 1998

[19] Tofallis, Chris (2002). "Model Fitting for Multiple Variables by Minimising the Geometric Mean Deviation". In Van Huffel, Sabine; Lemmerling, P. (eds.). Total Least Squares and Errors-in-Variables Modeling: Analysis, Algorithms and Applications. Dordrecht: Kluwer Academic Publ. ISBN 978-1402004766. SSRN 1077322.

[20] Tofallis, Chris (2015). "Fitting Equations to Data with the Perfect Correlation Relationship". SSRN 2707593.

[21] Tofallis, C. (2023). Fitting an Equation to Data Impartially. Mathematics, 11(18), 3957. https://ssrn.com/abstract=4556739 https://doi.org/10.3390/math11183957

[1]

[note 1]

[2]

[3]

[4]

[5]

[note 2]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]