Matrix regularization

Last updated January 05, 2024

In the field of statistical learning theory, matrix regularization generalizes notions of vector regularization to cases where the object to be learned is a matrix. The purpose of regularization is to enforce conditions, for example sparsity or smoothness, that can produce stable predictive functions. For example, in the more common vector framework, Tikhonov regularization optimizes over

Basic definition

Consider a matrix $W$ to be learned from a set of examples, $S=(X_{i}^{t},y_{i}^{t})$ , where $i$ goes from $1$ to $n$ , and $t$ goes from $1$ to $T$ . Let each input matrix $X_{i}$ be $\in \mathbb {R} ^{DT}$ , and let $W$ be of size $D\times T$ . A general model for the output $y$ can be posed as

y_{i}^{t}=\langle W,X_{i}^{t}\rangle _{F},

where the inner product is the Frobenius inner product. For different applications the matrices $X_{i}$ will have different forms,^[1] but for each of these the optimization problem to infer $W$ can be written as

\min _{W\in {\mathcal {H}}}E(W)+R(W),

where $E$ defines the empirical error for a given $W$ , and $R(W)$ is a matrix regularization penalty. The function $R(W)$ is typically chosen to be convex and is often selected to enforce sparsity (using $\ell ^{1}$ -norms) and/or smoothness (using $\ell ^{2}$ -norms). Finally, $W$ is in the space of matrices ${\mathcal {H}}$ with Frobenius inner product $\langle \dots \rangle _{F}$ .

General applications

Matrix completion

In the problem of matrix completion, the matrix $X_{i}^{t}$ takes the form

X_{i}^{t}=e_{t}\otimes e_{i}',

where $(e_{t})_{t}$ and $(e_{i}')_{i}$ are the canonical basis in $\mathbb {R} ^{T}$ and $\mathbb {R} ^{D}$ . In this case the role of the Frobenius inner product is to select individual elements $w_{i}^{t}$ from the matrix $W$ . Thus, the output $y$ is a sampling of entries from the matrix $W$ .

The problem of reconstructing $W$ from a small set of sampled entries is possible only under certain restrictions on the matrix, and these restrictions can be enforced by a regularization function. For example, it might be assumed that $W$ is low-rank, in which case the regularization penalty can take the form of a nuclear norm.^[2]

R(W)=\lambda \|W\|_{*}=\lambda \sum |\sigma _{i}|,

where $\sigma _{i}$ , with $i$ from $1$ to $\min D,T$ , are the singular values of $W$ .

Multivariate regression

Models used in multivariate regression are parameterized by a matrix of coefficients. In the Frobenius inner product above, each matrix $X$ is

X_{i}^{t}=e_{t}\otimes x_{i}\,

such that the output of the inner product is the dot product of one row of the input with one column of the coefficient matrix. The familiar form of such models is

Y=XW+b\,

Many of the vector norms used in single variable regression can be extended to the multivariate case. One example is the squared Frobenius norm, which can be viewed as an $\ell ^{2}$ -norm acting either entrywise, or on the singular values of the matrix:

R(W)=\lambda \|W\|_{F}^{2}=\lambda \sum \sum |w_{ij}|^{2}=\lambda \operatorname {Tr} (W^{*}W)=\lambda \sum \sigma _{i}^{2}.

In the multivariate case the effect of regularizing with the Frobenius norm is the same as the vector case; very complex models will have larger norms, and, thus, will be penalized more.

Multi-task learning

The setup for multi-task learning is almost the same as the setup for multivariate regression. The primary difference is that the input variables are also indexed by task (columns of $Y$ ). The representation with the Frobenius inner product is then

X_{i}^{t}=e_{t}\otimes x_{i}^{t}.

The role of matrix regularization in this setting can be the same as in multivariate regression, but matrix norms can also be used to couple learning problems across tasks. In particular, note that for the optimization problem

\min _{W}\|XW-Y\|_{2}^{2}+\lambda \|W\|_{2}^{2}

the solutions corresponding to each column of $Y$ are decoupled. That is, the same solution can be found by solving the joint problem, or by solving an isolated regression problem for each column. The problems can be coupled by adding an additional regularization penalty on the covariance of solutions

\min _{W,\Omega }\|XW-Y\|_{2}^{2}+\lambda _{1}\|W\|_{2}^{2}+\lambda _{2}\operatorname {Tr} (W^{T}\Omega ^{-1}W)

where $\Omega$ models the relationship between tasks. This scheme can be used to both enforce similarity of solutions across tasks, and to learn the specific structure of task similarity by alternating between optimizations of $W$ and $\Omega$ .^[3] When the relationship between tasks is known to lie on a graph, the Laplacian matrix of the graph can be used to couple the learning problems.

Spectral regularization

Regularization by spectral filtering has been used to find stable solutions to problems such as those discussed above by addressing ill-posed matrix inversions (see for example Filter function for Tikhonov regularization). In many cases the regularization function acts on the input (or kernel) to ensure a bounded inverse by eliminating small singular values, but it can also be useful to have spectral norms that act on the matrix that is to be learned.

There are a number of matrix norms that act on the singular values of the matrix. Frequently used examples include the Schatten p-norms, with p = 1 or 2. For example, matrix regularization with a Schatten 1-norm, also called the nuclear norm, can be used to enforce sparsity in the spectrum of a matrix. This has been used in the context of matrix completion when the matrix in question is believed to have a restricted rank.^[2] In this case the optimization problem becomes:

\min \|W\|_{*}

subject to

W_{i,j}=Y_{ij}.

Spectral Regularization is also used to enforce a reduced rank coefficient matrix in multivariate regression.^[4] In this setting, a reduced rank coefficient matrix can be found by keeping just the top $n$ singular values, but this can be extended to keep any reduced set of singular values and vectors.

Structured sparsity

Sparse optimization has become the focus of much research interest as a way to find solutions that depend on a small number of variables (see e.g. the Lasso method). In principle, entry-wise sparsity can be enforced by penalizing the entry-wise $\ell ^{0}$ -norm of the matrix, but the $\ell ^{0}$ -norm is not convex. In practice this can be implemented by convex relaxation to the $\ell ^{1}$ -norm. While entry-wise regularization with an $\ell ^{1}$ -norm will find solutions with a small number of nonzero elements, applying an $\ell ^{1}$ -norm to different groups of variables can enforce structure in the sparsity of solutions.^[5]

The most straightforward example of structured sparsity uses the $\ell _{p,q}$ norm with $p=2$ and $q=1$ :

\|W\|_{2,1}=\sum \|w_{i}\|_{2}.

For example, the $\ell _{2,1}$ norm is used in multi-task learning to group features across tasks, such that all the elements in a given row of the coefficient matrix can be forced to zero as a group.^[6] The grouping effect is achieved by taking the $\ell ^{2}$ -norm of each row, and then taking the total penalty to be the sum of these row-wise norms. This regularization results in rows that will tend to be all zeros, or dense. The same type of regularization can be used to enforce sparsity column-wise by taking the $\ell ^{2}$ -norms of each column.

More generally, the $\ell _{2,1}$ norm can be applied to arbitrary groups of variables:

R(W)=\lambda \sum _{g}^{G}{\sqrt {\sum _{j}^{|G_{g}|}|w_{g}^{j}|^{2}}}=\lambda \sum _{g}^{G}\|w_{g}\|_{g}

where the index $g$ is across groups of variables, and $|G_{g}|$ indicates the cardinality of group $g$ .

Algorithms for solving these group sparsity problems extend the more well-known Lasso and group Lasso methods by allowing overlapping groups, for example, and have been implemented via matching pursuit:^[7] and proximal gradient methods.^[8] By writing the proximal gradient with respect to a given coefficient, $w_{g}^{i}$ , it can be seen that this norm enforces a group-wise soft threshold^[1]

\operatorname {prox} _{\lambda ,R_{g}}(w_{g})^{i}=\left(w_{g}^{i}-\lambda {\frac {w_{g}^{i}}{\|w_{g}\|_{g}}}\right)\mathbf {1} _{\|w_{g}\|_{g}\geq \lambda }.

where $\mathbf {1} _{\|w_{g}\|_{g}\geq \lambda }$ is the indicator function for group norms $\geq \lambda$ .

Thus, using $\ell _{2,1}$ norms it is straightforward to enforce structure in the sparsity of a matrix either row-wise, column-wise, or in arbitrary blocks. By enforcing group norms on blocks in multivariate or multi-task regression, for example, it is possible to find groups of input and output variables, such that defined subsets of output variables (columns in the matrix $Y$ ) will depend on the same sparse set of input variables.

Multiple kernel selection

The ideas of structured sparsity and feature selection can be extended to the nonparametric case of multiple kernel learning.^[9] This can be useful when there are multiple types of input data (color and texture, for example) with different appropriate kernels for each, or when the appropriate kernel is unknown. If there are two kernels, for example, with feature maps $A$ and $B$ that lie in corresponding reproducing kernel Hilbert spaces ${\mathcal {H_{A}}},{\mathcal {H_{B}}}$ , then a larger space, ${\mathcal {H_{D}}}$ , can be created as the sum of two spaces:

{\mathcal {H_{D}}}:f=h+h';h\in {\mathcal {H_{A}}},h'\in {\mathcal {H_{B}}}

assuming linear independence in $A$ and $B$ . In this case the $\ell _{2,1}$ -norm is again the sum of norms:

\|f\|_{{\mathcal {H_{D}}},1}=\|h\|_{\mathcal {H_{A}}}+\|h'\|_{\mathcal {H_{B}}}

Thus, by choosing a matrix regularization function as this type of norm, it is possible to find a solution that is sparse in terms of which kernels are used, but dense in the coefficient of each used kernel. Multiple kernel learning can also be used as a form of nonlinear variable selection, or as a model aggregation technique (e.g. by taking the sum of squared norms and relaxing sparsity constraints). For example, each kernel can be taken to be the Gaussian kernel with a different width.

Related Research Articles

In machine learning, support vector machines are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues SVMs are one of the most studied models, being based on statistical learning frameworks or VC theory proposed by Vapnik and Chervonenkis (1974).

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

Multi-task learning (MTL) is a subfield of machine learning in which multiple learning tasks are solved at the same time, while exploiting commonalities and differences across tasks. This can result in improved learning efficiency and prediction accuracy for the task-specific models, when compared to training the models separately. Early versions of MTL were called "hints".

Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also known as Tikhonov regularization, named for Andrey Tikhonov, it is a method of regularization of ill-posed problems. It is particularly useful to mitigate the problem of multicollinearity in linear regression, which commonly occurs in models with large numbers of parameters. In general, the method provides improved efficiency in parameter estimation problems in exchange for a tolerable amount of bias.

<span class="mw-page-title-main">Regularization (mathematics)</span> Technique to make a model more generalizable and transferable

In mathematics, statistics, finance, computer science, particularly in machine learning and inverse problems, regularization is a process that changes the result answer to be "simpler". It is often used to obtain results for ill-posed problems or to prevent overfitting.

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.

Compressed sensing is a signal processing technique for efficiently acquiring and reconstructing a signal, by finding solutions to underdetermined linear systems. This is based on the principle that, through optimization, the sparsity of a signal can be exploited to recover it from far fewer samples than required by the Nyquist–Shannon sampling theorem. There are two conditions under which recovery is possible. The first one is sparsity, which requires the signal to be sparse in some domain. The second one is incoherence, which is applied through the isometric property, which is sufficient for sparse signals. Compressed sensing has applications in, for example, MRI where the incoherence condition is typically satisfied.

A kernel smoother is a statistical technique to estimate a real valued function $as the weighted average of neighboring observed data. The weight is defined by the kernel, such that closer points are given higher weights. The estimated function is smooth, and the level of smoothness is set by a single parameter. Kernel smoothing is a type of weighted moving average.$

In statistics and machine learning, lasso is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the resulting statistical model. The lasso method assumes that the coefficients of the linear model are sparse, meaning that few of them are non-zero. It was originally introduced in geophysics, and later by Robert Tibshirani, who coined the term.

Within mathematical analysis, Regularization perspectives on support-vector machines provide a way of interpreting support-vector machines (SVMs) in the context of other regularization-based machine-learning algorithms. SVM algorithms categorize binary data, with the goal of fitting the training set data in a way that minimizes the average of the hinge-loss function and L2 norm of the learned weights. This strategy avoids overfitting via Tikhonov regularization and in the L2 norm sense and also corresponds to minimizing the bias and variance of our estimator of the weights. Estimators with lower Mean squared error predict better or generalize better when given unseen data.

Within bayesian statistics for machine learning, kernel methods arise from the assumption of an inner product space or similarity structure on inputs. For some such methods, such as support vector machines (SVMs), the original formulation and its regularization were not Bayesian in nature. It is helpful to understand them from a Bayesian perspective. Because the kernels are not necessarily positive semidefinite, the underlying structure may not be inner product spaces, but instead more general reproducing kernel Hilbert spaces. In Bayesian probability kernel methods are a key component of Gaussian processes, where the kernel function is known as the covariance function. Kernel methods have traditionally been used in supervised learning problems where the input space is usually a space of vectors while the output space is a space of scalars. More recently these methods have been extended to problems that deal with multiple outputs such as in multi-task learning.

For computer science, in statistical learning theory, a representer theorem is any of several related results stating that a minimizer $of a regularized empirical risk functional defined over a reproducing kernel Hilbert space can be represented as a finite linear combination of kernel products evaluated on the input points in the training set data.$

Proximal gradientmethods for learning is an area of research in optimization and statistical learning theory which studies algorithms for a general class of convex regularization problems where the regularization penalty may not be differentiable. One such example is $regularization of the form$

Spectral regularization is any of a class of regularization techniques used in machine learning to control the impact of noise and prevent overfitting. Spectral regularization can be used in a broad range of applications, from deblurring images to classifying emails into a spam folder and a non-spam folder. For instance, in the email classification example, spectral regularization can be used to reduce the impact of noise and prevent overfitting when a machine learning system is being trained on a labeled set of emails to learn how to tell a spam and a non-spam email apart.

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space $on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in, discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song, Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.$

Multiple kernel learning refers to a set of machine learning methods that use a predefined set of kernels and learn an optimal linear or non-linear combination of kernels as part of the algorithm. Reasons to use multiple kernel learning include a) the ability to select for an optimal kernel and parameters from a larger set of kernels, reducing bias due to kernel selection while allowing for more automated machine learning methods, and b) combining data from different sources that have different notions of similarity and thus require different kernels. Instead of creating a new kernel, multiple kernel algorithms can be used to combine kernels already established for each individual data source.

In machine learning, Manifold regularization is a technique for using the shape of a dataset to constrain the functions that should be learned on that dataset. In many machine learning problems, the data to be learned do not cover the entire input space. For example, a facial recognition system may not need to classify any possible image, but only the subset of images that contain faces. The technique of manifold learning assumes that the relevant subset of data comes from a manifold, a mathematical structure with useful properties. The technique also assumes that the function to be learned is smooth: data with different labels are not likely to be close together, and so the labeling function should not change quickly in areas where there are likely to be many data points. Because of this assumption, a manifold regularization algorithm can use unlabeled data to inform where the learned function is allowed to change quickly and where it is not, using an extension of the technique of Tikhonov regularization. Manifold regularization algorithms can extend supervised learning algorithms in semi-supervised learning and transductive learning settings, where unlabeled data are available. The technique has been used for applications including medical imaging, geographical imaging, and object recognition.

Regularized least squares (RLS) is a family of methods for solving the least-squares problem while using regularization to further constrain the resulting solution.

Sparse dictionary learning is a representation learning method which aims at finding a sparse representation of the input data in the form of a linear combination of basic elements as well as those basic elements themselves. These elements are called atoms and they compose a dictionary. Atoms in the dictionary are not required to be orthogonal, and they may be an over-complete spanning set. This problem setup also allows the dimensionality of the signals being represented to be higher than the one of the signals being observed. The above two properties lead to having seemingly redundant atoms that allow multiple representations of the same signal but also provide an improvement in sparsity and flexibility of the representation.

Structured sparsity regularization is a class of methods, and an area of research in statistical learning theory, that extend and generalize sparsity regularization learning methods. Both sparsity and structured sparsity regularization methods seek to exploit the assumption that the output variable $to be learned can be described by a reduced number of variables in the input space . Sparsity regularization methods focus on selecting the input variables that best describe the output. Structured sparsity regularization methods generalize and extend sparsity regularization methods, by allowing for optimal selection over structures like groups or networks of input variables in .$

References

1 2 Rosasco, Lorenzo; Poggio, Tomaso (December 2014). "A Regularization Tour of Machine Learning". MIT-9.520 Lectures Notes (Manuscript).
1 2 Candès, Emmanuel J.; Recht, Benjamin (2009). "Exact Matrix Completion via Convex Optimization". Foundations of Computational Mathematics. 9 (6): 717–772. doi: 10.1007/s10208-009-9045-5 .
↑ Zhang; Yeung (2012). "A Convex Formulation for Learning Task Relationships in Multi-Task Learning". Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (UAI2010). arXiv: 1203.3536 . Bibcode:2012arXiv1203.3536Z.
↑ Izenman, Alan J. (1975). "Reduced Rank Regression for the Multivariate Linear Model". Journal of Multivariate Analysis . 5 (2): 248–264. doi: 10.1016/0047-259X(75)90042-1 .
↑ Kakade; Shalev-Shwartz; Tewari (2012). "Regularization Techniques for Learning with Matrices". Journal of Machine Learning Research. 13: 1865–1890.
↑ Argyriou, A.; Evgeniou, T.; Pontil, M. (2008). "Convex multi-task feature learning". Machine Learning . 73 (3): 243–272. doi: 10.1007/s10994-007-5040-8 .
↑ Huang; Zhang; Metaxas (2011). "Learning with Structured Sparsity". Journal of Machine Learning Research. 12: 3371–3412.
↑ Chen, Xi; et al. (2012). "Smoothing Proximal Gradient Method for General Structured Sparse Regression". Annals of Applied Statistics. 6 (2): 719–752. arXiv: 1005.4717 . doi: 10.1214/11-AOAS514 .
↑ Sonnenburg; Ratsch; Schafer; Scholkopf (2006). "Large Scale Multiple Kernel Learning". Journal of Machine Learning Research. 7: 1531–1565.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Notes-1] 1 2 Rosasco, Lorenzo; Poggio, Tomaso (December 2014). "A Regularization Tour of Machine Learning". MIT-9.520 Lectures Notes (Manuscript).

[Candès,_Emmanuel_J_2009_pp._717-2] 1 2 Candès, Emmanuel J.; Recht, Benjamin (2009). "Exact Matrix Completion via Convex Optimization". Foundations of Computational Mathematics. 9 (6): 717–772. doi: 10.1007/s10208-009-9045-5 .

[3] Zhang; Yeung (2012). "A Convex Formulation for Learning Task Relationships in Multi-Task Learning". Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (UAI2010). arXiv: 1203.3536 . Bibcode:2012arXiv1203.3536Z.

[4] Izenman, Alan J. (1975). "Reduced Rank Regression for the Multivariate Linear Model". Journal of Multivariate Analysis . 5 (2): 248–264. doi: 10.1016/0047-259X(75)90042-1 .

[5] Kakade; Shalev-Shwartz; Tewari (2012). "Regularization Techniques for Learning with Matrices". Journal of Machine Learning Research. 13: 1865–1890.

[6] Argyriou, A.; Evgeniou, T.; Pontil, M. (2008). "Convex multi-task feature learning". Machine Learning . 73 (3): 243–272. doi: 10.1007/s10994-007-5040-8 .

[7] Huang; Zhang; Metaxas (2011). "Learning with Structured Sparsity". Journal of Machine Learning Research. 12: 3371–3412.

[8] Chen, Xi; et al. (2012). "Smoothing Proximal Gradient Method for General Structured Sparse Regression". Annals of Applied Statistics. 6 (2): 719–752. arXiv: 1005.4717 . doi: 10.1214/11-AOAS514 .

[9] Sonnenburg; Ratsch; Schafer; Scholkopf (2006). "Large Scale Multiple Kernel Learning". Journal of Machine Learning Research. 7: 1531–1565.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]