Regularization perspectives on support vector machines

Last updated December 29, 2023

Within mathematical analysis, Regularization perspectives on support-vector machines provide a way of interpreting support-vector machines (SVMs) in the context of other regularization-based machine-learning algorithms. SVM algorithms categorize binary data, with the goal of fitting the training set data in a way that minimizes the average of the hinge-loss function and L2 norm of the learned weights. This strategy avoids overfitting via Tikhonov regularization and in the L2 norm sense and also corresponds to minimizing the bias and variance of our estimator of the weights. Estimators with lower Mean squared error predict better or generalize better when given unseen data.

Specifically, Tikhonov regularization algorithms produce a decision boundary that minimizes the average training-set error and constrain the Decision boundary not to be excessively complicated or overfit the training data via a L2 norm of the weights term. The training and test-set errors can be measured without bias and in a fair way using accuracy, precision, Auc-Roc, precision-recall, and other metrics.

Regularization perspectives on support-vector machines interpret SVM as a special case of Tikhonov regularization, specifically Tikhonov regularization with the hinge loss for a loss function. This provides a theoretical framework with which to analyze SVM algorithms and compare them to other algorithms with the same goals: to generalize without overfitting. SVM was first proposed in 1995 by Corinna Cortes and Vladimir Vapnik, and framed geometrically as a method for finding hyperplanes that can separate multidimensional data into two categories.^[1] This traditional geometric interpretation of SVMs provides useful intuition about how SVMs work, but is difficult to relate to other machine-learning techniques for avoiding overfitting, like regularization, early stopping, sparsity and Bayesian inference. However, once it was discovered that SVM is also a special case of Tikhonov regularization, regularization perspectives on SVM provided the theory necessary to fit SVM within a broader class of algorithms.^[2]^[3]^[4] This has enabled detailed comparisons between SVM and other forms of Tikhonov regularization, and theoretical grounding for why it is beneficial to use SVM's loss function, the hinge loss.^[5]

Theoretical background

In the statistical learning theory framework, an algorithm is a strategy for choosing a function $f\colon \mathbf {X} \to \mathbf {Y}$ given a training set $S=\{(x_{1},y_{1}),\ldots ,(x_{n},y_{n})\}$ of inputs $x_{i}$ and their labels $y_{i}$ (the labels are usually $\pm 1$ ). Regularization strategies avoid overfitting by choosing a function that fits the data, but is not too complex. Specifically:

f={\underset {f\in {\mathcal {H}}}{\operatorname {argmin} }}\left\{{\frac {1}{n}}\sum _{i=1}^{n}V(y_{i},f(x_{i}))+\lambda \|f\|_{\mathcal {H}}^{2}\right\},

where ${\mathcal {H}}$ is a hypothesis space ^[6] of functions, $V\colon \mathbf {Y} \times \mathbf {Y} \to \mathbb {R}$ is the loss function, $\|\cdot \|_{\mathcal {H}}$ is a norm on the hypothesis space of functions, and $\lambda \in \mathbb {R}$ is the regularization parameter.^[7]

When ${\mathcal {H}}$ is a reproducing kernel Hilbert space, there exists a kernel function $K\colon \mathbf {X} \times \mathbf {X} \to \mathbb {R}$ that can be written as an $n\times n$ symmetric positive-definite matrix $\mathbf {K}$ . By the representer theorem,^[8]

f(x_{i})=\sum _{j=1}^{n}c_{j}\mathbf {K} _{ij},{\text{ and }}\|f\|_{\mathcal {H}}^{2}=\langle f,f\rangle _{\mathcal {H}}=\sum _{i=1}^{n}\sum _{j=1}^{n}c_{i}c_{j}K(x_{i},x_{j})=c^{T}\mathbf {K} c.

Special properties of the hinge loss

The simplest and most intuitive loss function for categorization is the misclassification loss, or 0–1 loss, which is 0 if $f(x_{i})=y_{i}$ and 1 if $f(x_{i})\neq y_{i}$ , i.e. the Heaviside step function on $-y_{i}f(x_{i})$ . However, this loss function is not convex, which makes the regularization problem very difficult to minimize computationally. Therefore, we look for convex substitutes for the 0–1 loss. The hinge loss, $V{\big (}y_{i},f(x_{i}){\big )}={\big (}1-yf(x){\big )}_{+}$ , where $(s)_{+}=\max(s,0)$ , provides such a convex relaxation. In fact, the hinge loss is the tightest convex upper bound to the 0–1 misclassification loss function,^[4] and with infinite data returns the Bayes-optimal solution:^[5]^[9]

f_{b}(x)={\begin{cases}1,&p(1\mid x)>p(-1\mid x),\\-1,&p(1\mid x)<p(-1\mid x).\end{cases}}

Derivation

The Tikhonov regularization problem can be shown to be equivalent to traditional formulations of SVM by expressing it in terms of the hinge loss.^[10] With the hinge loss

V{\big (}y_{i},f(x_{i}){\big )}={\big (}1-yf(x){\big )}_{+},

where $(s)_{+}=\max(s,0)$ , the regularization problem becomes

f={\underset {f\in {\mathcal {H}}}{\operatorname {argmin} }}\left\{{\frac {1}{n}}\sum _{i=1}^{n}{\big (}1-yf(x){\big )}_{+}+\lambda \|f\|_{\mathcal {H}}^{2}\right\}.

Multiplying by $1/(2\lambda )$ yields

f={\underset {f\in {\mathcal {H}}}{\operatorname {argmin} }}\left\{C\sum _{i=1}^{n}{\big (}1-yf(x){\big )}_{+}+{\frac {1}{2}}\|f\|_{\mathcal {H}}^{2}\right\}

with $C=1/(2\lambda n)$ , which is equivalent to the standard SVM minimization problem.

Notes and references

↑ Cortes, Corinna; Vladimir Vapnik (1995). "Support-Vector Networks". Machine Learning. 20 (3): 273–297. doi: 10.1007/BF00994018 .
↑ Rosasco, Lorenzo. "Regularized Least-Squares and Support Vector Machines" (PDF).
↑ Rifkin, Ryan (2002). Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning (PDF). MIT (PhD thesis).
1 2 Lee, Yoonkyung; Wahba, Grace (2012). "Multicategory Support Vector Machines". Journal of the American Statistical Association. 99 (465): 67–81. CiteSeerX 10.1.1.22.1879 . doi:10.1198/016214504000000098. S2CID 261035640.
1 2 Rosasco L.; De Vito E.; Caponnetto A.; Piana M.; Verri A. (May 2004). "Are Loss Functions All the Same". Neural Computation. 5. 16 (5): 1063–1076. CiteSeerX 10.1.1.109.6786 . doi:10.1162/089976604773135104. PMID 15070510. S2CID 11845688.
↑ A hypothesis space is the set of functions used to model the data in a machine-learning problem. Each function corresponds to a hypothesis about the structure of the data. Typically the functions in a hypothesis space form a Hilbert space of functions with norm formed from the loss function.
↑ For insight on choosing the parameter, see, e.g., Wahba, Grace; Yonghua Wang (1990). "When is the optimal regularization parameter insensitive to the choice of the loss function". Communications in Statistics – Theory and Methods. 19 (5): 1685–1700. doi:10.1080/03610929008830285.
↑ See Scholkopf, Bernhard; Ralf Herbrich; Alex Smola (2001). "A Generalized Representer Theorem". Computational Learning Theory. Lecture Notes in Computer Science. Vol. 2111. pp. 416–426. CiteSeerX 10.1.1.42.8617 . doi:10.1007/3-540-44581-1_27. ISBN 978-3-540-42343-0.{{cite book}}: |journal= ignored (help)
↑ Lin, Yi (July 2002). "Support Vector Machines and the Bayes Rule in Classification" (PDF). Data Mining and Knowledge Discovery. 6 (3): 259–275. doi:10.1023/A:1015469627679. S2CID 24759201.
↑ For a detailed derivation, see Rifkin, Ryan (2002). Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning (PDF). MIT (PhD thesis).

Evgeniou, Theodoros; Massimiliano Pontil; Tomaso Poggio (2000). "Regularization Networks and Support Vector Machines" (PDF). Advances in Computational Mathematics. 13 (1): 1–50. doi:10.1023/A:1018946025316. S2CID 70866.
Joachims, Thorsten. "SVMlight". Archived from the original on 2015-04-19. Retrieved 2012-05-18.
Vapnik, Vladimir (1999). The Nature of Statistical Learning Theory. New York: Springer-Verlag. ISBN 978-0-387-98780-4.

Related Research Articles

In machine learning, support vector machines are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues SVMs are one of the most studied models, being based on statistical learning frameworks or VC theory proposed by Vapnik and Chervonenkis (1974).

In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class it belongs to. A linear classifier achieves this by making a classification decision based on the value of a linear combination of the characteristics. An object's characteristics are also known as feature values and are typically presented to the machine in a vector called a feature vector. Such classifiers work well for practical problems such as document classification, and more generally for problems with many variables (features), reaching accuracy levels comparable to non-linear classifiers while taking less time to train and use. 5-12-23

In machine learning, early stopping is a form of regularization used to avoid overfitting when training a learner with an iterative method, such as gradient descent. Such methods update the learner so as to make it better fit the training data with each iteration. Up to a point, this improves the learner's performance on data outside of the training set. Past that point, however, improving the learner's fit to the training data comes at the expense of increased generalization error. Early stopping rules provide guidance as to how many iterations can be run before the learner begins to over-fit. Early stopping rules have been employed in many different machine learning methods, with varying amounts of theoretical foundation.

Multi-task learning (MTL) is a subfield of machine learning in which multiple learning tasks are solved at the same time, while exploiting commonalities and differences across tasks. This can result in improved learning efficiency and prediction accuracy for the task-specific models, when compared to training the models separately. Early versions of MTL were called "hints".

Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on data. Statistical learning theory has led to successful applications in fields such as computer vision, speech recognition, and bioinformatics.

In mathematics, statistics, finance, computer science, particularly in machine learning and inverse problems, regularization is a process that changes the result answer to be "simpler". It is often used to obtain results for ill-posed problems or to prevent overfitting.

In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). These methods involve using linear classifiers to solve nonlinear problems. The general task of pattern analysis is to find and study general types of relations in datasets. For many algorithms that solve these tasks, the data in raw representation have to be explicitly transformed into feature vector representations via a user-specified feature map: in contrast, kernel methods require only a user-specified kernel, i.e., a similarity function over all pairs of data points computed using inner products. The feature map in kernel machines is infinite dimensional but only requires a finite dimensional matrix from user-input according to the Representer theorem. Kernel machines are slow to compute for datasets larger than a couple of thousand examples without parallel processing.

Augmented Lagrangian methods are a certain class of algorithms for solving constrained optimization problems. They have similarities to penalty methods in that they replace a constrained optimization problem by a series of unconstrained problems and add a penalty term to the objective, but the augmented Lagrangian method adds yet another term designed to mimic a Lagrange multiplier. The augmented Lagrangian is related to, but not identical with, the method of Lagrange multipliers.

In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs).

In statistics and, in particular, in the fitting of linear or logistic regression models, the elastic net is a regularized regression method that linearly combines the L₁ and L₂ penalties of the lasso and ridge methods.

Within bayesian statistics for machine learning, kernel methods arise from the assumption of an inner product space or similarity structure on inputs. For some such methods, such as support vector machines (SVMs), the original formulation and its regularization were not Bayesian in nature. It is helpful to understand them from a Bayesian perspective. Because the kernels are not necessarily positive semidefinite, the underlying structure may not be inner product spaces, but instead more general reproducing kernel Hilbert spaces. In Bayesian probability kernel methods are a key component of Gaussian processes, where the kernel function is known as the covariance function. Kernel methods have traditionally been used in supervised learning problems where the input space is usually a space of vectors while the output space is a space of scalars. More recently these methods have been extended to problems that deal with multiple outputs such as in multi-task learning.

For computer science, in statistical learning theory, a representer theorem is any of several related results stating that a minimizer $of a regularized empirical risk functional defined over a reproducing kernel Hilbert space can be represented as a finite linear combination of kernel products evaluated on the input points in the training set data.$

Spectral regularization is any of a class of regularization techniques used in machine learning to control the impact of noise and prevent overfitting. Spectral regularization can be used in a broad range of applications, from deblurring images to classifying emails into a spam folder and a non-spam folder. For instance, in the email classification example, spectral regularization can be used to reduce the impact of noise and prevent overfitting when a machine learning system is being trained on a labeled set of emails to learn how to tell a spam and a non-spam email apart.

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space $on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in, discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song, Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.$

Kernel methods are a well-established tool to analyze the relationship between input data and the corresponding output of a function. Kernels encapsulate the properties of functions in a computationally efficient way and allow algorithms to easily swap functions of varying complexity.

In machine learning and mathematical optimization, loss functions for classification are computationally feasible loss functions representing the price paid for inaccuracy of predictions in classification problems. Given $as the space of all possible inputs, and as the set of labels, a typical goal of classification algorithms is to find a function which best predicts a label for a given input . However, because of incomplete information, noise in the measurement, or probabilistic components in the underlying process, it is possible for the same to generate different . As a result, the goal of the learning problem is to minimize expected loss, defined as$

In the field of statistical learning theory, matrix regularization generalizes notions of vector regularization to cases where the object to be learned is a matrix. The purpose of regularization is to enforce conditions, for example sparsity or smoothness, that can produce stable predictive functions. For example, in the more common vector framework, Tikhonov regularization optimizes over

In machine learning, Manifold regularization is a technique for using the shape of a dataset to constrain the functions that should be learned on that dataset. In many machine learning problems, the data to be learned do not cover the entire input space. For example, a facial recognition system may not need to classify any possible image, but only the subset of images that contain faces. The technique of manifold learning assumes that the relevant subset of data comes from a manifold, a mathematical structure with useful properties. The technique also assumes that the function to be learned is smooth: data with different labels are not likely to be close together, and so the labeling function should not change quickly in areas where there are likely to be many data points. Because of this assumption, a manifold regularization algorithm can use unlabeled data to inform where the learned function is allowed to change quickly and where it is not, using an extension of the technique of Tikhonov regularization. Manifold regularization algorithms can extend supervised learning algorithms in semi-supervised learning and transductive learning settings, where unlabeled data are available. The technique has been used for applications including medical imaging, geographical imaging, and object recognition.

Regularized least squares (RLS) is a family of methods for solving the least-squares problem while using regularization to further constrain the resulting solution.

Weak supervision is a paradigm in machine learning, the relevance and notability of which increased with the advent of large language models due to large amount of data required to train them. It is characterized by using a combination of a small amount of human-labeled data, followed by a large amount of unlabeled data. In other words, the desired output values are provided only for a subset of the training data. The remaining data is unlabeled or imprecisely labeled. Intuitively, it can be seen as an exam and labeled data as sample problems that the teacher solves for the class as an aid in solving another set of problems. In the transductive setting, these unsolved problems act as exam questions. In the inductive setting, they become practice problems of the sort that will make up the exam. Technically, it could be viewed as performing clustering and then labeling the clusters with the labeled data, pushing the decision boundary away from high-density regions, or learning an underlying one-dimensional manifold where the data reside.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Cortes, Corinna; Vladimir Vapnik (1995). "Support-Vector Networks". Machine Learning. 20 (3): 273–297. doi: 10.1007/BF00994018 .

[rosasco1-2] Rosasco, Lorenzo. "Regularized Least-Squares and Support Vector Machines" (PDF).

[3] Rifkin, Ryan (2002). Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning (PDF). MIT (PhD thesis).

[Lee_2012_67–81-4] 1 2 Lee, Yoonkyung; Wahba, Grace (2012). "Multicategory Support Vector Machines". Journal of the American Statistical Association. 99 (465): 67–81. CiteSeerX 10.1.1.22.1879 . doi:10.1198/016214504000000098. S2CID 261035640.

[Rosasco_2004_1063–1076-5] 1 2 Rosasco L.; De Vito E.; Caponnetto A.; Piana M.; Verri A. (May 2004). "Are Loss Functions All the Same". Neural Computation. 5. 16 (5): 1063–1076. CiteSeerX 10.1.1.109.6786 . doi:10.1162/089976604773135104. PMID 15070510. S2CID 11845688.

[6] A hypothesis space is the set of functions used to model the data in a machine-learning problem. Each function corresponds to a hypothesis about the structure of the data. Typically the functions in a hypothesis space form a Hilbert space of functions with norm formed from the loss function.

[7] For insight on choosing the parameter, see, e.g., Wahba, Grace; Yonghua Wang (1990). "When is the optimal regularization parameter insensitive to the choice of the loss function". Communications in Statistics – Theory and Methods. 19 (5): 1685–1700. doi:10.1080/03610929008830285.

[8] See Scholkopf, Bernhard; Ralf Herbrich; Alex Smola (2001). "A Generalized Representer Theorem". Computational Learning Theory. Lecture Notes in Computer Science. Vol. 2111. pp. 416–426. CiteSeerX 10.1.1.42.8617 . doi:10.1007/3-540-44581-1_27. ISBN 978-3-540-42343-0.{{cite book}}: |journal= ignored (help)

[9] Lin, Yi (July 2002). "Support Vector Machines and the Bayes Rule in Classification" (PDF). Data Mining and Knowledge Discovery. 6 (3): 259–275. doi:10.1023/A:1015469627679. S2CID 24759201.

[10] For a detailed derivation, see Rifkin, Ryan (2002). Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning (PDF). MIT (PhD thesis).

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]