Sequential minimal optimization

Sequential minimal optimization
Class	Optimization algorithm for training support vector machines
Worst-case performance	O(n³)

Last updated July 02, 2023

Sequential minimal optimization (SMO) is an algorithm for solving the quadratic programming (QP) problem that arises during the training of support-vector machines (SVM). It was invented by John Platt in 1998 at Microsoft Research.^[1] SMO is widely used for training support vector machines and is implemented by the popular LIBSVM tool.^[2]^[3] The publication of the SMO algorithm in 1998 has generated a lot of excitement in the SVM community, as previously available methods for SVM training were much more complex and required expensive third-party QP solvers.^[4]

Optimization problem

Consider a binary classification problem with a dataset (x₁, y₁), ..., (x_n, y_n), where x_i is an input vector and y_i ∈ {-1, +1} is a binary label corresponding to it. A soft-margin support vector machine is trained by solving a quadratic programming problem, which is expressed in the dual form as follows:

\max _{\alpha }\sum _{i=1}^{n}\alpha _{i}-{\frac {1}{2}}\sum _{i=1}^{n}\sum _{j=1}^{n}y_{i}y_{j}K(x_{i},x_{j})\alpha _{i}\alpha _{j},

subject to:

0\leq \alpha _{i}\leq C,\quad {\mbox{ for }}i=1,2,\ldots ,n,

\sum _{i=1}^{n}y_{i}\alpha _{i}=0

where C is an SVM hyperparameter and K(x_i, x_j) is the kernel function, both supplied by the user; and the variables $\alpha _{i}$ are Lagrange multipliers.

Algorithm

SMO is an iterative algorithm for solving the optimization problem described above. SMO breaks this problem into a series of smallest possible sub-problems, which are then solved analytically. Because of the linear equality constraint involving the Lagrange multipliers $\alpha _{i}$ , the smallest possible problem involves two such multipliers. Then, for any two multipliers $\alpha _{1}$ and $\alpha _{2}$ , the constraints are reduced to:

0\leq \alpha _{1},\alpha _{2}\leq C,

y_{1}\alpha _{1}+y_{2}\alpha _{2}=k,

and this reduced problem can be solved analytically: one needs to find a minimum of a one-dimensional quadratic function. $k$ is the negative of the sum over the rest of terms in the equality constraint, which is fixed in each iteration.

The algorithm proceeds as follows:

Find a Lagrange multiplier $\alpha _{1}$ that violates the Karush–Kuhn–Tucker (KKT) conditions for the optimization problem.
Pick a second multiplier $\alpha _{2}$ and optimize the pair $(\alpha _{1},\alpha _{2})$ .
Repeat steps 1 and 2 until convergence.

When all the Lagrange multipliers satisfy the KKT conditions (within a user-defined tolerance), the problem has been solved. Although this algorithm is guaranteed to converge, heuristics are used to choose the pair of multipliers so as to accelerate the rate of convergence. This is critical for large data sets since there are $n(n-1)/2$ possible choices for $\alpha _{i}$ and $\alpha _{j}$ .

Related Work

The first approach to splitting large SVM learning problems into a series of smaller optimization tasks was proposed by Bernhard Boser, Isabelle Guyon, Vladimir Vapnik.^[5] It is known as the "chunking algorithm". The algorithm starts with a random subset of the data, solves this problem, and iteratively adds examples which violate the optimality conditions. One disadvantage of this algorithm is that it is necessary to solve QP-problems scaling with the number of SVs. On real world sparse data sets, SMO can be more than 1000 times faster than the chunking algorithm.^[1]

In 1997, E. Osuna, R. Freund, and F. Girosi proved a theorem which suggests a whole new set of QP algorithms for SVMs.^[6] By the virtue of this theorem a large QP problem can be broken down into a series of smaller QP sub-problems. A sequence of QP sub-problems that always add at least one violator of the Karush–Kuhn–Tucker (KKT) conditions is guaranteed to converge. The chunking algorithm obeys the conditions of the theorem, and hence will converge.^[1] The SMO algorithm can be considered a special case of the Osuna algorithm, where the size of the optimization is two and both Lagrange multipliers are replaced at every step with new multipliers that are chosen via good heuristics.^[1]

The SMO algorithm is closely related to a family of optimization algorithms called Bregman methods or row-action methods. These methods solve convex programming problems with linear constraints. They are iterative methods where each step projects the current primal point onto each constraint.^[1]

Related Research Articles

Quadratic programming (QP) is the process of solving certain mathematical optimization problems involving quadratic functions. Specifically, one seeks to optimize a multivariate quadratic function subject to linear constraints on the variables. Quadratic programming is a type of nonlinear programming.

In machine learning, support vector machines are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues SVMs are one of the most robust prediction methods, being based on statistical learning frameworks or VC theory proposed by Vapnik and Chervonenkis (1974). Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier. SVM maps training examples to points in space so as to maximise the width of the gap between the two categories. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

In mathematical optimization theory, the linear complementarity problem (LCP) arises frequently in computational mechanics and encompasses the well-known quadratic programming as a special case. It was proposed by Cottle and Dantzig in 1968.

<span class="mw-page-title-main">Interior-point method</span> Algorithms for solving convex optimization problems

Interior-point methods are a certain class of algorithms that solve linear and nonlinear convex optimization problems.

Convex optimization is a subfield of mathematical optimization that studies the problem of minimizing convex functions over convex sets. Many classes of convex optimization problems admit polynomial-time algorithms, whereas mathematical optimization is in general NP-hard.

In numerical optimization, the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is an iterative method for solving unconstrained nonlinear optimization problems. Like the related Davidon–Fletcher–Powell method, BFGS determines the descent direction by preconditioning the gradient with curvature information. It does so by gradually improving an approximation to the Hessian matrix of the loss function, obtained only from gradient evaluations via a generalized secant method.

In mathematical optimization, the Karush–Kuhn–Tucker (KKT) conditions, also known as the Kuhn–Tucker conditions, are first derivative tests for a solution in nonlinear programming to be optimal, provided that some regularity conditions are satisfied.

The Frank–Wolfe algorithm is an iterative first-order optimization algorithm for constrained convex optimization. Also known as the conditional gradient method, reduced gradient algorithm and the convex combination algorithm, the method was originally proposed by Marguerite Frank and Philip Wolfe in 1956. In each iteration, the Frank–Wolfe algorithm considers a linear approximation of the objective function, and moves towards a minimizer of this linear function.

In mathematical optimization theory, duality or the duality principle is the principle that optimization problems may be viewed from either of two perspectives, the primal problem or the dual problem. If the primal is a minimization problem then the dual is a maximization problem. Any feasible solution to the primal (minimization) problem is at least as large as any feasible solution to the dual (maximization) problem. Therefore, the solution to the primal is an upper bound to the solution of the dual, and the solution of the dual is a lower bound to the solution of the primal. This fact is called weak duality.

In mathematical optimization, constrained optimization is the process of optimizing an objective function with respect to some variables in the presence of constraints on those variables. The objective function is either a cost function or energy function, which is to be minimized, or a reward function or utility function, which is to be maximized. Constraints can be either hard constraints, which set conditions for the variables that are required to be satisfied, or soft constraints, which have some variable values that are penalized in the objective function if, and based on the extent that, the conditions on the variables are not satisfied.

In mathematics, a Relevance Vector Machine (RVM) is a machine learning technique that uses Bayesian inference to obtain parsimonious solutions for regression and probabilistic classification. The RVM has an identical functional form to the support vector machine, but provides probabilistic classification.

In computational chemistry, a constraint algorithm is a method for satisfying the Newtonian motion of a rigid body which consists of mass points. A restraint algorithm is used to ensure that the distance between mass points is maintained. The general steps involved are: (i) choose novel unconstrained coordinates, (ii) introduce explicit constraint forces, (iii) minimize constraint forces implicitly by the technique of Lagrange multipliers or projection methods.

Linear Programming Boosting (LPBoost) is a supervised classifier from the boosting family of classifiers. LPBoost maximizes a margin between training samples of different classes and hence also belongs to the class of margin-maximizing supervised classification algorithms. Consider a classification function

Subgradient methods are iterative methods for solving convex minimization problems. Originally developed by Naum Z. Shor and others in the 1960s and 1970s, subgradient methods are convergent when applied even to a non-differentiable objective function. When the objective function is differentiable, sub-gradient methods for unconstrained problems use the same search direction as the method of steepest descent.

In machine learning, one-class classification (OCC), also known as unary classification or class-modelling, tries to identify objects of a specific class amongst all objects, by primarily learning from a training set containing only the objects of that class, although there exist variants of one-class classifiers where counter-examples are used to further refine the classification boundary. This is different from and more difficult than the traditional classification problem, which tries to distinguish between two or more classes with the training set containing objects from all the classes. Examples include the monitoring of helicopter gearboxes, motor failure prediction, or the operational status of a nuclear plant as 'normal': In this scenario, there are few, if any, examples of catastrophic system states; only the statistics of normal operation are known.

The Bregman method is an iterative algorithm to solve certain convex optimization problems involving regularization. The original version is due to Lev M. Bregman, who published it in 1967.

Least-squares support-vector machines (LS-SVM) for statistics and in statistical modeling, are least-squares versions of support-vector machines (SVM), which are a set of related supervised learning methods that analyze data and recognize patterns, and which are used for classification and regression analysis. In this version one finds the solution by solving a set of linear equations instead of a convex quadratic programming (QP) problem for classical SVMs. Least-squares SVM classifiers were proposed by Johan Suykens and Joos Vandewalle. LS-SVMs are a class of kernel-based learning methods.

Augmented Lagrangian methods are a certain class of algorithms for solving constrained optimization problems. They have similarities to penalty methods in that they replace a constrained optimization problem by a series of unconstrained problems and add a penalty term to the objective; the difference is that the augmented Lagrangian method adds yet another term, designed to mimic a Lagrange multiplier. The augmented Lagrangian is related to, but not identical with the method of Lagrange multipliers.

In mathematics, a submodular set function is a set function whose value, informally, has the property that the difference in the incremental value of the function that a single element makes when added to an input set decreases as the size of the input set increases. Submodular functions have a natural diminishing returns property which makes them suitable for many applications, including approximation algorithms, game theory and electrical networks. Recently, submodular functions have also found immense utility in several real world problems in machine learning and artificial intelligence, including automatic summarization, multi-document summarization, feature selection, active learning, sensor placement, image collection summarization and many other domains.

In the mathematical theory of probability, the drift-plus-penalty method is used for optimization of queueing networks and other stochastic systems.

References

1 2 3 4 5 Platt, John (1998). "Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines" (PDF). CiteSeerX 10.1.1.43.4376 .
↑ Chang, Chih-Chung; Lin, Chih-Jen (2011). "LIBSVM: A library for support vector machines". ACM Transactions on Intelligent Systems and Technology. 2 (3). doi:10.1145/1961189.1961199. S2CID 961425.
↑ Zanni, Luca (2006). "Parallel Software for Training Large Scale Support Vector Machines on Multiprocessor Systems" (PDF).
↑ Rifkin, Ryan (2002). Everything Old is New Again: a Fresh Look at Historical Approaches in Machine Learning (Ph.D. Thesis). Massachusetts Institute of Technology. p. 18. hdl:1721.1/17549.
↑ Boser, B. E.; Guyon, I. M.; Vapnik, V. N. (1992). "A training algorithm for optimal margin classifiers". Proceedings of the fifth annual workshop on Computational learning theory - COLT '92. p. 144. CiteSeerX 10.1.1.21.3818 . doi:10.1145/130385.130401. ISBN 978-0897914970. S2CID 207165665.
↑ Osuna, E.; Freund, R.; Girosi, F. (1997). "An improved training algorithm for support vector machines". Neural Networks for Signal Processing [1997] VII. Proceedings of the 1997 IEEE Workshop. pp. 276–285. CiteSeerX 10.1.1.392.7405 . doi:10.1109/NNSP.1997.622408. ISBN 978-0-7803-4256-9. S2CID 5667586.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Platt-1] 1 2 3 4 5 Platt, John (1998). "Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines" (PDF). CiteSeerX 10.1.1.43.4376 .

[2] Chang, Chih-Chung; Lin, Chih-Jen (2011). "LIBSVM: A library for support vector machines". ACM Transactions on Intelligent Systems and Technology. 2 (3). doi:10.1145/1961189.1961199. S2CID 961425.

[3] Zanni, Luca (2006). "Parallel Software for Training Large Scale Support Vector Machines on Multiprocessor Systems" (PDF).

[4] Rifkin, Ryan (2002). Everything Old is New Again: a Fresh Look at Historical Approaches in Machine Learning (Ph.D. Thesis). Massachusetts Institute of Technology. p. 18. hdl:1721.1/17549.

[ReferenceA-5] Boser, B. E.; Guyon, I. M.; Vapnik, V. N. (1992). "A training algorithm for optimal margin classifiers". Proceedings of the fifth annual workshop on Computational learning theory - COLT '92. p. 144. CiteSeerX 10.1.1.21.3818 . doi:10.1145/130385.130401. ISBN 978-0897914970. S2CID 207165665.

[6] Osuna, E.; Freund, R.; Girosi, F. (1997). "An improved training algorithm for support vector machines". Neural Networks for Signal Processing [1997] VII. Proceedings of the 1997 IEEE Workshop. pp. 276–285. CiteSeerX 10.1.1.392.7405 . doi:10.1109/NNSP.1997.622408. ISBN 978-0-7803-4256-9. S2CID 5667586.

[1]

[2]

[3]

[4]

[5]

[6]