Structured support vector machine

Last updated January 22, 2022

The structured support-vector machine is a machine learning algorithm that generalizes the Support-Vector Machine (SVM) classifier. Whereas the SVM classifier supports binary classification, multiclass classification and regression, the structured SVM allows training of a classifier for general structured output labels.

As an example, a sample instance might be a natural language sentence, and the output label is an annotated parse tree. Training a classifier consists of showing pairs of correct sample and output label pairs. After training, the structured SVM model allows one to predict for new sample instances the corresponding output label; that is, given a natural language sentence, the classifier can produce the most likely parse tree.

Training

For a set of $\ell$ training instances $({\boldsymbol {x}}_{n},y_{n})\in {\mathcal {X}}\times {\mathcal {Y}}$ , $n=1,\dots ,\ell$ from a sample space ${\mathcal {X}}$ and label space ${\mathcal {Y}}$ , the structured SVM minimizes the following regularized risk function.

{\underset {\boldsymbol {w}}{\min }}\quad \|{\boldsymbol {w}}\|^{2}+C\sum _{n=1}^{\ell }{\underset {y\in {\mathcal {Y}}}{\max }}\left(0,\Delta (y_{n},y)+\langle {\boldsymbol {w}},\Psi ({\boldsymbol {x}}_{n},y)\rangle -\langle {\boldsymbol {w}},\Psi ({\boldsymbol {x}}_{n},y_{n})\rangle \right)

The function is convex in ${\boldsymbol {w}}$ because the maximum of a set of affine functions is convex. The function ${\displaystyle \Delta$ measures a distance in label space and is an arbitrary function (not necessarily a metric) satisfying $\Delta (y,z)\geq 0$ and $\Delta (y,y)=0\;\;\forall y,z\in {\mathcal {Y}}$ . The function ${\displaystyle \Psi$ is a feature function, extracting some feature vector from a given sample and label. The design of this function depends very much on the application.

Because the regularized risk function above is non-differentiable, it is often reformulated in terms of a quadratic program by introducing one slack variable $\xi _{n}$ for each sample, each representing the value of the maximum. The standard structured SVM primal formulation is given as follows.

{\begin{array}{cl}{\underset {{\boldsymbol {w}},{\boldsymbol {\xi }}}{\min }}&\|{\boldsymbol {w}}\|^{2}+C\sum _{n=1}^{\ell }\xi _{n}\\{\textrm {s.t.}}&\langle {\boldsymbol {w}},\Psi ({\boldsymbol {x}}_{n},y_{n})\rangle -\langle {\boldsymbol {w}},\Psi ({\boldsymbol {x}}_{n},y)\rangle +\xi _{n}\geq \Delta (y_{n},y),\qquad n=1,\dots ,\ell ,\quad \forall y\in {\mathcal {Y}}\end{array}}

Inference

At test time, only a sample ${\boldsymbol {x}}\in {\mathcal {X}}$ is known, and a prediction function $f:{\mathcal {X}}\to {\mathcal {Y}}$ maps it to a predicted label from the label space ${\mathcal {Y}}$ . For structured SVMs, given the vector ${\boldsymbol {w}}$ obtained from training, the prediction function is the following.

f({\boldsymbol {x}})={\underset {y\in {\mathcal {Y}}}{\textrm {argmax}}}\quad \langle {\boldsymbol {w}},\Psi ({\boldsymbol {x}},y)\rangle

Therefore, the maximizer over the label space is the predicted label. Solving for this maximizer is the so-called inference problem and similar to making a maximum a-posteriori (MAP) prediction in probabilistic models. Depending on the structure of the function $\Psi$ , solving for the maximizer can be a hard problem.

Separation

The above quadratic program involves a very large, possibly infinite number of linear inequality constraints. In general, the number of inequalities is too large to be optimized over explicitly. Instead the problem is solved by using delayed constraint generation where only a finite and small subset of the constraints is used. Optimizing over a subset of the constraints enlarges the feasible set and will yield a solution that provides a lower bound on the objective. To test whether the solution ${\boldsymbol {w}}$ violates constraints of the complete set inequalities, a separation problem needs to be solved. As the inequalities decompose over the samples, for each sample $({\boldsymbol {x}}_{n},y_{n})$ the following problem needs to be solved.

y_{n}^{*}={\underset {y\in {\mathcal {Y}}}{\textrm {argmax}}}\left(\Delta (y_{n},y)+\langle {\boldsymbol {w}},\Psi ({\boldsymbol {x}}_{n},y)\rangle -\langle {\boldsymbol {w}},\Psi ({\boldsymbol {x}}_{n},y_{n})\rangle -\xi _{n}\right)

The right hand side objective to be maximized is composed of the constant $-\langle {\boldsymbol {w}},\Psi ({\boldsymbol {x}}_{n},y_{n})\rangle -\xi _{n}$ and a term dependent on the variables optimized over, namely $\Delta (y_{n},y)+\langle {\boldsymbol {w}},\Psi ({\boldsymbol {x}}_{n},y)\rangle$ . If the achieved right hand side objective is smaller or equal to zero, no violated constraints for this sample exist. If it is strictly larger than zero, the most violated constraint with respect to this sample has been identified. The problem is enlarged by this constraint and resolved. The process continues until no violated inequalities can be identified.

If the constants are dropped from the above problem, we obtain the following problem to be solved.

y_{n}^{*}={\underset {y\in {\mathcal {Y}}}{\textrm {argmax}}}\left(\Delta (y_{n},y)+\langle {\boldsymbol {w}},\Psi ({\boldsymbol {x}}_{n},y)\rangle \right)

This problem looks very similar to the inference problem. The only difference is the addition of the term $\Delta (y_{n},y)$ . Most often, it is chosen such that it has a natural decomposition in label space. In that case, the influence of $\Delta$ can be encoded into the inference problem and solving for the most violating constraint is equivalent to solving the inference problem.

Related Research Articles

In quantum mechanics, bra–ket notation, or Dirac notation, is used ubiquitously to denote quantum states. The notation uses angle brackets, and , and a vertical bar , to construct "bras" and "kets".

In quantum mechanics, the uncertainty principle is any of a variety of mathematical inequalities asserting a fundamental limit to the accuracy with which the values for certain pairs of physical quantities of a particle, such as position, x, and momentum, p, can be predicted from initial conditions.

Distributions, also known as Schwartz distributions or generalized functions, are objects that generalize the classical notion of functions in mathematical analysis. Distributions make it possible to differentiate functions whose derivatives do not exist in the classical sense. In particular, any locally integrable function has a distributional derivative. Distributions are widely used in the theory of partial differential equations, where it may be easier to establish the existence of distributional solutions than classical solutions, or where appropriate classical solutions may not exist. Distributions are also important in physics and engineering where many problems naturally lead to differential equations whose solutions or initial conditions are distributions, such as the Dirac delta function.

Wave function Mathematical description of the quantum state of a system

A wave function in quantum physics is a mathematical description of the quantum state of an isolated quantum system. The wave function is a complex-valued probability amplitude, and the probabilities for the possible results of measurements made on the system can be derived from it. The most common symbols for a wave function are the Greek letters $ψ$ and $Ψ$ .

Quantum decoherence is the loss of quantum coherence. In quantum mechanics, particles such as electrons are described by a wave function, a mathematical representation of the quantum state of a system; a probabilistic interpretation of the wave function is used to explain various quantum effects. As long as there exists a definite phase relation between different states, the system is said to be coherent. A definite phase relationship is necessary to perform quantum computing on quantum information encoded in quantum states. Coherence is preserved under the laws of quantum physics.

In physics, an operator is a function over a space of physical states onto another space of physical states. The simplest example of the utility of operators is the study of symmetry. Because of this, they are very useful tools in classical mechanics. Operators are even more important in quantum mechanics, where they form an intrinsic part of the formulation of the theory.

In atomic physics, the fine structure describes the splitting of the spectral lines of atoms due to electron spin and relativistic corrections to the non-relativistic Schrödinger equation. It was first measured precisely for the hydrogen atom by Albert A. Michelson and Edward W. Morley in 1887, laying the basis for the theoretical treatment by Arnold Sommerfeld, introducing the fine-structure constant.

The Schwinger–Dyson equations (SDEs) or Dyson–Schwinger equations, named after Julian Schwinger and Freeman Dyson, are general relations between Green functions in quantum field theories (QFTs). They are also referred to as the Euler–Lagrange equations of quantum field theories, since they are the equations of motion corresponding to the Green's function. They form a set of infinitely many functional differential equations, all coupled to each other, sometimes referred to as the infinite tower of SDEs.

In theoretical physics, a gravitational anomaly is an example of a gauge anomaly: it is an effect of quantum mechanics — usually a one-loop diagram—that invalidates the general covariance of a theory of general relativity combined with some other fields. The adjective "gravitational" is derived from the symmetry of a gravitational theory, namely from general covariance. A gravitational anomaly is generally synonymous with diffeomorphism anomaly, since general covariance is symmetry under coordinate reparametrization; i.e. diffeomorphism.

In quantum field theory, the LSZ reduction formula is a method to calculate S-matrix elements from the time-ordered correlation functions of a quantum field theory. It is a step of the path that starts from the Lagrangian of some quantum field theory and leads to prediction of measurable quantities. It is named after the three German physicists Harry Lehmann, Kurt Symanzik and Wolfhart Zimmermann.

In quantum mechanics, a two-state system is a quantum system that can exist in any quantum superposition of two independent quantum states. The Hilbert space describing such a system is two-dimensional. Therefore, a complete basis spanning the space will consist of two independent states. Any two-state system can also be seen as a qubit.

In quantum mechanics, the position operator is the operator that corresponds to the position observable of a particle.

Photon polarization is the quantum mechanical description of the classical polarized sinusoidal plane electromagnetic wave. An individual photon can be described as having right or left circular polarization, or a superposition of the two. Equivalently, a photon can be described as having horizontal or vertical linear polarization, or a superposition of the two.

In many-body theory, the term Green's function is sometimes used interchangeably with correlation function, but refers specifically to correlators of field operators or creation and annihilation operators.

Linear Programming Boosting (LPBoost) is a supervised classifier from the boosting family of classifiers. LPBoost maximizes a margin between training samples of different classes and hence also belongs to the class of margin-maximizing supervised classification algorithms. Consider a classification function

Quantum walks are quantum analogues of classical random walks. In contrast to the classical random walk, where the walker occupies definite states and the randomness arises due to stochastic transitions between states, in quantum walks randomness arises through: (1) quantum superposition of states, (2) non-random, reversible unitary evolution and (3) collapse of the wave function due to state measurements.

In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs).

In probability theory and statistics, the normal-inverse-Wishart distribution is a multivariate four-parameter family of continuous probability distributions. It is the conjugate prior of a multivariate normal distribution with unknown mean and covariance matrix.

In the ADM formulation of general relativity one splits spacetime into spatial slices and time, the basic variables are taken to be the induced metric, $, on the spatial slice, and its conjugate momentum variable related to the extrinsic curvature,,. These are the metric canonical coordinates.$

The theory of causal fermion systems is an approach to describe fundamental physics. It provides a unification of the weak, the strong and the electromagnetic forces with gravity at the level of classical field theory. Moreover, it gives quantum mechanics as a limiting case and has revealed close connections to quantum field theory. Therefore, it is a candidate for a unified physical theory. Instead of introducing physical objects on a preexisting spacetime manifold, the general concept is to derive spacetime as well as all the objects therein as secondary objects from the structures of an underlying causal fermion system. This concept also makes it possible to generalize notions of differential geometry to the non-smooth setting. In particular, one can describe situations when spacetime no longer has a manifold structure on the microscopic scale. As a result, the theory of causal fermion systems is a proposal for quantum geometry and an approach to quantum gravity.

References

Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann and Yasemin Altun (2005), Large Margin Methods for Structured and Interdependent Output Variables, JMLR, Vol. 6, pages 1453-1484.
Thomas Finley and Thorsten Joachims (2008), Training Structural SVMs when Exact Inference is Intractable, ICML 2008.
Sunita Sarawagi and Rahul Gupta (2008), Accurate Max-margin Training for Structured Output Spaces, ICML 2008.
Gökhan BakIr, Ben Taskar, Thomas Hofmann, Bernhard Schölkopf, Alex Smola and SVN Vishwanathan (2007), Predicting Structured Data, MIT Press.
Vojtěch Franc and Bogdan Savchynskyy Discriminative Learning of Max-Sum Classifiers, Journal of Machine Learning Research, 9(Jan):67—104, 2008, Microtome Publishing
Kevin Murphy Machine Learning, MIT Press

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.