Kernel perceptron

Last updated

In machine learning, the kernel perceptron is a variant of the popular perceptron learning algorithm that can learn kernel machines, i.e. non-linear classifiers that employ a kernel function to compute the similarity of unseen samples to training samples. The algorithm was invented in 1964, [1] making it the first kernel classification learner. [2]

Contents

Preliminaries

The perceptron algorithm

The perceptron algorithm is an online learning algorithm that operates by a principle called "error-driven learning". It iteratively improves a model by running it on training samples, then updating the model whenever it finds it has made an incorrect classification with respect to a supervised signal. The model learned by the standard perceptron algorithm is a linear binary classifier: a vector of weights w (and optionally an intercept term b, omitted here for simplicity) that is used to classify a sample vector x as class "one" or class "minus one" according to

where a zero is arbitrarily mapped to one or minus one. (The "hat" on ŷ denotes an estimated value.)

In pseudocode, the perceptron algorithm is given by:

Initialize w to an all-zero vector of length p, the number of predictors (features).
For some fixed number of iterations, or until some stopping criterion is met:
For each training example xi with ground truth label yi ∈ {-1, 1}:
Let ŷ = sgn(wTxi).
If ŷyi, update ww + yixi.

Kernel Methods

By contrast with the linear models learned by the perceptron, a kernel method [3] is a classifier that stores a subset of its training examples xi, associates with each a weight αi, and makes decisions for new samples x' by evaluating

.

Here, K is some kernel function. Formally, a kernel function is a non-negative semidefinite kernel (see Mercer's condition), representing an inner product between samples in a high-dimensional space, as if the samples had been expanded to include additional features by a function Φ: K(x, x') = Φ(x) · Φ(x'). Intuitively, it can be thought of as a similarity function between samples, so the kernel machine establishes the class of a new sample by weighted comparison to the training set. Each function x'K(xi, x') serves as a basis function in the classification.

Algorithm

To derive a kernelized version of the perceptron algorithm, we must first formulate it in dual form, starting from the observation that the weight vector w can be expressed as a linear combination of the n training samples. The equation for the weight vector is

where αi is the number of times xi was misclassified, forcing an update ww + yixi. Using this result, we can formulate the dual perceptron algorithm, which loops through the samples as before, making predictions, but instead of storing and updating a weight vector w, it updates a "mistake counter" vector α. We must also rewrite the prediction formula to get rid of w:

Plugging these two equations into the training loop turn it into the dual perceptron algorithm.

Finally, we can replace the dot product in the dual perceptron by an arbitrary kernel function, to get the effect of a feature map Φ without computing Φ(x) explicitly for any samples. Doing this yields the kernel perceptron algorithm: [4]

Initialize α to an all-zeros vector of length n, the number of training samples.
For some fixed number of iterations, or until some stopping criterion is met:
For each training example xj, yj:
Let
If ŷyj, perform an update by incrementing the mistake counter:
αjαj + 1

Variants and extensions

One problem with the kernel perceptron, as presented above, is that it does not learn sparse kernel machines. Initially, all the αi are zero so that evaluating the decision function to get ŷ requires no kernel evaluations at all, but each update increments a single αi, making the evaluation increasingly more costly. Moreover, when the kernel perceptron is used in an online setting, the number of non-zero αi and thus the evaluation cost grow linearly in the number of examples presented to the algorithm.

The forgetron variant of the kernel perceptron was suggested to deal with this problem. It maintains an active set of examples with non-zero αi, removing ("forgetting") examples from the active set when it exceeds a pre-determined budget and "shrinking" (lowering the weight of) old examples as new ones are promoted to non-zero αi. [5]

Another problem with the kernel perceptron is that it does not regularize, making it vulnerable to overfitting. The NORMA online kernel learning algorithm can be regarded as a generalization of the kernel perceptron algorithm with regularization. [6] The sequential minimal optimization (SMO) algorithm used to learn support vector machines can also be regarded as a generalization of the kernel perceptron. [6]

The voted perceptron algorithm of Freund and Schapire also extends to the kernelized case, [7] giving generalization bounds comparable to the kernel SVM. [2]

Related Research Articles

In mathematics, a linear map is a mapping between two vector spaces that preserves the operations of vector addition and scalar multiplication. The same names and the same definition are also used for the more general case of modules over a ring; see Module homomorphism.

Support-vector machine Set of methods for supervised statistical learning

In machine learning, support-vector machines are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues, SVMs are one of the most robust prediction methods, being based on statistical learning frameworks or VC theory proposed by Vapnik and Chervonenkis (1974). Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier. SVM maps training examples to points in space so as to maximise the width of the gap between the two categories. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class it belongs to. A linear classifier achieves this by making a classification decision based on the value of a linear combination of the characteristics. An object's characteristics are also known as feature values and are typically presented to the machine in a vector called a feature vector. Such classifiers work well for practical problems such as document classification, and more generally for problems with many variables (features), reaching accuracy levels comparable to non-linear classifiers while taking less time to train and use.

Perceptron

In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class. It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector.

Nonlinear dimensionality reduction Summary of algorithms for nonlinear dimensionality reduction

High-dimensional data, meaning data that requires more than two or three dimensions to represent, can be difficult to interpret. One approach to simplification is to assume that the data of interest lies within lower-dimensional space. If the data of interest is of low enough dimension, the data can be visualised in the low-dimensional space.

In control theory, a state observer or state estimator is a system that provides an estimate of the internal state of a given real system, from measurements of the input and output of the real system. It is typically computer-implemented, and provides the basis of many practical applications.

Random forest A binary search tree based ensemble machine learning method

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean/average prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set. Random forests generally outperform decision trees, but their accuracy is lower than gradient boosted trees. However, data characteristics can affect their performance.

Multilayer perceptron

A multilayer perceptron (MLP) is a class of feedforward artificial neural network (ANN). The term MLP is used ambiguously, sometimes loosely to any feedforward ANN, sometimes strictly to refer to networks composed of multiple layers of perceptrons ; see § Terminology. Multilayer perceptrons are sometimes colloquially referred to as "vanilla" neural networks, especially when they have a single hidden layer.

Kernel method

In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). The general task of pattern analysis is to find and study general types of relations in datasets. For many algorithms that solve these tasks, the data in raw representation have to be explicitly transformed into feature vector representations via a user-specified feature map: in contrast, kernel methods require only a user-specified kernel, i.e., a similarity function over pairs of data points in raw representation.

Relevance vector machine

In mathematics, a Relevance Vector Machine (RVM) is a machine learning technique that uses Bayesian inference to obtain parsimonious solutions for regression and probabilistic classification. The RVM has an identical functional form to the support vector machine, but provides probabilistic classification.

ADALINE Early single-layer artificial neural network

ADALINE is an early single-layer artificial neural network and the name of the physical device that implemented this network. The network uses memistors. It was developed by Professor Bernard Widrow and his graduate student Ted Hoff at Stanford University in 1960. It is based on the McCulloch–Pitts neuron. It consists of a weight, a bias and a summation function.

In the field of mathematical modeling, a radial basis function network is an artificial neural network that uses radial basis functions as activation functions. The output of the network is a linear combination of radial basis functions of the inputs and neuron parameters. Radial basis function networks have many uses, including function approximation, time series prediction, classification, and system control. They were first formulated in a 1988 paper by Broomhead and Lowe, both researchers at the Royal Signals and Radar Establishment.

The Viola–Jones object detection framework is an object detection framework which was proposed in 2001 by Paul Viola and Michael Jones. Although it can be trained to detect a variety of object classes, it was motivated primarily by the problem of face detection.

Structured prediction supervised machine learning techniques

Structured prediction or structured (output) learning is an umbrella term for supervised machine learning techniques that involves predicting structured objects, rather than scalar discrete or real values.

Hinge loss

In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs).

In statistics, ordinal regression is a type of regression analysis used for predicting an ordinal variable, i.e. a variable whose value exists on an arbitrary scale where only the relative ordering between different values is significant. It can be considered an intermediate problem between regression and classification. Examples of ordinal regression are ordered logit and ordered probit. Ordinal regression turns up often in the social sciences, for example in the modeling of human levels of preference, as well as in information retrieval. In machine learning, ordinal regression may also be called ranking learning.

In machine learning, kernel methods arise from the assumption of an inner product space or similarity structure on inputs. For some such methods, such as support vector machines (SVMs), the original formulation and its regularization were not Bayesian in nature. It is helpful to understand them from a Bayesian perspective. Because the kernels are not necessarily positive semidefinite, the underlying structure may not be inner product spaces, but instead more general reproducing kernel Hilbert spaces. In Bayesian probability kernel methods are a key component of Gaussian processes, where the kernel function is known as the covariance function. Kernel methods have traditionally been used in supervised learning problems where the input space is usually a space of vectors while the output space is a space of scalars. More recently these methods have been extended to problems that deal with multiple outputs such as in multi-task learning.

In machine learning, the radial basis function kernel, or RBF kernel, is a popular kernel function used in various kernelized learning algorithms. In particular, it is commonly used in support vector machine classification.

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in , discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song , Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.

Regularized least squares (RLS) is a family of methods for solving the least-squares problem while using regularization to further constrain the resulting solution.

References

  1. Aizerman, M. A.; Braverman, Emmanuel M.; Rozoner, L. I. (1964). "Theoretical foundations of the potential function method in pattern recognition learning". Automation and Remote Control. 25: 821–837. Cited in Guyon, Isabelle; Boser, B.; Vapnik, Vladimir (1993). Automatic capacity tuning of very large VC-dimension classifiers. Advances in neural information processing systems. CiteSeerX   10.1.1.17.7215 .
  2. 1 2 Bordes, Antoine; Ertekin, Seyda; Weston, Jason; Bottou, Léon (2005). "Fast kernel classifiers with online and active learning". JMLR . 6: 1579–1619.
  3. Schölkopf, Bernhard; and Smola, Alexander J.; Learning with Kernels, MIT Press, Cambridge, MA, 2002. ISBN   0-262-19475-9
  4. Shawe-Taylor, John; Cristianini, Nello (2004). Kernel Methods for Pattern Analysis. Cambridge University Press. pp. 241–242.
  5. Dekel, Ofer; Shalev-Shwartz, Shai; Singer, Yoram (2008). "The forgetron: A kernel-based perceptron on a budget" (PDF). SIAM Journal on Computing. 37 (5): 1342–1372. CiteSeerX   10.1.1.115.568 . doi:10.1137/060666998.
  6. 1 2 Kivinen, Jyrki; Smola, Alexander J.; Williamson, Robert C. (2004). "Online learning with kernels". IEEE Transactions on Signal Processing. 52 (8): 2165–2176. CiteSeerX   10.1.1.578.5680 . doi:10.1109/TSP.2004.830991.
  7. Freund, Y.; Schapire, R. E. (1999). "Large margin classification using the perceptron algorithm" (PDF). Machine Learning . 37 (3): 277–296. doi: 10.1023/A:1007662407062 .CS1 maint: discouraged parameter (link)