Neighbourhood components analysis

Last updated December 19, 2024

Neighbourhood components analysis is a supervised learning method for classifying multivariate data into distinct classes according to a given distance metric over the data. Functionally, it serves the same purposes as the K-nearest neighbors algorithm and makes direct use of a related concept termed stochastic nearest neighbours.

Definition

Neighbourhood components analysis aims at "learning" a distance metric by finding a linear transformation of input data such that the average leave-one-out (LOO) classification performance is maximized in the transformed space. The key insight to the algorithm is that a matrix $A$ corresponding to the transformation can be found by defining a differentiable objective function for $A$ , followed by the use of an iterative solver such as conjugate gradient descent. One of the benefits of this algorithm is that the number of classes $k$ can be determined as a function of $A$ , up to a scalar constant. This use of the algorithm, therefore, addresses the issue of model selection.

Explanation

In order to define $A$ , we define an objective function describing classification accuracy in the transformed space and try to determine $A^{*}$ such that this objective function is maximized.

$A^{*}={\mbox{argmax}}_{A}f(A)$

Leave-one-out (LOO) classification

Consider predicting the class label of a single data point by consensus of its $k$ -nearest neighbours with a given distance metric. This is known as leave-one-out classification. However, the set of nearest-neighbours $C_{i}$ can be quite different after passing all the points through a linear transformation. Specifically, the set of neighbours for a point can undergo discrete changes in response to smooth changes in the elements of $A$ , implying that any objective function $f(\cdot )$ based on the neighbours of a point will be piecewise-constant, and hence not differentiable.

Solution

We can resolve this difficulty by using an approach inspired by stochastic gradient descent. Rather than considering the $k$ -nearest neighbours at each transformed point in LOO-classification, we'll consider the entire transformed data set as stochastic nearest neighbours. We define these using a softmax function of the squared Euclidean distance between a given LOO-classification point and each other point in the transformed space:

$p_{ij}={\begin{cases}{\frac {e^{-||Ax_{i}-Ax_{j}||^{2}}}{\sum _{k\neq i}e^{-||Ax_{i}-Ax_{k}||^{2}}}},&{\mbox{if }}j\neq i\\0,&{\mbox{if }}j=i\end{cases}}$

The probability of correctly classifying data point $i$ is the probability of classifying the points of each of its neighbours with the same class $C_{i}$ :

$p_{i}=\sum _{j\in C_{i}}p_{ij}\quad$ where $p_{ij}$ is the probability of classifying neighbour $j$ of point $i$ .

Define the objective function using LOO classification, this time using the entire data set as stochastic nearest neighbours:

$f(A)=\sum _{i}\sum _{j\in C_{i}}p_{ij}=\sum _{i}p_{i}$

Note that under stochastic nearest neighbours, the consensus class for a single point $i$ is the expected value of a point's class in the limit of an infinite number of samples drawn from the distribution over its neighbours $j\in C_{i}$ i.e.: $P(Class(X_{i})=Class(X_{j}))=p_{ij}$ . Thus the predicted class is an affine combination of the classes of every other point, weighted by the softmax function for each $j\in C_{j}$ where $C_{j}$ is now the entire transformed data set.

This choice of objective function is preferable as it is differentiable with respect to $A$ (denote $x_{ij}=x_{i}-x_{j}$ ):

${\frac {\partial f}{\partial A}}=-2A\sum _{i}\sum _{j\in C_{i}}p_{ij}\left(x_{ij}x_{ij}^{T}-\sum _{k}p_{ik}x_{ik}x_{ik}^{T}\right)$

$=2A\sum _{i}\left(p_{i}\sum _{k}p_{ik}x_{ik}x_{ik}^{T}-\sum _{j\in C_{i}}p_{ij}x_{ij}x_{ij}^{T}\right)$

Obtaining a gradient for $A$ means that it can be found with an iterative solver such as conjugate gradient descent. Note that in practice, most of the innermost terms of the gradient evaluate to insignificant contributions due to the rapidly diminishing contribution of distant points from the point of interest. This means that the inner sum of the gradient can be truncated, resulting in reasonable computation times even for large data sets.

Alternative formulation

"Maximizing $f(\cdot )$ is equivalent to minimizing the $L_{1}$ -distance between the predicted class distribution and the true class distribution (ie: where the $p_{i}$ induced by $A$ are all equal to 1). A natural alternative is the KL-divergence, which induces the following objective function and gradient:" (Goldberger 2005)

$g(A)=\sum _{i}\log \left(\sum _{j\in C_{i}}p_{ij}\right)=\sum _{i}\log(p_{i})$

${\frac {\partial g}{\partial A}}=2A\sum _{i}\left(\sum _{k}p_{ik}x_{ik}x_{ik}^{T}-{\frac {\sum _{j\in C_{i}}p_{ij}x_{ij}x_{ij}^{T}}{\sum _{j\in C_{i}}p_{ij}}}\right)$

In practice, optimization of $A$ using this function tends to give similar performance results as with the original.

History and background

Neighbourhood components analysis was developed by Jacob Goldberger, Sam Roweis, Ruslan Salakhudinov, and Geoff Hinton at the University of Toronto's department of computer science in 2004.

Related Research Articles

In machine learning, support vector machines are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories, SVMs are one of the most studied models, being based on statistical learning frameworks of VC theory proposed by Vapnik and Chervonenkis (1974).

The method of least squares is a parameter estimation method in regression analysis based on minimizing the sum of the squares of the residuals made in the results of each individual equation.

In statistics, nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables. The data are fitted by a method of successive approximations (iterations).

Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate.

Maximum Variance Unfolding (MVU), also known as Semidefinite Embedding (SDE), is an algorithm in computer science that uses semidefinite programming to perform non-linear dimensionality reduction of high-dimensional vectorial input data.

In machine learning, backpropagation is a gradient estimation method commonly used for training a neural network to compute its parameter updates.

Functional data analysis (FDA) is a branch of statistics that analyses data providing information about curves, surfaces or anything else varying over a continuum. In its most general form, under an FDA framework, each sample element of functional data is considered to be a random function. The physical continuum over which these functions are defined is often time, but may also be spatial location, wavelength, probability, etc. Intrinsically, functional data are infinite dimensional. The high intrinsic dimensionality of these data brings challenges for theory as well as computation, where these challenges vary with how the functional data were sampled. However, the high or infinite dimensional structure of the data is a rich source of information and there are many interesting challenges for research and data analysis.

<span class="mw-page-title-main">Geometry processing</span>

Geometry processing is an area of research that uses concepts from applied mathematics, computer science and engineering to design efficient algorithms for the acquisition, reconstruction, analysis, manipulation, simulation and transmission of complex 3D models. As the name implies, many of the concepts, data structures, and algorithms are directly analogous to signal processing and image processing. For example, where image smoothing might convolve an intensity signal with a blur kernel formed using the Laplace operator, geometric smoothing might be achieved by convolving a surface geometry with a blur kernel formed using the Laplace-Beltrami operator.

In numerical linear algebra, the Jacobi eigenvalue algorithm is an iterative method for the calculation of the eigenvalues and eigenvectors of a real symmetric matrix. It is named after Carl Gustav Jacob Jacobi, who first proposed the method in 1846, but only became widely used in the 1950s with the advent of computers.

Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.

In the field of mathematical modeling, a radial basis function network is an artificial neural network that uses radial basis functions as activation functions. The output of the network is a linear combination of radial basis functions of the inputs and neuron parameters. Radial basis function networks have many uses, including function approximation, time series prediction, classification, and system control. They were first formulated in a 1988 paper by Broomhead and Lowe, both researchers at the Royal Signals and Radar Establishment.

Non-linear least squares is the form of least squares analysis used to fit a set of m observations with a model that is non-linear in n unknown parameters (m ≥ n). It is used in some forms of nonlinear regression. The basis of the method is to approximate the model by a linear one and to refine the parameters by successive iterations. There are many similarities to linear least squares, but also some significant differences. In economic theory, the non-linear least squares method is applied in (i) the probit regression, (ii) threshold regression, (iii) smooth regression, (iv) logistic link regression, (v) Box–Cox transformed regressors ( $).$

In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need of out-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time, e.g., prediction of prices in the financial international markets. Online learning algorithms may be prone to catastrophic interference, a problem that can be addressed by incremental learning approaches.

<span class="mw-page-title-main">Diffusion</span> Transport of dissolved species from the highest to the lowest concentration region

Diffusion is the net movement of anything generally from a region of higher concentration to a region of lower concentration. Diffusion is driven by a gradient in Gibbs free energy or chemical potential. It is possible to diffuse "uphill" from a region of lower concentration to a region of higher concentration, as in spinodal decomposition. Diffusion is a stochastic process due to the inherent randomness of the diffusing entity and can be used to model many real-life stochastic scenarios. Therefore, diffusion and the corresponding mathematical models are used in several fields beyond physics, such as statistics, probability theory, information theory, neural networks, finance, and marketing.

The system size expansion, also known as van Kampen's expansion or the Ω-expansion, is a technique pioneered by Nico van Kampen used in the analysis of stochastic processes. Specifically, it allows one to find an approximation to the solution of a master equation with nonlinear transition rates. The leading order term of the expansion is given by the linear noise approximation, in which the master equation is approximated by a Fokker–Planck equation with linear coefficients determined by the transition rates and stoichiometry of the system.

Augmented Lagrangian methods are a certain class of algorithms for solving constrained optimization problems. They have similarities to penalty methods in that they replace a constrained optimization problem by a series of unconstrained problems and add a penalty term to the objective, but the augmented Lagrangian method adds yet another term designed to mimic a Lagrange multiplier. The augmented Lagrangian is related to, but not identical with, the method of Lagrange multipliers.

In statistics, Ward's method is a criterion applied in hierarchical cluster analysis. Ward's minimum variance method is a special case of the objective function approach originally presented by Joe H. Ward, Jr. Ward suggested a general agglomerative hierarchical clustering procedure, where the criterion for choosing the pair of clusters to merge at each step is based on the optimal value of an objective function. This objective function could be "any function that reflects the investigator's purpose." Many of the standard clustering procedures are contained in this very general class. To illustrate the procedure, Ward used the example where the objective function is the error sum of squares, and this example is known as Ward's method or more precisely Ward's minimum variance method.

Extension neural network is a pattern recognition method found by M. H. Wang and C. P. Hung in 2003 to classify instances of data sets. Extension neural network is composed of artificial neural network and extension theory concepts. It uses the fast and adaptive learning capability of neural network and correlation estimation property of extension theory by calculating extension distance.
ENN was used in:

A capsule neural network (CapsNet) is a machine learning system that is a type of artificial neural network (ANN) that can be used to better model hierarchical relationships. The approach is an attempt to more closely mimic biological neural organization.

(Stochastic) variance reduction is an algorithmic approach to minimizing functions that can be decomposed into finite sums. By exploiting the finite sum structure, variance reduction techniques are able to achieve convergence rates that are impossible to achieve with methods that treat the objective as an infinite sum, as in the classical Stochastic approximation setting.

References

J. Goldberger, G. Hinton, S. Roweis, R. Salakhutdinov. (2005) Neighbourhood Components Analysis. Advances in Neural Information Processing Systems. 17, 513–520, 2005.

External links

Software

The MLPACK library contains a C++ implementation
nca (C++)
scikit-learn's "NeighborhoodComponentsAnalysis" implementation (Python)

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.