Kernel-independent component analysis

Last updated August 24, 2019

In statistics, kernel-independent component analysis (kernel ICA) is an efficient algorithm for independent component analysis which estimates source components by optimizing a generalized variance contrast function, which is based on representations in a reproducing kernel Hilbert space.^[1]^[2] Those contrast functions use the notion of mutual information as a measure of statistical independence.

In signal processing, independent component analysis (ICA) is a computational method for separating a multivariate signal into additive subcomponents. This is done by assuming that the subcomponents are non-Gaussian signals and that they are statistically independent from each other. ICA is a special case of blind source separation. A common example application is the "cocktail party problem" of listening in on one person's speech in a noisy room.

In functional analysis, a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions $and in the RKHS are close in norm, i.e., is small, then and are also pointwise close, i.e., is small for all . The reverse need not be true.$

In probability theory, two events are independent, statistically independent, or stochastically independent if the occurrence of one does not affect the probability of occurrence of the other. Similarly, two random variables are independent if the realization of one does not affect the probability distribution of the other.

Main idea

Kernel ICA is based on the idea that correlations between two random variables can be represented in a reproducing kernel Hilbert space (RKHS), denoted by ${\mathcal {F}}$ , associated with a feature map $L_{x}:{\mathcal {F}}\mapsto \mathbb {R}$ defined for a fixed $x\in \mathbb {R}$ . The ${\mathcal {F}}$ -correlation between two random variables $X$ and $Y$ is defined as

\rho _{\mathcal {F}}(X,Y)=\max _{f,g\in {\mathcal {F}}}\operatorname {corr} (\langle L_{X},f\rangle ,\langle L_{Y},g\rangle )

where the functions $f,g:\mathbb {R} \to \mathbb {R}$ range over ${\mathcal {F}}$ and

\operatorname {corr} (\langle L_{X},f\rangle ,\langle L_{Y},g\rangle ):={\frac {\operatorname {cov} (f(X),g(Y))}{\operatorname {var} (f(X))^{1/2}\operatorname {var} (g(Y))^{1/2}}}

for fixed $f,g\in {\mathcal {F}}$ .^[1] Note that the reproducing property implies that $f(x)=\langle L_{x},f\rangle$ for fixed $x\in \mathbb {R}$ and $f\in {\mathcal {F}}$ .^[3] It follows then that the ${\mathcal {F}}$ -correlation between two independent random variables is zero.

This notion of ${\mathcal {F}}$ -correlations is used for defining contrast functions that are optimized in the Kernel ICA algorithm. Specifically, if $\mathbf {X} :=(x_{ij})\in \mathbb {R} ^{n\times m}$ is a prewhitened data matrix, that is, the sample mean of each column is zero and the sample covariance of the rows is the $m\times m$ dimensional identity matrix, Kernel ICA estimates a $m\times m$ dimensional orthogonal matrix $\mathbf {A}$ so as to minimize finite-sample ${\mathcal {F}}$ -correlations between the columns of $\mathbf {S} :=\mathbf {X} \mathbf {A} ^{\prime }$ .

Related Research Articles

In mathematics, the Cauchy–Schwarz inequality, also known as the Cauchy–Bunyakovsky–Schwarz inequality, is a useful inequality encountered in many different settings, such as linear algebra, analysis, probability theory, vector algebra and other areas. It is considered to be one of the most important inequalities in all of mathematics.

In probability, and statistics, a multivariate random variable or random vector is a list of mathematical variables each of whose value is unknown, either because the value has not yet occurred or because there is imperfect knowledge of its value. The individual variables in a random vector are grouped together because they are all part of a single mathematical system — often they represent different properties of an individual statistical unit. For example, while a given person has a specific age, height and weight, the representation of these features of an unspecified person from within a group would be a random vector. Normally each element of a random vector is a real number.

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

In probability theory and statistics, covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values,, the covariance is positive. In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other,, the covariance is negative. The sign of the covariance therefore shows the tendency in the linear relationship between the variables. The magnitude of the covariance is not easy to interpret because it is not normalized and hence depends on the magnitudes of the variables. The normalized version of the covariance, the correlation coefficient, however, shows by its magnitude the strength of the linear relation.

Covariance matrix measure of covariance of components of a random vector

In probability theory and statistics, a covariance matrix, also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix, is a matrix whose element in the i, j position is the covariance between the i-th and j-th elements of a random vector. A random vector is a random variable with multiple dimensions. Each element of the vector is a scalar random variable. Each element has either a finite number of observed empirical values or a finite or infinite number of potential values. The potential values are specified by a theoretical joint probability distribution.

In signal processing, cross-correlation is a measure of similarity of two series as a function of the displacement of one relative to the other. This is also known as a sliding dot product or sliding inner-product. It is commonly used for searching a long signal for a shorter, known feature. It has applications in pattern recognition, single particle analysis, electron tomography, averaging, cryptanalysis, and neurophysiology. The cross-correlation is similar in nature to the convolution of two functions. In an autocorrelation, which is the cross-correlation of a signal with itself, there will always be a peak at a lag of zero, and its size will be the signal energy.

In the theory of stochastic processes, the Karhunen–Loève theorem, also known as the Kosambi–Karhunen–Loève theorem is a representation of a stochastic process as an infinite linear combination of orthogonal functions, analogous to a Fourier series representation of a function on a bounded interval. The transformation is also known as Hotelling transform and eigenvector transform, and is closely related to principal component analysis (PCA) technique widely used in image processing and in data analysis in many fields.

Multi-task learning (MTL) is a subfield of machine learning in which multiple learning tasks are solved at the same time, while exploiting commonalities and differences across tasks. This can result in improved learning efficiency and prediction accuracy for the task-specific models, when compared to training the models separately. Early versions of MTL were called "hints".

In linear algebra, the Gram matrix of a set of vectors $in an inner product space is the Hermitian matrix of inner products, whose entries are given by .$

In machine learning, kernel methods are a class of algorithms for pattern analysis, whose best known member is the support vector machine (SVM). The general task of pattern analysis is to find and study general types of relations in datasets. For many algorithms that solve these tasks, the data in raw representation have to be explicitly transformed into feature vector representations via a user-specified feature map: in contrast, kernel methods require only a user-specified kernel, i.e., a similarity function over pairs of data points in raw representation.

In statistics, principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). Typically, it considers regressing the outcome on a set of covariates based on a standard linear regression model, but uses PCA for estimating the unknown regression coefficients in the model.

In machine learning, a subfield of computer science, learning with errors (LWE) is the problem to infer a linear $-ary function over a finite ring from given samples some of which may be erroneous. The LWE problem is conjectured to be hard to solve, and thus be useful in cryptography.$

In machine learning, kernel methods arise from the assumption of an inner product space or similarity structure on inputs. For some such methods, such as support vector machines (SVMs), the original formulation and its regularization were not Bayesian in nature. It is helpful to understand them from a Bayesian perspective. Because the kernels are not necessarily positive semidefinite, the underlying structure may not be inner product spaces, but instead more general reproducing kernel Hilbert spaces. In Bayesian probability kernel methods are a key component of Gaussian processes, where the kernel function is known as the covariance function. Kernel methods have traditionally been used in supervised learning problems where the input space is usually a space of vectors while the output space is a space of scalars. More recently these methods have been extended to problems that deal with multiple outputs such as in multi-task learning.

In statistical learning theory, a representer theorem is any of several related results stating that a minimizer $of a regularized empirical risk functional defined over a reproducing kernel Hilbert space can be represented as a finite linear combination of kernel products evaluated on the input points in the training set data.$

Functional principal component analysis (FPCA) is a statistical method for investigating the dominant modes of variation of functional data. Using this method, a random function is represented in the eigenbasis, which is an orthonormal basis of the Hilbert space L² that consists of the eigenfunctions of the autocovariance operator. FPCA represents functional data in the most parsimonious way, in the sense that when using a fixed number of basis functions, the eigenfunction basis explains more variation than any other basis expansion. FPCA can be applied for representing random functions, or in functional regression and classification.

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space $on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in, discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song, Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.$

Kernel methods are a well-established tool to analyze the relationship between input data and the corresponding output of a function. Kernels encapsulate the properties of functions in a computationally efficient way and allow algorithms to easily swap functions of varying complexity.

Low-rank matrix approximations are essential tools in the application of kernel methods to large-scale learning problems.

In statistics, functional correlation is a dimensionality reduction technique used to quantify the correlation and dependence between two variables when the data is functional. Several approaches have been developed to quantify the relation between two functional variables.

References

1 2 Bach, Francis R.; Jordan, Michael I. (2003). "Kernel independent component analysis" (PDF). The Journal of Machine Learning Research. 3: 1–48. doi:10.1162/153244303768966085.
↑ Bach, Francis R.; Jordan, Michael I. (2003). Kernel independent component analysis (PDF). IEEE International Conference on Acoustics, Speech, and Signal Processing. 4. pp. IV-876–9. doi:10.1109/icassp.2003.1202783. ISBN 978-0-7803-7663-2.
↑ Saitoh, Saburou (1988). Theory of Reproducing Kernels and Its Applications. Longman. ISBN 978-0582035645.

This statistics-related article is a stub. You can help Wikipedia by expanding it.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Bach_Jordan_JMLR_2003-1] 1 2 Bach, Francis R.; Jordan, Michael I. (2003). "Kernel independent component analysis" (PDF). The Journal of Machine Learning Research. 3: 1–48. doi:10.1162/153244303768966085.

[Bach_Jordan_ICASSP_2003-2] Bach, Francis R.; Jordan, Michael I. (2003). Kernel independent component analysis (PDF). IEEE International Conference on Acoustics, Speech, and Signal Processing. 4. pp. IV-876–9. doi:10.1109/icassp.2003.1202783. ISBN 978-0-7803-7663-2.

[Saitoh-3] Saitoh, Saburou (1988). Theory of Reproducing Kernels and Its Applications. Longman. ISBN 978-0582035645.