This article needs additional citations for verification .(October 2011) |
Part of a series on |
Machine learning and data mining |
---|
In signal processing, independent component analysis (ICA) is a computational method for separating a multivariate signal into additive subcomponents. This is done by assuming that at most one subcomponent is Gaussian and that the subcomponents are statistically independent from each other. [1] ICA was invented by Jeanny Hérault and Christian Jutten in 1985. [2] ICA is a special case of blind source separation. A common example application of ICA is the "cocktail party problem" of listening in on one person's speech in a noisy room. [3]
Independent component analysis attempts to decompose a multivariate signal into independent non-Gaussian signals. As an example, sound is usually a signal that is composed of the numerical addition, at each time t, of signals from several sources. The question then is whether it is possible to separate these contributing sources from the observed total signal. When the statistical independence assumption is correct, blind ICA separation of a mixed signal gives very good results. [5] It is also used for signals that are not supposed to be generated by mixing for analysis purposes.
A simple application of ICA is the "cocktail party problem", where the underlying speech signals are separated from a sample data consisting of people talking simultaneously in a room. Usually the problem is simplified by assuming no time delays or echoes. Note that a filtered and delayed signal is a copy of a dependent component, and thus the statistical independence assumption is not violated.
Mixing weights for constructing the observed signals from the components can be placed in an matrix. An important thing to consider is that if sources are present, at least observations (e.g. microphones if the observed signal is audio) are needed to recover the original signals. When there are an equal number of observations and source signals, the mixing matrix is square (). Other cases of underdetermined () and overdetermined () have been investigated.
The success of ICA separation of mixed signals relies on two assumptions and three effects of mixing source signals. Two assumptions:
Three effects of mixing source signals:
Those principles contribute to the basic establishment of ICA. If the signals extracted from a set of mixtures are independent and have non-Gaussian distributions or have low complexity, then they must be source signals. [6] [7]
ICA finds the independent components (also called factors, latent variables or sources) by maximizing the statistical independence of the estimated components. We may choose one of many ways to define a proxy for independence, and this choice governs the form of the ICA algorithm. The two broadest definitions of independence for ICA are
The Minimization-of-Mutual information (MMI) family of ICA algorithms uses measures like Kullback-Leibler Divergence and maximum entropy. The non-Gaussianity family of ICA algorithms, motivated by the central limit theorem, uses kurtosis and negentropy. [8]
Typical algorithms for ICA use centering (subtract the mean to create a zero mean signal), whitening (usually with the eigenvalue decomposition), and dimensionality reduction as preprocessing steps in order to simplify and reduce the complexity of the problem for the actual iterative algorithm. Whitening and dimension reduction can be achieved with principal component analysis or singular value decomposition. Whitening ensures that all dimensions are treated equally a priori before the algorithm is run. Well-known algorithms for ICA include infomax, FastICA, JADE, and kernel-independent component analysis, among others. In general, ICA cannot identify the actual number of source signals, a uniquely correct ordering of the source signals, nor the proper scaling (including sign) of the source signals.
ICA is important to blind signal separation and has many practical applications. It is closely related to (or even a special case of) the search for a factorial code of the data, i.e., a new vector-valued representation of each data vector such that it gets uniquely encoded by the resulting code vector (loss-free coding), but the code components are statistically independent.
Linear independent component analysis can be divided into noiseless and noisy cases, where noiseless ICA is a special case of noisy ICA. Nonlinear ICA should be considered as a separate case.
The data are represented by the observed random vector and the hidden components as the random vector The task is to transform the observed data using a linear static transformation as into a vector of maximally independent components measured by some function of independence.
The components of the observed random vector are generated as a sum of the independent components , :
weighted by the mixing weights .
The same generative model can be written in vector form as , where the observed random vector is represented by the basis vectors . The basis vectors form the columns of the mixing matrix and the generative formula can be written as , where .
Given the model and realizations (samples) of the random vector , the task is to estimate both the mixing matrix and the sources . This is done by adaptively calculating the vectors and setting up a cost function which either maximizes the non-gaussianity of the calculated or minimizes the mutual information. In some cases, a priori knowledge of the probability distributions of the sources can be used in the cost function.
The original sources can be recovered by multiplying the observed signals with the inverse of the mixing matrix , also known as the unmixing matrix. Here it is assumed that the mixing matrix is square (). If the number of basis vectors is greater than the dimensionality of the observed vectors, , the task is overcomplete but is still solvable with the pseudo inverse.
With the added assumption of zero-mean and uncorrelated Gaussian noise , the ICA model takes the form .
The mixing of the sources does not need to be linear. Using a nonlinear mixing function with parameters the nonlinear ICA model is .
The independent components are identifiable up to a permutation and scaling of the sources. [9] This identifiability requires that:
A special variant of ICA is binary ICA in which both signal sources and monitors are in binary form and observations from monitors are disjunctive mixtures of binary independent sources. The problem was shown to have applications in many domains including medical diagnosis, multi-cluster assignment, network tomography and internet resource management.
Let be the set of binary variables from monitors and be the set of binary variables from sources. Source-monitor connections are represented by the (unknown) mixing matrix , where indicates that signal from the i-th source can be observed by the j-th monitor. The system works as follows: at any time, if a source is active () and it is connected to the monitor () then the monitor will observe some activity (). Formally we have:
where is Boolean AND and is Boolean OR. Noise is not explicitly modelled, rather, can be treated as independent sources.
The above problem can be heuristically solved [10] by assuming variables are continuous and running FastICA on binary observation data to get the mixing matrix (real values), then apply round number techniques on to obtain the binary values. This approach has been shown to produce a highly inaccurate result.[ citation needed ]
Another method is to use dynamic programming: recursively breaking the observation matrix into its sub-matrices and run the inference algorithm on these sub-matrices. The key observation which leads to this algorithm is the sub-matrix of where corresponds to the unbiased observation matrix of hidden components that do not have connection to the -th monitor. Experimental results from [11] show that this approach is accurate under moderate noise levels.
The Generalized Binary ICA framework [12] introduces a broader problem formulation which does not necessitate any knowledge on the generative model. In other words, this method attempts to decompose a source into its independent components (as much as possible, and without losing any information) with no prior assumption on the way it was generated. Although this problem appears quite complex, it can be accurately solved with a branch and bound search tree algorithm or tightly upper bounded with a single multiplication of a matrix with a vector.
Signal mixtures tend to have Gaussian probability density functions, and source signals tend to have non-Gaussian probability density functions. Each source signal can be extracted from a set of signal mixtures by taking the inner product of a weight vector and those signal mixtures where this inner product provides an orthogonal projection of the signal mixtures. The remaining challenge is finding such a weight vector. One type of method for doing so is projection pursuit. [13] [14]
Projection pursuit seeks one projection at a time such that the extracted signal is as non-Gaussian as possible. This contrasts with ICA, which typically extracts M signals simultaneously from M signal mixtures, which requires estimating a M × M unmixing matrix. One practical advantage of projection pursuit over ICA is that fewer than M signals can be extracted if required, where each source signal is extracted from M signal mixtures using an M-element weight vector.
We can use kurtosis to recover the multiple source signal by finding the correct weight vectors with the use of projection pursuit.
The kurtosis of the probability density function of a signal, for a finite sample, is computed as
where is the sample mean of , the extracted signals. The constant 3 ensures that Gaussian signals have zero kurtosis, Super-Gaussian signals have positive kurtosis, and Sub-Gaussian signals have negative kurtosis. The denominator is the variance of , and ensures that the measured kurtosis takes account of signal variance. The goal of projection pursuit is to maximize the kurtosis, and make the extracted signal as non-normal as possible.
Using kurtosis as a measure of non-normality, we can now examine how the kurtosis of a signal extracted from a set of M mixtures varies as the weight vector is rotated around the origin. Given our assumption that each source signal is super-gaussian we would expect:
For multiple source mixture signals, we can use kurtosis and Gram-Schmidt Orthogonalization (GSO) to recover the signals. Given M signal mixtures in an M-dimensional space, GSO project these data points onto an (M-1)-dimensional space by using the weight vector. We can guarantee the independence of the extracted signals with the use of GSO.
In order to find the correct value of , we can use gradient descent method. We first of all whiten the data, and transform into a new mixture , which has unit variance, and . This process can be achieved by applying Singular value decomposition to ,
Rescaling each vector , and let . The signal extracted by a weighted vector is . If the weight vector w has unit length, then the variance of y is also 1, that is . The kurtosis can thus be written as:
The updating process for is:
where is a small constant to guarantee that converges to the optimal solution. After each update, we normalize , and set , and repeat the updating process until convergence. We can also use another algorithm to update the weight vector .
Another approach is using negentropy [8] [15] instead of kurtosis. Using negentropy is a more robust method than kurtosis, as kurtosis is very sensitive to outliers. The negentropy methods are based on an important property of Gaussian distribution: a Gaussian variable has the largest entropy among all continuous random variables of equal variance. This is also the reason why we want to find the most nongaussian variables. A simple proof can be found in Differential entropy.
y is a Gaussian random variable of the same covariance matrix as x
An approximation for negentropy is
A proof can be found in the original papers of Comon; [16] [8] it has been reproduced in the book Independent Component Analysis by Aapo Hyvärinen, Juha Karhunen, and Erkki Oja [17] This approximation also suffers from the same problem as kurtosis (sensitivity to outliers). Other approaches have been developed. [18]
A choice of and are
Infomax ICA [19] is essentially a multivariate, parallel version of projection pursuit. Whereas projection pursuit extracts a series of signals one at a time from a set of M signal mixtures, ICA extracts M signals in parallel. This tends to make ICA more robust than projection pursuit. [20]
The projection pursuit method uses Gram-Schmidt orthogonalization to ensure the independence of the extracted signal, while ICA use infomax and maximum likelihood estimate to ensure the independence of the extracted signal. The Non-Normality of the extracted signal is achieved by assigning an appropriate model, or prior, for the signal.
The process of ICA based on infomax in short is: given a set of signal mixtures and a set of identical independent model cumulative distribution functions(cdfs) , we seek the unmixing matrix which maximizes the joint entropy of the signals , where are the signals extracted by . Given the optimal , the signals have maximum entropy and are therefore independent, which ensures that the extracted signals are also independent. is an invertible function, and is the signal model. Note that if the source signal model probability density function matches the probability density function of the extracted signal , then maximizing the joint entropy of also maximizes the amount of mutual information between and . For this reason, using entropy to extract independent signals is known as infomax.
Consider the entropy of the vector variable , where is the set of signals extracted by the unmixing matrix . For a finite set of values sampled from a distribution with pdf , the entropy of can be estimated as:
The joint pdf can be shown to be related to the joint pdf of the extracted signals by the multivariate form:
where is the Jacobian matrix. We have , and is the pdf assumed for source signals , therefore,
therefore,
We know that when , is of uniform distribution, and is maximized. Since
where is the absolute value of the determinant of the unmixing matrix . Therefore,
so,
since , and maximizing does not affect , so we can maximize the function
to achieve the independence of the extracted signal.
If there are M marginal pdfs of the model joint pdf are independent and use the commonly super-gaussian model pdf for the source signals , then we have
In the sum, given an observed signal mixture , the corresponding set of extracted signals and source signal model , we can find the optimal unmixing matrix , and make the extracted signals independent and non-gaussian. Like the projection pursuit situation, we can use gradient descent method to find the optimal solution of the unmixing matrix.
Maximum likelihood estimation (MLE) is a standard statistical tool for finding parameter values (e.g. the unmixing matrix ) that provide the best fit of some data (e.g., the extracted signals ) to a given a model (e.g., the assumed joint probability density function (pdf) of source signals). [20]
The ML "model" includes a specification of a pdf, which in this case is the pdf of the unknown source signals . Using ML ICA, the objective is to find an unmixing matrix that yields extracted signals with a joint pdf as similar as possible to the joint pdf of the unknown source signals .
MLE is thus based on the assumption that if the model pdf and the model parameters are correct then a high probability should be obtained for the data that were actually observed. Conversely, if is far from the correct parameter values then a low probability of the observed data would be expected.
Using MLE, we call the probability of the observed data for a given set of model parameter values (e.g., a pdf and a matrix ) the likelihood of the model parameter values given the observed data.
We define a likelihood function of :
This equals to the probability density at , since .
Thus, if we wish to find a that is most likely to have generated the observed mixtures from the unknown source signals with pdf then we need only find that which maximizes the likelihood. The unmixing matrix that maximizes equation is known as the MLE of the optimal unmixing matrix.
It is common practice to use the log likelihood, because this is easier to evaluate. As the logarithm is a monotonic function, the that maximizes the function also maximizes its logarithm . This allows us to take the logarithm of equation above, which yields the log likelihood function
If we substitute a commonly used high-Kurtosis model pdf for the source signals then we have
This matrix that maximizes this function is the maximum likelihood estimation.
The early general framework for independent component analysis was introduced by Jeanny Hérault and Bernard Ans from 1984, [21] further developed by Christian Jutten in 1985 and 1986, [2] [22] [23] and refined by Pierre Comon in 1991, [16] and popularized in his paper of 1994. [8] In 1995, Tony Bell and Terry Sejnowski introduced a fast and efficient ICA algorithm based on infomax, a principle introduced by Ralph Linsker in 1987. An interesting link between ML and Infomax approaches can be found in. [24] A quite comprehensive tutorial on ML approach has been published by J-F.Cardoso in 1998. [25]
There are many algorithms available in the literature which do ICA. A largely used one, including in industrial applications, is the FastICA algorithm, developed by Hyvärinen and Oja, [26] which uses the negentropy as cost function, already proposed 7 years before by Pierre Comon in this context. [8] Other examples are rather related to blind source separation where a more general approach is used. For example, one can drop the independence assumption and separate mutually correlated signals, thus, statistically "dependent" signals. Sepp Hochreiter and Jürgen Schmidhuber showed how to obtain non-linear ICA or source separation as a by-product of regularization (1999). [27] Their method does not require a priori knowledge about the number of independent sources.
ICA can be extended to analyze non-physical signals. For instance, ICA has been applied to discover discussion topics on a bag of news list archives.
Some ICA applications are listed below: [6]
ICA can be applied through the following software:
{{cite book}}
: CS1 maint: multiple names: authors list (link)In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables, each of which clusters around a mean value.
Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing.
Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent patterns. PR has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power.
In probability theory and statistics, a covariance matrix is a square matrix giving the covariance between each pair of elements of a given random vector.
In statistics, the logistic model is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression estimates the parameters of a logistic model. In binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.
In statistics, particularly in hypothesis testing, the Hotelling's T-squared distribution (T2), proposed by Harold Hotelling, is a multivariate probability distribution that is tightly related to the F-distribution and is most notable for arising as the distribution of a set of sample statistics that are natural generalizations of the statistics underlying the Student's t-distribution. The Hotelling's t-squared statistic (t2) is a generalization of Student's t-statistic that is used in multivariate hypothesis testing.
In applied statistics, total least squares is a type of errors-in-variables regression, a least squares data modeling technique in which observational errors on both dependent and independent variables are taken into account. It is a generalization of Deming regression and also of orthogonal regression, and can be applied to both linear and non-linear models.
In statistics, nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables. The data are fitted by a method of successive approximations (iterations).
Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their value affects the distribution of the measured data. An estimator attempts to approximate the unknown parameters using the measurements. In estimation theory, two approaches are generally considered:
In mathematics, matrix calculus is a specialized notation for doing multivariable calculus, especially over spaces of matrices. It collects the various partial derivatives of a single function with respect to many variables, and/or of a multivariate function with respect to a single variable, into vectors and matrices that can be treated as single entities. This greatly simplifies operations such as finding the maximum or minimum of a multivariate function and solving systems of differential equations. The notation used here is commonly used in statistics and engineering, while the tensor index notation is preferred in physics.
FastICA is an efficient and popular algorithm for independent component analysis invented by Aapo Hyvärinen at Helsinki University of Technology. Like most ICA algorithms, FastICA seeks an orthogonal rotation of prewhitened data, through a fixed-point iteration scheme, that maximizes a measure of non-Gaussianity of the rotated components. Non-gaussianity serves as a proxy for statistical independence, which is a very strong condition and requires infinite data to verify. FastICA can also be alternatively derived as an approximative Newton iteration.
In directional statistics, the von Mises–Fisher distribution, is a probability distribution on the -sphere in . If the distribution reduces to the von Mises distribution on the circle.
The sensitivity index or discriminability index or detectability index is a dimensionless statistic used in signal detection theory. A higher index indicates that the signal can be more readily detected.
In the field of multivariate statistics, kernel principal component analysis (kernel PCA) is an extension of principal component analysis (PCA) using techniques of kernel methods. Using a kernel, the originally linear operations of PCA are performed in a reproducing kernel Hilbert space.
In statistics, principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). PCR is a form of reduced rank regression.More specifically, PCR is used for estimating the unknown regression coefficients in a standard linear regression model.
In probability theory, the family of complex normal distributions, denoted or , characterizes complex random variables whose real and imaginary parts are jointly normal. The complex normal family has three parameters: location parameter μ, covariance matrix , and the relation matrix . The standard complex normal is the univariate distribution with , , and .
In probability theory and statistics, the generalized chi-squared distribution is the distribution of a quadratic form of a multinormal variable, or a linear combination of different normal variables and squares of normal variables. Equivalently, it is also a linear sum of independent noncentral chi-square variables and a normal variable. There are several other such generalizations for which the same term is sometimes used; some of them are special cases of the family discussed here, for example the gamma distribution.
In probability theory, a logit-normal distribution is a probability distribution of a random variable whose logit has a normal distribution. If Y is a random variable with a normal distribution, and t is the standard logistic function, then X = t(Y) has a logit-normal distribution; likewise, if X is logit-normally distributed, then Y = logit(X)= log (X/(1-X)) is normally distributed. It is also known as the logistic normal distribution, which often refers to a multinomial logit version (e.g.).
Joint Approximation Diagonalization of Eigen-matrices (JADE) is an algorithm for independent component analysis that separates observed mixed signals into latent source signals by exploiting fourth order moments. The fourth order moments are a measure of non-Gaussianity, which is used as a proxy for defining independence between the source signals. The motivation for this measure is that Gaussian distributions possess zero excess kurtosis, and with non-Gaussianity being a canonical assumption of ICA, JADE seeks an orthogonal rotation of the observed mixed vectors to estimate source vectors which possess high values of excess kurtosis.
In statistics, the class of vector generalized linear models (VGLMs) was proposed to enlarge the scope of models catered for by generalized linear models (GLMs). In particular, VGLMs allow for response variables outside the classical exponential family and for more than one parameter. Each parameter can be transformed by a link function. The VGLM framework is also large enough to naturally accommodate multiple responses; these are several independent responses each coming from a particular statistical distribution with possibly different parameter values.