K-SVD

Last updated

In applied mathematics, k-SVD is a dictionary learning algorithm for creating a dictionary for sparse representations, via a singular value decomposition approach. k-SVD is a generalization of the k-means clustering method, and it works by iteratively alternating between sparse coding the input data based on the current dictionary, and updating the atoms in the dictionary to better fit the data. It is structurally related to the expectation maximization (EM) algorithm. [1] [2] k-SVD can be found widely in use in applications such as image processing, audio processing, biology, and document analysis.

Contents

k-SVD algorithm

k-SVD is a kind of generalization of k-means, as follows. The k-means clustering can be also regarded as a method of sparse representation. That is, finding the best possible codebook to represent the data samples by nearest neighbor, by solving

which is nearly equivalent to

which is k-means that allows "weights".

The letter F denotes the Frobenius norm. The sparse representation term enforces k-means algorithm to use only one atom (column) in dictionary . To relax this constraint, the target of the k-SVD algorithm is to represent the signal as a linear combination of atoms in .

The k-SVD algorithm follows the construction flow of the k-means algorithm. However, in contrast to k-means, in order to achieve a linear combination of atoms in , the sparsity term of the constraint is relaxed so that the number of nonzero entries of each column can be more than 1, but less than a number .

So, the objective function becomes

or in another objective form

In the k-SVD algorithm, the is first fixed and the best coefficient matrix is found. As finding the truly optimal is hard, we use an approximation pursuit method. Any algorithm such as OMP, the orthogonal matching pursuit can be used for the calculation of the coefficients, as long as it can supply a solution with a fixed and predetermined number of nonzero entries .

After the sparse coding task, the next is to search for a better dictionary . However, finding the whole dictionary all at a time is impossible, so the process is to update only one column of the dictionary each time, while fixing . The update of the -th column is done by rewriting the penalty term as

where denotes the k-th row of X.

By decomposing the multiplication into sum of rank 1 matrices, we can assume the other terms are assumed fixed, and the -th remains unknown. After this step, we can solve the minimization problem by approximate the term with a matrix using singular value decomposition, then update with it. However, the new solution for the vector is not guaranteed to be sparse.

To cure this problem, define as

which points to examples that use atom (also the entries of that is nonzero). Then, define as a matrix of size , with ones on the entries and zeros otherwise. When multiplying , this shrinks the row vector by discarding the zero entries. Similarly, the multiplication is the subset of the examples that are current using the atom. The same effect can be seen on .

So the minimization problem as mentioned before becomes

and can be done by directly using SVD. SVD decomposes into . The solution for is the first column of U, the coefficient vector as the first column of . After updating the whole dictionary, the process then turns to iteratively solve X, then iteratively solve D.

Limitations

Choosing an appropriate "dictionary" for a dataset is a non-convex problem, and k-SVD operates by an iterative update which does not guarantee to find the global optimum. [2] However, this is common to other algorithms for this purpose, and k-SVD works fairly well in practice. [2] [ better source needed ]

See also

Related Research Articles

In computational mathematics, an iterative method is a mathematical procedure that uses an initial value to generate a sequence of improving approximate solutions for a class of problems, in which the n-th approximation is derived from the previous ones.

<span class="mw-page-title-main">Fourier transform</span> Mathematical transform that expresses a function of time as a function of frequency

In physics, engineering and mathematics, the Fourier transform (FT) is an integral transform that takes as input a function and outputs another function that describes the extent to which various frequencies are present in the original function. The output of the transform is a complex-valued function of frequency. The term Fourier transform refers to both this complex-valued function and the mathematical operation. When a distinction needs to be made the Fourier transform is sometimes called the frequency domain representation of the original function. The Fourier transform is analogous to decomposing the sound of a musical chord into the intensities of its constituent pitches.

In the calculus of variations and classical mechanics, the Euler–Lagrange equations are a system of second-order ordinary differential equations whose solutions are stationary points of the given action functional. The equations were discovered in the 1750s by Swiss mathematician Leonhard Euler and Italian mathematician Joseph-Louis Lagrange.

In mathematics, the Hodge star operator or Hodge star is a linear map defined on the exterior algebra of a finite-dimensional oriented vector space endowed with a nondegenerate symmetric bilinear form. Applying the operator to an element of the algebra produces the Hodge dual of the element. This map was introduced by W. V. D. Hodge.

The information bottleneck method is a technique in information theory introduced by Naftali Tishby, Fernando C. Pereira, and William Bialek. It is designed for finding the best tradeoff between accuracy and complexity (compression) when summarizing a random variable X, given a joint probability distribution p(X,Y) between X and an observed relevant variable Y - and self-described as providing "a surprisingly rich framework for discussing a variety of problems in signal processing and learning".

In abstract algebra and multilinear algebra, a multilinear form on a vector space over a field is a map

Particle filters, or sequential Monte Carlo methods, are a set of Monte Carlo algorithms used to find approximate solutions for filtering problems for nonlinear state-space systems, such as signal processing and Bayesian statistical inference. The filtering problem consists of estimating the internal states in dynamical systems when partial observations are made and random perturbations are present in the sensors as well as in the dynamical system. The objective is to compute the posterior distributions of the states of a Markov process, given the noisy and partial observations. The term "particle filters" was first coined in 1996 by Pierre Del Moral about mean-field interacting particle methods used in fluid mechanics since the beginning of the 1960s. The term "Sequential Monte Carlo" was coined by Jun S. Liu and Rong Chen in 1998.

In mathematics, a locally integrable function is a function which is integrable on every compact subset of its domain of definition. The importance of such functions lies in the fact that their function space is similar to Lp spaces, but its members are not required to satisfy any growth restriction on their behavior at the boundary of their domain : in other words, locally integrable functions can grow arbitrarily fast at the domain boundary, but are still manageable in a way similar to ordinary integrable functions.

In linear algebra, a frame of an inner product space is a generalization of a basis of a vector space to sets that may be linearly dependent. In the terminology of signal processing, a frame provides a redundant, stable way of representing a signal. Frames are used in error detection and correction and the design and analysis of filter banks and more generally in applied mathematics, computer science, and engineering.

In applied mathematics, discontinuous Galerkin methods (DG methods) form a class of numerical methods for solving differential equations. They combine features of the finite element and the finite volume framework and have been successfully applied to hyperbolic, elliptic, parabolic and mixed form problems arising from a wide range of applications. DG methods have in particular received considerable interest for problems with a dominant first-order part, e.g. in electrodynamics, fluid mechanics and plasma physics.

In probability theory, a Markov kernel is a map that in the general theory of Markov processes plays the role that the transition matrix does in the theory of Markov processes with a finite state space.

<span class="mw-page-title-main">Generalized chi-squared distribution</span>

In probability theory and statistics, the generalized chi-squared distribution is the distribution of a quadratic form of a multinormal variable, or a linear combination of different normal variables and squares of normal variables. Equivalently, it is also a linear sum of independent noncentral chi-square variables and a normal variable. There are several other such generalizations for which the same term is sometimes used; some of them are special cases of the family discussed here, for example the gamma distribution.

In mathematics, the method of steepest descent or saddle-point method is an extension of Laplace's method for approximating an integral, where one deforms a contour integral in the complex plane to pass near a stationary point, in roughly the direction of steepest descent or stationary phase. The saddle-point approximation is used with integrals in the complex plane, whereas Laplace’s method is used with real integrals.

Augmented Lagrangian methods are a certain class of algorithms for solving constrained optimization problems. They have similarities to penalty methods in that they replace a constrained optimization problem by a series of unconstrained problems and add a penalty term to the objective, but the augmented Lagrangian method adds yet another term designed to mimic a Lagrange multiplier. The augmented Lagrangian is related to, but not identical with, the method of Lagrange multipliers.

For certain applications in linear algebra, it is useful to know properties of the probability distribution of the largest eigenvalue of a finite sum of random matrices. Suppose is a finite sequence of random matrices. Analogous to the well-known Chernoff bound for sums of scalars, a bound on the following is sought for a given parameter t:

<span class="mw-page-title-main">Matrix completion</span>

Matrix completion is the task of filling in the missing entries of a partially observed matrix, which is equivalent to performing data imputation in statistics. A wide range of datasets are naturally organized in matrix form. One example is the movie-ratings matrix, as appears in the Netflix problem: Given a ratings matrix in which each entry represents the rating of movie by customer , if customer has watched movie and is otherwise missing, we would like to predict the remaining entries in order to make good recommendations to customers on what to watch next. Another example is the document-term matrix: The frequencies of words used in a collection of documents can be represented as a matrix, where each entry corresponds to the number of times the associated term appears in the indicated document.

In mathematics, the Whitney inequality gives an upper bound for the error of best approximation of a function by polynomials in terms of the moduli of smoothness. It was first proved by Hassler Whitney in 1957, and is an important tool in the field of approximation theory for obtaining upper estimates on the errors of best approximation.

Sparse dictionary learning is a representation learning method which aims at finding a sparse representation of the input data in the form of a linear combination of basic elements as well as those basic elements themselves. These elements are called atoms and they compose a dictionary. Atoms in the dictionary are not required to be orthogonal, and they may be an over-complete spanning set. This problem setup also allows the dimensionality of the signals being represented to be higher than the one of the signals being observed. The above two properties lead to having seemingly redundant atoms that allow multiple representations of the same signal but also provide an improvement in sparsity and flexibility of the representation.

This article summarizes several identities in exterior calculus.

The variational multiscale method (VMS) is a technique used for deriving models and numerical methods for multiscale phenomena. The VMS framework has been mainly applied to design stabilized finite element methods in which stability of the standard Galerkin method is not ensured both in terms of singular perturbation and of compatibility conditions with the finite element spaces.

References

  1. Michal Aharon; Michael Elad; Alfred Bruckstein (2006), "K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation" (PDF), IEEE Transactions on Signal Processing, 54 (11): 4311–4322, Bibcode:2006ITSP...54.4311A, doi:10.1109/TSP.2006.881199, S2CID   7477309
  2. 1 2 3 Rubinstein, R., Bruckstein, A.M., and Elad, M. (2010), "Dictionaries for Sparse Representation Modeling", Proceedings of the IEEE, 98 (6): 1045–1057, CiteSeerX   10.1.1.160.527 , doi:10.1109/JPROC.2010.2040551, S2CID   2176046 {{citation}}: CS1 maint: multiple names: authors list (link)