K-SVD

Last updated

In applied mathematics, k-SVD is a dictionary learning algorithm for creating a dictionary for sparse representations, via a singular value decomposition approach. k-SVD is a generalization of the k-means clustering method, and it works by iteratively alternating between sparse coding the input data based on the current dictionary, and updating the atoms in the dictionary to better fit the data. It is structurally related to the expectation–maximization (EM) algorithm. [1] [2] k-SVD can be found widely in use in applications such as image processing, audio processing, biology, and document analysis.

Contents

k-SVD algorithm

k-SVD is a kind of generalization of k-means, as follows. The k-means clustering can be also regarded as a method of sparse representation. That is, finding the best possible codebook to represent the data samples by nearest neighbor, by solving

which is nearly equivalent to

which is k-means that allows "weights".

The letter F denotes the Frobenius norm. The sparse representation term enforces k-means algorithm to use only one atom (column) in dictionary . To relax this constraint, the target of the k-SVD algorithm is to represent the signal as a linear combination of atoms in .

The k-SVD algorithm follows the construction flow of the k-means algorithm. However, in contrast to k-means, in order to achieve a linear combination of atoms in , the sparsity term of the constraint is relaxed so that the number of nonzero entries of each column can be more than 1, but less than a number .

So, the objective function becomes

or in another objective form

In the k-SVD algorithm, the is first fixed and the best coefficient matrix is found. As finding the truly optimal is hard, we use an approximation pursuit method. Any algorithm such as OMP, the orthogonal matching pursuit can be used for the calculation of the coefficients, as long as it can supply a solution with a fixed and predetermined number of nonzero entries .

After the sparse coding task, the next is to search for a better dictionary . However, finding the whole dictionary all at a time is impossible, so the process is to update only one column of the dictionary each time, while fixing . The update of the -th column is done by rewriting the penalty term as

where denotes the k-th row of X.

By decomposing the multiplication into sum of rank 1 matrices, we can assume the other terms are assumed fixed, and the -th remains unknown. After this step, we can solve the minimization problem by approximate the term with a matrix using singular value decomposition, then update with it. However, the new solution for the vector is not guaranteed to be sparse.

To cure this problem, define as

which points to examples that use atom (also the entries of that is nonzero). Then, define as a matrix of size , with ones on the entries and zeros otherwise. When multiplying , this shrinks the row vector by discarding the zero entries. Similarly, the multiplication is the subset of the examples that are current using the atom. The same effect can be seen on .

So the minimization problem as mentioned before becomes

and can be done by directly using SVD. SVD decomposes into . The solution for is the first column of U, the coefficient vector as the first column of . After updating the whole dictionary, the process then turns to iteratively solve X, then iteratively solve D.

Limitations

Choosing an appropriate "dictionary" for a dataset is a non-convex problem, and k-SVD operates by an iterative update which does not guarantee to find the global optimum. [2] However, this is common to other algorithms for this purpose, and k-SVD works fairly well in practice. [2] [ better source needed ]

See also

Related Research Articles

In computational mathematics, an iterative method is a mathematical procedure that uses an initial value to generate a sequence of improving approximate solutions for a class of problems, in which the n-th approximation is derived from the previous ones.

Big <i>O</i> notation Describes limiting behavior of a function

Big O notation is a mathematical notation that describes the limiting behavior of a function when the argument tends towards a particular value or infinity. Big O is a member of a family of notations invented by German mathematicians Paul Bachmann, Edmund Landau, and others, collectively called Bachmann–Landau notation or asymptotic notation. The letter O was chosen by Bachmann to stand for Ordnung, meaning the order of approximation.

<span class="mw-page-title-main">Quantum harmonic oscillator</span> Important, well-understood quantum mechanical model

The quantum harmonic oscillator is the quantum-mechanical analog of the classical harmonic oscillator. Because an arbitrary smooth potential can usually be approximated as a harmonic potential at the vicinity of a stable equilibrium point, it is one of the most important model systems in quantum mechanics. Furthermore, it is one of the few quantum-mechanical systems for which an exact, analytical solution is known.

In control theory and signal processing, a linear, time-invariant system is said to be minimum-phase if the system and its inverse are causal and stable.

<span class="mw-page-title-main">Reproducing kernel Hilbert space</span> In functional analysis, a Hilbert space

In functional analysis, a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions and in the RKHS are close in norm, i.e., is small, then and are also pointwise close, i.e., is small for all . The converse does not need to be true. Informally, this can be shown by looking at the supremum norm: the sequence of functions converges pointwise, but does not converge uniformly i.e. does not converge with respect to the supremum norm.

In numerical linear algebra, the Jacobi method is an iterative algorithm for determining the solutions of a strictly diagonally dominant system of linear equations. Each diagonal element is solved for, and an approximate value is plugged in. The process is then iterated until it converges. This algorithm is a stripped-down version of the Jacobi transformation method of matrix diagonalization. The method is named after Carl Gustav Jacob Jacobi.

In mathematics, a locally integrable function is a function which is integrable on every compact subset of its domain of definition. The importance of such functions lies in the fact that their function space is similar to Lp spaces, but its members are not required to satisfy any growth restriction on their behavior at the boundary of their domain : in other words, locally integrable functions can grow arbitrarily fast at the domain boundary, but are still manageable in a way similar to ordinary integrable functions.

Linear Programming Boosting (LPBoost) is a supervised classifier from the boosting family of classifiers. LPBoost maximizes a margin between training samples of different classes and hence also belongs to the class of margin-maximizing supervised classification algorithms. Consider a classification function

<span class="mw-page-title-main">Elliptic boundary value problem</span>

In mathematics, an elliptic boundary value problem is a special kind of boundary value problem which can be thought of as the stable state of an evolution problem. For example, the Dirichlet problem for the Laplacian gives the eventual distribution of heat in a room several hours after the heating is turned on.

In applied mathematics, discontinuous Galerkin methods (DG methods) form a class of numerical methods for solving differential equations. They combine features of the finite element and the finite volume framework and have been successfully applied to hyperbolic, elliptic, parabolic and mixed form problems arising from a wide range of applications. DG methods have in particular received considerable interest for problems with a dominant first-order part, e.g. in electrodynamics, fluid mechanics and plasma physics. Indeed, the solutions of such problems may involve strong gradients (and even discontinuities) so that classical finite element methods fail, while finite volume methods are restricted to low order approximations.

In mathematics, the class of Muckenhoupt weightsAp consists of those weights ω for which the Hardy–Littlewood maximal operator is bounded on Lp(). Specifically, we consider functions f on Rn and their associated maximal functions M( f ) defined as

In probability theory, a Markov kernel is a map that in the general theory of Markov processes plays the role that the transition matrix does in the theory of Markov processes with a finite state space.

<span class="mw-page-title-main">Total variation denoising</span> Noise removal process during image processing

In signal processing, particularly image processing, total variation denoising, also known as total variation regularization or total variation filtering, is a noise removal process (filter). It is based on the principle that signals with excessive and possibly spurious detail have high total variation, that is, the integral of the image gradient magnitude is high. According to this principle, reducing the total variation of the signal—subject to it being a close match to the original signal—removes unwanted detail whilst preserving important details such as edges. The concept was pioneered by L. I. Rudin, S. Osher, and E. Fatemi in 1992 and so is today known as the ROF model.

In mathematics, the method of steepest descent or saddle-point method is an extension of Laplace's method for approximating an integral, where one deforms a contour integral in the complex plane to pass near a stationary point, in roughly the direction of steepest descent or stationary phase. The saddle-point approximation is used with integrals in the complex plane, whereas Laplace’s method is used with real integrals.

In coding theory, generalized minimum-distance (GMD) decoding provides an efficient algorithm for decoding concatenated codes, which is based on using an errors-and-erasures decoder for the outer code.

In numerical mathematics, hierarchical matrices (H-matrices) are used as data-sparse approximations of non-sparse matrices. While a sparse matrix of dimension can be represented efficiently in units of storage by storing only its non-zero entries, a non-sparse matrix would require units of storage, and using this type of matrices for large problems would therefore be prohibitively expensive in terms of storage and computing time. Hierarchical matrices provide an approximation requiring only units of storage, where is a parameter controlling the accuracy of the approximation. In typical applications, e.g., when discretizing integral equations, preconditioning the resulting systems of linear equations, or solving elliptic partial differential equations, a rank proportional to with a small constant is sufficient to ensure an accuracy of . Compared to many other data-sparse representations of non-sparse matrices, hierarchical matrices offer a major advantage: the results of matrix arithmetic operations like matrix multiplication, factorization or inversion can be approximated in operations, where

<span class="mw-page-title-main">Matrix completion</span>

Matrix completion is the task of filling in the missing entries of a partially observed matrix, which is equivalent to performing data imputation in statistics. A wide range of datasets are naturally organized in matrix form. One example is the movie-ratings matrix, as appears in the Netflix problem: Given a ratings matrix in which each entry represents the rating of movie by customer , if customer has watched movie and is otherwise missing, we would like to predict the remaining entries in order to make good recommendations to customers on what to watch next. Another example is the document-term matrix: The frequencies of words used in a collection of documents can be represented as a matrix, where each entry corresponds to the number of times the associated term appears in the indicated document.

In the mathematical theory of probability, the drift-plus-penalty method is used for optimization of queueing networks and other stochastic systems.

Sparse dictionary learning is a representation learning method which aims at finding a sparse representation of the input data in the form of a linear combination of basic elements as well as those basic elements themselves. These elements are called atoms and they compose a dictionary. Atoms in the dictionary are not required to be orthogonal, and they may be an over-complete spanning set. This problem setup also allows the dimensionality of the signals being represented to be higher than the one of the signals being observed. The above two properties lead to having seemingly redundant atoms that allow multiple representations of the same signal but also provide an improvement in sparsity and flexibility of the representation.

The variational multiscale method (VMS) is a technique used for deriving models and numerical methods for multiscale phenomena. The VMS framework has been mainly applied to design stabilized finite element methods in which stability of the standard Galerkin method is not ensured both in terms of singular perturbation and of compatibility conditions with the finite element spaces.

References

  1. Michal Aharon; Michael Elad; Alfred Bruckstein (2006), "K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation" (PDF), IEEE Transactions on Signal Processing, 54 (11): 4311–4322, Bibcode:2006ITSP...54.4311A, doi:10.1109/TSP.2006.881199, S2CID   7477309
  2. 1 2 3 Rubinstein, R., Bruckstein, A.M., and Elad, M. (2010), "Dictionaries for Sparse Representation Modeling", Proceedings of the IEEE, 98 (6): 1045–1057, CiteSeerX   10.1.1.160.527 , doi:10.1109/JPROC.2010.2040551, S2CID   2176046 {{citation}}: CS1 maint: multiple names: authors list (link)