Structured sparsity regularization

Last updated

Structured sparsity regularization is a class of methods, and an area of research in statistical learning theory, that extend and generalize sparsity regularization learning methods. [1] Both sparsity and structured sparsity regularization methods seek to exploit the assumption that the output variable (i.e., response, or dependent variable) to be learned can be described by a reduced number of variables in the input space (i.e., the domain, space of features or explanatory variables). Sparsity regularization methods focus on selecting the input variables that best describe the output. Structured sparsity regularization methods generalize and extend sparsity regularization methods, by allowing for optimal selection over structures like groups or networks of input variables in . [2] [3]

Contents

Common motivation for the use of structured sparsity methods are model interpretability, high-dimensional learning (where dimensionality of may be higher than the number of observations ), and reduction of computational complexity. [4] Moreover, structured sparsity methods allow to incorporate prior assumptions on the structure of the input variables, such as overlapping groups, [2] non-overlapping groups, and acyclic graphs. [3] Examples of uses of structured sparsity methods include face recognition, [5] magnetic resonance image (MRI) processing, [6] socio-linguistic analysis in natural language processing, [7] and analysis of genetic expression in breast cancer. [8]

Sparsity regularization

Consider the linear kernel regularized empirical risk minimization problem with a loss function and the "norm" as the regularization penalty:

where , and denotes the "norm", defined as the number of nonzero entries of the vector . is said to be sparse if. Which means that the output can be described by a small subset of input variables.

More generally, assume a dictionary with is given, such that the target function of a learning problem can be written as:

,

The norm as the number of non-zero components of is defined as

, where is the cardinality of set .

is said to be sparse if .

However, while using the norm for regularization favors sparser solutions, it is computationally difficult to use and additionally is not convex. A computationally more feasible norm that favors sparser solutions is the norm; this has been shown to still favor sparser solutions and is additionally convex. [4]

Structured sparsity regularization

Structured sparsity regularization extends and generalizes the variable selection problem that characterizes sparsity regularization. [2] [3] Consider the above regularized empirical risk minimization problem with a general kernel and associated feature map with .

The regularization term penalizes each component independently, which means that the algorithm will suppress input variables independently from each other.

In several situations we may want to impose more structure in the regularization process, so that, for example, input variables are suppressed according to predefined groups. Structured sparsity regularization methods allow to impose such structure by adding structure to the norms defining the regularization term.

Structures and norms

Non-overlapping groups: group Lasso

The non-overlapping group case is the most basic instance of structured sparsity. In it, an a priori partition of the coefficient vector in non-overlapping groups is assumed. Let be the vector of coefficients in group , we can define a regularization term and its group norm as

,

where is the group norm , is group , and is the j-th component of group .

The above norm is also referred to as group Lasso. [2] This regularizer will force entire coefficient groups towards zero, rather than individual coefficients. As the groups are non-overlapping, the set of non-zero coefficients can be obtained as the union of the groups that were not set to zero, and conversely for the set of zero coefficients.

Overlapping groups

Overlapping groups is the structure sparsity case where a variable can belong to more than one group . This case is often of interest as it can represent a more general class of relationships among variables than non-overlapping groups can, such as tree structures or other type of graphs. [3] [8]

There are two types of overlapping group sparsity regularization approaches, which are used to model different types of input variable relationships:

Intersection of complements: group Lasso

The intersection of complements approach is used in cases when we want to select only those input variables that have positive coefficients in all groups they belong to. Consider again the group Lasso for a regularized empirical risk minimization problem:

,

where is the group norm, is group , and is the j-th component of group .

As in the non-overlapping groups case, the group Lasso regularizer will potentially set entire groups of coefficients to zero. Selected variables are those with coefficients . However, as in this case groups may overlap, we take the intersection of the complements of those groups that are not set to zero.

This intersection of complements selection criteria implies the modeling choice that we allow some coefficients within a particular group to be set to zero, while others within the same group may remain positive. In other words, coefficients within a group may differ depending on the several group memberships that each variable within the group may have.

Union of groups: latent group Lasso

A different approach is to consider union of groups for variable selection. This approach captures the modeling situation where variables can be selected as long as they belong at least to one group with positive coefficients. This modeling perspective implies that we want to preserve group structure.

The formulation of the union of groups approach is also referred to as latent group Lasso, and requires to modify the group norm considered above and introduce the following regularizer [3]

where , is the vector of coefficients of group g, and is a vector with coefficients for all variables in group , and in all others, i.e., if in group and otherwise.

This regularizer can be interpreted as effectively replicating variables that belong to more than one group, therefore conserving group structure. As intended by the union of groups approach, requiring produces a vector of weights w that effectively sums up the weights of all variables across all groups they belong to.

Issues with Group Lasso regularization and alternative approaches

The objective function using group lasso consists of an error function, which is generally required to be convex but not necessarily strongly convex, and a group regularization term. An issue with this objective function is that it is convex but not necessarily strongly convex, and thus generally does not lead to unique solutions. [9]

An example of a way to fix this is to introduce the squared norm of the weight vector as an additional regularization term while keeping the regularization term from the group lasso approach. [9] If the coefficient of the squared norm term is greater than , then because the squared norm term is strongly convex, the resulting objective function will also be strongly convex. [9] Provided that the coefficient is suitably small but still positive, the weight vector minimizing the resulting objective function is generally very close to a weight vector that minimizes the objective function that would result from removing the group regularization term altogether from the original objective function; the latter scenario corresponds to the group Lasso approach. [9] Thus this approach allows for simpler optimization while maintaining sparsity. [9]

Norms based on the structure over Input variables

See: Submodular set function

Besides the norms discussed above, other norms used in structured sparsity methods include hierarchical norms and norms defined on grids. These norms arise from submodular functions and allow the incorporation of prior assumptions on the structure of the input variables. In the context of hierarchical norms, this structure can be represented as a directed acyclic graph over the variables while in the context of grid-based norms, the structure can be represented using a grid. [10] [11] [12] [13] [14] [15]

Hierarchical Norms

See: Unsupervised learning

Unsupervised learning methods are often used to learn the parameters of latent variable models. Latent variable models are statistical models where in addition to the observed variables, a set of latent variables also exists which is not observed. Often in such models, "hierarchies" are assumed between the variables of the system; this system of hierarchies can be represented using directed acyclic graphs.

Hierarchies of latent variables have emerged as a natural structure in several applications, notably to model text documents. [11] Hierarchical models using Bayesian non-parametric methods have been used to learn topic models, [10] which are statistical models for discovering the abstract "topics" that occur in a collection of documents. Hierarchies have also been considered in the context of kernel methods. [13] Hierarchical norms have been applied to bioinformatics, [12] computer vision and topic models. [14]

Norms defined on grids

If the structure assumed over variables is in the form of a 1D, 2D or 3D grid, then submodular functions based on overlapping groups can be considered as norms, leading to stable sets equal to rectangular or convex shapes. [13] Such methods have applications in computer vision [15]

Algorithms for computation

Best subset selection problem

The problem of choosing the best subset of input variables can be naturally formulated under a penalization framework as: [4]

Where denotes the "norm", defined as the number of nonzero entries of the vector .

Although this formulation makes sense from a modeling perspective, it is computationally unfeasible, as it is equivalent to an exhaustive search evaluating all possible subsets of variables. [4]

Two main approaches for solving the optimization problem are: 1) greedy methods, such as step-wise regression in statistics, or matching pursuit in signal processing; and 2) convex relaxation formulation approaches and proximal gradient optimization methods.

Convex relaxation

A natural approximation for the best subset selection problem is the norm regularization: [4]

Such a scheme is called basis pursuit or the Lasso, which substitutes the "norm" for the convex, non-differentiable norm.

Proximal gradient methods

Proximal gradient methods, also called forward-backward splitting, are optimization methods useful for minimizing functions with a convex and differentiable component, and a convex potentially non-differentiable component.

As such, proximal gradient methods are useful for solving sparsity and structured sparsity regularization problems [9] of the following form:

Where is a convex and differentiable loss function like the quadratic loss, and is a convex potentially non-differentiable regularizer such as the norm.

Connections to Other Areas of Machine Learning

Connection to Multiple Kernel Learning

Structured Sparsity regularization can be applied in the context of multiple kernel learning. [16] Multiple kernel learning refers to a set of machine learning methods that use a predefined set of kernels and learn an optimal linear or non-linear combination of kernels as part of the algorithm.

In the algorithms mentioned above, a whole space was taken into consideration at once and was partitioned into groups, i.e. subspaces. A complementary point of view is to consider the case in which distinct spaces are combined to obtain a new one. It is useful to discuss this idea considering finite dictionaries. Finite dictionaries with linearly independent elements - these elements are also known as atoms - refer to finite sets of linearly independent basis functions, the linear combinations of which define hypothesis spaces. Finite dictionaries can be used to define specific kernels, as will be shown. [16] Assume for this example that rather than only one dictionary, several finite dictionaries are considered.

For simplicity, the case in which there are only two dictionaries and where and are integers, will be considered. The atoms in as well as the atoms in are assumed to be linearly independent. Let be the union of the two dictionaries. Consider the linear space of functions given by linear combinations of the form

for some coefficient vectors , where . Assume the atoms in to still be linearly independent, or equivalently, that the map is one to one. The functions in the space can be seen as the sums of two components, one in the space , the linear combinations of atoms in and one in , the linear combinations of the atoms in .

One choice of norm on this space is . Note that we can now view as a function space in which , are subspaces. In view of the linear independence assumption, can be identified with and with respectively. The norm mentioned above can be seen as the group norm in associated to the subspaces , , providing a connection to structured sparsity regularization.

Here, , and can be seen to be the reproducing kernel Hilbert spaces with corresponding feature maps , given by , , given by , and , given by the concatenation of , respectively.

In the structured sparsity regularization approach to this scenario, the relevant groups of variables which the group norms consider correspond to the subspaces and . This approach promotes setting the groups of coefficients corresponding to these subspaces to zero as opposed to only individual coefficients, promoting sparse multiple kernel learning.

The above reasoning directly generalizes to any finite number of dictionaries, or feature maps. It can be extended to feature maps inducing infinite dimensional hypothesis

spaces. [16]

When Sparse Multiple Kernel Learning is useful

Considering sparse multiple kernel learning is useful in several situations including the following:

  • Data fusion: When each kernel corresponds to a different kind of modality/feature.
  • Nonlinear variable selection: Consider kernels depending only one dimension of the input.

Generally sparse multiple kernel learning is particularly useful when there are many kernels and model selection and interpretability are important. [16]

Additional uses and applications

Structured sparsity regularization methods have been used in a number of settings where it is desired to impose an a priori input variable structure to the regularization process. Some such applications are:

See also

Related Research Articles

In mathematics, more specifically in functional analysis, a Banach space is a complete normed vector space. Thus, a Banach space is a vector space with a metric that allows the computation of vector length and distance between vectors and is complete in the sense that a Cauchy sequence of vectors always converges to a well-defined limit that is within the space.

In mathematics, the Lp spaces are function spaces defined using a natural generalization of the p-norm for finite-dimensional vector spaces. They are sometimes called Lebesgue spaces, named after Henri Lebesgue, although according to the Bourbaki group they were first introduced by Frigyes Riesz.

<span class="mw-page-title-main">Support vector machine</span> Set of methods for supervised statistical learning

In machine learning, support vector machines are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues SVMs are one of the most robust prediction methods, being based on statistical learning frameworks or VC theory proposed by Vapnik and Chervonenkis (1974). Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier. SVM maps training examples to points in space so as to maximise the width of the gap between the two categories. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

Multi-task learning (MTL) is a subfield of machine learning in which multiple learning tasks are solved at the same time, while exploiting commonalities and differences across tasks. This can result in improved learning efficiency and prediction accuracy for the task-specific models, when compared to training the models separately. Early versions of MTL were called "hints".

In mathematics, a norm is a function from a real or complex vector space to the non-negative real numbers that behaves in certain ways like the distance from the origin: it commutes with scaling, obeys a form of the triangle inequality, and is zero only at the origin. In particular, the Euclidean distance in a Euclidean space is defined by a norm on the associated Euclidean vector space, called the Euclidean norm, the 2-norm, or, sometimes, the magnitude of the vector. This norm can be defined as the square root of the inner product of a vector with itself.

<span class="mw-page-title-main">Feature selection</span> Procedure in machine learning and statistics

Feature selection is the process of selecting a subset of relevant features for use in model construction. Stylometry and DNA microarray analysis are two cases where feature selection is used. It should be distinguished from feature extraction.

<span class="mw-page-title-main">Regularization (mathematics)</span> Technique to make a model more generalizable and transferable

In mathematics, statistics, finance, computer science, particularly in machine learning and inverse problems, regularization is a process that changes the result answer to be "simpler". It is often used to obtain results for ill-posed problems or to prevent overfitting.

In mathematical optimization, the Karush–Kuhn–Tucker (KKT) conditions, also known as the Kuhn–Tucker conditions, are first derivative tests for a solution in nonlinear programming to be optimal, provided that some regularity conditions are satisfied.

In mathematics, nuclear spaces are topological vector spaces that can be viewed as a generalization of finite dimensional Euclidean spaces and share many of their desirable properties. Nuclear spaces are however quite different from Hilbert spaces, another generalization of finite dimensional Euclidean spaces. They were introduced by Alexander Grothendieck.

Compressed sensing is a signal processing technique for efficiently acquiring and reconstructing a signal, by finding solutions to underdetermined linear systems. This is based on the principle that, through optimization, the sparsity of a signal can be exploited to recover it from far fewer samples than required by the Nyquist–Shannon sampling theorem. There are two conditions under which recovery is possible. The first one is sparsity, which requires the signal to be sparse in some domain. The second one is incoherence, which is applied through the isometric property, which is sufficient for sparse signals.

Sparse approximation theory deals with sparse solutions for systems of linear equations. Techniques for finding these solutions and exploiting them in applications have found wide use in image processing, signal processing, machine learning, medical imaging, and more.

In statistical theory, the field of high-dimensional statistics studies data whose dimension is larger than typically considered in classical multivariate analysis. The area arose owing to the emergence of many modern data sets in which the dimension of the data vectors may be comparable to, or even larger than, the sample size, so that justification for the use of traditional techniques, often based on asymptotic arguments with the dimension held fixed as the sample size increased, was lacking.

In statistics and machine learning, lasso is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the resulting statistical model. It was originally introduced in geophysics, and later by Robert Tibshirani, who coined the term.

In statistics and, in particular, in the fitting of linear or logistic regression models, the elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods.

Proximal gradientmethods for learning is an area of research in optimization and statistical learning theory which studies algorithms for a general class of convex regularization problems where the regularization penalty may not be differentiable. One such example is regularization of the form

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in , discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song , Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.

In the field of statistical learning theory, matrix regularization generalizes notions of vector regularization to cases where the object to be learned is a matrix. The purpose of regularization is to enforce conditions, for example sparsity or smoothness, that can produce stable predictive functions. For example, in the more common vector framework, Tikhonov regularization optimizes over

<span class="mw-page-title-main">Multiple kernel learning</span> Set of machine learning methods

Multiple kernel learning refers to a set of machine learning methods that use a predefined set of kernels and learn an optimal linear or non-linear combination of kernels as part of the algorithm. Reasons to use multiple kernel learning include a) the ability to select for an optimal kernel and parameters from a larger set of kernels, reducing bias due to kernel selection while allowing for more automated machine learning methods, and b) combining data from different sources that have different notions of similarity and thus require different kernels. Instead of creating a new kernel, multiple kernel algorithms can be used to combine kernels already established for each individual data source.

<span class="mw-page-title-main">Manifold regularization</span>

In machine learning, Manifold regularization is a technique for using the shape of a dataset to constrain the functions that should be learned on that dataset. In many machine learning problems, the data to be learned do not cover the entire input space. For example, a facial recognition system may not need to classify any possible image, but only the subset of images that contain faces. The technique of manifold learning assumes that the relevant subset of data comes from a manifold, a mathematical structure with useful properties. The technique also assumes that the function to be learned is smooth: data with different labels are not likely to be close together, and so the labeling function should not change quickly in areas where there are likely to be many data points. Because of this assumption, a manifold regularization algorithm can use unlabeled data to inform where the learned function is allowed to change quickly and where it is not, using an extension of the technique of Tikhonov regularization. Manifold regularization algorithms can extend supervised learning algorithms in semi-supervised learning and transductive learning settings, where unlabeled data are available. The technique has been used for applications including medical imaging, geographical imaging, and object recognition.

Regularized least squares (RLS) is a family of methods for solving the least-squares problem while using regularization to further constrain the resulting solution.

References

  1. Rosasco, Lorenzo; Poggio, Tomasso (December 2014). A Regularization Tour of Machine Learning, MIT-9.520 Lectures Notes.
  2. 1 2 3 4 Yuan, M.; Lin, Y. (2006). "Model selection and estimation in regression with grouped variables". J. R. Stat. Soc. B. 68 (1): 49–67. CiteSeerX   10.1.1.79.2062 . doi:10.1111/j.1467-9868.2005.00532.x. S2CID   6162124.
  3. 1 2 3 4 5 Obozinski, G.; Laurent, J.; Vert, J.-P. (2011). "Group lasso with overlaps: the latent group lasso approach". arXiv: 1110.0413 [stat.ML].
  4. 1 2 3 4 5 L. Rosasco. Lecture 10 of the Lecture Notes for 9.520: Statistical Learning Theory and Applications. Massachusetts Institute of Technology, Fall 2014. Available at https://www.mit.edu/~9.520/fall14/slides/class18/class18_sparsity.pdf
  5. 1 2 Jia, Kui; et al. (2012). "Robust and Practical Face Recognition via Structured Sparsity". In Andrew Fitzgibbon; Svetlana Lazebnik; Pietro Perona; Yoichi Sato; Cordelia Schmid (eds.). Computer Vision – ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012 Proceedings, Part IV.
  6. 1 2 Chen, Chen; et al. (2012). "Compressive Sensing MRI with Wavelet Tree Sparsity". Proceedings of the 26th Annual Conference on Neural Information Processing Systems. Vol. 25. Curran Associates. pp. 1115–1123.
  7. 1 2 Eisenstein, Jacob; et al. (2011). "Discovering Sociolinguistic Associations with Structured Sparsity". Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics.
  8. 1 2 3 Jacob, Laurent; et al. (2009). "Group Lasso with Overlap and Graph Lasso". Proceedings of the 26th International Conference on Machine Learning.
  9. 1 2 3 4 5 6 Villa, S.; Rosasco, L.; Mosci, S.; Verri, A. (2012). "Proximal methods for the latent group lasso penalty". arXiv: 1209.0368 [math.OC].
  10. 1 2 Blei, D., Ng, A., and Jordan, M. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, 2003.
  11. 1 2 Bengio, Y. "Learning deep architectures for AI". Foundations and Trends in Machine Learning, 2(1), 2009.
  12. 1 2 S. Kim and E. Xing. Tree-guided group Lasso for multi-task regression with structured sparsity. In Proc. ICML, 2010.
  13. 1 2 3 Jenatton, Rodolphe; Audibert, Jean-Yves; Bach, Francis (2011). "Structured Variable Selection with Sparsity-Inducing Norms". Journal of Machine Learning Research. 12 (2011): 2777–2824. arXiv: 0904.3523 . Bibcode:2009arXiv0904.3523J.
  14. 1 2 R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for sparse hierarchical dictionary learning. In Proc. ICML, 2010.
  15. 1 2 R. Jenatton, G. Obozinski, and F. Bach. Structured sparse principal component analysis. In Proc. AISTATS, 2009.
  16. 1 2 3 4 Rosasco, Lorenzo; Poggio, Tomaso (Fall 2015). "Chapter 6". MIT 9.520 course notes Fall 2015.