Bregman method

Last updated

The Bregman method is an iterative algorithm to solve certain convex optimization problems involving regularization. [1] The original version is due to Lev M. Bregman, who published it in 1967. [2]

Contents

The algorithm is a row-action method accessing constraint functions one by one and the method is particularly suited for large optimization problems where constraints can be efficiently enumerated[ citation needed ]. The algorithm works particularly well for regularizers such as the norm, where it converges very quickly because of an error cancellation effect. [3]

Algorithm

In order to be able to use the Bregman method, one must frame the problem of interest as finding , where is a regularizing function such as . [3]

The Bregman distance is defined as where belongs to the subdifferential of at (which we denoted ). [3] [4] One performs the iteration , with a constant to be chosen by the user (and the minimization performed by an ordinary convex optimization algorithm), [3] or , with chosen each time to be a member of . [4]

The algorithm starts with a pair of primal and dual variables. Then, for each constraint a generalized projection onto its feasible set is performed, updating both the constraint's dual variable and all primal variables for which there are non-zero coefficients in the constraint functions gradient. In case the objective is strictly convex and all constraint functions are convex, the limit of this iterative projection converges to the optimal primal dual pair.[ citation needed ]

In the case of a basis pursuit-type problem , the Bregman method is equivalent to ordinary gradient descent on the dual problem . [5] An exact regularization-type effect also occurs in this case; if exceeds a certain threshold, the optimum value of is precisely the optimum solution of . [3] [5]

Applications

The Bregman method or its generalizations can be applied to:

Generalizations and drawbacks

The method has links to the method of multipliers and dual ascent method (through the so-called Bregman alternating direction method of multipliers, [10] [7] generalizing the alternating direction method of multipliers [8] ) and multiple generalizations exist.

One drawback of the method is that it is only provably convergent if the objective function is strictly convex. In case this can not be ensured, as for linear programs or non-strictly convex quadratic programs, additional methods such as proximal gradient methods have been developed.[ citation needed ] In the case of the Rudin-Osher-Fatemi model of image denoising[ clarification needed ], the Bregman method provably converges. [11]

Some generalizations of the Bregman method include:

Linearized Bregman

In the Linearized Bregman method, one linearizes the intermediate objective functions by replacing the second term with (which approximates the second term near ) and adding the penalty term for a constant . The result is much more computationally tractable, especially in basis pursuit-type problems. [4] [5] In the case of a generic basis pursuit problem , one can express the iteration as for each component , where we define . [4]

Sometimes, when running the Linearized Bregman method, there are periods of "stagnation" where the residual[ clarification needed ] is almost constant. To alleviate this issue, one can use the Linearized Bregman method with kicking, where one essentially detects the beginning of a stagnation period, then predicts and skips to the end of it. [4] [5]

Since Linearized Bregman is mathematically equivalent to gradient descent, it can be accelerated with methods to accelerate gradient descent, such as line search, L-BGFS, Barzilai-Borwein steps, or the Nesterov method; the last has been proposed as the accelerated linearized Bregman method. [5] [9]

Split Bregman

The Split Bregman method solves problems of the form , where and are both convex, [4] particularly problems of the form . [6] We start by rewriting it as the constrained optimization problem , then relax it into where is a constant. By defining , one reduces the problem to one that can be solved with the ordinary Bregman algorithm. [4] [6]

The Split Bregman method has been generalized to optimization over complex numbers using Wirtinger derivatives. [1]

Related Research Articles

<span class="mw-page-title-main">Support vector machine</span> Set of methods for supervised statistical learning

In machine learning, support vector machines are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues SVMs are one of the most robust prediction methods, being based on statistical learning frameworks or VC theory proposed by Vapnik and Chervonenkis (1974). Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier. SVM maps training examples to points in space so as to maximise the width of the gap between the two categories. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class it belongs to. A linear classifier achieves this by making a classification decision based on the value of a linear combination of the characteristics. An object's characteristics are also known as feature values and are typically presented to the machine in a vector called a feature vector. Such classifiers work well for practical problems such as document classification, and more generally for problems with many variables (features), reaching accuracy levels comparable to non-linear classifiers while taking less time to train and use.

Convex optimization is a subfield of mathematical optimization that studies the problem of minimizing convex functions over convex sets. Many classes of convex optimization problems admit polynomial-time algorithms, whereas mathematical optimization is in general NP-hard.

<span class="mw-page-title-main">Regularization (mathematics)</span> Technique to make a model more generalizable and transferable

In mathematics, statistics, finance, computer science, particularly in machine learning and inverse problems, regularization is a process that changes the result answer to be "simpler". It is often used to obtain results for ill-posed problems or to prevent overfitting.

The Frank–Wolfe algorithm is an iterative first-order optimization algorithm for constrained convex optimization. Also known as the conditional gradient method, reduced gradient algorithm and the convex combination algorithm, the method was originally proposed by Marguerite Frank and Philip Wolfe in 1956. In each iteration, the Frank–Wolfe algorithm considers a linear approximation of the objective function, and moves towards a minimizer of this linear function.

In mathematical optimization theory, duality or the duality principle is the principle that optimization problems may be viewed from either of two perspectives, the primal problem or the dual problem. If the primal is a minimization problem then the dual is a maximization problem. Any feasible solution to the primal (minimization) problem is at least as large as any feasible solution to the dual (maximization) problem. Therefore, the solution to the primal is an upper bound to the solution of the dual, and the solution of the dual is a lower bound to the solution of the primal. This fact is called weak duality.

<span class="mw-page-title-main">Ellipsoid method</span> Iterative method for minimizing convex functions

In mathematical optimization, the ellipsoid method is an iterative method for minimizing convex functions. When specialized to solving feasible linear optimization problems with rational data, the ellipsoid method is an algorithm which finds an optimal solution in a number of steps that is polynomial in the input size.

<span class="mw-page-title-main">Autoencoder</span> Neural network that learns efficient data encoding in an unsupervised manner

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.

Penalty methods are a certain class of algorithms for solving constrained optimization problems.

Compressed sensing is a signal processing technique for efficiently acquiring and reconstructing a signal, by finding solutions to underdetermined linear systems. This is based on the principle that, through optimization, the sparsity of a signal can be exploited to recover it from far fewer samples than required by the Nyquist–Shannon sampling theorem. There are two conditions under which recovery is possible. The first one is sparsity, which requires the signal to be sparse in some domain. The second one is incoherence, which is applied through the isometric property, which is sufficient for sparse signals.

<span class="mw-page-title-main">Total variation denoising</span>

In signal processing, particularly image processing, total variation denoising, also known as total variation regularization or total variation filtering, is a noise removal process (filter). It is based on the principle that signals with excessive and possibly spurious detail have high total variation, that is, the integral of the absolute image gradient is high. According to this principle, reducing the total variation of the signal—subject to it being a close match to the original signal—removes unwanted detail whilst preserving important details such as edges. The concept was pioneered by L. I. Rudin, S. Osher, and E. Fatemi in 1992 and so is today known as the ROF model.

In applied mathematics and statistics, basis pursuit denoising (BPDN) refers to a mathematical optimization problem of the form

Augmented Lagrangian methods are a certain class of algorithms for solving constrained optimization problems. They have similarities to penalty methods in that they replace a constrained optimization problem by a series of unconstrained problems and add a penalty term to the objective; the difference is that the augmented Lagrangian method adds yet another term, designed to mimic a Lagrange multiplier. The augmented Lagrangian is related to, but not identical with the method of Lagrange multipliers.

<span class="mw-page-title-main">Step detection</span>

In statistics and signal processing, step detection is the process of finding abrupt changes in the mean level of a time series or signal. It is usually considered as a special case of the statistical method known as change detection or change point detection. Often, the step is small and the time series is corrupted by some kind of noise, and this makes the problem challenging because the step may be hidden by the noise. Therefore, statistical and/or signal processing algorithms are often required.

The Landweber iteration or Landweber algorithm is an algorithm to solve ill-posed linear inverse problems, and it has been extended to solve non-linear problems that involve constraints. The method was first proposed in the 1950s by Louis Landweber, and it can be now viewed as a special case of many other more general methods.

<span class="mw-page-title-main">Proximal gradient method</span>

Proximal gradient methods are a generalized form of projection used to solve non-differentiable convex optimization problems.

Proximal gradientmethods for learning is an area of research in optimization and statistical learning theory which studies algorithms for a general class of convex regularization problems where the regularization penalty may not be differentiable. One such example is regularization of the form

<span class="mw-page-title-main">Multiple kernel learning</span> Set of machine learning methods

Multiple kernel learning refers to a set of machine learning methods that use a predefined set of kernels and learn an optimal linear or non-linear combination of kernels as part of the algorithm. Reasons to use multiple kernel learning include a) the ability to select for an optimal kernel and parameters from a larger set of kernels, reducing bias due to kernel selection while allowing for more automated machine learning methods, and b) combining data from different sources that have different notions of similarity and thus require different kernels. Instead of creating a new kernel, multiple kernel algorithms can be used to combine kernels already established for each individual data source.

Regularized least squares (RLS) is a family of methods for solving the least-squares problem while using regularization to further constrain the resulting solution.

Structured sparsity regularization is a class of methods, and an area of research in statistical learning theory, that extend and generalize sparsity regularization learning methods. Both sparsity and structured sparsity regularization methods seek to exploit the assumption that the output variable to be learned can be described by a reduced number of variables in the input space . Sparsity regularization methods focus on selecting the input variables that best describe the output. Structured sparsity regularization methods generalize and extend sparsity regularization methods, by allowing for optimal selection over structures like groups or networks of input variables in .

References

  1. 1 2 3 4 Xiong, Kai; Zhao, Guanghui; Shi, Guangming; Wang, Yingbin (2019-09-12). "A Convex Optimization Algorithm for Compressed Sensing in a Complex Domain: The Complex-Valued Split Bregman Method". Sensors (published 18 Oct 2019). 19 (20): 4540. Bibcode:2019Senso..19.4540X. doi: 10.3390/s19204540 . PMC   6832202 . PMID   31635423.
  2. Bregman L. "A Relaxation Method of Finding a Common Point of Convex Sets and its Application to Problems of Optimization". Dokl. Akad. Nauk SSSR, v. 171, No. 5, 1966, p.p. 1019-1022. (English translation: Soviet Math. Dokl., v. 7, 1966, p.p. 1578-1581)
  3. 1 2 3 4 5 6 7 8 9 10 11 Yin, Wotao (8 Dec 2009). "The Bregman Methods: Review and New Results" (PDF). Archived (PDF) from the original on 2010-06-13. Retrieved 16 Apr 2021.
  4. 1 2 3 4 5 6 7 8 Bush, Jacqueline (10 Jun 2011). "University of California, Santa Barbara Senior Thesis: Bregman Algorithms" (PDF). University of California Santa Barbara. Archived (PDF) from the original on 2016-11-30. Retrieved 16 Apr 2021.
  5. 1 2 3 4 5 6 Yin, Wotao (28 May 2009). "Analysis and Generalizations of the Linearized Bregman Method" (PDF). SIAM Journal on Imaging Sciences. 3 (4): 856–877. doi:10.1137/090760350 . Retrieved 16 Apr 2021.
  6. 1 2 3 Goldstein, Tom; Osher, Stanley (2 Jun 2008). "The Split Bregman Method for L1-Regularized Problems". SIAM J. Imaging Sci. 2 (2): 323–343. doi:10.1137/080725891 . Retrieved 22 Apr 2021.
  7. 1 2 Jiang, Chunzhi (May 2015). "Comparison of Variable Penalty ADMM with Split Bregman Method on Hyperspectral Imaging Problems". Archived from the original on 2020-03-23. Retrieved 20 Apr 2021.
  8. 1 2 3 4 Boyd, Stephen; Parikh, Neal; Chu, Eric; Peleato, Borja; Eckstein, Jonathan (19 Nov 2010). "Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers". Foundations and Trends in Machine Learning. 3: 1–122. CiteSeerX   10.1.1.722.981 . doi:10.1561/2200000016.
  9. 1 2 Huang, Bo; Ma, Shiqian; Goldfarb, Donald (27 Jun 2011). "Accelerated Linearized Bregman Method". Journal of Scientific Computing. Plenum Press (published 1 Feb 2013). 54 (2–3): 428–453. arXiv: 1106.5413 . doi:10.1007/s10915-012-9592-9. ISSN   0885-7474. S2CID   14781930.
  10. Wang, Huahua; Banerjee, Arindam (13 Jun 2013). "Bregman alternating direction method of multipliers". NIPS'14: Proceedings of the 27th International Conference on Neural Information Processing Systems. 2: 2816–2824. arXiv: 1306.3203 .
  11. Jia, Rong-Qing (3 Oct 2008). "Convergence analysis of the Bregman method for the variational model of image denoising" (PDF). Applied and Computational Harmonic Analysis (published Nov 2009). 27 (3): 367–379. doi: 10.1016/j.acha.2009.05.002 . Retrieved 22 Apr 2021.