Bayesian optimization

Last updated

Bayesian optimization is a sequential design strategy for global optimization of black-box functions [1] [2] [3] that does not assume any functional forms. It is usually employed to optimize expensive-to-evaluate functions.

Contents

History

The term is generally attributed to Jonas Mockus  [ lt ] and is coined in his work from a series of publications on global optimization in the 1970s and 1980s. [4] [5] [1]

Strategy

Bayesian optimization of a function (black) with Gaussian processes (purple). Three acquisition functions (blue) are shown at the bottom. GpParBayesAnimationSmall.gif
Bayesian optimization of a function (black) with Gaussian processes (purple). Three acquisition functions (blue) are shown at the bottom.

Bayesian optimization is typically used on problems of the form , where is a set of points, , which rely upon less than 20 dimensions (), and whose membership can easily be evaluated. Bayesian optimization is particularly advantageous for problems where is difficult to evaluate due to its computational cost. The objective function, , is continuous and takes the form of some unknown structure, referred to as a "black box". Upon its evaluation, only is observed and its derivatives are not evaluated. [7]

Since the objective function is unknown, the Bayesian strategy is to treat it as a random function and place a prior over it. The prior captures beliefs about the behavior of the function. After gathering the function evaluations, which are treated as data, the prior is updated to form the posterior distribution over the objective function. The posterior distribution, in turn, is used to construct an acquisition function (often also referred to as infill sampling criteria) that determines the next query point.

There are several methods used to define the prior/posterior distribution over the objective function. The most common two methods use Gaussian processes in a method called kriging. Another less expensive method uses the Parzen-Tree Estimator to construct two distributions for 'high' and 'low' points, and then finds the location that maximizes the expected improvement. [8]

Standard Bayesian optimization relies upon each being easy to evaluate, and problems that deviate from this assumption are known as exotic Bayesian optimization problems. Optimization problems can become exotic if it is known that there is noise, the evaluations are being done in parallel, the quality of evaluations relies upon a tradeoff between difficulty and accuracy, the presence of random environmental conditions, or if the evaluation involves derivatives. [7]

Acquisition functions

Examples of acquisition functions include

and hybrids of these. [9] They all trade-off exploration and exploitation so as to minimize the number of function queries. As such, Bayesian optimization is well suited for functions that are expensive to evaluate.

Solution methods

The maximum of the acquisition function is typically found by resorting to discretization or by means of an auxiliary optimizer. Acquisition functions are typically well-behaved [ citation needed ] and are maximized using a numerical optimization technique, such as Newton's Method or quasi-Newton methods like the Broyden–Fletcher–Goldfarb–Shanno algorithm.

Applications

The approach has been applied to solve a wide range of problems, [10] including learning to rank, [11] computer graphics and visual design, [12] [13] [14] robotics, [15] [16] [17] [18] sensor networks, [19] [20] automatic algorithm configuration, [21] [22] automatic machine learning toolboxes, [23] [24] [25] reinforcement learning, [26] planning, visual attention, architecture configuration in deep learning, static program analysis, experimental particle physics, [27] [28] quality-diversity optimization, [29] [30] [31] chemistry, material design, and drug development. [7] [32] [33]

Bayesian Optimization has been applied in the field of facial recognition. [34] The performance of the Histogram of Oriented Gradients (HOG) algorithm, a popular feature extraction method, heavily relies on its parameter settings. Optimizing these parameters can be challenging but crucial for achieving high accuracy. [34] A novel approach to optimize the HOG algorithm parameters and image size for facial recognition using a Tree-structured Parzen Estimator (TPE) based Bayesian optimization technique has been proposed. [34] This optimized approach has the potential to be adapted for other computer vision applications and contributes to the ongoing development of hand-crafted parameter-based feature extraction algorithms in computer vision. [34]

See also

Related Research Articles

<span class="mw-page-title-main">Supervised learning</span> A paradigm in machine learning

Supervised learning (SL) is a paradigm in machine learning where input objects and a desired output value train a model. The training data is processed, building a function that maps new data on expected output values. An optimal scenario will allow for the algorithm to correctly determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured through the so-called generalization error.

<span class="mw-page-title-main">Neural network (machine learning)</span> Computational model used in machine learning, based on connected, hierarchical functions

In machine learning, a neural network is a model inspired by the structure and function of biological neural networks in animal brains.

<span class="mw-page-title-main">Mathematical optimization</span> Study of mathematical algorithms for optimization problems

Mathematical optimization or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfields: discrete optimization and continuous optimization. Optimization problems arise in all quantitative disciplines from computer science and engineering to operations research and economics, and the development of solution methods has been of interest in mathematics for centuries.

In machine learning, support vector machines are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues SVMs are one of the most studied models, being based on statistical learning frameworks of VC theory proposed by Vapnik and Chervonenkis (1974).

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

Global optimization is a branch of applied mathematics and numerical analysis that attempts to find the global minima or maxima of a function or a set of functions on a given set. It is usually described as a minimization problem because the maximization of the real-valued function is equivalent to the minimization of the function .

Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate.

<span class="mw-page-title-main">Multi-armed bandit</span> Resource problem in machine learning

In probability theory and machine learning, the multi-armed bandit problem is a problem in which a decision maker iteratively selects one of multiple fixed choices when the properties of each choice are only partially known at the time of allocation, and may become better understood as time passes. A fundamental aspect of bandit problems is that choosing an arm does not affect the properties of the arm or other arms.

Meta learning is a subfield of machine learning where automatic learning algorithms are applied to metadata about machine learning experiments. As of 2017, the term had not found a standard interpretation, however the main goal is to use such metadata to understand how automatic learning can become flexible in solving learning problems, hence to improve the performance of existing learning algorithms or to learn (induce) the learning algorithm itself, hence the alternative term learning to learn.

A surrogate model is an engineering method used when an outcome of interest cannot be easily measured or computed, so an approximate mathematical model of the outcome is used instead. Most engineering design problems require experiments and/or simulations to evaluate design objective and constraint functions as a function of design variables. For example, in order to find the optimal airfoil shape for an aircraft wing, an engineer simulates the airflow around the wing for different shape variables. For many real-world problems, however, a single simulation can take many minutes, hours, or even days to complete. As a result, routine tasks such as design optimization, design space exploration, sensitivity analysis and "what-if" analysis become impossible since they require thousands or even millions of simulation evaluations.

Multi-objective optimization or Pareto optimization is an area of multiple-criteria decision making that is concerned with mathematical optimization problems involving more than one objective function to be optimized simultaneously. Multi-objective is a type of vector optimization that has been applied in many fields of science, including engineering, economics and logistics where optimal decisions need to be taken in the presence of trade-offs between two or more conflicting objectives. Minimizing cost while maximizing comfort while buying a car, and maximizing performance whilst minimizing fuel consumption and emission of pollutants of a vehicle are examples of multi-objective optimization problems involving two and three objectives, respectively. In practical problems, there can be more than three objectives.

Group method of data handling (GMDH) is a family of inductive algorithms for computer-based mathematical modeling of multi-parametric datasets that features fully automatic structural and parametric optimization of models.

Harold Joseph Kushner is an American applied mathematician and a Professor Emeritus of Applied Mathematics at Brown University. He is known for his work on the theory of stochastic stability, the theory of non-linear filtering, and for the development of numerical methods for stochastic control problems such as the Markov chain approximation method. He is commonly cited as the first person to study Bayesian optimization, based on work he published in 1964.

Random search (RS) is a family of numerical optimization methods that do not require the gradient of the problem to be optimized, and RS can hence be used on functions that are not continuous or differentiable. Such optimization methods are also known as direct-search, derivative-free, or black-box methods.

Geometric feature learning is a technique combining machine learning and computer vision to solve visual tasks. The main goal of this method is to find a set of representative features of geometric form to represent an object by collecting geometric features from images and learning them using efficient machine learning methods. Humans solve visual tasks and can give fast response to the environment by extracting perceptual information from what they see. Researchers simulate humans' ability of recognizing objects to solve computer vision problems. For example, M. Mata et al.(2002) applied feature learning techniques to the mobile robot navigation tasks in order to avoid obstacles. They used genetic algorithms for learning features and recognizing objects (figures). Geometric feature learning methods can not only solve recognition problems but also predict subsequent actions by analyzing a set of sequential input sensory images, usually some extracting features of images. Through learning, some hypothesis of the next action are given and according to the probability of each hypothesis give a most probable action. This technique is widely used in the area of artificial intelligence.

Adversarial machine learning is the study of the attacks on machine learning algorithms, and of the defenses against such attacks. A survey from May 2020 exposes the fact that practitioners report a dire need for better protecting machine learning systems in industrial applications.

The following outline is provided as an overview of and topical guide to machine learning:

In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process.

Multi-task optimization is a paradigm in the optimization literature that focuses on solving multiple self-contained tasks simultaneously. The paradigm has been inspired by the well-established concepts of transfer learning and multi-task learning in predictive analytics.

Probabilistic numerics is an active field of study at the intersection of applied mathematics, statistics, and machine learning centering on the concept of uncertainty in computation. In probabilistic numerics, tasks in numerical analysis such as finding numerical solutions for integration, linear algebra, optimization and simulation and differential equations are seen as problems of statistical, probabilistic, or Bayesian inference.

References

  1. 1 2 Močkus, J. (1989). Bayesian Approach to Global Optimization. Dordrecht: Kluwer Academic. ISBN   0-7923-0115-3.
  2. Garnett, Roman (2023). Bayesian Optimization. Cambridge University Press. ISBN   978-1-108-42578-0.
  3. Hennig, P.; Osborne, M. A.; Kersting, H. P. (2022). Probabilistic Numerics (PDF). Cambridge University Press. pp. 243–278. ISBN   978-1107163447.
  4. Močkus, Jonas (1975). "On bayesian methods for seeking the extremum". Optimization Techniques IFIP Technical Conference Novosibirsk, July 1–7, 1974. Lecture Notes in Computer Science. Vol. 27. pp. 400–404. doi: 10.1007/3-540-07165-2_55 . ISBN   978-3-540-07165-5.
  5. Močkus, Jonas (1977). "On Bayesian Methods for Seeking the Extremum and their Application". IFIP Congress: 195–200.
  6. Wilson, Samuel (2019-11-22), ParBayesianOptimization R package , retrieved 2019-12-12
  7. 1 2 3 Frazier, Peter I. (2018-07-08). "A Tutorial on Bayesian Optimization". arXiv: 1807.02811 [stat.ML].
  8. J. S. Bergstra, R. Bardenet, Y. Bengio, B. Kégl: Algorithms for Hyper-Parameter Optimization. Advances in Neural Information Processing Systems: 2546–2554 (2011)
  9. Matthew W. Hoffman, Eric Brochu, Nando de Freitas: Portfolio Allocation for Bayesian Optimization. Uncertainty in Artificial Intelligence: 327–336 (2011)
  10. Eric Brochu, Vlad M. Cora, Nando de Freitas: A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning. CoRR abs/1012.2599 (2010)
  11. Eric Brochu, Nando de Freitas, Abhijeet Ghosh: Active Preference Learning with Discrete Choice Data. Advances in Neural Information Processing Systems: 409-416 (2007)
  12. Eric Brochu, Tyson Brochu, Nando de Freitas: A Bayesian Interactive Optimization Approach to Procedural Animation Design. Symposium on Computer Animation 2010: 103–112
  13. Yuki Koyama, Issei Sato, Daisuke Sakamoto, Takeo Igarashi: Sequential Line Search for Efficient Visual Design Optimization by Crowds. ACM Transactions on Graphics, Volume 36, Issue 4, pp.48:1–48:11 (2017). DOI: https://doi.org/10.1145/3072959.3073598
  14. Yuki Koyama, Issei Sato, Masataka Goto: Sequential Gallery for Interactive Visual Design Optimization. ACM Transactions on Graphics, Volume 39, Issue 4, pp.88:1–88:12 (2020). DOI: https://doi.org/10.1145/3386569.3392444
  15. Daniel J. Lizotte, Tao Wang, Michael H. Bowling, Dale Schuurmans: Automatic Gait Optimization with Gaussian Process Regression Archived 2017-08-12 at the Wayback Machine . International Joint Conference on Artificial Intelligence: 944–949 (2007)
  16. Ruben Martinez-Cantin, Nando de Freitas, Eric Brochu, Jose Castellanos and Arnaud Doucet. A Bayesian exploration-exploitation approach for optimal online sensing and planning with a visually guided mobile robot. Autonomous Robots. Volume 27, Issue 2, pp 93–103 (2009)
  17. Scott Kuindersma, Roderic Grupen, and Andrew Barto. Variable Risk Control via Stochastic Optimization. International Journal of Robotics Research, volume 32, number 7, pp 806–825 (2013)
  18. Roberto Calandra, André Seyfarth, Jan Peters, and Marc P. Deisenroth Bayesian optimization for learning gaits under uncertainty. Ann. Math. Artif. Intell. Volume 76, Issue 1, pp 5-23 (2016) DOI:10.1007/s10472-015-9463-9
  19. Niranjan Srinivas, Andreas Krause, Sham M. Kakade, Matthias W. Seeger: Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting. IEEE Transactions on Information Theory 58(5):3250–3265 (2012)
  20. Roman Garnett, Michael A. Osborne, Stephen J. Roberts: Bayesian optimization for sensor set selection [ dead link ]. ACM/IEEE International Conference on Information Processing in Sensor Networks: 209–219 (2010)
  21. Frank Hutter, Holger Hoos, and Kevin Leyton-Brown (2011). Sequential model-based optimization for general algorithm configuration, Learning and Intelligent Optimization
  22. J. Snoek, H. Larochelle, R. P. Adams Practical Bayesian Optimization of Machine Learning Algorithms. Advances in Neural Information Processing Systems: 2951-2959 (2012)
  23. J. Bergstra, D. Yamins, D. D. Cox (2013). Hyperopt: A Python Library for Optimizing the Hyperparameters of Machine Learning Algorithms. Proc. SciPy 2013.
  24. Chris Thornton, Frank Hutter, Holger H. Hoos, Kevin Leyton-Brown: Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms. KDD 2013: 847–855
  25. Jasper Snoek, Hugo Larochelle and Ryan Prescott Adams. Practical Bayesian Optimization of Machine Learning Algorithms. Advances in Neural Information Processing Systems, 2012
  26. Berkenkamp, Felix (2019). Safe Exploration in Reinforcement Learning: Theory and Applications in Robotics (Doctoral Thesis thesis). ETH Zurich. doi:10.3929/ethz-b-000370833. hdl:20.500.11850/370833.
  27. Philip Ilten, Mike Williams, Yunjie Yang. Event generator tuning using Bayesian optimization. 2017 JINST 12 P04028. DOI: 10.1088/1748-0221/12/04/P04028
  28. Evaristo Cisbani et al. AI-optimized detector design for the future Electron-Ion Collider: the dual-radiator RICH case 2020 JINST 15 P05009. DOI: 10.1088/1748-0221/15/05/P05009
  29. Kent, Paul; Gaier, Adam; Mouret, Jean-Baptiste; Branke, Juergen (2023-07-19). "BOP-Elites, a Bayesian Optimisation Approach to Quality Diversity Search with Black-Box descriptor functions". arXiv: 2307.09326 [math.OC]. Preprint: Arxiv.
  30. Kent, Paul; Branke, Juergen (2023-07-12). "Bayesian Quality Diversity Search with Interactive Illumination". Proceedings of the Genetic and Evolutionary Computation Conference (PDF). GECCO '23. New York, NY, USA: Association for Computing Machinery. pp. 1019–1026. doi:10.1145/3583131.3590486. ISBN   979-8-4007-0119-1. S2CID   259833672.
  31. Gaier, Adam; Asteroth, Alexander; Mouret, Jean-Baptiste (2018-09-01). "Data-Efficient Design Exploration through Surrogate-Assisted Illumination". Evolutionary Computation. 26 (3): 381–410. arXiv: 1806.05865 . doi: 10.1162/evco_a_00231 . ISSN   1063-6560. PMID   29883202. S2CID   47003986.
  32. Gomez-Bombarelli et al. Automatic Chemical Design using a Data-Driven Continuous Representation of Molecules. ACS Central Science, Volume 4, Issue 2, 268-276 (2018)
  33. Griffiths et al. Constrained Bayesian Optimization for Automatic Chemical Design using Variational Autoencoders Chemical Science: 11, 577-586 (2020)
  34. 1 2 3 4 Mohammed Mehdi Bouchene: Bayesian Optimization of Histogram of Oriented Gradients (Hog) Parameters for Facial Recognition. SSRN (2023)