Hyperparameter optimization

Last updated

In machine learning, hyperparameter optimization [1] or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process, which must be configured before the process starts. [2]

Contents

Hyperparameter optimization determines the set of hyperparameters that yields an optimal model which minimizes a predefined loss function on a given data set. [3] The objective function takes a set of hyperparameters and returns the associated loss. [3] Cross-validation is often used to estimate this generalization performance, and therefore choose the set of values for hyperparameters that maximize it. [4]

Approaches

Grid search across different values of two hyperparameters. For each hyperparameter, 10 different values are considered, so a total of 100 different combinations are evaluated and compared. Blue contours indicate regions with strong results, whereas red ones show regions with poor results. Hyperparameter Optimization using Grid Search.svg
Grid search across different values of two hyperparameters. For each hyperparameter, 10 different values are considered, so a total of 100 different combinations are evaluated and compared. Blue contours indicate regions with strong results, whereas red ones show regions with poor results.

The traditional method for hyperparameter optimization has been grid search, or a parameter sweep, which is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. A grid search algorithm must be guided by some performance metric, typically measured by cross-validation on the training set [5] or evaluation on a hold-out validation set. [6]

Since the parameter space of a machine learner may include real-valued or unbounded value spaces for certain parameters, manually set bounds and discretization may be necessary before applying grid search.

For example, a typical soft-margin SVM classifier equipped with an RBF kernel has at least two hyperparameters that need to be tuned for good performance on unseen data: a regularization constant C and a kernel hyperparameter γ. Both parameters are continuous, so to perform grid search, one selects a finite set of "reasonable" values for each, say

Grid search then trains an SVM with each pair (C, γ) in the Cartesian product of these two sets and evaluates their performance on a held-out validation set (or by internal cross-validation on the training set, in which case multiple SVMs are trained per pair). Finally, the grid search algorithm outputs the settings that achieved the highest score in the validation procedure.

Grid search suffers from the curse of dimensionality, but is often embarrassingly parallel because the hyperparameter settings it evaluates are typically independent of each other. [4]

Random search across different combinations of values for two hyperparameters. In this example, 100 different random choices are evaluated. The green bars show that more individual values for each hyperparameter are considered compared to a grid search. Hyperparameter Optimization using Random Search.svg
Random search across different combinations of values for two hyperparameters. In this example, 100 different random choices are evaluated. The green bars show that more individual values for each hyperparameter are considered compared to a grid search.

Random Search replaces the exhaustive enumeration of all combinations by selecting them randomly. This can be simply applied to the discrete setting described above, but also generalizes to continuous and mixed spaces. A benefit over grid search is that random search can explore many more values than grid search could for continuous hyperparameters. It can outperform Grid search, especially when only a small number of hyperparameters affects the final performance of the machine learning algorithm. [4] In this case, the optimization problem is said to have a low intrinsic dimensionality. [7] Random Search is also embarrassingly parallel, and additionally allows the inclusion of prior knowledge by specifying the distribution from which to sample. Despite its simplicity, random search remains one of the important base-lines against which to compare the performance of new hyperparameter optimization methods.

Methods such as Bayesian optimization smartly explore the space of potential choices of hyperparameters by deciding which combination to explore next based on previous observations. Hyperparameter Optimization using Tree-Structured Parzen Estimators.svg
Methods such as Bayesian optimization smartly explore the space of potential choices of hyperparameters by deciding which combination to explore next based on previous observations.

Bayesian optimization

Bayesian optimization is a global optimization method for noisy black-box functions. Applied to hyperparameter optimization, Bayesian optimization builds a probabilistic model of the function mapping from hyperparameter values to the objective evaluated on a validation set. By iteratively evaluating a promising hyperparameter configuration based on the current model, and then updating it, Bayesian optimization aims to gather observations revealing as much information as possible about this function and, in particular, the location of the optimum. It tries to balance exploration (hyperparameters for which the outcome is most uncertain) and exploitation (hyperparameters expected close to the optimum). In practice, Bayesian optimization has been shown [8] [9] [10] [11] to obtain better results in fewer evaluations compared to grid search and random search, due to the ability to reason about the quality of experiments before they are run.

Gradient-based optimization

For specific learning algorithms, it is possible to compute the gradient with respect to hyperparameters and then optimize the hyperparameters using gradient descent. The first usage of these techniques was focused on neural networks. [12] Since then, these methods have been extended to other models such as support vector machines [13] or logistic regression. [14]

A different approach in order to obtain a gradient with respect to hyperparameters consists in differentiating the steps of an iterative optimization algorithm using automatic differentiation. [15] [16] [17] [18] A more recent work along this direction uses the implicit function theorem to calculate hypergradients and proposes a stable approximation of the inverse Hessian. The method scales to millions of hyperparameters and requires constant memory. [19]

In a different approach, [20] a hypernetwork is trained to approximate the best response function. One of the advantages of this method is that it can handle discrete hyperparameters as well. Self-tuning networks [21] offer a memory efficient version of this approach by choosing a compact representation for the hypernetwork. More recently, Δ-STN [22] has improved this method further by a slight reparameterization of the hypernetwork which speeds up training. Δ-STN also yields a better approximation of the best-response Jacobian by linearizing the network in the weights, hence removing unnecessary nonlinear effects of large changes in the weights.

Apart from hypernetwork approaches, gradient-based methods can be used to optimize discrete hyperparameters also by adopting a continuous relaxation of the parameters. [23] Such methods have been extensively used for the optimization of architecture hyperparameters in neural architecture search.

Evolutionary optimization

Evolutionary optimization is a methodology for the global optimization of noisy black-box functions. In hyperparameter optimization, evolutionary optimization uses evolutionary algorithms to search the space of hyperparameters for a given algorithm. [9] Evolutionary hyperparameter optimization follows a process inspired by the biological concept of evolution:

  1. Create an initial population of random solutions (i.e., randomly generate tuples of hyperparameters, typically 100+)
  2. Evaluate the hyperparameter tuples and acquire their fitness function (e.g., 10-fold cross-validation accuracy of the machine learning algorithm with those hyperparameters)
  3. Rank the hyperparameter tuples by their relative fitness
  4. Replace the worst-performing hyperparameter tuples with new ones generated via crossover and mutation
  5. Repeat steps 2-4 until satisfactory algorithm performance is reached or is no longer improving.

Evolutionary optimization has been used in hyperparameter optimization for statistical machine learning algorithms, [9] automated machine learning, typical neural network [24] and deep neural network architecture search, [25] [26] as well as training of the weights in deep neural networks. [27]

Population-based

Population Based Training (PBT) learns both hyperparameter values and network weights. Multiple learning processes operate independently, using different hyperparameters. As with evolutionary methods, poorly performing models are iteratively replaced with models that adopt modified hyperparameter values and weights based on the better performers. This replacement model warm starting is the primary differentiator between PBT and other evolutionary methods. PBT thus allows the hyperparameters to evolve and eliminates the need for manual hypertuning. The process makes no assumptions regarding model architecture, loss functions or training procedures.

PBT and its variants are adaptive methods: they update hyperparameters during the training of the models. On the contrary, non-adaptive methods have the sub-optimal strategy to assign a constant set of hyperparameters for the whole training. [28]

Early stopping-based

Successive halving for eight arbitrary hyperparameter configurations. The approach starts with eight models with different configurations and consecutively applies successive halving until only one model remains. Successive-halving-for-eight-arbitrary-hyperparameter-configurations.png
Successive halving for eight arbitrary hyperparameter configurations. The approach starts with eight models with different configurations and consecutively applies successive halving until only one model remains.

A class of early stopping-based hyperparameter optimization algorithms is purpose built for large search spaces of continuous and discrete hyperparameters, particularly when the computational cost to evaluate the performance of a set of hyperparameters is high. Irace implements the iterated racing algorithm, that focuses the search around the most promising configurations, using statistical tests to discard the ones that perform poorly. [29] [30] Another early stopping hyperparameter optimization algorithm is successive halving (SHA), [31] which begins as a random search but periodically prunes low-performing models, thereby focusing computational resources on more promising models. Asynchronous successive halving (ASHA) [32] further improves upon SHA's resource utilization profile by removing the need to synchronously evaluate and prune low-performing models. Hyperband [33] is a higher level early stopping-based algorithm that invokes SHA or ASHA multiple times with varying levels of pruning aggressiveness, in order to be more widely applicable and with fewer required inputs.

Others

RBF [34] and spectral [35] approaches have also been developed.

Issues with hyperparameter optimization

When hyperparameter optimization is done, the set of hyperparameters are often fitted on a training set and selected based on the generalization performance, or score, of a validation set. However, this procedure is at risk of overfitting the hyperparameters to the validation set. Therefore, the generalization performance score of the validation set (which can be several sets in the case of a cross-validation procedure) cannot be used to simultaneously estimate the generalization performance of the final model. In order to do so, the generalization performance has to be evaluated on a set independent (which has no intersection) of the set (or sets) used for the optimization of the hyperparameters, otherwise the performance might give a value which is too optimistic (too large). This can be done on a second test set, or through an outer cross-validation procedure called nested cross-validation, which allows an unbiased estimation of the generalization performance of the model, taking into account the bias due to the hyperparameter optimization.

See also

Related Research Articles

<span class="mw-page-title-main">Supervised learning</span> Paradigm in machine learning

Supervised learning (SL) is a paradigm in machine learning where input objects and a desired output value train a model. The training data is processed, building a function that maps new data to expected output values. An optimal scenario will allow for the algorithm to correctly determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured through the so-called generalization error.

<span class="mw-page-title-main">Neural network (machine learning)</span> Computational model used in machine learning, based on connected, hierarchical functions

In machine learning, a neural network is a model inspired by the structure and function of biological neural networks in animal brains.

The inductive bias of a learning algorithm is the set of assumptions that the learner uses to predict outputs of given inputs that it has not encountered. Inductive bias is anything which makes the algorithm learn one pattern instead of another pattern. Learning is the process of apprehending useful knowledge by observing and interacting with the world. It involves searching a space of solutions for one expected to provide a better explanation of the data or to achieve higher rewards. But in many cases, there are multiple solutions which are equally good. An inductive bias allows a learning algorithm to prioritize one solution over another, independent of the observed data.

In machine learning, early stopping is a form of regularization used to avoid overfitting when training a learner with an iterative method, such as gradient descent. Such methods update the learner so as to make it better fit the training data with each iteration. Up to a point, this improves the learner's performance on data outside of the training set. Past that point, however, improving the learner's fit to the training data comes at the expense of increased generalization error. Early stopping rules provide guidance as to how many iterations can be run before the learner begins to over-fit. Early stopping rules have been employed in many different machine learning methods, with varying amounts of theoretical foundation.

Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate.

In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from input data. These input data used to build the model are usually divided into multiple data sets. In particular, three data sets are commonly used in different stages of the creation of the model: training, validation, and test sets.

Meta-learning is a subfield of machine learning where automatic learning algorithms are applied to metadata about machine learning experiments. As of 2017, the term had not found a standard interpretation, however the main goal is to use such metadata to understand how automatic learning can become flexible in solving learning problems, hence to improve the performance of existing learning algorithms or to learn (induce) the learning algorithm itself, hence the alternative term learning to learn.

A surrogate model is an engineering method used when an outcome of interest cannot be easily measured or computed, so an approximate mathematical model of the outcome is used instead. Most engineering design problems require experiments and/or simulations to evaluate design objective and constraint functions as a function of design variables. For example, in order to find the optimal airfoil shape for an aircraft wing, an engineer simulates the airflow around the wing for different shape variables. For many real-world problems, however, a single simulation can take many minutes, hours, or even days to complete. As a result, routine tasks such as design optimization, design space exploration, sensitivity analysis and "what-if" analysis become impossible since they require thousands or even millions of simulation evaluations.

There are many types of artificial neural networks (ANN).

In machine learning, a hyperparameter is a parameter that can be set in order to define any configurable part of a model's learning process. Hyperparameters can be classified as either model hyperparameters or algorithm hyperparameters. These are named hyperparameters in contrast to parameters, which are characteristics that the model learns from the data.

Bayesian optimization is a sequential design strategy for global optimization of black-box functions, that does not assume any functional forms. It is usually employed to optimize expensive-to-evaluate functions. With the rise of artificial intelligence innovation in the 21st century, Bayesian optimizations have found prominent use in machine learning problems, for optimizing hyperparameter values.

<span class="mw-page-title-main">Symbolic regression</span> Type of regression analysis

Symbolic regression (SR) is a type of regression analysis that searches the space of mathematical expressions to find the model that best fits a given dataset, both in terms of accuracy and simplicity.

The following outline is provided as an overview of and topical guide to machine learning:

Automated machine learning (AutoML) is the process of automating the tasks of applying machine learning to real-world problems. It is the combination of automation and ML.

Neural architecture search (NAS) is a technique for automating the design of artificial neural networks (ANN), a widely used model in the field of machine learning. NAS has been used to design networks that are on par with or outperform hand-designed architectures. Methods for NAS can be categorized according to the search space, search strategy and performance estimation strategy used:

Multi-task optimization is a paradigm in the optimization literature that focuses on solving multiple self-contained tasks simultaneously. The paradigm has been inspired by the well-established concepts of transfer learning and multi-task learning in predictive analytics.

<span class="mw-page-title-main">Learning curve (machine learning)</span>

In machine learning, a learning curve plots the optimal value of a model's loss function for a training set against this loss function evaluated on a validation data set with same parameters as produced the optimal function. Synonyms include error curve, experience curve, improvement curve and generalization curve.

In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences to what extent newly acquired information overrides old information, it metaphorically represents the speed at which a machine learning model "learns". In the adaptive control literature, the learning rate is commonly referred to as gain.

<span class="mw-page-title-main">Federated learning</span> Decentralized machine learning

Federated learning is a machine learning technique focusing on settings in which multiple entities collaboratively train a model while ensuring that their data remains decentralized. This stands in contrast to machine learning settings in which data is centrally stored. One of the primary defining characteristics of federated learning is data heterogeneity. Due to the decentralized nature of the clients' data, there is no guarantee that data samples held by each client are independently and identically distributed.

Probabilistic numerics is an active field of study at the intersection of applied mathematics, statistics, and machine learning centering on the concept of uncertainty in computation. In probabilistic numerics, tasks in numerical analysis such as finding numerical solutions for integration, linear algebra, optimization and simulation and differential equations are seen as problems of statistical, probabilistic, or Bayesian inference.

References

  1. Matthias Feurer and Frank Hutter. Hyperparameter optimization. In: AutoML: Methods, Systems, Challenges, pages 3–38.
  2. Yang, Li (2020). "On hyperparameter optimization of machine learning algorithms: Theory and practice". Neurocomputing. 415: 295–316. arXiv: 2007.15745 . doi:10.1016/j.neucom.2020.07.061.
  3. 1 2 Claesen, Marc; Bart De Moor (2015). "Hyperparameter Search in Machine Learning". arXiv: 1502.02127 [cs.LG].
  4. 1 2 3 Bergstra, James; Bengio, Yoshua (2012). "Random Search for Hyper-Parameter Optimization" (PDF). Journal of Machine Learning Research. 13: 281–305.
  5. Chin-Wei Hsu, Chih-Chung Chang and Chih-Jen Lin (2010). A practical guide to support vector classification. Technical Report, National Taiwan University.
  6. Chicco D (December 2017). "Ten quick tips for machine learning in computational biology". BioData Mining. 10 (35): 35. doi: 10.1186/s13040-017-0155-3 . PMC   5721660 . PMID   29234465.
  7. Ziyu, Wang; Frank, Hutter; Masrour, Zoghi; David, Matheson; Nando, de Feitas (2016). "Bayesian Optimization in a Billion Dimensions via Random Embeddings". Journal of Artificial Intelligence Research. 55: 361–387. arXiv: 1301.1942 . doi:10.1613/jair.4806. S2CID   279236.
  8. Hutter, Frank; Hoos, Holger; Leyton-Brown, Kevin (2011), "Sequential Model-Based Optimization for General Algorithm Configuration", Learning and Intelligent Optimization (PDF), Lecture Notes in Computer Science, vol. 6683, pp. 507–523, CiteSeerX   10.1.1.307.8813 , doi:10.1007/978-3-642-25566-3_40, ISBN   978-3-642-25565-6, S2CID   6944647
  9. 1 2 3 Bergstra, James; Bardenet, Remi; Bengio, Yoshua; Kegl, Balazs (2011), "Algorithms for hyper-parameter optimization" (PDF), Advances in Neural Information Processing Systems
  10. Snoek, Jasper; Larochelle, Hugo; Adams, Ryan (2012). "Practical Bayesian Optimization of Machine Learning Algorithms" (PDF). Advances in Neural Information Processing Systems. arXiv: 1206.2944 . Bibcode:2012arXiv1206.2944S.
  11. Thornton, Chris; Hutter, Frank; Hoos, Holger; Leyton-Brown, Kevin (2013). "Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms" (PDF). Knowledge Discovery and Data Mining. arXiv: 1208.3719 . Bibcode:2012arXiv1208.3719T.
  12. Larsen, Jan; Hansen, Lars Kai; Svarer, Claus; Ohlsson, M (1996). "Design and regularization of neural networks: The optimal use of a validation set" (PDF). Neural Networks for Signal Processing VI. Proceedings of the 1996 IEEE Signal Processing Society Workshop. pp. 62–71. CiteSeerX   10.1.1.415.3266 . doi:10.1109/NNSP.1996.548336. ISBN   0-7803-3550-3. S2CID   238874.
  13. Olivier Chapelle; Vladimir Vapnik; Olivier Bousquet; Sayan Mukherjee (2002). "Choosing multiple parameters for support vector machines" (PDF). Machine Learning. 46: 131–159. doi: 10.1023/a:1012450327387 .
  14. Chuong B; Chuan-Sheng Foo; Andrew Y Ng (2008). "Efficient multiple hyperparameter learning for log-linear models" (PDF). Advances in Neural Information Processing Systems. 20.
  15. Domke, Justin (2012). "Generic Methods for Optimization-Based Modeling" (PDF). Aistats. 22. Archived from the original (PDF) on 2014-01-24. Retrieved 2017-12-09.
  16. Maclaurin, Dougal; Duvenaud, David; Adams, Ryan P. (2015). "Gradient-based Hyperparameter Optimization through Reversible Learning". arXiv: 1502.03492 [stat.ML].
  17. Franceschi, Luca; Donini, Michele; Frasconi, Paolo; Pontil, Massimiliano (2017). "Forward and Reverse Gradient-Based Hyperparameter Optimization" (PDF). Proceedings of the 34th International Conference on Machine Learning. arXiv: 1703.01785 . Bibcode:2017arXiv170301785F.
  18. Shaban, A., Cheng, C. A., Hatch, N., & Boots, B. (2019, April). Truncated back-propagation for bilevel optimization. In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 1723-1732). PMLR.
  19. Lorraine, J., Vicol, P., & Duvenaud, D. (2018). Optimizing Millions of Hyperparameters by Implicit Differentiation. arXiv preprint arXiv:1911.02590.
  20. Lorraine, J., & Duvenaud, D. (2018). Stochastic hyperparameter optimization through hypernetworks. arXiv preprint arXiv:1802.09419.
  21. MacKay, M., Vicol, P., Lorraine, J., Duvenaud, D., & Grosse, R. (2019). Self-tuning networks: Bilevel optimization of hyperparameters using structured best-response functions. arXiv preprint arXiv:1903.03088.
  22. Bae, J., & Grosse, R. B. (2020). Delta-stn: Efficient bilevel optimization for neural networks using structured response jacobians. Advances in Neural Information Processing Systems, 33, 21725-21737.
  23. Liu, H., Simonyan, K., & Yang, Y. (2018). Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055.
  24. Kousiouris G, Cuccinotta T, Varvarigou T (2011). "The effects of scheduling, workload type and consolidation scenarios on virtual machine performance and their prediction through optimized artificial neural networks". Journal of Systems and Software. 84 (8): 1270–1291. doi:10.1016/j.jss.2011.04.013. hdl: 11382/361472 .
  25. Miikkulainen R, Liang J, Meyerson E, Rawal A, Fink D, Francon O, Raju B, Shahrzad H, Navruzyan A, Duffy N, Hodjat B (2017). "Evolving Deep Neural Networks". arXiv: 1703.00548 [cs.NE].
  26. Jaderberg M, Dalibard V, Osindero S, Czarnecki WM, Donahue J, Razavi A, Vinyals O, Green T, Dunning I, Simonyan K, Fernando C, Kavukcuoglu K (2017). "Population Based Training of Neural Networks". arXiv: 1711.09846 [cs.LG].
  27. Such FP, Madhavan V, Conti E, Lehman J, Stanley KO, Clune J (2017). "Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning". arXiv: 1712.06567 [cs.NE].
  28. Li, Ang; Spyra, Ola; Perel, Sagi; Dalibard, Valentin; Jaderberg, Max; Gu, Chenjie; Budden, David; Harley, Tim; Gupta, Pramod (2019-02-05). "A Generalized Framework for Population Based Training". arXiv: 1902.01894 [cs.AI].
  29. López-Ibáñez, Manuel; Dubois-Lacoste, Jérémie; Pérez Cáceres, Leslie; Stützle, Thomas; Birattari, Mauro (2016). "The irace package: Iterated Racing for Automatic Algorithm Configuration". Operations Research Perspective. 3 (3): 43–58. doi: 10.1016/j.orp.2016.09.002 . hdl: 10419/178265 .
  30. Birattari, Mauro; Stützle, Thomas; Paquete, Luis; Varrentrapp, Klaus (2002). "A Racing Algorithm for Configuring Metaheuristics". Gecco 2002: 11–18.
  31. Jamieson, Kevin; Talwalkar, Ameet (2015-02-27). "Non-stochastic Best Arm Identification and Hyperparameter Optimization". arXiv: 1502.07943 [cs.LG].
  32. Li, Liam; Jamieson, Kevin; Rostamizadeh, Afshin; Gonina, Ekaterina; Hardt, Moritz; Recht, Benjamin; Talwalkar, Ameet (2020-03-16). "A System for Massively Parallel Hyperparameter Tuning". arXiv: 1810.05934v5 [cs.LG].
  33. Li, Lisha; Jamieson, Kevin; DeSalvo, Giulia; Rostamizadeh, Afshin; Talwalkar, Ameet (2020-03-16). "Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization". Journal of Machine Learning Research. 18: 1–52. arXiv: 1603.06560 .
  34. Diaz, Gonzalo; Fokoue, Achille; Nannicini, Giacomo; Samulowitz, Horst (2017). "An effective algorithm for hyperparameter optimization of neural networks". arXiv: 1705.08520 [cs.AI].
  35. Hazan, Elad; Klivans, Adam; Yuan, Yang (2017). "Hyperparameter Optimization: A Spectral Approach". arXiv: 1706.00764 [cs.LG].