Learning rate

Last updated June 07, 2023

In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function.^[1] Since it influences to what extent newly acquired information overrides old information, it metaphorically represents the speed at which a machine learning model "learns". In the adaptive control literature, the learning rate is commonly referred to as gain.^[2]

In setting a learning rate, there is a trade-off between the rate of convergence and overshooting. While the descent direction is usually determined from the gradient of the loss function, the learning rate determines how big a step is taken in that direction. A too high learning rate will make the learning jump over minima but a too low learning rate will either take too long to converge or get stuck in an undesirable local minimum.^[3]

In order to achieve faster convergence, prevent oscillations and getting stuck in undesirable local minima the learning rate is often varied during training either in accordance to a learning rate schedule or by using an adaptive learning rate.^[4] The learning rate and its adjustments may also differ per parameter, in which case it is a diagonal matrix that can be interpreted as an approximation to the inverse of the Hessian matrix in Newton's method.^[5] The learning rate is related to the step length determined by inexact line search in quasi-Newton methods and related optimization algorithms.^[6]^[7]

Learning rate schedule

Initial rate can be left as system default or can be selected using a range of techniques.^[8] A learning rate schedule changes the learning rate during learning and is most often changed between epochs/iterations. This is mainly done with two parameters: decay and momentum. There are many different learning rate schedules but the most common are time-based, step-based and exponential.^[4]

Decay serves to settle the learning in a nice place and avoid oscillations, a situation that may arise when a too high constant learning rate makes the learning jump back and forth over a minimum, and is controlled by a hyperparameter.

Momentum is analogous to a ball rolling down a hill; we want the ball to settle at the lowest point of the hill (corresponding to the lowest error). Momentum both speeds up the learning (increasing the learning rate) when the error cost gradient is heading in the same direction for a long time and also avoids local minima by 'rolling over' small bumps. Momentum is controlled by a hyperparameter analogous to a ball's mass which must be chosen manually—too high and the ball will roll over minima which we wish to find, too low and it will not fulfil its purpose. The formula for factoring in the momentum is more complex than for decay but is most often built in with deep learning libraries such as Keras.

Time-based learning schedules alter the learning rate depending on the learning rate of the previous time iteration. Factoring in the decay the mathematical formula for the learning rate is:

$\eta _{n+1}={\frac {\eta _{n}}{1+dn}}$

where $\eta$ is the learning rate, $d$ is a decay parameter and $n$ is the iteration step.

Step-based learning schedules changes the learning rate according to some predefined steps. The decay application formula is here defined as:

$\eta _{n}=\eta _{0}d^{\left\lfloor {\frac {1+n}{r}}\right\rfloor }$

where $\eta _{n}$ is the learning rate at iteration $n$ , $\eta _{0}$ is the initial learning rate, $d$ is how much the learning rate should change at each drop (0.5 corresponds to a halving) and $r$ corresponds to the drop rate, or how often the rate should be dropped (10 corresponds to a drop every 10 iterations). The floor function ( $\lfloor \dots \rfloor$ ) here drops the value of its input to 0 for all values smaller than 1.

Exponential learning schedules are similar to step-based, but instead of steps, a decreasing exponential function is used. The mathematical formula for factoring in the decay is:

$\eta _{n}=\eta _{0}e^{-dn}$

where $d$ is a decay parameter.

Adaptive learning rate

The issue with learning rate schedules is that they all depend on hyperparameters that must be manually chosen for each given learning session and may vary greatly depending on the problem at hand or the model used. To combat this, there are many different types of adaptive gradient descent algorithms such as Adagrad, Adadelta, RMSprop, and Adam ^[9] which are generally built into deep learning libraries such as Keras.^[10]

Related Research Articles

Simulated annealing (SA) is a probabilistic technique for approximating the global optimum of a given function. Specifically, it is a metaheuristic to approximate global optimization in a large search space for an optimization problem. For large numbers of local optima, SA can find the global optima. It is often used when the search space is discrete. For problems where finding an approximate global optimum is more important than finding a precise local optimum in a fixed amount of time, simulated annealing may be preferable to exact algorithms such as gradient descent or branch and bound.

In mathematics, gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the gradient of the function at the current point, because this is the direction of steepest descent. Conversely, stepping in the direction of the gradient will lead to a local maximum of that function; the procedure is then known as gradient ascent. It is particularly useful in machine learning for minimizing the cost or loss function. Gradient descent should not be confused with local search algorithms, although both are iterative methods for optimization.

In machine learning, early stopping is a form of regularization used to avoid overfitting when training a learner with an iterative method, such as gradient descent. Such methods update the learner so as to make it better fit the training data with each iteration. Up to a point, this improves the learner's performance on data outside of the training set. Past that point, however, improving the learner's fit to the training data comes at the expense of increased generalization error. Early stopping rules provide guidance as to how many iterations can be run before the learner begins to over-fit. Early stopping rules have been employed in many different machine learning methods, with varying amounts of theoretical foundation.

In computer science and operations research, the ant colony optimization algorithm (ACO) is a probabilistic technique for solving computational problems which can be reduced to finding good paths through graphs. Artificial ants stand for multi-agent methods inspired by the behavior of real ants. The pheromone-based communication of biological ants is often the predominant paradigm used. Combinations of artificial ants and local search algorithms have become a method of choice for numerous optimization tasks involving some sort of graph, e.g., vehicle routing and internet routing.

Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate.

In calculus, Newton's method (also called Newton–Raphson) is an iterative method for finding the roots of a differentiable function $F$ , which are solutions to the equation $F (x) = 0$ . As such, Newton's method can be applied to the derivative $f'$ of a twice-differentiable function $f$ to find the roots of the derivative (solutions to $f'(x) = 0$ ), also known as the critical points of $f$ . These solutions may be minima, maxima, or saddle points; see section "Several variables" in Critical point (mathematics) and also section "Geometric interpretation" in this article. This is relevant in optimization, which aims to find (global) minima of the function $f$ .

In (unconstrained) mathematical optimization, a backtracking line search is a line search method to determine the amount to move along a given search direction. Its use requires that the objective function is differentiable and that its gradient is known.

The Frank–Wolfe algorithm is an iterative first-order optimization algorithm for constrained convex optimization. Also known as the conditional gradient method, reduced gradient algorithm and the convex combination algorithm, the method was originally proposed by Marguerite Frank and Philip Wolfe in 1956. In each iteration, the Frank–Wolfe algorithm considers a linear approximation of the objective function, and moves towards a minimizer of this linear function.

In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need of out-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time, e.g., stock price prediction. Online learning algorithms may be prone to catastrophic interference, a problem that can be addressed by incremental learning approaches.

Natural evolution strategies (NES) are a family of numerical optimization algorithms for black box problems. Similar in spirit to evolution strategies, they iteratively update the (continuous) parameters of a search distribution by following the natural gradient towards higher expected fitness.

In machine learning, a hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters are derived via training.

An artificial neural network's learning rule or learning process is a method, mathematical logic or algorithm which improves the network's performance and/or training time. Usually, this rule is applied repeatedly over the network. It is done by updating the weights and bias levels of a network when a network is simulated in a specific data environment. A learning rule may accept existing conditions of the network and will compare the expected result and actual result of the network to give new and improved values for weights and bias. Depending on the complexity of actual model being simulated, the learning rule of the network can be as simple as an XOR gate or mean squared error, or as complex as the result of a system of differential equations.

Spectral regularization is any of a class of regularization techniques used in machine learning to control the impact of noise and prevent overfitting. Spectral regularization can be used in a broad range of applications, from deblurring images to classifying emails into a spam folder and a non-spam folder. For instance, in the email classification example, spectral regularization can be used to reduce the impact of noise and prevent overfitting when a machine learning system is being trained on a labeled set of emails to learn how to tell a spam and a non-spam email apart.

In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters are learned.

Batch normalization is a method used to make training of artificial neural networks faster and more stable through normalization of the layers' inputs by re-centering and re-scaling. It was proposed by Sergey Ioffe and Christian Szegedy in 2015.

Stochastic gradient Langevin dynamics (SGLD) is an optimization and sampling technique composed of characteristics from Stochastic gradient descent, a Robbins–Monro optimization algorithm, and Langevin dynamics, a mathematical extension of molecular dynamics models. Like stochastic gradient descent, SGLD is an iterative optimization algorithm which uses minibatching to create a stochastic gradient estimator, as used in SGD to optimize a differentiable objective function. Unlike traditional SGD, SGLD can be used for Bayesian learning as a sampling method. SGLD may be viewed as Langevin dynamics applied to posterior distributions, but the key difference is that the likelihood gradient terms are minibatched, like in SGD. SGLD, like Langevin dynamics, produces samples from a posterior distribution of parameters based on available data. First described by Welling and Teh in 2011, the method has applications in many contexts which require optimization, and is most notably applied in machine learning problems.

Federated learning is a machine learning technique that trains an algorithm via multiple independent sessions, each using its own dataset. This approach stands in contrast to traditional centralized machine learning techniques where local datasets are merged into one training session, as well as to approaches that assume that local data samples are identically distributed.

In mathematical optimization, oracle complexity is a standard theoretical framework to study the computational requirements for solving classes of optimization problems. It is suitable for analyzing iterative algorithms which proceed by computing local information about the objective function at various points. The framework has been used to provide tight worst-case guarantees on the number of required iterations, for several important classes of optimization problems.

In mathematics, mirror descent is an iterative optimization algorithm for finding a local minimum of a differentiable function.

(Stochastic) variance reduction is an algorithmic approach to minimizing functions that can be decomposed into finite sums. By exploiting the finite sum structure, variance reduction techniques are able to achieve convergence rates that are impossible to achieve with methods that treat the objective as an infinite sum, as in the classical Stochastic approximation setting.

References

↑ Murphy, Kevin P. (2012). Machine Learning: A Probabilistic Perspective. Cambridge: MIT Press. p. 247. ISBN 978-0-262-01802-9.
↑ Delyon, Bernard (2000). "Stochastic Approximation with Decreasing Gain: Convergence and Asymptotic Theory". Unpublished Lecture Notes. Université de Rennes. CiteSeerX 10.1.1.29.4428 .
↑ Buduma, Nikhil; Locascio, Nicholas (2017). Fundamentals of Deep Learning : Designing Next-Generation Machine Intelligence Algorithms. O'Reilly. p. 21. ISBN 978-1-4919-2558-4.
1 2 Patterson, Josh; Gibson, Adam (2017). "Understanding Learning Rates". Deep Learning : A Practitioner's Approach. O'Reilly. pp. 258–263. ISBN 978-1-4919-1425-0.
↑ Ruder, Sebastian (2017). "An Overview of Gradient Descent Optimization Algorithms". arXiv: 1609.04747 [cs.LG].
↑ Nesterov, Y. (2004). Introductory Lectures on Convex Optimization: A Basic Course. Boston: Kluwer. p. 25. ISBN 1-4020-7553-7.
↑ Dixon, L. C. W. (1972). "The Choice of Step Length, a Crucial Factor in the Performance of Variable Metric Algorithms". Numerical Methods for Non-linear Optimization. London: Academic Press. pp. 149–170. ISBN 0-12-455650-7.
↑ Smith, Leslie N. (4 April 2017). "Cyclical Learning Rates for Training Neural Networks". arXiv: 1506.01186 [cs.CV].
↑ Murphy, Kevin (2021). Probabilistic Machine Learning: An Introduction. Probabilistic Machine Learning: An Introduction. MIT Press. Retrieved 10 April 2021.
↑ Brownlee, Jason (22 January 2019). "How to Configure the Learning Rate When Training Deep Learning Neural Networks". Machine Learning Mastery. Retrieved 4 January 2021.

External links

de Freitas, Nando (February 12, 2015). "Optimization". Deep Learning Lecture 6. University of Oxford – via YouTube.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Murphy, Kevin P. (2012). Machine Learning: A Probabilistic Perspective. Cambridge: MIT Press. p. 247. ISBN 978-0-262-01802-9.

[2] Delyon, Bernard (2000). "Stochastic Approximation with Decreasing Gain: Convergence and Asymptotic Theory". Unpublished Lecture Notes. Université de Rennes. CiteSeerX 10.1.1.29.4428 .

[3] Buduma, Nikhil; Locascio, Nicholas (2017). Fundamentals of Deep Learning : Designing Next-Generation Machine Intelligence Algorithms. O'Reilly. p. 21. ISBN 978-1-4919-2558-4.

[variablelearningrate-4] 1 2 Patterson, Josh; Gibson, Adam (2017). "Understanding Learning Rates". Deep Learning : A Practitioner's Approach. O'Reilly. pp. 258–263. ISBN 978-1-4919-1425-0.

[5] Ruder, Sebastian (2017). "An Overview of Gradient Descent Optimization Algorithms". arXiv: 1609.04747 [cs.LG].

[6] Nesterov, Y. (2004). Introductory Lectures on Convex Optimization: A Basic Course. Boston: Kluwer. p. 25. ISBN 1-4020-7553-7.

[7] Dixon, L. C. W. (1972). "The Choice of Step Length, a Crucial Factor in the Performance of Variable Metric Algorithms". Numerical Methods for Non-linear Optimization. London: Academic Press. pp. 149–170. ISBN 0-12-455650-7.

[8] Smith, Leslie N. (4 April 2017). "Cyclical Learning Rates for Training Neural Networks". arXiv: 1506.01186 [cs.CV].

[9] Murphy, Kevin (2021). Probabilistic Machine Learning: An Introduction. Probabilistic Machine Learning: An Introduction. MIT Press. Retrieved 10 April 2021.

[10] Brownlee, Jason (22 January 2019). "How to Configure the Learning Rate When Training Deep Learning Neural Networks". Machine Learning Mastery. Retrieved 4 January 2021.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

Learning rate

Contents

Learning rate schedule

Adaptive learning rate

See also

Related Research Articles

References

Further reading

External links