Online machine learning

Last updated November 26, 2024

In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need of out-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time, e.g., prediction of prices in the financial international markets. Online learning algorithms may be prone to catastrophic interference, a problem that can be addressed by incremental learning approaches.

Introduction
Statistical view of online learning
Example: linear least squares
Batch learning
Online learning: recursive least squares
Stochastic gradient descent
Incremental stochastic gradient descent
Kernel methods
Online convex optimization
Online subgradient descent (OSD)
Other algorithms
Continual learning
Interpretations of online learning
Implementations
See also
References
External links

Introduction

In the setting of supervised learning, a function of $f:X\to Y$ is to be learned, where $X$ is thought of as a space of inputs and $Y$ as a space of outputs, that predicts well on instances that are drawn from a joint probability distribution $p(x,y)$ on $X\times Y$ . In reality, the learner never knows the true distribution $p(x,y)$ over instances. Instead, the learner usually has access to a training set of examples $(x_{1},y_{1}),\ldots ,(x_{n},y_{n})$ . In this setting, the loss function is given as $V:Y\times Y\to \mathbb {R}$ , such that $V(f(x),y)$ measures the difference between the predicted value $f(x)$ and the true value $y$ . The ideal goal is to select a function $f\in {\mathcal {H}}$ , where ${\mathcal {H}}$ is a space of functions called a hypothesis space, so that some notion of total loss is minimized. Depending on the type of model (statistical or adversarial), one can devise different notions of loss, which lead to different learning algorithms.

Statistical view of online learning

In statistical learning models, the training sample $(x_{i},y_{i})$ are assumed to have been drawn from the true distribution $p(x,y)$ and the objective is to minimize the expected "risk" $I[f]=\mathbb {E} [V(f(x),y)]=\int V(f(x),y)\,dp(x,y)\ .$ A common paradigm in this situation is to estimate a function ${\hat {f}}$ through empirical risk minimization or regularized empirical risk minimization (usually Tikhonov regularization). The choice of loss function here gives rise to several well-known learning algorithms such as regularized least squares and support vector machines. A purely online model in this category would learn based on just the new input $(x_{t+1},y_{t+1})$ , the current best predictor $f_{t}$ and some extra stored information (which is usually expected to have storage requirements independent of training data size). For many formulations, for example nonlinear kernel methods, true online learning is not possible, though a form of hybrid online learning with recursive algorithms can be used where $f_{t+1}$ is permitted to depend on $f_{t}$ and all previous data points $(x_{1},y_{1}),\ldots ,(x_{t},y_{t})$ . In this case, the space requirements are no longer guaranteed to be constant since it requires storing all previous data points, but the solution may take less time to compute with the addition of a new data point, as compared to batch learning techniques.

A common strategy to overcome the above issues is to learn using mini-batches, which process a small batch of $b\geq 1$ data points at a time, this can be considered as pseudo-online learning for $b$ much smaller than the total number of training points. Mini-batch techniques are used with repeated passing over the training data to obtain optimized out-of-core versions of machine learning algorithms, for example, stochastic gradient descent. When combined with backpropagation, this is currently the de facto training method for training artificial neural networks.

Example: linear least squares

The simple example of linear least squares is used to explain a variety of ideas in online learning. The ideas are general enough to be applied to other settings, for example, with other convex loss functions.

Batch learning

Consider the setting of supervised learning with $f$ being a linear function to be learned: $f(x_{j})=\langle w,x_{j}\rangle =w\cdot x_{j}$ where $x_{j}\in \mathbb {R} ^{d}$ is a vector of inputs (data points) and $w\in \mathbb {R} ^{d}$ is a linear filter vector. The goal is to compute the filter vector $w$ . To this end, a square loss function $V(f(x_{j}),y_{j})=(f(x_{j})-y_{j})^{2}=(\langle w,x_{j}\rangle -y_{j})^{2}$ is used to compute the vector $w$ that minimizes the empirical loss $I_{n}[w]=\sum _{j=1}^{n}V(\langle w,x_{j}\rangle ,y_{j})=\sum _{j=1}^{n}(x_{j}^{\mathsf {T}}w-y_{j})^{2}$ where $y_{j}\in \mathbb {R} .$

Let $X$ be the $i\times d$ data matrix and $y\in \mathbb {R} ^{i}$ is the column vector of target values after the arrival of the first $i$ data points. Assuming that the covariance matrix $\Sigma _{i}=X^{\mathsf {T}}X$ is invertible (otherwise it is preferential to proceed in a similar fashion with Tikhonov regularization), the best solution $f^{*}(x)=\langle w^{*},x\rangle$ to the linear least squares problem is given by $w^{*}=(X^{\mathsf {T}}X)^{-1}X^{\mathsf {T}}y=\Sigma _{i}^{-1}\sum _{j=1}^{i}x_{j}y_{j}.$

Now, calculating the covariance matrix $\Sigma _{i}=\sum _{j=1}^{i}x_{j}x_{j}^{\mathsf {T}}$ takes time $O(id^{2})$ , inverting the $d\times d$ matrix takes time $O(d^{3})$ , while the rest of the multiplication takes time $O(d^{2})$ , giving a total time of $O(id^{2}+d^{3})$ . When there are $n$ total points in the dataset, to recompute the solution after the arrival of every datapoint $i=1,\ldots ,n$ , the naive approach will have a total complexity $O(n^{2}d^{2}+nd^{3})$ . Note that when storing the matrix $\Sigma _{i}$ , then updating it at each step needs only adding $x_{i+1}x_{i+1}^{\mathsf {T}}$ , which takes $O(d^{2})$ time, reducing the total time to $O(nd^{2}+nd^{3})=O(nd^{3})$ , but with an additional storage space of $O(d^{2})$ to store $\Sigma _{i}$ .^[1]

Online learning: recursive least squares

The recursive least squares (RLS) algorithm considers an online approach to the least squares problem. It can be shown that by initialising $\textstyle w_{0}=0\in \mathbb {R} ^{d}$ and $\textstyle \Gamma _{0}=I\in \mathbb {R} ^{d\times d}$ , the solution of the linear least squares problem given in the previous section can be computed by the following iteration: $\Gamma _{i}=\Gamma _{i-1}-{\frac {\Gamma _{i-1}x_{i}x_{i}^{\mathsf {T}}\Gamma _{i-1}}{1+x_{i}^{\mathsf {T}}\Gamma _{i-1}x_{i}}}$ $w_{i}=w_{i-1}-\Gamma _{i}x_{i}\left(x_{i}^{\mathsf {T}}w_{i-1}-y_{i}\right)$ The above iteration algorithm can be proved using induction on $i$ .^[2] The proof also shows that $\Gamma _{i}=\Sigma _{i}^{-1}$ . One can look at RLS also in the context of adaptive filters (see RLS).

The complexity for $n$ steps of this algorithm is $O(nd^{2})$ , which is an order of magnitude faster than the corresponding batch learning complexity. The storage requirements at every step $i$ here are to store the matrix $\Gamma _{i}$ , which is constant at $O(d^{2})$ . For the case when $\Sigma _{i}$ is not invertible, consider the regularised version of the problem loss function $\sum _{j=1}^{n}\left(x_{j}^{\mathsf {T}}w-y_{j}\right)^{2}+\lambda \left\|w\right\|_{2}^{2}$ . Then, it's easy to show that the same algorithm works with $\Gamma _{0}=(I+\lambda I)^{-1}$ , and the iterations proceed to give $\Gamma _{i}=(\Sigma _{i}+\lambda I)^{-1}$ .^[1]

Stochastic gradient descent

When this $w_{i}=w_{i-1}-\Gamma _{i}x_{i}\left(x_{i}^{\mathsf {T}}w_{i-1}-y_{i}\right)$ is replaced by $w_{i}=w_{i-1}-\gamma _{i}x_{i}\left(x_{i}^{\mathsf {T}}w_{i-1}-y_{i}\right)=w_{i-1}-\gamma _{i}\nabla V(\langle w_{i-1},x_{i}\rangle ,y_{i})$ or $\Gamma _{i}\in \mathbb {R} ^{d\times d}$ by $\gamma _{i}\in \mathbb {R}$ , this becomes the stochastic gradient descent algorithm. In this case, the complexity for $n$ steps of this algorithm reduces to $O(nd)$ . The storage requirements at every step $i$ are constant at $O(d)$ .

However, the stepsize $\gamma _{i}$ needs to be chosen carefully to solve the expected risk minimization problem, as detailed above. By choosing a decaying step size $\gamma _{i}\approx {\frac {1}{\sqrt {i}}},$ one can prove the convergence of the average iterate ${\textstyle {\overline {w}}_{n}={\frac {1}{n}}\sum _{i=1}^{n}w_{i}}$ . This setting is a special case of stochastic optimization, a well known problem in optimization.^[1]

Incremental stochastic gradient descent

In practice, one can perform multiple stochastic gradient passes (also called cycles or epochs) over the data. The algorithm thus obtained is called incremental gradient method and corresponds to an iteration $w_{i}=w_{i-1}-\gamma _{i}\nabla V(\langle w_{i-1},x_{t_{i}}\rangle ,y_{t_{i}})$ The main difference with the stochastic gradient method is that here a sequence $t_{i}$ is chosen to decide which training point is visited in the $i$ -th step. Such a sequence can be stochastic or deterministic. The number of iterations is then decoupled to the number of points (each point can be considered more than once). The incremental gradient method can be shown to provide a minimizer to the empirical risk.^[3] Incremental techniques can be advantageous when considering objective functions made up of a sum of many terms e.g. an empirical error corresponding to a very large dataset.^[1]

Kernel methods

Kernels can be used to extend the above algorithms to non-parametric models (or models where the parameters form an infinite dimensional space). The corresponding procedure will no longer be truly online and instead involve storing all the data points, but is still faster than the brute force method. This discussion is restricted to the case of the square loss, though it can be extended to any convex loss. It can be shown by an easy induction ^[1] that if $X_{i}$ is the data matrix and $w_{i}$ is the output after $i$ steps of the SGD algorithm, then, $w_{i}=X_{i}^{\mathsf {T}}c_{i}$ where $c_{i}=((c_{i})_{1},(c_{i})_{2},...,(c_{i})_{i})\in \mathbb {R} ^{i}$ and the sequence $c_{i}$ satisfies the recursion: $c_{0}=0$ $(c_{i})_{j}=(c_{i-1})_{j},j=1,2,...,i-1$ and $(c_{i})_{i}=\gamma _{i}{\Big (}y_{i}-\sum _{j=1}^{i-1}(c_{i-1})_{j}\langle x_{j},x_{i}\rangle {\Big )}$ Notice that here $\langle x_{j},x_{i}\rangle$ is just the standard Kernel on $\mathbb {R} ^{d}$ , and the predictor is of the form $f_{i}(x)=\langle w_{i-1},x\rangle =\sum _{j=1}^{i-1}(c_{i-1})_{j}\langle x_{j},x\rangle .$

Now, if a general kernel $K$ is introduced instead and let the predictor be $f_{i}(x)=\sum _{j=1}^{i-1}(c_{i-1})_{j}K(x_{j},x)$ then the same proof will also show that predictor minimising the least squares loss is obtained by changing the above recursion to $(c_{i})_{i}=\gamma _{i}{\Big (}y_{i}-\sum _{j=1}^{i-1}(c_{i-1})_{j}K(x_{j},x_{i}){\Big )}$ The above expression requires storing all the data for updating $c_{i}$ . The total time complexity for the recursion when evaluating for the $n$ -th datapoint is $O(n^{2}dk)$ , where $k$ is the cost of evaluating the kernel on a single pair of points.^[1] Thus, the use of the kernel has allowed the movement from a finite dimensional parameter space $\textstyle w_{i}\in \mathbb {R} ^{d}$ to a possibly infinite dimensional feature represented by a kernel $K$ by instead performing the recursion on the space of parameters $\textstyle c_{i}\in \mathbb {R} ^{i}$ , whose dimension is the same as the size of the training dataset. In general, this is a consequence of the representer theorem.^[1]

Online convex optimization

Online convex optimization (OCO) ^[4] is a general framework for decision making which leverages convex optimization to allow for efficient algorithms. The framework is that of repeated game playing as follows:

For $t=1,2,...,T$

Learner receives input $x_{t}$
Learner outputs $w_{t}$ from a fixed convex set $S$
Nature sends back a convex loss function $v_{t}:S\rightarrow \mathbb {R}$ .
Learner suffers loss $v_{t}(w_{t})$ and updates its model

The goal is to minimize regret, or the difference between cumulative loss and the loss of the best fixed point $u\in S$ in hindsight. As an example, consider the case of online least squares linear regression. Here, the weight vectors come from the convex set $S=\mathbb {R} ^{d}$ , and nature sends back the convex loss function $v_{t}(w)=(\langle w,x_{t}\rangle -y_{t})^{2}$ . Note here that $y_{t}$ is implicitly sent with $v_{t}$ .

Some online prediction problems however cannot fit in the framework of OCO. For example, in online classification, the prediction domain and the loss functions are not convex. In such scenarios, two simple techniques for convexification are used: randomisation and surrogate loss functions.^{[ citation needed ]}

Some simple online convex optimisation algorithms are:

Follow the leader (FTL)

The simplest learning rule to try is to select (at the current step) the hypothesis that has the least loss over all past rounds. This algorithm is called Follow the leader, and round $t$ is simply given by: $w_{t}=\mathop {\operatorname {arg\,min} } _{w\in S}\sum _{i=1}^{t-1}v_{i}(w)$ This method can thus be looked as a greedy algorithm. For the case of online quadratic optimization (where the loss function is $v_{t}(w)=\left\|w-x_{t}\right\|_{2}^{2}$ ), one can show a regret bound that grows as $\log(T)$ . However, similar bounds cannot be obtained for the FTL algorithm for other important families of models like online linear optimization. To do so, one modifies FTL by adding regularisation.

Follow the regularised leader (FTRL)

This is a natural modification of FTL that is used to stabilise the FTL solutions and obtain better regret bounds. A regularisation function $R:S\to \mathbb {R}$ is chosen and learning performed in round $t$ as follows: $w_{t}=\mathop {\operatorname {arg\,min} } _{w\in S}\sum _{i=1}^{t-1}v_{i}(w)+R(w)$ As a special example, consider the case of online linear optimisation i.e. where nature sends back loss functions of the form $v_{t}(w)=\langle w,z_{t}\rangle$ . Also, let $S=\mathbb {R} ^{d}$ . Suppose the regularisation function ${\textstyle R(w)={\frac {1}{2\eta }}\left\|w\right\|_{2}^{2}}$ is chosen for some positive number $\eta$ . Then, one can show that the regret minimising iteration becomes $w_{t+1}=-\eta \sum _{i=1}^{t}z_{i}=w_{t}-\eta z_{t}$ Note that this can be rewritten as $w_{t+1}=w_{t}-\eta \nabla v_{t}(w_{t})$ , which looks exactly like online gradient descent.

If $S$ is instead some convex subspace of $\mathbb {R} ^{d}$ , $S$ would need to be projected onto, leading to the modified update rule $w_{t+1}=\Pi _{S}(-\eta \sum _{i=1}^{t}z_{i})=\Pi _{S}(\eta \theta _{t+1})$ This algorithm is known as lazy projection, as the vector $\theta _{t+1}$ accumulates the gradients. It is also known as Nesterov's dual averaging algorithm. In this scenario of linear loss functions and quadratic regularisation, the regret is bounded by $O({\sqrt {T}})$ , and thus the average regret goes to $0$ as desired.

Online subgradient descent (OSD)

The above proved a regret bound for linear loss functions $v_{t}(w)=\langle w,z_{t}\rangle$ . To generalise the algorithm to any convex loss function, the subgradient $\partial v_{t}(w_{t})$ of $v_{t}$ is used as a linear approximation to $v_{t}$ near $w_{t}$ , leading to the online subgradient descent algorithm:

Initialise parameter $\eta ,w_{1}=0$

For $t=1,2,...,T$

Predict using $w_{t}$ , receive $f_{t}$ from nature.
Choose $z_{t}\in \partial v_{t}(w_{t})$
If $S=\mathbb {R} ^{d}$ , update as $w_{t+1}=w_{t}-\eta z_{t}$
If $S\subset \mathbb {R} ^{d}$ , project cumulative gradients onto $S$ i.e. $w_{t+1}=\Pi _{S}(\eta \theta _{t+1}),\theta _{t+1}=\theta _{t}+z_{t}$

One can use the OSD algorithm to derive $O({\sqrt {T}})$ regret bounds for the online version of SVM's for classification, which use the hinge loss $v_{t}(w)=\max\{0,1-y_{t}(w\cdot x_{t})\}$

Other algorithms

Quadratically regularised FTRL algorithms lead to lazily projected gradient algorithms as described above. To use the above for arbitrary convex functions and regularisers, one uses online mirror descent. The optimal regularization in hindsight can be derived for linear loss functions, this leads to the AdaGrad algorithm. For the Euclidean regularisation, one can show a regret bound of $O({\sqrt {T}})$ , which can be improved further to a $O(\log T)$ for strongly convex and exp-concave loss functions.

Continual learning

Continual learning means constantly improving the learned model by processing continuous streams of information.^[5] Continual learning capabilities are essential for software systems and autonomous agents interacting in an ever changing real world. However, continual learning is a challenge for machine learning and neural network models since the continual acquisition of incrementally available information from non-stationary data distributions generally leads to catastrophic forgetting.

Interpretations of online learning

The paradigm of online learning has different interpretations depending on the choice of the learning model, each of which has distinct implications about the predictive quality of the sequence of functions $f_{1},f_{2},\ldots ,f_{n}$ . The prototypical stochastic gradient descent algorithm is used for this discussion. As noted above, its recursion is given by $w_{t}=w_{t-1}-\gamma _{t}\nabla V(\langle w_{t-1},x_{t}\rangle ,y_{t})$

The first interpretation consider the stochastic gradient descent method as applied to the problem of minimizing the expected risk $I[w]$ defined above.^[6] Indeed, in the case of an infinite stream of data, since the examples $(x_{1},y_{1}),(x_{2},y_{2}),\ldots$ are assumed to be drawn i.i.d. from the distribution $p(x,y)$ , the sequence of gradients of $V(\cdot ,\cdot )$ in the above iteration are an i.i.d. sample of stochastic estimates of the gradient of the expected risk $I[w]$ and therefore one can apply complexity results for the stochastic gradient descent method to bound the deviation $I[w_{t}]-I[w^{\ast }]$ , where $w^{\ast }$ is the minimizer of $I[w]$ .^[7] This interpretation is also valid in the case of a finite training set; although with multiple passes through the data the gradients are no longer independent, still complexity results can be obtained in special cases.

The second interpretation applies to the case of a finite training set and considers the SGD algorithm as an instance of incremental gradient descent method.^[3] In this case, one instead looks at the empirical risk: $I_{n}[w]={\frac {1}{n}}\sum _{i=1}^{n}V(\langle w,x_{i}\rangle ,y_{i})\ .$ Since the gradients of $V(\cdot ,\cdot )$ in the incremental gradient descent iterations are also stochastic estimates of the gradient of $I_{n}[w]$ , this interpretation is also related to the stochastic gradient descent method, but applied to minimize the empirical risk as opposed to the expected risk. Since this interpretation concerns the empirical risk and not the expected risk, multiple passes through the data are readily allowed and actually lead to tighter bounds on the deviations $I_{n}[w_{t}]-I_{n}[w_{n}^{\ast }]$ , where $w_{n}^{\ast }$ is the minimizer of $I_{n}[w]$ .

Implementations

Vowpal Wabbit: Open-source fast out-of-core online learning system which is notable for supporting a number of machine learning reductions, importance weighting and a selection of different loss functions and optimisation algorithms. It uses the hashing trick for bounding the size of the set of features independent of the amount of training data.
scikit-learn: Provides out-of-core implementations of algorithms for
- Classification: Perceptron, SGD classifier, Naive bayes classifier.
- Regression: SGD Regressor, Passive Aggressive regressor.
- Clustering: Mini-batch k-means.
- Feature extraction: Mini-batch dictionary learning, Incremental PCA.

Related Research Articles

Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function.

In mathematics, the Hessian matrix, Hessian or Hesse matrix is a square matrix of second-order partial derivatives of a scalar-valued function, or scalar field. It describes the local curvature of a function of many variables. The Hessian matrix was developed in the 19th century by the German mathematician Ludwig Otto Hesse and later named after him. Hesse originally used the term "functional determinants". The Hessian is sometimes denoted by H or, ambiguously, by ∇².

In functional analysis, a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Specifically, a Hilbert space $of functions from a set is an RKHS if, for each, there exists a function such that for all,$

In probability theory and related fields, Malliavin calculus is a set of mathematical techniques and ideas that extend the mathematical field of calculus of variations from deterministic functions to stochastic processes. In particular, it allows the computation of derivatives of random variables. Malliavin calculus is also called the stochastic calculus of variations. P. Malliavin first initiated the calculus on infinite dimensional space. Then, the significant contributors such as S. Kusuoka, D. Stroock, J-M. Bismut, Shinzo Watanabe, I. Shigekawa, and so on finally completed the foundations.

Multi-task learning (MTL) is a subfield of machine learning in which multiple learning tasks are solved at the same time, while exploiting commonalities and differences across tasks. This can result in improved learning efficiency and prediction accuracy for the task-specific models, when compared to training the models separately. Inherently, Multi-task learning is a multi-objective optimization problem having trade-offs between different tasks. Early versions of MTL were called "hints".

Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate.

In (unconstrained) mathematical optimization, a backtracking line search is a line search method to determine the amount to move along a given search direction. Its use requires that the objective function is differentiable and that its gradient is known.

<span class="mw-page-title-main">Regularization (mathematics)</span> Technique to make a model more generalizable and transferable

In mathematics, statistics, finance, and computer science, particularly in machine learning and inverse problems, regularization is a process that converts the answer of a problem to a simpler one. It is often used in solving ill-posed problems or to prevent overfitting.

In mathematics, specifically statistics and information geometry, a Bregman divergence or Bregman distance is a measure of difference between two points, defined in terms of a strictly convex function; they form an important class of divergences. When the points are interpreted as probability distributions – notably as either values of the parameter of a parametric model or as a data set of observed values – the resulting distance is a statistical distance. The most basic Bregman divergence is the squared Euclidean distance.

In mathematics, the Möbius energy of a knot is a particular knot energy, i.e., a functional on the space of knots. It was discovered by Jun O'Hara, who demonstrated that the energy blows up as the knot's strands get close to one another. This is a useful property because it prevents self-intersection and ensures the result under gradient descent is of the same knot type.

In mathematics, the Fortuin–Kasteleyn–Ginibre (FKG) inequality is a correlation inequality, a fundamental tool in statistical mechanics and probabilistic combinatorics, due to Cees M. Fortuin, Pieter W. Kasteleyn, and Jean Ginibre. Informally, it says that in many random systems, increasing events are positively correlated, while an increasing and a decreasing event are negatively correlated. It was obtained by studying the random cluster model.

In cryptography, learning with errors (LWE) is a mathematical problem that is widely used to create secure encryption algorithms. It is based on the idea of representing secret information as a set of equations with errors. In other words, LWE is a way to hide the value of a secret by introducing noise to it. In more technical terms, it refers to the computational problem of inferring a linear $-ary function over a finite ring from given samples some of which may be erroneous. The LWE problem is conjectured to be hard to solve, and thus to be useful in cryptography.$

In discrete mathematics, ideal lattices are a special class of lattices and a generalization of cyclic lattices. Ideal lattices naturally occur in many parts of number theory, but also in other areas. In particular, they have a significant place in cryptography. Micciancio defined a generalization of cyclic lattices as ideal lattices. They can be used in cryptosystems to decrease by a square root the number of parameters necessary to describe a lattice, making them more efficient. Ideal lattices are a new concept, but similar lattice classes have been used for a long time. For example, cyclic lattices, a special case of ideal lattices, are used in NTRUEncrypt and NTRUSign.

In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs).

Proximal gradientmethods for learning is an area of research in optimization and statistical learning theory which studies algorithms for a general class of convex regularization problems where the regularization penalty may not be differentiable. One such example is $regularization of the form$

In mathematics, the Weil–Brezin map, named after André Weil and Jonathan Brezin, is a unitary transformation that maps a Schwartz function on the real line to a smooth function on the Heisenberg manifold. The Weil–Brezin map gives a geometric interpretation of the Fourier transform, the Plancherel theorem and the Poisson summation formula. The image of Gaussian functions under the Weil–Brezin map are nil-theta functions, which are related to theta functions. The Weil–Brezin map is sometimes referred to as the Zak transform, which is widely applied in the field of physics and signal processing; however, the Weil–Brezin Map is defined via Heisenberg group geometrically, whereas there is no direct geometric or group theoretic interpretation from the Zak transform.

Regularized least squares (RLS) is a family of methods for solving the least-squares problem while using regularization to further constrain the resulting solution.

Batch normalization is a method used to make training of artificial neural networks faster and more stable through normalization of the layers' inputs by re-centering and re-scaling. It was proposed by Sergey Ioffe and Christian Szegedy in 2015.

In mathematical optimization, oracle complexity is a standard theoretical framework to study the computational requirements for solving classes of optimization problems. It is suitable for analyzing iterative algorithms which proceed by computing local information about the objective function at various points. The framework has been used to provide tight worst-case guarantees on the number of required iterations, for several important classes of optimization problems.

(Stochastic) variance reduction is an algorithmic approach to minimizing functions that can be decomposed into finite sums. By exploiting the finite sum structure, variance reduction techniques are able to achieve convergence rates that are impossible to achieve with methods that treat the objective as an infinite sum, as in the classical Stochastic approximation setting.

References

1 2 3 4 5 6 7 L. Rosasco, T. Poggio, Machine Learning: a Regularization Approach, MIT-9.520 Lectures Notes, Manuscript, Dec. 2015. Chapter 7 - Online Learning
↑ Kushner, Harold J.; Yin, G. George (2003). Stochastic Approximation and Recursive Algorithms with Applications (Second ed.). New York: Springer. pp. 8–12. ISBN 978-0-387-21769-7.
1 2 Bertsekas, D. P. (2011). Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. Optimization for Machine Learning, 85.
↑ Hazan, Elad (2015). Introduction to Online Convex Optimization (PDF). Foundations and Trends in Optimization.
↑ Parisi, German I.; Kemker, Ronald; Part, Jose L.; Kanan, Christopher; Wermter, Stefan (2019). "Continual lifelong learning with neural networks: A review". Neural Networks. 113: 54–71. arXiv: 1802.07569 . doi:10.1016/j.neunet.2019.01.012. ISSN 0893-6080.
↑ Bottou, Léon (1998). "Online Algorithms and Stochastic Approximations". Online Learning and Neural Networks . Cambridge University Press. ISBN 978-0-521-65263-6.
↑ Stochastic Approximation Algorithms and Applications, Harold J. Kushner and G. George Yin, New York: Springer-Verlag, 1997. ISBN 0-387-94916-X; 2nd ed., titled Stochastic Approximation and Recursive Algorithms and Applications, 2003, ISBN 0-387-00894-2.

External links

6.883: Online Methods in Machine Learning: Theory and Applications. Alexander Rakhlin. MIT

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[lorenzo-1] 1 2 3 4 5 6 7 L. Rosasco, T. Poggio, Machine Learning: a Regularization Approach, MIT-9.520 Lectures Notes, Manuscript, Dec. 2015. Chapter 7 - Online Learning

[2] Kushner, Harold J.; Yin, G. George (2003). Stochastic Approximation and Recursive Algorithms with Applications (Second ed.). New York: Springer. pp. 8–12. ISBN 978-0-387-21769-7.

[bertsekas-3] 1 2 Bertsekas, D. P. (2011). Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. Optimization for Machine Learning, 85.

[4] Hazan, Elad (2015). Introduction to Online Convex Optimization (PDF). Foundations and Trends in Optimization.

[5] Parisi, German I.; Kemker, Ronald; Part, Jose L.; Kanan, Christopher; Wermter, Stefan (2019). "Continual lifelong learning with neural networks: A review". Neural Networks. 113: 54–71. arXiv: 1802.07569 . doi:10.1016/j.neunet.2019.01.012. ISSN 0893-6080.

[6] Bottou, Léon (1998). "Online Algorithms and Stochastic Approximations". Online Learning and Neural Networks . Cambridge University Press. ISBN 978-0-521-65263-6.

[kushneryin-7] Stochastic Approximation Algorithms and Applications, Harold J. Kushner and G. George Yin, New York: Springer-Verlag, 1997. ISBN 0-387-94916-X; 2nd ed., titled Stochastic Approximation and Recursive Algorithms and Applications, 2003, ISBN 0-387-00894-2.

[1]

[2]

[3]

[4]

[5]

[6]

[7]