Delta rule

Last updated October 26, 2023

In machine learning, the delta rule is a gradient descent learning rule for updating the weights of the inputs to artificial neurons in a single-layer neural network.^[1] It is a special case of the more general backpropagation algorithm. For a neuron $j$ with activation function $g(x)$ , the delta rule for neuron $j$ 's $i$ -th weight $w_{ji}$ is given by

While the delta rule is similar to the perceptron's update rule, the derivation is different. The perceptron uses the Heaviside step function as the activation function $g(h)$ , and that means that $g'(h)$ does not exist at zero, and is equal to zero elsewhere, which makes the direct application of the delta rule impossible.

Derivation of the delta rule

The delta rule is derived by attempting to minimize the error in the output of the neural network through gradient descent. The error for a neural network with $j$ outputs can be measured as

E=\sum _{j}{\tfrac {1}{2}}\left(t_{j}-y_{j}\right)^{2}.

In this case, we wish to move through "weight space" of the neuron (the space of all possible values of all of the neuron's weights) in proportion to the gradient of the error function with respect to each weight. In order to do that, we calculate the partial derivative of the error with respect to each weight. For the $i$ th weight, this derivative can be written as

{\frac {\partial E}{\partial w_{ji}}}.

Because we are only concerning ourselves with the $j$ -th neuron, we can substitute the error formula above while omitting the summation:

{\frac {\partial E}{\partial w_{ji}}}={\frac {\partial }{\partial w_{ji}}}\left[{\frac {1}{2}}\left(t_{j}-y_{j}\right)^{2}\right]

Next we use the chain rule to split this into two derivatives:

{\frac {\partial E}{\partial w_{ji}}}={\frac {\partial \left({\frac {1}{2}}\left(t_{j}-y_{j}\right)^{2}\right)}{\partial y_{j}}}{\frac {\partial y_{j}}{\partial w_{ji}}}

To find the left derivative, we simply apply the power rule and the chain rule:

{\frac {\partial E}{\partial w_{ji}}}=-\left(t_{j}-y_{j}\right){\frac {\partial y_{j}}{\partial w_{ji}}}

To find the right derivative, we again apply the chain rule, this time differentiating with respect to the total input to $j$ , $h_{j}$ :

{\frac {\partial E}{\partial w_{ji}}}=-\left(t_{j}-y_{j}\right){\frac {\partial y_{j}}{\partial h_{j}}}{\frac {\partial h_{j}}{\partial w_{ji}}}

Note that the output of the $j$ th neuron, $y_{j}$ , is just the neuron's activation function $g$ applied to the neuron's input $h_{j}$ . We can therefore write the derivative of $y_{j}$ with respect to $h_{j}$ simply as $g$ 's first derivative:

{\frac {\partial E}{\partial w_{ji}}}=-\left(t_{j}-y_{j}\right)g'(h_{j}){\frac {\partial h_{j}}{\partial w_{ji}}}

Next we rewrite $h_{j}$ in the last term as the sum over all $k$ weights of each weight $w_{jk}$ times its corresponding input $x_{k}$ :

{\frac {\partial E}{\partial w_{ji}}}=-\left(t_{j}-y_{j}\right)g'(h_{j})\;{\frac {\partial }{\partial w_{ji}}}\!\!\left[\sum _{i}x_{i}w_{ji}\right]

Because we are only concerned with the $i$ th weight, the only term of the summation that is relevant is $x_{i}w_{ji}$ . Clearly,

{\frac {\partial (x_{i}w_{ji})}{\partial w_{ji}}}=x_{i}.

giving us our final equation for the gradient:

{\frac {\partial E}{\partial w_{ji}}}=-\left(t_{j}-y_{j}\right)g'(h_{j})x_{i}

As noted above, gradient descent tells us that our change for each weight should be proportional to the gradient. Choosing a proportionality constant $\alpha$ and eliminating the minus sign to enable us to move the weight in the negative direction of the gradient to minimize error, we arrive at our target equation:

\Delta w_{ji}=\alpha (t_{j}-y_{j})g'(h_{j})x_{i}.

Related Research Articles

In calculus, the chain rule is a formula that expresses the derivative of the composition of two differentiable functions $f$ and $g$ in terms of the derivatives of $f$ and $g$ . More precisely, if $is the function such that for every x, then the chain rule is, in Lagrange's notation,$

In mathematics, the derivative shows the sensitivity of change of a function's output with respect to the input. Derivatives are a fundamental tool of calculus. For example, the derivative of the position of a moving object with respect to time is the object's velocity: this measures how quickly the position of the object changes when time advances.

<span class="mw-page-title-main">Gradient</span> Multivariate derivative (mathematics)

In vector calculus, the gradient of a scalar-valued differentiable function $of several variables is the vector field whose value at a point is the "direction and rate of fastest increase". The gradient transforms like a vector under change of basis of the space of variables of . If the gradient of a function is non-zero at a point, the direction of the gradient is the direction in which the function increases most quickly from, and the magnitude of the gradient is the rate of increase in that direction, the greatest absolute directional derivative. Further, a point where the gradient is the zero vector is known as a stationary point. The gradient thus plays a fundamental role in optimization theory, where it is used to maximize a function by gradient ascent. In coordinate-free terms, the gradient of a function may be defined by:$

In mathematical physics, the Dirac delta distribution, also known as the unit impulse, is a generalized function or distribution over the real numbers, whose value is zero everywhere except at zero, and whose integral over the entire real line is equal to one.

<span class="mw-page-title-main">Taylor's theorem</span> Approximation of a function by a truncated power series

In calculus, Taylor's theorem gives an approximation of a $-times differentiable function around a given point by a polynomial of degree, called the -th-order Taylor polynomial . For a smooth function, the Taylor polynomial is the truncation at the order of the Taylor series of the function. The first-order Taylor polynomial is the linear approximation of the function, and the second-order Taylor polynomial is often referred to as the quadratic approximation . There are several versions of Taylor's theorem, some giving explicit estimates of the approximation error of the function by its Taylor polynomial.$

In mathematics, a partial derivative of a function of several variables is its derivative with respect to one of those variables, with the others held constant. Partial derivatives are used in vector calculus and differential geometry.

In calculus, the product rule is a formula used to find the derivatives of products of two or more functions. For two functions, it may be stated in Lagrange's notation as

In the calculus of variations, a field of mathematical analysis, the functional derivative relates a change in a functional to a change in a function on which the functional depends.

In mathematics, the Hodge star operator or Hodge star is a linear map defined on the exterior algebra of a finite-dimensional oriented vector space endowed with a nondegenerate symmetric bilinear form. Applying the operator to an element of the algebra produces the Hodge dual of the element. This map was introduced by W. V. D. Hodge.

In mathematics, the covariant derivative is a way of specifying a derivative along tangent vectors of a manifold. Alternatively, the covariant derivative is a way of introducing and working with a connection on a manifold by means of a differential operator, to be contrasted with the approach given by a principal connection on the frame bundle – see affine connection. In the special case of a manifold isometrically embedded into a higher-dimensional Euclidean space, the covariant derivative can be viewed as the orthogonal projection of the Euclidean directional derivative onto the manifold's tangent space. In this case the Euclidean derivative is broken into two parts, the extrinsic normal component and the intrinsic covariant derivative component.

A directional derivative is a concept in multivariable calculus that measures the rate at which a function changes in a particular direction at a given point.

The Gauss–Newton algorithm is used to solve non-linear least squares problems, which is equivalent to minimizing a sum of squared function values. It is an extension of Newton's method for finding a minimum of a non-linear function. Since a sum of squares must be nonnegative, the algorithm can be viewed as using Newton's method to iteratively approximate zeroes of the components of the sum, and thus minimizing the sum. In this sense, the algorithm is also an effective method for solving overdetermined systems of equations. It has the advantage that second derivatives, which can be challenging to compute, are not required.

As a machine-learning algorithm, backpropagation performs a backward pass to adjust a neural network model's parameters, aiming to minimize the mean squared error (MSE). In a multi-layered network, backpropagation uses the following steps:

Propagate training data through the model from input to predicted output by computing the successive hidden layers' outputs and finally the final layer's output.
Adjust the model weights to reduce the error relative to the weights.
1. The error is typically the squared difference between prediction and target.
2. For each weight, the slope or derivative of the error is found, and the weight adjusted by a negative multiple of this derivative, so as to go downslope toward the minimum-error configuration.
3. This derivative is easy to calculate for final layer weights, and possible to calculate for one layer given the next layer's derivatives. Starting at the end, then, the derivatives are calculated layer by layer toward the beginning -- thus "backpropagation".
Repeatedly update the weights until they converge or the model has undergone enough iterations.

A feedforward neural network (FNN) is one of the two broad types of artificial neural network, characterized by direction of the flow of information between its layers. Its flow is uni-directional, meaning that the information in the model flows in only one direction—forward—from the input nodes, through the hidden nodes and to the output nodes, without any cycles or loops, in contrast to recurrent neural networks, which have a bi-directional flow. Modern feedforward networks are trained using the backpropagation method and are colloquially referred to as the "vanilla" neural networks.

In differential geometry, the four-gradient $is the four-vector analogue of the gradient from vector calculus.$

In differential geometry, a tensor density or relative tensor is a generalization of the tensor field concept. A tensor density transforms as a tensor field when passing from one coordinate system to another, except that it is additionally multiplied or weighted by a power W of the Jacobian determinant of the coordinate transition function or its absolute value. A tensor density with a single index is called a vector density. A distinction is made among (authentic) tensor densities, pseudotensor densities, even tensor densities and odd tensor densities. Sometimes tensor densities with a negative weight W are called tensor capacity. A tensor density can also be regarded as a section of the tensor product of a tensor bundle with a density bundle.

In mathematics, the derivative is a fundamental construction of differential calculus and admits many possible generalizations within the fields of mathematical analysis, combinatorics, algebra, geometry, etc.

A multilayer perceptron (MLP) is a misnomer for a modern feedforward artificial neural network, consisting of fully connected neurons with a nonlinear kind of activation function, organized in at least three layers, notable for being able to distinguish data that is not linearly separable. It is a misnomer because the original perceptron used a Heaviside step function, instead of a nonlinear kind of activation function.

In calculus, the Leibniz integral rule for differentiation under the integral sign states that for an integral of the form

Batch normalization is a method used to make training of artificial neural networks faster and more stable through normalization of the layers' inputs by re-centering and re-scaling. It was proposed by Sergey Ioffe and Christian Szegedy in 2015.

References

↑ Russell, Ingrid. "The Delta Rule". University of Hartford. Archived from the original on 4 March 2016. Retrieved 5 November 2012.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Russell, Ingrid. "The Delta Rule". University of Hartford. Archived from the original on 4 March 2016. Retrieved 5 November 2012.

[1]

Delta rule

Contents

Derivation of the delta rule

See also

Related Research Articles

References