Learning rule

Last updated October 28, 2024

An artificial neural network's learning rule or learning process is a method, mathematical logic or algorithm which improves the network's performance and/or training time. Usually, this rule is applied repeatedly over the network. It is done by updating the weight and bias ^{[ broken anchor ]} levels of a network when it is simulated in a specific data environment.^[1] A learning rule may accept existing conditions (weights and biases) of the network, and will compare the expected result and actual result of the network to give new and improved values for the weights and biases.^[2] Depending on the complexity of the model being simulated, the learning rule of the network can be as simple as an XOR gate or mean squared error, or as complex as the result of a system of differential equations.

Background

A lot of the learning methods in machine learning work similar to each other, and are based on each other, which makes it difficult to classify them in clear categories. But they can be broadly understood in 4 categories of learning methods, though these categories don't have clear boundaries and they tend to belong to multiple categories of learning methods^[3] -

Hebbian - Neocognitron, Brain-state-in-a-box^[4]
Gradient Descent - ADALINE, Hopfield Network, Recurrent Neural Network
Competitive - Learning Vector Quantisation, Self-Organising Feature Map, Adaptive Resonance Theory
Stochastic - Boltzmann Machine, Cauchy Machine

It is to be noted that though these learning rules might appear to be based on similar ideas, they do have subtle differences, as they are a generalisation or application over the previous rule, and hence it makes sense to study them separately based on their origins and intents.

Hebbian Learning

Developed by Donald Hebb in 1949 to describe biological neuron firing. In the mid-1950s it was also applied to computer simulations of neural networks.

$\Delta w_{i}=\eta x_{i}y$

Where $\eta$ represents the learning rate, $x_{i}$ represents the input of neuron i, and y is the output of the neuron. It has been shown that Hebb's rule in its basic form is unstable. Oja's Rule, BCM Theory are other learning rules built on top of or alongside Hebb's Rule in the study of biological neurons.

Perceptron Learning Rule (PLR)

The perceptron learning rule originates from the Hebbian assumption, and was used by Frank Rosenblatt in his perceptron in 1958. The net is passed to the activation (transfer) function and the function's output is used for adjusting the weights. The learning signal is the difference between the desired response and the actual response of a neuron. The step function is often used as an activation function, and the outputs are generally restricted to -1, 0, or 1.

The weights are updated with

$w_{\text{new}}=w_{\text{old}}+\eta (t-o)x_{i}$ where "t" is the target value and "o" is the output of the perceptron, and $\eta$ is called the learning rate.

The algorithm converges to the correct classification if: ^[5]

the training data is linearly separable*
$\eta$ is sufficiently small (though smaller $\eta$ generally means a longer learning time and more epochs)

*It should also be noted that a single layer perceptron with this learning rule is incapable of working on linearly non-separable inputs, and hence the XOR problem cannot be solved using this rule alone^[6]

Backpropagation

Seppo Linnainmaa in 1970 is said to have developed the Backpropagation Algorithm^[7] but the origins of the algorithm go back to the 1960s with many contributors. It is a generalisation of the least mean squares algorithm in the linear perceptron and the Delta Learning Rule.

It implements gradient descent search through the space possible network weights, iteratively reducing the error, between the target values and the network outputs.

Widrow-Hoff Learning (Delta Learning Rule)

Similar to the perceptron learning rule but with different origin. It was developed for use in the ADALAINE network, which differs from the Perceptron mainly in terms of the training. The weights are adjusted according to the weighted sum of the inputs (the net), whereas in perceptron the sign of the weighted sum was useful for determining the output as the threshold was set to 0, -1, or +1. This makes ADALINE different from the normal perceptron.

Delta rule (DR) is similar to the Perceptron Learning Rule (PLR), with some differences:

Error (δ) in DR is not restricted to having values of 0, 1, or -1 (as in PLR), but may have any value
DR can be derived for any differentiable output/activation function f, whereas in PLR only works for threshold output function

Sometimes only when the Widrow-Hoff is applied to binary targets specifically, it is referred to as Delta Rule, but the terms seem to be used often interchangeably. The delta rule is considered to a special case of the back-propagation algorithm.

Delta rule also closely resembles the Rescorla-Wagner model under which Pavlovian conditioning occurs.^[8]

Competitive Learning

Competitive learning is considered a variant of Hebbian learning, but it is special enough to be discussed separately. Competitive learning works by increasing the specialization of each node in the network. It is well suited to finding clusters within data.

Models and algorithms based on the principle of competitive learning include vector quantization and self-organizing maps (Kohonen maps).

Related Research Articles

In machine learning, a neural network is a model inspired by the structure and function of biological neural networks in animal brains.

In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class. It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector.

Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, where a small portion of the data is tagged, and self-supervision. Some researchers consider self-supervised learning a form of unsupervised learning.

An artificial neuron is a mathematical function conceived as a model of biological neurons in a neural network. Artificial neurons are the elementary units of artificial neural networks. The artificial neuron is a function that receives one or more inputs, applies weights to these inputs, and sums them to produce an output.

Hebbian theory is a neuropsychological theory claiming that an increase in synaptic efficacy arises from a presynaptic cell's repeated and persistent stimulation of a postsynaptic cell. It is an attempt to explain synaptic plasticity, the adaptation of brain neurons during the learning process. It was introduced by Donald Hebb in his 1949 book The Organization of Behavior. The theory is also called Hebb's rule, Hebb's postulate, and cell assembly theory. Hebb states it as follows:

Let us assume that the persistence or repetition of a reverberatory activity tends to induce lasting cellular changes that add to its stability. ... When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased.

A Hopfield network is a form of recurrent neural network, or a spin glass system, that can serve as a content-addressable memory. The Hopfield network, named for John Hopfield, consists of a single layer of neurons, where each neuron is connected to every other neuron except itself. These connections are bidirectional and symmetric, meaning the weight of the connection from neuron i to neuron j is the same as the weight from neuron j to neuron i. Patterns are associatively recalled by fixing certain inputs, and dynamically evolve the network to minimize an energy function, towards local energy minimum states that correspond to stored patterns. Patterns are associatively learned by a Hebbian learning algorithm.

Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate.

In machine learning, the delta rule is a gradient descent learning rule for updating the weights of the inputs to artificial neurons in a single-layer neural network. It can be derived as the backpropagation algorithm for a single-layer neural network with mean-square error loss function.

In machine learning, backpropagation is a gradient estimation method commonly used for training neural networks to compute the network parameter updates.

A feedforward neural network (FNN) is one of the two broad types of artificial neural network, characterized by direction of the flow of information between its layers. Its flow is uni-directional, meaning that the information in the model flows in only one direction—forward—from the input nodes, through the hidden nodes and to the output nodes, without any cycles or loops. Modern feedforward networks are trained using backpropagation, and are colloquially referred to as "vanilla" neural networks.

A multilayer perceptron (MLP) is a name for a modern feedforward artificial neural network, consisting of fully connected neurons with a nonlinear activation function, organized in at least three layers, notable for being able to distinguish data that is not linearly separable.

Quantum neural networks are computational neural network models which are based on the principles of quantum mechanics. The first ideas on quantum neural computation were published independently in 1995 by Subhash Kak and Ron Chrisley, engaging with the theory of quantum mind, which posits that quantum effects play a role in cognitive function. However, typical research in quantum neural networks involves combining classical artificial neural network models with the advantages of quantum information in order to develop more efficient algorithms. One important motivation for these investigations is the difficulty to train classical neural networks, especially in big data applications. The hope is that features of quantum computing such as quantum parallelism or the effects of interference and entanglement can be used as resources. Since the technological implementation of a quantum computer is still in a premature stage, such quantum neural network models are mostly theoretical proposals that await their full implementation in physical experiments.

ADALINE is an early single-layer artificial neural network and the name of the physical device that implemented this network. It was developed by professor Bernard Widrow and his doctoral student Ted Hoff at Stanford University in 1960. It is based on the perceptron. It consists of weights, a bias and a summation function. The weights and biases were implemented by rheostats, and later, memistors.

Oja's learning rule, or simply Oja's rule, named after Finnish computer scientist Erkki Oja, is a model of how neurons in the brain or in artificial neural networks change connection strength, or learn, over time. It is a modification of the standard Hebb's Rule that, through multiplicative normalization, solves all stability problems and generates an algorithm for principal components analysis. This is a computational form of an effect which is believed to happen in biological neurons.

Neural cryptography is a branch of cryptography dedicated to analyzing the application of stochastic algorithms, especially artificial neural network algorithms, for use in encryption and cryptanalysis.

The generalized Hebbian algorithm (GHA), also known in the literature as Sanger's rule, is a linear feedforward neural network for unsupervised learning with applications primarily in principal components analysis. First defined in 1989, it is similar to Oja's rule in its formulation and stability, except it can be applied to networks with multiple outputs. The name originates because of the similarity between the algorithm and a hypothesis made by Donald Hebb about the way in which synaptic strengths in the brain are modified in response to experience, i.e., that changes are proportional to the correlation between the firing of pre- and post-synaptic neurons.

In neuroscience and computer science, synaptic weight refers to the strength or amplitude of a connection between two nodes, corresponding in biology to the amount of influence the firing of one neuron has on another. The term is typically used in artificial and biological neural network research.

There are many types of artificial neural networks (ANN).

In machine learning, the vanishing gradient problem is encountered when training neural networks with gradient-based learning methods and backpropagation. In such methods, during each training iteration, each neural network weight receives an update proportional to the partial derivative of the loss function with respect to the current weight. The problem is that as the network depth or sequence length increases, the gradient magnitude typically is expected to decrease, slowing the training process. In the worst case, this may completely stop the neural network from further learning. As one example of the problem cause, traditional activation functions such as the hyperbolic tangent function have gradients in the range $[-1,1]$ , and backpropagation computes gradients using the chain rule. This has the effect of multiplying $n$ of these small numbers to compute gradients of the early layers in an $n$ -layer network, meaning that the gradient decreases exponentially with $n$ while the early layers train very slowly.

An artificial neural network (ANN) combines biological principles with advanced statistics to solve problems in domains such as pattern recognition and game-play. ANNs adopt the basic model of neuron analogues connected to each other in a variety of ways.

References

↑ Simon Haykin (16 July 1998). "Chapter 2: Learning Processes". Neural Networks: A comprehensive foundation (2nd ed.). Prentice Hall. pp. 50–104. ISBN 978-8178083001 . Retrieved 2 May 2012.
↑ S Russell, P Norvig (1995). "Chapter 18: Learning from Examples". Artificial Intelligence: A Modern Approach (3rd ed.). Prentice Hall. pp. 693–859. ISBN 0-13-103805-2 . Retrieved 20 Nov 2013.
↑ Rajasekaran, Sundaramoorthy. (2003). Neural networks, fuzzy logic, and genetic algorithms : synthesis and applications. Pai, G. A. Vijayalakshmi. (Eastern economy ed.). New Delhi: Prentice-Hall of India. ISBN 81-203-2186-3. OCLC 56960832.
↑ Golden, Richard M. (1986-03-01). "The "Brain-State-in-a-Box" neural model is a gradient descent algorithm". Journal of Mathematical Psychology. 30 (1): 73–80. doi:10.1016/0022-2496(86)90043-X. ISSN 0022-2496.
↑ Sivanandam, S. N. (2007). Principles of soft computing. Deepa, S. N. (1st ed.). New Delhi: Wiley India. ISBN 978-81-265-1075-7. OCLC 760996382.
↑ Minsky, Marvin, 1927-2016. (1969). Perceptrons; an introduction to computational geometry. Papert, Seymour. Cambridge, Mass.: MIT Press. ISBN 0-262-13043-2. OCLC 5034.{{cite book}}: CS1 maint: multiple names: authors list (link) CS1 maint: numeric names: authors list (link)
↑ Schmidhuber, Juergen (January 2015). "Deep Learning in Neural Networks: An Overview". Neural Networks. 61: 85–117. arXiv: 1404.7828 . doi:10.1016/j.neunet.2014.09.003. PMID 25462637. S2CID 11715509.
↑ Rescorla, Robert (2008-03-31). "Rescorla-Wagner model". Scholarpedia. 3 (3): 2237. Bibcode:2008SchpJ...3.2237R. doi: 10.4249/scholarpedia.2237 . ISSN 1941-6016.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Simon_Haykin-1] Simon Haykin (16 July 1998). "Chapter 2: Learning Processes". Neural Networks: A comprehensive foundation (2nd ed.). Prentice Hall. pp. 50–104. ISBN 978-8178083001 . Retrieved 2 May 2012.

[S_Russell,_P_Norvig-2] S Russell, P Norvig (1995). "Chapter 18: Learning from Examples". Artificial Intelligence: A Modern Approach (3rd ed.). Prentice Hall. pp. 693–859. ISBN 0-13-103805-2 . Retrieved 20 Nov 2013.

[3] Rajasekaran, Sundaramoorthy. (2003). Neural networks, fuzzy logic, and genetic algorithms : synthesis and applications. Pai, G. A. Vijayalakshmi. (Eastern economy ed.). New Delhi: Prentice-Hall of India. ISBN 81-203-2186-3. OCLC 56960832.

[4] Golden, Richard M. (1986-03-01). "The "Brain-State-in-a-Box" neural model is a gradient descent algorithm". Journal of Mathematical Psychology. 30 (1): 73–80. doi:10.1016/0022-2496(86)90043-X. ISSN 0022-2496.

[5] Sivanandam, S. N. (2007). Principles of soft computing. Deepa, S. N. (1st ed.). New Delhi: Wiley India. ISBN 978-81-265-1075-7. OCLC 760996382.

[6] Minsky, Marvin, 1927-2016. (1969). Perceptrons; an introduction to computational geometry. Papert, Seymour. Cambridge, Mass.: MIT Press. ISBN 0-262-13043-2. OCLC 5034.{{cite book}}: CS1 maint: multiple names: authors list (link) CS1 maint: numeric names: authors list (link)

[7] Schmidhuber, Juergen (January 2015). "Deep Learning in Neural Networks: An Overview". Neural Networks. 61: 85–117. arXiv: 1404.7828 . doi:10.1016/j.neunet.2014.09.003. PMID 25462637. S2CID 11715509.

[8] Rescorla, Robert (2008-03-31). "Rescorla-Wagner model". Scholarpedia. 3 (3): 2237. Bibcode:2008SchpJ...3.2237R. doi: 10.4249/scholarpedia.2237 . ISSN 1941-6016.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]