Generalized Hebbian algorithm

Last updated December 13, 2024

The generalized Hebbian algorithm, also known in the literature as Sanger's rule, is a linear feedforward neural network for unsupervised learning with applications primarily in principal components analysis. First defined in 1989,^[1] it is similar to Oja's rule in its formulation and stability, except it can be applied to networks with multiple outputs. The name originates because of the similarity between the algorithm and a hypothesis made by Donald Hebb ^[2] about the way in which synaptic strengths in the brain are modified in response to experience, i.e., that changes are proportional to the correlation between the firing of pre- and post-synaptic neurons.^[3]

Theory

Consider a problem of learning a linear code for some data. Each data is a multi-dimensional vector $x\in \mathbb {R} ^{n}$ , and can be (approximately) represented as a linear sum of linear code vectors $w_{1},\dots ,w_{m}$ . When $m=n$ , it is possible to exactly represent the data. If $m<n$ , it is possible to approximately represent the data. To minimize the L2 loss of representation, $w_{1},\dots ,w_{m}$ should be the highest principal component vectors.

The generalized Hebbian algorithm is an iterative algorithm to find the highest principal component vectors, in an algorithmic form that resembles unsupervised Hebbian learning in neural networks.

Consider a one-layered neural network with $n$ input neurons and $m$ output neurons $y_{1},\dots ,y_{m}$ . The linear code vectors are the connection strengths, that is, $w_{ij}$ is the synaptic weight or connection strength between the $j$ -th input and $i$ -th output neurons.

The generalized Hebbian algorithm learning rule is of the form

\,\Delta w_{ij}~=~\eta y_{i}\left(x_{j}-\sum _{k=1}^{i}w_{kj}y_{k}\right)

where $\eta$ is the learning rate parameter.^[4]

Derivation

In matrix form, Oja's rule can be written

\,{\frac {{\text{d}}w(t)}{{\text{d}}t}}~=~w(t)Q-\mathrm {diag} [w(t)Qw(t)^{\mathrm {T} }]w(t)

,

and the Gram-Schmidt algorithm is

\,\Delta w(t)~=~-\mathrm {lower} [w(t)w(t)^{\mathrm {T} }]w(t)

,

where $w (t)$ is any matrix, in this case representing synaptic weights, $Q = η x x T$ is the autocorrelation matrix, simply the outer product of inputs, $diag$ is the function that diagonalizes a matrix, and $lower$ is the function that sets all matrix elements on or above the diagonal equal to 0. We can combine these equations to get our original rule in matrix form,

\,\Delta w(t)~=~\eta (t)\left(\mathbf {y} (t)\mathbf {x} (t)^{\mathrm {T} }-\mathrm {LT} [\mathbf {y} (t)\mathbf {y} (t)^{\mathrm {T} }]w(t)\right)

,

where the function $LT$ sets all matrix elements above the diagonal equal to 0, and note that our output $y (t) = w (t) x (t)$ is a linear neuron.^[1]

Stability and Principal Components Analysis

^[5]

Oja's rule is the special case where $m=1$ .^[6] One can think of the generalized Hebbian algorithm as iterating Oja's rule.

With Oja's rule, $w_{1}$ is learned, and it has the same direction as the largest principal component vector is learned, with length determined by $E[x_{j}]=E[w_{1j}y_{1}]$ for all $j$ , where the expectation is taken over all input-output pairs. In other words, the length of the vector $w_{1}$ is such that we have an autoencoder, with the latent code $y_{1}=\sum _{i}w_{1i}x_{i}$ , such that $E[\|x-y_{1}w_{1}\|^{2}]$ is minimized.

When $m=2$ , the first neuron in the hidden layer of the autoencoder still learns as described, since it is unaffected by the second neuron. So, after the first neuron and its vector $w_{1}$ has converged, the second neuron is effectively running another Oja's rule on the modified input vectors, defined by $x'=x-y_{1}w_{1}$ , which we know is the input vector with the first principal component removed. Therefore, the second neuron learns to code for the second principal component.

By induction, this results in finding the top- $m$ principal components for arbitrary $m$ .

Applications

The generalized Hebbian algorithm is used in applications where a self-organizing map is necessary, or where a feature or principal components analysis can be used. Examples of such cases include artificial intelligence and speech and image processing.

Its importance comes from the fact that learning is a single-layer process—that is, a synaptic weight changes only depending on the response of the inputs and outputs of that layer, thus avoiding the multi-layer dependence associated with the backpropagation algorithm. It also has a simple and predictable trade-off between learning speed and accuracy of convergence as set by the learning rate parameter $η$ .^[5]

As an example, (Olshausen and Field, 1996)^[7] performed the generalized Hebbian algorithm on 8-by-8 patches of photos of natural scenes, and found that it results in Fourier-like features. The features are the same as the principal components found by principal components analysis, as expected, and that, the features are determined by the $64\times 64$ variance matrix of the samples of 8-by-8 patches. In other words, it is determined by the second-order statistics of the pixels in images. They criticized this as insufficient to capture higher-order statistics which are necessary to explain the Gabor-like features of simple cells in the primary visual cortex.

Related Research Articles

<span class="mw-page-title-main">Principal component analysis</span> Method of data analysis

Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing.

A self-organizing map (SOM) or self-organizing feature map (SOFM) is an unsupervised machine learning technique used to produce a low-dimensional representation of a higher-dimensional data set while preserving the topological structure of the data. For example, a data set with $variables measured in observations could be represented as clusters of observations with similar values for the variables. These clusters then could be visualized as a two-dimensional "map" such that observations in proximal clusters have more similar values than observations in distal clusters. This can make high-dimensional data easier to visualize and analyze.$

In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class. It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector.

Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, where a small portion of the data is tagged, and self-supervision. Some researchers consider self-supervised learning a form of unsupervised learning.

Nonlinear dimensionality reduction, also known as manifold learning, is any of various related techniques that aim to project high-dimensional data, potentially existing across non-linear manifolds which cannot be adequately captured by linear decomposition methods, onto lower-dimensional latent manifolds, with the goal of either visualizing the data in the low-dimensional space, or learning the mapping itself. The techniques described below can be understood as generalizations of linear decomposition methods used for dimensionality reduction, such as singular value decomposition and principal component analysis.

An artificial neuron is a mathematical function conceived as a model of a biological neuron in a neural network. The artificial neuron is the elementary unit of an artificial neural network.

Hebbian theory is a neuropsychological theory claiming that an increase in synaptic efficacy arises from a presynaptic cell's repeated and persistent stimulation of a postsynaptic cell. It is an attempt to explain synaptic plasticity, the adaptation of brain neurons during the learning process. It was introduced by Donald Hebb in his 1949 book The Organization of Behavior. The theory is also called Hebb's rule, Hebb's postulate, and cell assembly theory. Hebb states it as follows:

Let us assume that the persistence or repetition of a reverberatory activity tends to induce lasting cellular changes that add to its stability. ... When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased.

In signal processing, independent component analysis (ICA) is a computational method for separating a multivariate signal into additive subcomponents. This is done by assuming that at most one subcomponent is Gaussian and that the subcomponents are statistically independent from each other. ICA was invented by Jeanny Hérault and Christian Jutten in 1985. ICA is a special case of blind source separation. A common example application of ICA is the "cocktail party problem" of listening in on one person's speech in a noisy room.

In computer science, learning vector quantization (LVQ) is a prototype-based supervised classification algorithm. LVQ is the supervised counterpart of vector quantization systems.

A Hopfield network is a form of recurrent neural network, or a spin glass system, that can serve as a content-addressable memory. The Hopfield network, named for John Hopfield, consists of a single layer of neurons, where each neuron is connected to every other neuron except itself. These connections are bidirectional and symmetric, meaning the weight of the connection from neuron i to neuron j is the same as the weight from neuron j to neuron i. Patterns are associatively recalled by fixing certain inputs, and dynamically evolve the network to minimize an energy function, towards local energy minimum states that correspond to stored patterns. Patterns are associatively learned by a Hebbian learning algorithm.

In machine learning, backpropagation is a gradient estimation method commonly used for training a neural network to compute its parameter updates.

FastICA is an efficient and popular algorithm for independent component analysis invented by Aapo Hyvärinen at Helsinki University of Technology. Like most ICA algorithms, FastICA seeks an orthogonal rotation of prewhitened data, through a fixed-point iteration scheme, that maximizes a measure of non-Gaussianity of the rotated components. Non-gaussianity serves as a proxy for statistical independence, which is a very strong condition and requires infinite data to verify. FastICA can also be alternatively derived as an approximative Newton iteration.

In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). These methods involve using linear classifiers to solve nonlinear problems. The general task of pattern analysis is to find and study general types of relations in datasets. For many algorithms that solve these tasks, the data in raw representation have to be explicitly transformed into feature vector representations via a user-specified feature map: in contrast, kernel methods require only a user-specified kernel, i.e., a similarity function over all pairs of data points computed using inner products. The feature map in kernel machines is infinite dimensional but only requires a finite dimensional matrix from user-input according to the Representer theorem. Kernel machines are slow to compute for datasets larger than a couple of thousand examples without parallel processing.

Oja's learning rule, or simply Oja's rule, named after Finnish computer scientist Erkki Oja, is a model of how neurons in the brain or in artificial neural networks change connection strength, or learn, over time. It is a modification of the standard Hebb's Rule that, through multiplicative normalization, solves all stability problems and generates an algorithm for principal components analysis. This is a computational form of an effect which is believed to happen in biological neurons.

In neuroscience and computer science, synaptic weight refers to the strength or amplitude of a connection between two nodes, corresponding in biology to the amount of influence the firing of one neuron has on another. The term is typically used in artificial and biological neural network research.

<span class="mw-page-title-main">Biological neuron model</span> Mathematical descriptions of the properties of certain cells in the nervous system

Biological neuron models, also known as spiking neuron models, are mathematical descriptions of the conduction of electrical signals in neurons. Neurons are electrically excitable cells within the nervous system, able to fire electric signals, called action potentials, across a neural network. These mathematical models describe the role of the biophysical and geometrical characteristics of neurons on the conduction of electrical activity.

In neuroethology and the study of learning, anti-Hebbian learning describes a particular class of learning rule by which synaptic plasticity can be controlled. These rules are based on a reversal of Hebb's postulate, and therefore can be simplistically understood as dictating reduction of the strength of synaptic connectivity between neurons following a scenario in which a neuron directly contributes to production of an action potential in another neuron.

Competitive learning is a form of unsupervised learning in artificial neural networks, in which nodes compete for the right to respond to a subset of the input data. A variant of Hebbian learning, competitive learning works by increasing the specialization of each node in the network. It is well suited to finding clusters within data.

An artificial neural network's learning rule or learning process is a method, mathematical logic or algorithm which improves the network's performance and/or training time. Usually, this rule is applied repeatedly over the network. It is done by updating the weight and bias levels of a network when it is simulated in a specific data environment. A learning rule may accept existing conditions of the network, and will compare the expected result and actual result of the network to give new and improved values for the weights and biases. Depending on the complexity of the model being simulated, the learning rule of the network can be as simple as an XOR gate or mean squared error, or as complex as the result of a system of differential equations.

The spike response model (SRM) is a spiking neuron model in which spikes are generated by either a deterministic or a stochastic threshold process. In the SRM, the membrane voltage $V$ is described as a linear sum of the postsynaptic potentials (PSPs) caused by spike arrivals to which the effects of refractoriness and adaptation are added. The threshold is either fixed or dynamic. In the latter case it increases after each spike. The SRM is flexible enough to account for a variety of neuronal firing pattern in response to step current input. The SRM has also been used in the theory of computation to quantify the capacity of spiking neural networks; and in the neurosciences to predict the subthreshold voltage and the firing times of cortical neurons during stimulation with a time-dependent current stimulation. The name Spike Response Model points to the property that the two important filters $and of the model can be interpreted as the response of the membrane potential to an incoming spike (response kernel , the PSP) and to an outgoing spike (response kernel , also called refractory kernel). The SRM has been formulated in continuous time and in discrete time. The SRM can be viewed as a generalized linear model (GLM) or as an (integrated version of) a generalized integrate-and-fire model with adaptation.$

References

1 2 Sanger, Terence D. (1989). "Optimal unsupervised learning in a single-layer linear feedforward neural network" (PDF). Neural Networks. 2 (6): 459–473. CiteSeerX 10.1.1.128.6893 . doi:10.1016/0893-6080(89)90044-0 . Retrieved 2007-11-24.
↑ Hebb, D.O. (1949). The Organization of Behavior. New York: Wiley & Sons. ISBN 9781135631918.
↑ Hertz, John; Anders Krough; Richard G. Palmer (1991). Introduction to the Theory of Neural Computation. Redwood City, CA: Addison-Wesley Publishing Company. ISBN 978-0201515602.
↑ Gorrell, Genevieve (2006), "Generalized Hebbian Algorithm for Incremental Singular Value Decomposition in Natural Language Processing.", EACL, CiteSeerX 10.1.1.102.2084
1 2 Haykin, Simon (1998). Neural Networks: A Comprehensive Foundation (2 ed.). Prentice Hall. ISBN 978-0-13-273350-2.
↑ Oja, Erkki (November 1982). "Simplified neuron model as a principal component analyzer". Journal of Mathematical Biology. 15 (3): 267–273. doi:10.1007/BF00275687. PMID 7153672. S2CID 16577977. BF00275687.
↑ Olshausen, Bruno A.; Field, David J. (June 1996). "Emergence of simple-cell receptive field properties by learning a sparse code for natural images". Nature. 381 (6583): 607–609. doi:10.1038/381607a0. ISSN 1476-4687. PMID 8637596.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Sanger89-1] 1 2 Sanger, Terence D. (1989). "Optimal unsupervised learning in a single-layer linear feedforward neural network" (PDF). Neural Networks. 2 (6): 459–473. CiteSeerX 10.1.1.128.6893 . doi:10.1016/0893-6080(89)90044-0 . Retrieved 2007-11-24.

[Hebb_1949-2] Hebb, D.O. (1949). The Organization of Behavior. New York: Wiley & Sons. ISBN 9781135631918.

[Hertz,_Krough,_and_Palmer,_1991-3] Hertz, John; Anders Krough; Richard G. Palmer (1991). Introduction to the Theory of Neural Computation. Redwood City, CA: Addison-Wesley Publishing Company. ISBN 978-0201515602.

[4] Gorrell, Genevieve (2006), "Generalized Hebbian Algorithm for Incremental Singular Value Decomposition in Natural Language Processing.", EACL, CiteSeerX 10.1.1.102.2084

[Haykin98-5] 1 2 Haykin, Simon (1998). Neural Networks: A Comprehensive Foundation (2 ed.). Prentice Hall. ISBN 978-0-13-273350-2.

[Oja82-6] Oja, Erkki (November 1982). "Simplified neuron model as a principal component analyzer". Journal of Mathematical Biology. 15 (3): 267–273. doi:10.1007/BF00275687. PMID 7153672. S2CID 16577977. BF00275687.

[7] Olshausen, Bruno A.; Field, David J. (June 1996). "Emergence of simple-cell receptive field properties by learning a sparse code for natural images". Nature. 381 (6583): 607–609. doi:10.1038/381607a0. ISSN 1476-4687. PMID 8637596.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

v t e Hebbian learning
True Hebbian learning	Contrastive Hebbian learning Generalized Hebbian algorithm Oja's rule
Related concepts	Engram (neuropsychology) Feedforward neural network Backpropagation Leabra GeneRec