Modern Hopfield network

Last updated

Modern Hopfield networks [1] [2] (also known as Dense Associative Memories [3] ) are generalizations of the classical Hopfield networks that break the linear scaling relationship between the number of input features and the number of stored memories. This is achieved by introducing stronger non-linearities (either in the energy function or neurons’ activation functions) leading to super-linear [3] (even an exponential [4] ) memory storage capacity as a function of the number of feature neurons. The network still requires a sufficient number of hidden neurons. [5]

Contents

The key theoretical idea behind the modern Hopfield networks is to use an energy function and an update rule that is more sharply peaked around the stored memories in the space of neuron’s configurations compared to the classical Hopfield network. [3]

Classical Hopfield networks

Hopfield networks [6] [7] are recurrent neural networks with dynamical trajectories converging to fixed point attractor states and described by an energy function. The state of each model neuron is defined by a time-dependent variable , which can be chosen to be either discrete or continuous. A complete model describes the mathematics of how the future state of activity of each neuron depends on the known present or previous activity of all the neurons.

In the original Hopfield model of associative memory, [6] the variables were binary, and the dynamics were described by a one-at-a-time update of the state of the neurons. An energy function quadratic in the was defined, and the dynamics consisted of changing the activity of each single neuron only if doing so would lower the total energy of the system. This same idea was extended to the case of being a continuous variable representing the output of neuron , and being a monotonic function of an input current. The dynamics became expressed as a set of first-order differential equations for which the "energy" of the system always decreased. [7] The energy in the continuous case has one term which is quadratic in the (as in the binary model), and a second term which depends on the gain function (neuron's activation function). While having many desirable properties of associative memory, both of these classical systems suffer from a small memory storage capacity, which scales linearly with the number of input features. [6]

Discrete variables

A simple example [3] of the Modern Hopfield network can be written in terms of binary variables that represent the active and inactive state of the model neuron .In this formula the weights represent the matrix of memory vectors (index enumerates different memories, and index enumerates the content of each memory corresponding to the -th feature neuron), and the function is a rapidly growing non-linear function. The update rule for individual neurons (in the asynchronous case) can be written in the following form which states that in order to calculate the updated state of the -th neuron the network compares two energies: the energy of the network with the -th neuron in the ON state and the energy of the network with the -th neuron in the OFF state, given the states of the remaining neuron. The updated state of the -th neuron selects the state that has the lowest of the two energies. [3]

In the limiting case when the non-linear energy function is quadratic these equations reduce to the familiar energy function and the update rule for the classical binary Hopfield network. [6]

The memory storage capacity of these networks can be calculated for random binary patterns. For the power energy function the maximal number of memories that can be stored and retrieved from this network without errors is given by [3] For an exponential energy function the memory storage capacity is exponential in the number of feature neurons [4]

Continuous variables

Fig.1 An example of a continuous Modern Hopfield network with
N
f
=
5
{\textstyle N_{f}=5}
feature neurons and
N
mem
=
11
{\displaystyle N_{\text{mem}}=11}
memory (hidden) neurons with symmetric synaptic connections between them. Modern Hopfield Network.png
Fig.1 An example of a continuous Modern Hopfield network with feature neurons and memory (hidden) neurons with symmetric synaptic connections between them.

Modern Hopfield networks or Dense Associative Memories can be best understood in continuous variables and continuous time. [1] [5] Consider the network architecture, shown in Fig.1, and the equations for the neurons' state evolution [5]

where the currents of the feature neurons are denoted by , and the currents of the memory neurons are denoted by ( stands for hidden neurons). There are no synaptic connections among the feature neurons or the memory neurons. A matrix denotes the strength of synapses from a feature neuron to the memory neuron . The synapses are assumed to be symmetric, so that the same value characterizes a different physical synapse from the memory neuron to the feature neuron . The outputs of the memory neurons and the feature neurons are denoted by and , which are non-linear functions of the corresponding currents. In general these outputs can depend on the currents of all the neurons in that layer so that and . It is convenient to define these activation function as derivatives of the Lagrangian functions for the two groups of neurons

This way the specific form of the equations for neuron's states is completely defined once the Lagrangian functions are specified. Finally, the time constants for the two groups of neurons are denoted by and , is the input current to the network that can be driven by the presented data.

Fig.2 Effective theory on the feature neurons for various common choices of the Lagrangian functions. Model A reduces to the models studied in depending on the choice of the activation function, model B reduces to the model studied in, model C reduces to the model of. Effective theory of Modern Hopfield Networks.png
Fig.2 Effective theory on the feature neurons for various common choices of the Lagrangian functions. Model A reduces to the models studied in depending on the choice of the activation function, model B reduces to the model studied in, model C reduces to the model of.

General systems of non-linear differential equations can have many complicated behaviors that can depend on the choice of the non-linearities and the initial conditions. For Hopfield networks, however, this is not the case - the dynamical trajectories always converge to a fixed point attractor state. This property is achieved because these equations are specifically engineered so that they have an underlying energy function [5]

The terms grouped into square brackets represent a Legendre transform of the Lagrangian function with respect to the states of the neurons. If the Hessian matrices of the Lagrangian functions are positive semi-definite, the energy function is guaranteed to decrease on the dynamical trajectory [5]

This property makes it possible to prove that the system of dynamical equations describing temporal evolution of neurons' activities will eventually reach a fixed point attractor state.

In certain situations one can assume that the dynamics of hidden neurons equilibrates at a much faster time scale compared to the feature neurons, . In this case the steady state solution of the second equation in the system ( 1 ) can be used to express the currents of the hidden units through the outputs of the feature neurons. This makes it possible to reduce the general theory ( 1 ) to an effective theory for feature neurons only. The resulting effective update rules and the energies for various common choices of the Lagrangian functions are shown in Fig.2. In the case of log-sum-exponential Lagrangian function the update rule (if applied once) for the states of the feature neurons is the attention mechanism [1] commonly used in many modern AI systems (see Ref. [5] for the derivation of this result from the continuous time formulation).

Relationship to classical Hopfield network with continuous variables

Classical formulation of continuous Hopfield networks [6] can be understood [5] as a special limiting case of the Modern Hopfield networks with one hidden layer. Continuous Hopfield Networks for neurons with graded response are typically described [6] by the dynamical equations

and the energy function

where , and is the inverse of the activation function . This model is a special limit of the class of models that is called models A, [5] with the following choice of the Lagrangian functions

that, according to the definition ( 2 ), leads to the activation functions

If we integrate out the hidden neurons the system of equations ( 1 ) reduces to the equations on the feature neurons ( 5 ) with , and the general expression for the energy ( 3 ) reduces to the effective energy

While the first two terms in equation ( 6 ) are the same as those in equation ( 9 ), the third terms look superficially different. In equation ( 9 ) it is a Legendre transform of the Lagrangian for the feature neurons, while in ( 6 ) the third term is an integral of the inverse activation function. Nevertheless, these two expressions are in fact equivalent, since the derivatives of a function and its Legendre transform are inverse functions of each other. The easiest way to see that these two terms are equal explicitly is to differentiate each one with respect to . The results of these differentiations for both expressions are equal to . Thus, the two expressions are equal up to an additive constant. This completes the proof [5] that the classical Hopfield network with continuous states [6] is a special limiting case of the modern Hopfield network ( 1 ) with energy ( 3 ).

General formulation

Fig.3 The connectivity diagram of the fully-connected modern Hopfield network consisting of five neurons. The synaptic weights are described by a symmetric matrix
W
I
J
{\displaystyle W_{IJ}}
. HAM Full Connect.png
Fig.3 The connectivity diagram of the fully-connected modern Hopfield network consisting of five neurons. The synaptic weights are described by a symmetric matrix .

Biological neural networks have a large degree of heterogeneity in terms of different cell types. This section describes a mathematical model of a fully connected Modern Hopfield network assuming the extreme degree of heterogeneity: every single neuron is different. [8] Specifically, an energy function and the corresponding dynamical equations are described assuming that each neuron has its own activation function and kinetic time scale. The network is assumed to be fully connected, so that every neuron is connected to every other neuron using a symmetric matrix of weights , indices and enumerate different neurons in the network, see Fig.3. The easiest way to mathematically formulate this problem is to define the architecture through a Lagrangian function that depends on the activities of all the neurons in the network. The activation function for each neuron is defined as a partial derivative of the Lagrangian with respect to that neuron's activity

From the biological perspective one can think about as an axonal output of the neuron . In the simplest case, when the Lagrangian is additive for different neurons, this definition results in the activation that is a non-linear function of that neuron's activity. For non-additive Lagrangians this activation function can depend on the activities of a group of neurons. For instance, it can contain contrastive (softmax) or divisive normalization. The dynamical equations describing temporal evolution of a given neuron are given by [8]

This equation belongs to the class of models called firing rate models in neuroscience. Each neuron collects the axonal outputs from all the neurons, weights them with the synaptic coefficients and produces its own time-dependent activity . The temporal evolution has a time constant , which in general can be different for every neuron. This network has a global energy function [8]

where the first two terms represent the Legendre transform of the Lagrangian function with respect to the neurons' currents . The temporal derivative of this energy function can be computed on the dynamical trajectories leading to (see [8] for details)

The last inequality sign holds provided that the matrix (or its symmetric part) is positive semi-definite. If, in addition to this, the energy function is bounded from below the non-linear dynamical equations are guaranteed to converge to a fixed point attractor state. The advantage of formulating this network in terms of the Lagrangian functions is that it makes it possible to easily experiment with different choices of the activation functions and different architectural arrangements of neurons. For all those flexible choices the conditions of convergence are determined by the properties of the matrix and the existence of the lower bound on the energy function.

Hierarchical associative memory network

Fig.4 The connectivity diagram of the layered Hierarchical Associative Memory network. Each layer can have different number of neurons, different activation function, and different time scales. The feedforward weights and feedback weights are equal. Hierarchical Associative Memory.png
Fig.4 The connectivity diagram of the layered Hierarchical Associative Memory network. Each layer can have different number of neurons, different activation function, and different time scales. The feedforward weights and feedback weights are equal.

The neurons can be organized in layers so that every neuron in a given layer has the same activation function and the same dynamic time scale. If we assume that there are no horizontal connections between the neurons within the layer (lateral connections) and there are no skip-layer connections, the general fully connected network ( 11 ), ( 12 ) reduces to the architecture shown in Fig.4. It has layers of recurrently connected neurons with the states described by continuous variables and the activation functions , index enumerates the layers of the network, and index enumerates individual neurons in that layer. The activation functions can depend on the activities of all the neurons in the layer. Every layer can have a different number of neurons . These neurons are recurrently connected with the neurons in the preceding and the subsequent layers. The matrices of weights that connect neurons in layers and are denoted by (the order of the upper indices for weights is the same as the order of the lower indices, in the example above this means that the index enumerates neurons in the layer , and index enumerates neurons in the layer ). The feedforward weights and the feedback weights are equal. The dynamical equations for the neurons' states can be written as [8]

with boundary conditions

The main difference of these equations from the conventional feedforward networks is the presence of the second term, which is responsible for the feedback from higher layers. These top-down signals help neurons in lower layers to decide on their response to the presented stimuli. Following the general recipe it is convenient to introduce a Lagrangian function for the -th hidden layer, which depends on the activities of all the neurons in that layer. [8] The activation functions in that layer can be defined as partial derivatives of the Lagrangian

With these definitions the energy (Lyapunov) function is given by [8]

If the Lagrangian functions, or equivalently the activation functions, are chosen in such a way that the Hessians for each layer are positive semi-definite and the overall energy is bounded from below, this system is guaranteed to converge to a fixed point attractor state. The temporal derivative of this energy function is given by [8]

Thus, the hierarchical layered network is indeed an attractor network with the global energy function. This network is described by a hierarchical set of synaptic weights that can be learned for each specific problem.

Related Research Articles

In particle physics, the Dirac equation is a relativistic wave equation derived by British physicist Paul Dirac in 1928. In its free form, or including electromagnetic interactions, it describes all spin-1/2 massive particles, called "Dirac particles", such as electrons and quarks for which parity is a symmetry. It is consistent with both the principles of quantum mechanics and the theory of special relativity, and was the first theory to account fully for special relativity in the context of quantum mechanics. It was validated by accounting for the fine structure of the hydrogen spectrum in a completely rigorous way. It has become vital in the building of the Standard Model.

<span class="mw-page-title-main">Navier–Stokes equations</span> Equations describing the motion of viscous fluid substances

The Navier–Stokes equations are partial differential equations which describe the motion of viscous fluid substances. They were named after French engineer and physicist Claude-Louis Navier and the Irish physicist and mathematician George Gabriel Stokes. They were developed over several decades of progressively building the theories, from 1822 (Navier) to 1842–1850 (Stokes).

<span class="mw-page-title-main">Stress–energy tensor</span> Tensor describing energy momentum density in spacetime

The stress–energy tensor, sometimes called the stress–energy–momentum tensor or the energy–momentum tensor, is a tensor physical quantity that describes the density and flux of energy and momentum in spacetime, generalizing the stress tensor of Newtonian physics. It is an attribute of matter, radiation, and non-gravitational force fields. This density and flux of energy and momentum are the sources of the gravitational field in the Einstein field equations of general relativity, just as mass density is the source of such a field in Newtonian gravity.

<span class="mw-page-title-main">Noether's theorem</span> Statement relating differentiable symmetries to conserved quantities

Noether's theorem states that every continuous symmetry of the action of a physical system with conservative forces has a corresponding conservation law. This is the first of two theorems published by mathematician Emmy Noether in 1918. The action of a physical system is the integral over time of a Lagrangian function, from which the system's behavior can be determined by the principle of least action. This theorem only applies to continuous and smooth symmetries of physical space.

<span class="mw-page-title-main">Differential operator</span> Typically linear operator defined in terms of differentiation of functions

In mathematics, a differential operator is an operator defined as a function of the differentiation operator. It is helpful, as a matter of notation first, to consider differentiation as an abstract operation that accepts a function and returns another function.

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

A Newtonian fluid is a fluid in which the viscous stresses arising from its flow are at every point linearly correlated to the local strain rate — the rate of change of its deformation over time. Stresses are proportional to the rate of change of the fluid's velocity vector.

<span class="mw-page-title-main">Legendre transformation</span> Mathematical transformation

In mathematics, the Legendre transformation, first introduced by Adrien-Marie Legendre in 1787 when studying the minimal surface problem, is an involutive transformation on real-valued functions that are convex on a real variable. Specifically, if a real-valued multivariable function is convex on one of its independent real variables, then the Legendre transform with respect to this variable is applicable to the function.

<span class="mw-page-title-main">Onsager reciprocal relations</span> Relations between flows and forces, or gradients, in thermodynamic systems

In thermodynamics, the Onsager reciprocal relations express the equality of certain ratios between flows and forces in thermodynamic systems out of equilibrium, but where a notion of local equilibrium exists.

In physics, the Hamilton–Jacobi equation, named after William Rowan Hamilton and Carl Gustav Jacob Jacobi, is an alternative formulation of classical mechanics, equivalent to other formulations such as Newton's laws of motion, Lagrangian mechanics and Hamiltonian mechanics.

Teleparallelism, was an attempt by Albert Einstein to base a unified theory of electromagnetism and gravity on the mathematical structure of distant parallelism, also referred to as absolute or teleparallelism. In this theory, a spacetime is characterized by a curvature-free linear connection in conjunction with a metric tensor field, both defined in terms of a dynamical tetrad field.

A Hopfield network is a form of recurrent neural network, or a spin glass system, that can serve as a content-addressable memory. The Hopfield network, named for John Hopfield, consists of a single layer of neurons, where each neuron is connected to every other neuron except itself. These connections are bidirectional and symmetric, meaning the weight of the connection from neuron i to neuron j is the same as the weight from neuron j to neuron i. Patterns are associatively recalled by fixing certain inputs, and dynamically evolve the network to minimize an energy function, towards local energy minimum states that correspond to stored patterns. Patterns are associatively learned by a Hebbian learning algorithm.

In physics, precisely in the study of the theory of general relativity and many alternatives to it, the post-Newtonian formalism is a calculational tool that expresses Einstein's (nonlinear) equations of gravity in terms of the lowest-order deviations from Newton's law of universal gravitation. This allows approximations to Einstein's equations to be made in the case of weak fields. Higher-order terms can be added to increase accuracy, but for strong fields, it may be preferable to solve the complete equations numerically. Some of these post-Newtonian approximations are expansions in a small parameter, which is the ratio of the velocity of the matter forming the gravitational field to the speed of light, which in this case is better called the speed of gravity. In the limit, when the fundamental speed of gravity becomes infinite, the post-Newtonian expansion reduces to Newton's law of gravity.

In mathematics — specifically, in stochastic analysis — the infinitesimal generator of a Feller process is a Fourier multiplier operator that encodes a great deal of information about the process.

In mathematics, the spectral theory of ordinary differential equations is the part of spectral theory concerned with the determination of the spectrum and eigenfunction expansion associated with a linear ordinary differential equation. In his dissertation, Hermann Weyl generalized the classical Sturm–Liouville theory on a finite closed interval to second order differential operators with singularities at the endpoints of the interval, possibly semi-infinite or infinite. Unlike the classical case, the spectrum may no longer consist of just a countable set of eigenvalues, but may also contain a continuous part. In this case the eigenfunction expansion involves an integral over the continuous part with respect to a spectral measure, given by the Titchmarsh–Kodaira formula. The theory was put in its final simplified form for singular differential equations of even degree by Kodaira and others, using von Neumann's spectral theorem. It has had important applications in quantum mechanics, operator theory and harmonic analysis on semisimple Lie groups.

<span class="mw-page-title-main">Lagrangian mechanics</span> Formulation of classical mechanics

In physics, Lagrangian mechanics is a formulation of classical mechanics founded on the stationary-action principle. It was introduced by the Italian-French mathematician and astronomer Joseph-Louis Lagrange in his presentation to the Turin Academy of Science in 1760 culminating in his 1788 grand opus, Mécanique analytique.

Lagrangian field theory is a formalism in classical field theory. It is the field-theoretic analogue of Lagrangian mechanics. Lagrangian mechanics is used to analyze the motion of a system of discrete particles each with a finite number of degrees of freedom. Lagrangian field theory applies to continua and fields, which have an infinite number of degrees of freedom.

In machine learning, a Hyper basis function network, or HyperBF network, is a generalization of radial basis function (RBF) networks concept, where the Mahalanobis-like distance is used instead of Euclidean distance measure. Hyper basis function networks were first introduced by Poggio and Girosi in the 1990 paper “Networks for Approximation and Learning”.

Vasiliev equations are formally consistent gauge invariant nonlinear equations whose linearization over a specific vacuum solution describes free massless higher-spin fields on anti-de Sitter space. The Vasiliev equations are classical equations and no Lagrangian is known that starts from canonical two-derivative Frønsdal Lagrangian and is completed by interactions terms. There is a number of variations of Vasiliev equations that work in three, four and arbitrary number of space-time dimensions. Vasiliev's equations admit supersymmetric extensions with any number of super-symmetries and allow for Yang–Mills gaugings. Vasiliev's equations are background independent, the simplest exact solution being anti-de Sitter space. It is important to note that locality is not properly implemented and the equations give a solution of certain formal deformation procedure, which is difficult to map to field theory language. The higher-spin AdS/CFT correspondence is reviewed in Higher-spin theory article.

In physics and mathematics, the Klein–Kramers equation or sometimes referred as Kramers–Chandrasekhar equation is a partial differential equation that describes the probability density function f of a Brownian particle in phase space (r, p). It is a special case of the Fokker–Planck equation.

References

  1. 1 2 3 4 Ramsauer, Hubert; et al. (2021). "Hopfield Networks is All You Need". International Conference on Learning Representations. arXiv: 2008.02217 .
  2. "Hopfield Networks is All You Need". hopfield-layers. 2020-08-25. Archived from the original on 26 Mar 2023. Retrieved 2023-05-04.
  3. 1 2 3 4 5 6 7 Krotov, Dmitry; Hopfield, John (2016). "Dense Associative Memory for Pattern Recognition". Neural Information Processing Systems. 29: 1172–1180. arXiv: 1606.01164 .
  4. 1 2 3 Demircigil, Mete; et al. (2017). "On a model of associative memory with huge storage capacity". Journal of Statistical Physics. 168 (2): 288–299. arXiv: 1702.01929 . Bibcode:2017JSP...168..288D. doi:10.1007/s10955-017-1806-y. S2CID   119317128.
  5. 1 2 3 4 5 6 7 8 9 10 Krotov, Dmitry; Hopfield, John (2021). "Large associative memory problem in neurobiology and machine learning". International Conference on Learning Representations. arXiv: 2008.06996 .
  6. 1 2 3 4 5 6 7 Hopfield, John (1982). "Neural networks and physical systems with emergent collective computational abilities". Proceedings of the National Academy of Sciences. 79 (8): 2554–2558. Bibcode:1982PNAS...79.2554H. doi: 10.1073/pnas.79.8.2554 . PMC   346238 . PMID   6953413.
  7. 1 2 Hopfield, John (1984). "Neurons with graded response have collective computational properties like those of two-state neurons". Proceedings of the National Academy of Sciences. 81 (10): 3088–3092. Bibcode:1984PNAS...81.3088H. doi: 10.1073/pnas.81.10.3088 . PMC   345226 . PMID   6587342.
  8. 1 2 3 4 5 6 7 8 9 Krotov, Dmitry (2021). "Hierarchical Associative Memory". arXiv: 2107.06446 [cs.NE].