Deep belief network

Last updated December 04, 2023

In machine learning, a deep belief network (DBN) is a generative graphical model, or alternatively a class of deep neural network, composed of multiple layers of latent variables ("hidden units"), with connections between the layers but not between units within each layer.^[1]

DBNs can be viewed as a composition of simple, unsupervised networks such as restricted Boltzmann machines (RBMs)^[1] or autoencoders,^[3] where each sub-network's hidden layer serves as the visible layer for the next. An RBM is an undirected, generative energy-based model with a "visible" input layer and a hidden layer and connections between but not within layers. This composition leads to a fast, layer-by-layer unsupervised training procedure, where contrastive divergence is applied to each sub-network in turn, starting from the "lowest" pair of layers (the lowest visible layer is a training set).

The observation^[2] that DBNs can be trained greedily, one layer at a time, led to one of the first effective deep learning algorithms.^[4]^: 6 Overall, there are many attractive implementations and uses of DBNs in real-life applications and scenarios (e.g., electroencephalography,^[5] drug discovery ^[6]^[7]^[8]).

Training

The training method for RBMs proposed by Geoffrey Hinton for use with training "Product of Experts" models is called contrastive divergence (CD).^[9] CD provides an approximation to the maximum likelihood method that would ideally be applied for learning the weights.^[10]^[11] In training a single RBM, weight updates are performed with gradient descent via the following equation: $w_{ij}(t+1)=w_{ij}(t)+\eta {\frac {\partial \log(p(v))}{\partial w_{ij}}}$

where, $p(v)$ is the probability of a visible vector, which is given by $p(v)={\frac {1}{Z}}\sum _{h}e^{-E(v,h)}$ . $Z$ is the partition function (used for normalizing) and $E(v,h)$ is the energy function assigned to the state of the network. A lower energy indicates the network is in a more "desirable" configuration. The gradient ${\frac {\partial \log(p(v))}{\partial w_{ij}}}$ has the simple form $\langle v_{i}h_{j}\rangle _{\text{data}}-\langle v_{i}h_{j}\rangle _{\text{model}}$ where $\langle \cdots \rangle _{p}$ represent averages with respect to distribution $p$ . The issue arises in sampling $\langle v_{i}h_{j}\rangle _{\text{model}}$ because this requires extended alternating Gibbs sampling. CD replaces this step by running alternating Gibbs sampling for $n$ steps (values of $n=1$ perform well). After $n$ steps, the data are sampled and that sample is used in place of $\langle v_{i}h_{j}\rangle _{\text{model}}$ . The CD procedure works as follows:^[10]

Initialize the visible units to a training vector.
Update the hidden units in parallel given the visible units: $p(h_{j}=1\mid {\textbf {V}})=\sigma (b_{j}+\sum _{i}v_{i}w_{ij})$ . $\sigma$ is the sigmoid function and $b_{j}$ is the bias of $h_{j}$ .
Update the visible units in parallel given the hidden units: $p(v_{i}=1\mid {\textbf {H}})=\sigma (a_{i}+\sum _{j}h_{j}w_{ij})$ . $a_{i}$ is the bias of $v_{i}$ . This is called the "reconstruction" step.
Re-update the hidden units in parallel given the reconstructed visible units using the same equation as in step 2.
Perform the weight update: $\Delta w_{ij}\propto \langle v_{i}h_{j}\rangle _{\text{data}}-\langle v_{i}h_{j}\rangle _{\text{reconstruction}}$ .

Once an RBM is trained, another RBM is "stacked" atop it, taking its input from the final trained layer. The new visible layer is initialized to a training vector, and values for the units in the already-trained layers are assigned using the current weights and biases. The new RBM is then trained with the procedure above. This whole process is repeated until the desired stopping criterion is met.^[12]

Although the approximation of CD to maximum likelihood is crude (does not follow the gradient of any function), it is empirically effective.^[10]

Related Research Articles

Unsupervised learning is a paradigm in machine learning where, in contrast to supervised learning and semi-supervised learning, algorithms learn patterns exclusively from unlabeled data.

In thermodynamics, the Helmholtz free energy is a thermodynamic potential that measures the useful work obtainable from a closed thermodynamic system at a constant temperature (isothermal). The change in the Helmholtz energy during a process is equal to the maximum amount of work that the system can perform in a thermodynamic process in which temperature is held constant. At constant temperature, the Helmholtz free energy is minimized at equilibrium.

The Ising model, named after the physicists Ernst Ising and Wilhelm Lenz, is a mathematical model of ferromagnetism in statistical mechanics. The model consists of discrete variables that represent magnetic dipole moments of atomic "spins" that can be in one of two states. The spins are arranged in a graph, usually a lattice, allowing each spin to interact with its neighbors. Neighboring spins that agree have a lower energy than those that disagree; the system tends to the lowest energy but heat disturbs this tendency, thus creating the possibility of different structural phases. The model allows the identification of phase transitions as a simplified model of reality. The two-dimensional square-lattice Ising model is one of the simplest statistical models to show a phase transition.

Hebbian theory is a neuropsychological theory claiming that an increase in synaptic efficacy arises from a presynaptic cell's repeated and persistent stimulation of a postsynaptic cell. It is an attempt to explain synaptic plasticity, the adaptation of brain neurons during the learning process. It was introduced by Donald Hebb in his 1949 book The Organization of Behavior. The theory is also called Hebb's rule, Hebb's postulate, and cell assembly theory. Hebb states it as follows:

Let us assume that the persistence or repetition of a reverberatory activity tends to induce lasting cellular changes that add to its stability. ... When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased.

In linear algebra, the Gram matrix of a set of vectors $in an inner product space is the Hermitian matrix of inner products, whose entries are given by the inner product . If the vectors are the columns of matrix then the Gram matrix is in the general case that the vector coordinates are complex numbers, which simplifies to for the case that the vector coordinates are real numbers.$

<span class="mw-page-title-main">Boltzmann equation</span> Equation of statistical mechanics

The Boltzmann equation or Boltzmann transport equation (BTE) describes the statistical behaviour of a thermodynamic system not in a state of equilibrium; it was devised by Ludwig Boltzmann in 1872. The classic example of such a system is a fluid with temperature gradients in space causing heat to flow from hotter regions to colder ones, by the random but biased transport of the particles making up that fluid. In the modern literature the term Boltzmann equation is often used in a more general sense, referring to any kinetic equation that describes the change of a macroscopic quantity in a thermodynamic system, such as energy, charge or particle number.

<span class="mw-page-title-main">Boltzmann machine</span> Type of stochastic recurrent neural network

A Boltzmann machine is a stochastic spin-glass model with an external field, i.e., a Sherrington–Kirkpatrick model, that is a stochastic Ising model. It is a statistical physics technique applied in the context of cognitive science. It is also classified as a Markov random field.

A Hopfield network is a form of recurrent artificial neural network and a type of spin glass system popularised by John Hopfield in 1982 as described by Shun'ichi Amari in 1972 and by Little in 1974 based on Ernst Ising's work with Wilhelm Lenz on the Ising model. Hopfield networks serve as content-addressable ("associative") memory systems with binary threshold nodes, or with continuous variables. Hopfield networks also provide a model for understanding human memory.

As a machine-learning algorithm, backpropagation is a crucial step in a common method used to iteratively train a neural network model. It is used to calculate the necessary parameter adjustments, to gradually minimize error.

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.

Activation function of a node in an artificial neural network is a function that calculates the output of the node. Nontrivial problems can be solved only using a nonlinear activation function. Modern activation functions include the smooth version of the ReLU, the GELU, which was used in the 2018 BERT model, the logistic (sigmoid) function used in the 2012 speech recognition model developed by Hinton et al, the ReLU used in the 2012 AlexNet computer vision model and in the 2015 ResNet model.

In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need of out-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time, e.g., stock price prediction. Online learning algorithms may be prone to catastrophic interference, a problem that can be addressed by incremental learning approaches.

Quadratic unconstrained binary optimization (QUBO), also known as unconstrained binary quadratic programming (UBQP), is a combinatorial optimization problem with a wide range of applications from finance and economics to machine learning. QUBO is an NP hard problem, and for many classical problems from theoretical computer science, like maximum cut, graph coloring and the partition problem, embeddings into QUBO have been formulated. Embeddings for machine learning models include support-vector machines, clustering and probabilistic graphical models. Moreover, due to its close connection to Ising models, QUBO constitutes a central problem class for adiabatic quantum computation, where it is solved through a physical process called quantum annealing.

There are many types of artificial neural networks (ANN).

A restricted Boltzmann machine (RBM) is a generative stochastic artificial neural network that can learn a probability distribution over its set of inputs.

Within bayesian statistics for machine learning, kernel methods arise from the assumption of an inner product space or similarity structure on inputs. For some such methods, such as support vector machines (SVMs), the original formulation and its regularization were not Bayesian in nature. It is helpful to understand them from a Bayesian perspective. Because the kernels are not necessarily positive semidefinite, the underlying structure may not be inner product spaces, but instead more general reproducing kernel Hilbert spaces. In Bayesian probability kernel methods are a key component of Gaussian processes, where the kernel function is known as the covariance function. Kernel methods have traditionally been used in supervised learning problems where the input space is usually a space of vectors while the output space is a space of scalars. More recently these methods have been extended to problems that deal with multiple outputs such as in multi-task learning.

In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

Multimodal learning, in context of machine learning, is deep learning from a combination of various modalities of data, often arising in real-world applications. An example of multi-modal data is data that combines text with imaging data consisting of pixel intensities and annotation tags. As these modalities have fundamentally different statistical properties, combining them is non-trivial, which is why specialized modelling strategies and algorithms are required.

Dilution and dropout are regularization techniques for reducing overfitting in artificial neural networks by preventing complex co-adaptations on training data. They are an efficient way of performing model averaging with neural networks. Dilution refers to thinning weights, while dropout refers to randomly "dropping out", or omitting, units during the training process of a neural network. Both trigger the same type of regularization.

A transformer is a deep learning architecture, initially proposed in 2017, that relies on the parallel multi-head attention mechanism. It is notable for requiring less training time than previous recurrent neural architectures, such as long short-term memory (LSTM), and its later variation has been prevalently adopted for training large language models on large (language) datasets, such as the Wikipedia corpus and Common Crawl, by virtue of the parallelized processing of input sequence. Input text is split into n-grams encoded as tokens and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. Though the transformer paper was published in 2017, the softmax-based attention mechanism was proposed in 2014 for machine translation, and the Fast Weight Controller, similar to a transformer, was proposed in 1992.

References

1 2 3 Hinton G (2009). "Deep belief networks". Scholarpedia. 4 (5): 5947. Bibcode:2009SchpJ...4.5947H. doi: 10.4249/scholarpedia.5947 .
1 2 Hinton GE, Osindero S, Teh YW (July 2006). "A fast learning algorithm for deep belief nets" (PDF). Neural Computation. 18 (7): 1527–54. CiteSeerX 10.1.1.76.1541 . doi:10.1162/neco.2006.18.7.1527. PMID 16764513. S2CID 2309950.
↑ Bengio Y, Lamblin P, Popovici D, Larochelle H (2007). Greedy Layer-Wise Training of Deep Networks (PDF). NIPS.
↑ Bengio, Y. (2009). "Learning Deep Architectures for AI" (PDF). Foundations and Trends in Machine Learning. 2: 1–127. CiteSeerX 10.1.1.701.9550 . doi:10.1561/2200000006.
↑ Movahedi F, Coyle JL, Sejdic E (May 2018). "Deep Belief Networks for Electroencephalography: A Review of Recent Contributions and Future Outlooks". IEEE Journal of Biomedical and Health Informatics. 22 (3): 642–652. doi:10.1109/jbhi.2017.2727218. PMC 5967386 . PMID 28715343.
↑ Ghasemi, Pérez-Sánchez; Mehri, Pérez-Garrido (2018). "Neural network and deep-learning algorithms used in QSAR studies: merits and drawbacks". Drug Discovery Today. 23 (10): 1784–1790. doi:10.1016/j.drudis.2018.06.016. PMID 29936244. S2CID 49418479.
↑ Ghasemi, Pérez-Sánchez; Mehri, fassihi (2016). "The Role of Different Sampling Methods in Improving Biological Activity Prediction Using Deep Belief Network". Journal of Computational Chemistry. 38 (10): 1–8. doi:10.1002/jcc.24671. PMID 27862046. S2CID 12077015.
↑ Gawehn E, Hiss JA, Schneider G (January 2016). "Deep Learning in Drug Discovery". Molecular Informatics. 35 (1): 3–14. doi: 10.1002/minf.201501008 . PMID 27491648. S2CID 10574953.
↑ Hinton GE (2002). "Training Product of Experts by Minimizing Contrastive Divergence" (PDF). Neural Computation. 14 (8): 1771–1800. CiteSeerX 10.1.1.35.8613 . doi:10.1162/089976602760128018. PMID 12180402. S2CID 207596505.
1 2 3 Hinton GE (2010). "A Practical Guide to Training Restricted Boltzmann Machines". Tech. Rep. UTML TR 2010-003.
↑ Fischer A, Igel C (2014). "Training Restricted Boltzmann Machines: An Introduction" (PDF). Pattern Recognition. 47 (1): 25–39. Bibcode:2014PatRe..47...25F. CiteSeerX 10.1.1.716.8647 . doi:10.1016/j.patcog.2013.05.025. Archived from the original (PDF) on 2015-06-10. Retrieved 2017-07-02.
↑ Bengio Y (2009). "Learning Deep Architectures for AI" (PDF). Foundations and Trends in Machine Learning. 2 (1): 1–127. CiteSeerX 10.1.1.701.9550 . doi:10.1561/2200000006. Archived from the original (PDF) on 2016-03-04. Retrieved 2017-07-02.

External links

"Deep Belief Networks". Deep Learning Tutorials.
"Deep Belief Network Example". Deeplearning4j Tutorials. Archived from the original on 2016-10-03. Retrieved 2015-02-22.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[scholar-1] 1 2 3 Hinton G (2009). "Deep belief networks". Scholarpedia. 4 (5): 5947. Bibcode:2009SchpJ...4.5947H. doi: 10.4249/scholarpedia.5947 .

[hinton06-2] 1 2 Hinton GE, Osindero S, Teh YW (July 2006). "A fast learning algorithm for deep belief nets" (PDF). Neural Computation. 18 (7): 1527–54. CiteSeerX 10.1.1.76.1541 . doi:10.1162/neco.2006.18.7.1527. PMID 16764513. S2CID 2309950.

[3] Bengio Y, Lamblin P, Popovici D, Larochelle H (2007). Greedy Layer-Wise Training of Deep Networks (PDF). NIPS.

[4] Bengio, Y. (2009). "Learning Deep Architectures for AI" (PDF). Foundations and Trends in Machine Learning. 2: 1–127. CiteSeerX 10.1.1.701.9550 . doi:10.1561/2200000006.

[5] Movahedi F, Coyle JL, Sejdic E (May 2018). "Deep Belief Networks for Electroencephalography: A Review of Recent Contributions and Future Outlooks". IEEE Journal of Biomedical and Health Informatics. 22 (3): 642–652. doi:10.1109/jbhi.2017.2727218. PMC 5967386 . PMID 28715343.

[6] Ghasemi, Pérez-Sánchez; Mehri, Pérez-Garrido (2018). "Neural network and deep-learning algorithms used in QSAR studies: merits and drawbacks". Drug Discovery Today. 23 (10): 1784–1790. doi:10.1016/j.drudis.2018.06.016. PMID 29936244. S2CID 49418479.

[7] Ghasemi, Pérez-Sánchez; Mehri, fassihi (2016). "The Role of Different Sampling Methods in Improving Biological Activity Prediction Using Deep Belief Network". Journal of Computational Chemistry. 38 (10): 1–8. doi:10.1002/jcc.24671. PMID 27862046. S2CID 12077015.

[8] Gawehn E, Hiss JA, Schneider G (January 2016). "Deep Learning in Drug Discovery". Molecular Informatics. 35 (1): 3–14. doi: 10.1002/minf.201501008 . PMID 27491648. S2CID 10574953.

[POE-9] Hinton GE (2002). "Training Product of Experts by Minimizing Contrastive Divergence" (PDF). Neural Computation. 14 (8): 1771–1800. CiteSeerX 10.1.1.35.8613 . doi:10.1162/089976602760128018. PMID 12180402. S2CID 207596505.

[RBMTRAIN2-10] 1 2 3 Hinton GE (2010). "A Practical Guide to Training Restricted Boltzmann Machines". Tech. Rep. UTML TR 2010-003.

[RBMTutorial-11] Fischer A, Igel C (2014). "Training Restricted Boltzmann Machines: An Introduction" (PDF). Pattern Recognition. 47 (1): 25–39. Bibcode:2014PatRe..47...25F. CiteSeerX 10.1.1.716.8647 . doi:10.1016/j.patcog.2013.05.025. Archived from the original (PDF) on 2015-06-10. Retrieved 2017-07-02.

[BENGIODEEP-12] Bengio Y (2009). "Learning Deep Architectures for AI" (PDF). Foundations and Trends in Machine Learning. 2 (1): 1–127. CiteSeerX 10.1.1.701.9550 . doi:10.1561/2200000006. Archived from the original (PDF) on 2016-03-04. Retrieved 2017-07-02.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

Deep belief network

Contents

Training

See also

Related Research Articles

References

External links