# Autoencoder

Last updated

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data (unsupervised learning). [1] The encoding is validated and refined by attempting to regenerate the input from the encoding. The autoencoder learns a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore insignificant data (“noise”).

## Contents

Variants exist, aiming to force the learned representations to assume useful properties. [2] Examples are regularized autoencoders (Sparse, Denoising and Contractive), which are effective in learning representations for subsequent classification tasks, [3] and Variational autoencoders, with applications as generative models. [4] Autoencoders are applied to many problems, from facial recognition [5] to acquiring the meaning of words. [6] [7]

## Basic architecture

An autoencoder has two main parts: an encoder that maps the input into the code, and a decoder that maps the code to a reconstruction of the input.

The simplest way to perform the copying task perfectly would be to duplicate the signal. Instead, autoencoders are typically forced to reconstruct the input approximately, preserving only the most relevant aspects of the data in the copy.

The idea of autoencoders has been popular for decades. The first applications date to the 1980s. [2] [8] [9] Their most traditional application was dimensionality reduction or feature learning, but the concept became widely used for learning generative models of data. [10] [11] Some of the most powerful AIs in the 2010s involved autoencoders stacked inside deep neural networks. [12]

The simplest form of an autoencoder is a feedforward, non-recurrent neural network similar to single layer perceptrons that participate in multilayer perceptrons (MLP) – employing an input layer and an output layer connected by one or more hidden layers. The output layer has the same number of nodes (neurons) as the input layer. Its purpose is to reconstruct its inputs (minimizing the difference between the input and the output) instead of predicting a target value ${\displaystyle Y}$ given inputs ${\displaystyle X}$. Therefore, autoencoders learn unsupervised.

An autoencoder consists of two parts, the encoder and the decoder, which can be defined as transitions ${\displaystyle \phi }$ and ${\displaystyle \psi ,}$ such that:

${\displaystyle \phi$ :{\mathcal {X}}\rightarrow {\mathcal {F}}}
${\displaystyle \psi$ :{\mathcal {F}}\rightarrow {\mathcal {X}}}
${\displaystyle \phi ,\psi ={\underset {\phi ,\psi }{\operatorname {arg\,min} }}\,\|{\mathcal {X}}-(\psi \circ \phi ){\mathcal {X}}\|^{2}}$

In the simplest case, given one hidden layer, the encoder stage of an autoencoder takes the input ${\displaystyle \mathbf {x} \in \mathbb {R} ^{d}={\mathcal {X}}}$ and maps it to ${\displaystyle \mathbf {h} \in \mathbb {R} ^{p}={\mathcal {F}}}$:

${\displaystyle \mathbf {h} =\sigma (\mathbf {Wx} +\mathbf {b} )}$

This image ${\displaystyle \mathbf {h} }$ is usually referred to as code, latent variables, or a latent representation. ${\displaystyle \sigma }$ is an element-wise activation function such as a sigmoid function or a rectified linear unit. ${\displaystyle \mathbf {W} }$ is a weight matrix and ${\displaystyle \mathbf {b} }$ is a bias vector. Weights and biases are usually initialized randomly, and then updated iteratively during training through backpropagation. After that, the decoder stage of the autoencoder maps ${\displaystyle \mathbf {h} }$ to the reconstruction ${\displaystyle \mathbf {x'} }$ of the same shape as ${\displaystyle \mathbf {x} }$:

${\displaystyle \mathbf {x'} =\sigma '(\mathbf {W'h} +\mathbf {b'} )}$

where ${\displaystyle \mathbf {\sigma '} ,\mathbf {W'} ,{\text{ and }}\mathbf {b'} }$ for the decoder may be unrelated to the corresponding ${\displaystyle \mathbf {\sigma } ,\mathbf {W} ,{\text{ and }}\mathbf {b} }$ for the encoder.

Autoencoders are trained to minimise reconstruction errors (such as squared errors), often referred to as the "loss":

${\displaystyle {\mathcal {L}}(\mathbf {x} ,\mathbf {x'} )=\|\mathbf {x} -\mathbf {x'} \|^{2}=\|\mathbf {x} -\sigma '(\mathbf {W'} (\sigma (\mathbf {Wx} +\mathbf {b} ))+\mathbf {b'} )\|^{2}}$

where ${\displaystyle \mathbf {x} }$ is usually averaged over the training set.

As mentioned before, autoencoder training is performed through backpropagation of the error, just like other feedforward neural networks.

Should the feature space ${\displaystyle {\mathcal {F}}}$ have lower dimensionality than the input space ${\displaystyle {\mathcal {X}}}$, the feature vector ${\displaystyle \phi (x)}$ can be regarded as a compressed representation of the input ${\displaystyle x}$. This is the case of undercomplete autoencoders. If the hidden layers are larger than (overcomplete), or equal to, the input layer, or the hidden units are given enough capacity, an autoencoder can potentially learn the identity function and become useless. However, experimental results found that overcomplete autoencoders might still learn useful features. [13] In the ideal setting, the code dimension and the model capacity could be set on the basis of the complexity of the data distribution to be modeled. One way to do so is to exploit the model variants known as Regularized Autoencoders. [2]

## Variations

### Regularized autoencoders

Various techniques exist to prevent autoencoders from learning the identity function and to improve their ability to capture important information and learn richer representations.

#### Sparse autoencoder (SAE)

Learning representations in a way that encourages sparsity improves performance on classification tasks. [14] Sparse autoencoders may include more (rather than fewer) hidden units than inputs, but only a small number of the hidden units are allowed to be active at the same time (thus, sparse). [12] This constraint forces the model to respond to the unique statistical features of the training data.

Specifically, a sparse autoencoder is an autoencoder whose training criterion involves a sparsity penalty ${\displaystyle \Omega ({\boldsymbol {h}})}$ on the code layer ${\displaystyle {\boldsymbol {h}}}$.

${\displaystyle {\mathcal {L}}(\mathbf {x} ,\mathbf {x'} )+\Omega ({\boldsymbol {h}})}$

Recalling that ${\displaystyle {\boldsymbol {h}}=f({\boldsymbol {W}}{\boldsymbol {x}}+{\boldsymbol {b}})}$, the penalty encourages the model to activate (i.e. output value close to 1) specific areas of the network on the basis of the input data, while inactivating all other neurons (i.e. to have an output value close to 0). [15]

This sparsity can be achieved by formulating the penalty terms in different ways.

${\displaystyle {\hat {\rho _{j}}}={\frac {1}{m}}\sum _{i=1}^{m}[h_{j}(x_{i})]}$
be the average activation of the hidden unit ${\displaystyle j}$ (averaged over the ${\displaystyle m}$ training examples). The notation ${\displaystyle h_{j}(x_{i})}$ identifies the input value that triggered the activation. To encourage most of the neurons to be inactive, ${\displaystyle {\hat {\rho _{j}}}}$ needs to be close to 0. Therefore, this method enforces the constraint ${\displaystyle {\hat {\rho _{j}}}=\rho }$ where ${\displaystyle \rho }$ is the sparsity parameter, a value close to zero. The penalty term ${\displaystyle \Omega ({\boldsymbol {h}})}$ takes a form that penalizes ${\displaystyle {\hat {\rho _{j}}}}$ for deviating significantly from ${\displaystyle \rho }$, exploiting the KL divergence:
${\displaystyle \sum _{j=1}^{s}KL(\rho ||{\hat {\rho _{j}}})=\sum _{j=1}^{s}\left[\rho \log {\frac {\rho }{\hat {\rho _{j}}}}+(1-\rho )\log {\frac {1-\rho }{1-{\hat {\rho _{j}}}}}\right]}$
where ${\displaystyle j}$ is summing over the ${\displaystyle s}$ hidden nodes in the hidden layer, and ${\displaystyle KL(\rho ||{\hat {\rho _{j}}})}$ is the KL-divergence between a Bernoulli random variable with mean ${\displaystyle \rho }$ and a Bernoulli random variable with mean ${\displaystyle {\hat {\rho _{j}}}}$. [15]
• Another way to achieve sparsity is by applying L1 or L2 regularization terms on the activation, scaled by a certain parameter ${\displaystyle \lambda }$. [18] For instance, in the case of L1 the loss function becomes
${\displaystyle {\mathcal {L}}(\mathbf {x} ,\mathbf {x'} )+\lambda \sum _{i}|h_{i}|}$
• A further proposed strategy to force sparsity is to manually zero all but the strongest hidden unit activations (k-sparse autoencoder). [19] The k-sparse autoencoder is based on a linear autoencoder (i.e. with linear activation function) and tied weights. The identification of the strongest activations can be achieved by sorting the activities and keeping only the first k values, or by using ReLU hidden units with thresholds that are adaptively adjusted until the k largest activities are identified. This selection acts like the previously mentioned regularization terms in that it prevents the model from reconstructing the input using too many neurons. [19]

#### Denoising autoencoder (DAE)

Denoising autoencoders (DAE) try to achieve a good representation by changing the reconstruction criterion. [2]

Indeed, DAEs take a partially corrupted input and are trained to recover the original undistorted input. In practice, the objective of denoising autoencoders is that of cleaning the corrupted input, or denoising. Two assumptions are inherent to this approach:

• Higher level representations are relatively stable and robust to the corruption of the input;
• To perform denoising well, the model needs to extract features that capture useful structure in the input distribution. [3]

In other words, denoising is advocated as a training criterion for learning to extract useful features that will constitute better higher level representations of the input. [3]

The training process of a DAE works as follows:

• The initial input ${\displaystyle x}$ is corrupted into ${\displaystyle {\boldsymbol {\tilde {x}}}}$ through stochastic mapping ${\displaystyle {\boldsymbol {\tilde {x}}}\thicksim q_{D}({\boldsymbol {\tilde {x}}}|{\boldsymbol {x}})}$.
• The corrupted input ${\displaystyle {\boldsymbol {\tilde {x}}}}$ is then mapped to a hidden representation with the same process of the standard autoencoder, ${\displaystyle {\boldsymbol {h}}=f_{\theta }({\boldsymbol {\tilde {x}}})=s({\boldsymbol {W}}{\boldsymbol {\tilde {x}}}+{\boldsymbol {b}})}$.
• From the hidden representation the model reconstructs ${\displaystyle {\boldsymbol {z}}=g_{\theta '}({\boldsymbol {h}})}$. [3]

The model's parameters ${\displaystyle \theta }$ and ${\displaystyle \theta '}$ are trained to minimize the average reconstruction error over the training data, specifically, minimizing the difference between ${\displaystyle {\boldsymbol {z}}}$ and the original uncorrupted input ${\displaystyle {\boldsymbol {x}}}$. [3] Note that each time a random example ${\displaystyle {\boldsymbol {x}}}$ is presented to the model, a new corrupted version is generated stochastically on the basis of ${\displaystyle q_{D}({\boldsymbol {\tilde {x}}}|{\boldsymbol {x}})}$.

The above-mentioned training process could be applied with any kind of corruption process. Some examples might be additive isotropic Gaussian noise, masking noise (a fraction of the input chosen at random for each example is forced to 0) or salt-and-pepper noise (a fraction of the input chosen at random for each example is set to its minimum or maximum value with uniform probability). [3]

The corruption of the input is performed only during training. After training, no corruption is added.

#### Contractive autoencoder (CAE)

A contractive autoencoder adds an explicit regularizer in its objective function that forces the model to learn an encoding robust to slight variations of input values. This regularizer corresponds to the Frobenius norm of the Jacobian matrix of the encoder activations with respect to the input. Since the penalty is applied to training examples only, this term forces the model to learn useful information about the training distribution. The final objective function has the following form:

${\displaystyle {\mathcal {L}}(\mathbf {x} ,\mathbf {x'} )+\lambda \sum _{i}||\nabla _{x}h_{i}||^{2}}$

The autoencoder is termed contractive because it is encouraged to map a neighborhood of input points to a smaller neighborhood of output points. [2]

DAE is connected to CAE: in the limit of small Gaussian input noise, DAEs make the reconstruction function resist small but finite-sized input perturbations, while CAEs make the extracted features resist infinitesimal input perturbations.

### Concrete autoencoder

The concrete autoencoder is designed for discrete feature selection. [20] A concrete autoencoder forces the latent space to consist only of a user-specified number of features. The concrete autoencoder uses a continuous relaxation of the categorical distribution to allow gradients to pass through the feature selector layer, which makes it possible to use standard backpropagation to learn an optimal subset of input features that minimize reconstruction loss.

### Variational autoencoder (VAE)

Variational autoencoders (VAEs) belong to the families of variational Bayesian methods. Despite the architectural similarities with basic autoencoders, VAEs are architecture with different goals and with a completely different mathematical formulation. The latent space is in this case composed by a mixture of distributions instead of a fixed vector.

Given an input dataset ${\displaystyle \mathbf {x} }$ characterized by an unknown probability function ${\displaystyle P(\mathbf {x} )}$ and a multivariate latent encoding vector ${\displaystyle \mathbf {z} }$, the objective is to model the data as a distribution ${\displaystyle p_{\theta }(\mathbf {x} )}$, with ${\displaystyle \theta }$ defined as the set of the network parameters so that ${\displaystyle p_{\theta }(\mathbf {x} )=\int _{\mathbf {z} }p_{\theta }(\mathbf {x,z} )d\mathbf {z} }$.

Autoencoders are often trained with a single layer encoder and a single layer decoder, but using many-layered (deep) encoders and decoders offers many advantages. [2]

• Depth can exponentially reduce the computational cost of representing some functions. [2]
• Depth can exponentially decrease the amount of training data needed to learn some functions. [2]
• Experimentally, deep autoencoders yield better compression compared to shallow or linear autoencoders. [21]

### Training

Geoffrey Hinton developed the deep belief network technique for training many-layered deep autoencoders. His method involves treating each neighbouring set of two layers as a restricted Boltzmann machine so that pretraining approximates a good solution, then using backpropagation to fine-tune the results. [21]

Researchers have debated whether joint training (i.e. training the whole architecture together with a single global reconstruction objective to optimize) would be better for deep auto-encoders. [22] A 2015 study showed that joint training learns better data models along with more representative features for classification as compared to the layerwise method. [22] However, their experiments showed that the success of joint training depends heavily on the regularization strategies adopted. [22] [23]

## Applications

The two main applications of autoencoders are dimensionality reduction and information retrieval, [2] but modern variations have been applied to other tasks.

### Dimensionality reduction

Dimensionality reduction was one of the first deep learning applications. [2]

For Hinton's 2006 study, [21] he pretrained a multi-layer autoencoder with a stack of RBMs and then used their weights to initialize a deep autoencoder with gradually smaller hidden layers until hitting a bottleneck of 30 neurons. The resulting 30 dimensions of the code yielded a smaller reconstruction error compared to the first 30 components of a principal component analysis (PCA), and learned a representation that was qualitatively easier to interpret, clearly separating data clusters. [2] [21]

Representing dimensions can improve performance on tasks such as classification. [2] Indeed, the hallmark of dimensionality reduction is to place semantically related examples near each other. [25]

#### Principal component analysis

If linear activations are used, or only a single sigmoid hidden layer, then the optimal solution to an autoencoder is strongly related to principal component analysis (PCA). [26] [27] The weights of an autoencoder with a single hidden layer of size ${\displaystyle p}$ (where ${\displaystyle p}$ is less than the size of the input) span the same vector subspace as the one spanned by the first ${\displaystyle p}$ principal components, and the output of the autoencoder is an orthogonal projection onto this subspace. The autoencoder weights are not equal to the principal components, and are generally not orthogonal, yet the principal components may be recovered from them using the singular value decomposition. [28]

However, the potential of autoencoders resides in their non-linearity, allowing the model to learn more powerful generalizations compared to PCA, and to reconstruct the input with significantly lower information loss. [21]

### Information retrieval

Information retrieval benefits particularly from dimensionality reduction in that search can become more efficient in certain kinds of low dimensional spaces. Autoencoders were indeed applied to semantic hashing, proposed by Salakhutdinov and Hinton in 2007. [25] By training the algorithm to produce a low-dimensional binary code, all database entries could be stored in a hash table mapping binary code vectors to entries. This table would then support information retrieval by returning all entries with the same binary code as the query, or slightly less similar entries by flipping some bits from the query encoding.

### Anomaly detection

Another application for autoencoders is anomaly detection. [29] [30] [31] [32] [33] By learning to replicate the most salient features in the training data under some of the constraints described previously, the model is encouraged to learn to precisely reproduce the most frequently observed characteristics. When facing anomalies, the model should worsen its reconstruction performance. In most cases, only data with normal instances are used to train the autoencoder; in others, the frequency of anomalies is small compared to the observation set so that its contribution to the learned representation could be ignored. After training, the autoencoder will accurately reconstruct "normal" data, while failing to do so with unfamiliar anomalous data. [31] Reconstruction error (the error between the original data and its low dimensional reconstruction) is used as an anomaly score to detect anomalies. [31]

Recent literature has however shown that certain autoencoding models can, counterintuitively, be very good at reconstructing anomalous examples and consequently not able to reliably perform anomaly detection. [34] [35]

### Image processing

The characteristics of autoencoders are useful in image processing.

One example can be found in lossy image compression, where autoencoders outperformed other approaches and proved competitive against JPEG 2000. [36] [37]

Another useful application of autoencoders in image preprocessing is image denoising. [38] [39] [40]

Autoencoders found use in more demanding contexts such as medical imaging where they have been used for image denoising [41] as well as super-resolution. [42] [43] In image-assisted diagnosis, experiments have applied autoencoders for breast cancer detection [44] and for modelling the relation between the cognitive decline of Alzheimer's Disease and the latent features of an autoencoder trained with MRI. [45]

### Drug discovery

In 2019 molecules generated with variational autoencoders were validated experimentally in mice. [46] [47]

### Popularity prediction

Recently, a stacked autoencoder framework produced promising results in predicting popularity of social media posts, [48] which is helpful for online advertising strategies.

### Machine translation

Autoencoders have been applied to machine translation, which is usually referred to as neural machine translation (NMT). [49] [50] Unlike traditional autoencoders, the output does not match the input - it is in another language. In NMT, texts are treated as sequences to be encoded into the learning procedure, while on the decoder side sequences in the target language(s) are generated. Language-specific autoencoders incorporate further linguistic features into the learning procedure, such as Chinese decomposition features. [51]

## Related Research Articles

Continuum mechanics is a branch of mechanics that deals with the mechanical behavior of materials modeled as a continuous mass rather than as discrete particles. The French mathematician Augustin-Louis Cauchy was the first to formulate such models in the 19th century.

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

Pattern recognition is the automated recognition of patterns and regularities in data. It has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power. However, these activities can be viewed as two facets of the same field of application, and together they have undergone substantial development over the past few decades. A modern definition of pattern recognition is:

The field of pattern recognition is concerned with the automatic discovery of regularities in data through the use of computer algorithms and with the use of these regularities to take actions such as classifying the data into different categories.

In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class. It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector.

In machine learning, early stopping is a form of regularization used to avoid overfitting when training a learner with an iterative method, such as gradient descent. Such methods update the learner so as to make it better fit the training data with each iteration. Up to a point, this improves the learner's performance on data outside of the training set. Past that point, however, improving the learner's fit to the training data comes at the expense of increased generalization error. Early stopping rules provide guidance as to how many iterations can be run before the learner begins to over-fit. Early stopping rules have been employed in many different machine learning methods, with varying amounts of theoretical foundation.

Unsupervised learning (UL) is a type of algorithm that learns patterns from untagged data. The hope is that, through mimicry, the machine is forced to build a compact internal representation of its world and then generate imaginative content. In contrast to supervised learning (SL) where data is tagged by a human, e.g. as "car" or "fish" etc, UL exhibits self-organization that captures patterns as neuronal predilections or probability densities. The other levels in the supervision spectrum are reinforcement learning where the machine is given only a numerical performance score as its guidance, and semi-supervised learning where a smaller portion of the data is tagged. Two broad methods in UL are Neural Networks and Probabilistic Methods.

Multi-task learning (MTL) is a subfield of machine learning in which multiple learning tasks are solved at the same time, while exploiting commonalities and differences across tasks. This can result in improved learning efficiency and prediction accuracy for the task-specific models, when compared to training the models separately. Early versions of MTL were called "hints".

Quantum neural networks are computational neural network models which are based on the principles of quantum mechanics. The first ideas on quantum neural computation were published independently in 1995 by Subhash Kak and Ron Chrisley, engaging with the theory of quantum mind, which posits that quantum effects play a role in cognitive function. However, typical research in quantum neural networks involves combining classical artificial neural network models with the advantages of quantum information in order to develop more efficient algorithms. One important motivation for these investigations is the difficulty to train classical neural networks, especially in big data applications. The hope is that features of quantum computing such as quantum parallelism or the effects of interference and entanglement can be used as resources. Since the technological implementation of a quantum computer is still in a premature stage, such quantum neural network models are mostly theoretical proposals that await their full implementation in physical experiments.

In the field of mathematical modeling, a radial basis function network is an artificial neural network that uses radial basis functions as activation functions. The output of the network is a linear combination of radial basis functions of the inputs and neuron parameters. Radial basis function networks have many uses, including function approximation, time series prediction, classification, and system control. They were first formulated in a 1988 paper by Broomhead and Lowe, both researchers at the Royal Signals and Radar Establishment.

In computer science, locality-sensitive hashing (LSH) is an algorithmic technique that hashes similar input items into the same "buckets" with high probability. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.

In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs. A standard integrated circuit can be seen as a digital network of activation functions that can be "ON" (1) or "OFF" (0), depending on input. This is similar to the linear perceptron in neural networks. However, only nonlinear activation functions allow such networks to compute nontrivial problems using only a small number of nodes, and such activation functions are called nonlinearities.

In the mathematical theory of artificial neural networks, universal approximation theorems are results that establish the density of an algorithmically generated class of functions within a given function space of interest. Typically, these results concern the approximation capabilities of the feedforward architecture on the space of continuous functions between two Euclidean spaces, and the approximation is with respect to the compact convergence topology. However, there are also a variety of results between non-Euclidean spaces and other commonly used architectures and, more generally, algorithmically generated sets of functions, such as the convolutional neural network (CNN) architecture, radial basis-functions, or neural networks with specific properties. Most universal approximation theorems can be parsed into two classes. The first quantifies the approximation capabilities of neural networks with an arbitrary number of artificial neurons and the second focuses on the case with an arbitrary number of hidden layers, each containing a limited number of artificial neurons.

There are many types of artificial neural networks (ANN).

In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

The sample complexity of a machine learning algorithm represents the number of training-samples that it needs in order to successfully learn a target function.

Extreme learning machines are feedforward neural networks for classification, regression, clustering, sparse approximation, compression and feature learning with a single layer or multiple layers of hidden nodes, where the parameters of hidden nodes need not be tuned. These hidden nodes can be randomly assigned and never updated, or can be inherited from their ancestors without being changed. In most cases, the output weights of hidden nodes are usually learned in a single step, which essentially amounts to learning a linear model. The name "extreme learning machine" (ELM) was given to such models by its main inventor Guang-Bin Huang.

Sparse coding is a representation learning method which aims at finding a sparse representation of the input data in the form of a linear combination of basic elements as well as those basic elements themselves. These elements are called atoms and they compose a dictionary. Atoms in the dictionary are not required to be orthogonal, and they may be an over-complete spanning set. This problem setup also allows the dimensionality of the signals being represented to be higher than the one of the signals being observed. The above two properties lead to having seemingly redundant atoms that allow multiple representations of the same signal but also provide an improvement in sparsity and flexibility of the representation.

Batch normalization is a method used to make artificial neural networks faster and more stable through normalization of the layers' inputs by re-centering and re-scaling. It was proposed by Sergey Ioffe and Christian Szegedy in 2015.

A transformer is a deep learning model that adopts the mechanism of attention, differentially weighing the significance of each part of the input data. It is used primarily in the field of natural language processing (NLP) and in computer vision (CV).

In machine learning, a variational autoencoder, also known as VAE, is the artificial neural network architecture introduced by Diederik P Kingma and Max Welling, belonging to the families of probabilistic graphical models and variational Bayesian methods.

## References

1. Kramer, Mark A. (1991). "Nonlinear principal component analysis using autoassociative neural networks" (PDF). AIChE Journal. 37 (2): 233–243. doi:10.1002/aic.690370209.
2. Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron (2016). Deep Learning. MIT Press. ISBN   978-0262035613.
3. Vincent, Pascal; Larochelle, Hugo (2010). "Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion". Journal of Machine Learning Research. 11: 3371–3408.
4. Welling, Max; Kingma, Diederik P. (2019). "An Introduction to Variational Autoencoders". Foundations and Trends in Machine Learning. 12 (4): 307–392. arXiv:. Bibcode:2019arXiv190602691K. doi:10.1561/2200000056. S2CID   174802445.
5. Hinton GE, Krizhevsky A, Wang SD. Transforming auto-encoders. In International Conference on Artificial Neural Networks 2011 Jun 14 (pp. 44-51). Springer, Berlin, Heidelberg.
6. Liou, Cheng-Yuan; Huang, Jau-Chi; Yang, Wen-Chie (2008). "Modeling word perception using the Elman network". Neurocomputing. 71 (16–18): 3150. doi:10.1016/j.neucom.2008.04.030.
7. Liou, Cheng-Yuan; Cheng, Wei-Chen; Liou, Jiun-Wei; Liou, Daw-Ran (2014). "Autoencoder for words". Neurocomputing. 139: 84–96. doi:10.1016/j.neucom.2013.09.055.
8. Schmidhuber, Jürgen (January 2015). "Deep learning in neural networks: An overview". Neural Networks. 61: 85–117. arXiv:. doi:10.1016/j.neunet.2014.09.003. PMID   25462637. S2CID   11715509.
9. Hinton, G. E., & Zemel, R. S. (1994). Autoencoders, minimum description length and Helmholtz free energy. In Advances in neural information processing systems 6 (pp. 3-10).
10. Diederik P Kingma; Welling, Max (2013). "Auto-Encoding Variational Bayes". arXiv: [stat.ML].
11. Generating Faces with Torch, Boesen A., Larsen L. and Sonderby S.K., 2015
12. Domingos, Pedro (2015). "4". The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World. Basic Books. "Deeper into the Brain" subsection. ISBN   978-046506192-1.
13. Bengio, Y. (2009). "Learning Deep Architectures for AI" (PDF). Foundations and Trends in Machine Learning. 2 (8): 1795–7. CiteSeerX  . doi:10.1561/2200000006. PMID   23946944.
14. Frey, Brendan; Makhzani, Alireza (2013-12-19). "k-Sparse Autoencoders". arXiv:. Bibcode:2013arXiv1312.5663M.Cite journal requires |journal= (help)
15. Ng, A. (2011). Sparse autoencoder. CS294A Lecture notes, 72(2011), 1-19.
16. Nair, Vinod; Hinton, Geoffrey E. (2009). "3D Object Recognition with Deep Belief Nets". Proceedings of the 22Nd International Conference on Neural Information Processing Systems. NIPS'09. USA: Curran Associates Inc.: 1339–1347. ISBN   9781615679119.
17. Zeng, Nianyin; Zhang, Hong; Song, Baoye; Liu, Weibo; Li, Yurong; Dobaie, Abdullah M. (2018-01-17). "Facial expression recognition via learning deep sparse autoencoders". Neurocomputing. 273: 643–649. doi:10.1016/j.neucom.2017.08.043. ISSN   0925-2312.
18. Arpit, Devansh; Zhou, Yingbo; Ngo, Hung; Govindaraju, Venu (2015). "Why Regularized Auto-Encoders learn Sparse Representation?". arXiv: [stat.ML].
19. Makhzani, Alireza; Frey, Brendan (2013). "K-Sparse Autoencoders". arXiv: [cs.LG].
20. Abid, Abubakar; Balin, Muhammad Fatih; Zou, James (2019-01-27). "Concrete Autoencoders for Differentiable Feature Selection and Reconstruction". arXiv: [cs.LG].
21. Hinton, G. E.; Salakhutdinov, R.R. (2006-07-28). "Reducing the Dimensionality of Data with Neural Networks". Science. 313 (5786): 504–507. Bibcode:2006Sci...313..504H. doi:10.1126/science.1127647. PMID   16873662. S2CID   1658773.
22. Zhou, Yingbo; Arpit, Devansh; Nwogu, Ifeoma; Govindaraju, Venu (2014). "Is Joint Training Better for Deep Auto-Encoders?". arXiv: [stat.ML].
23. R. Salakhutdinov and G. E. Hinton, “Deep boltzmann machines,” in AISTATS, 2009, pp. 448–455.
24. "Fashion MNIST". 2019-07-12.
25. Salakhutdinov, Ruslan; Hinton, Geoffrey (2009-07-01). "Semantic hashing". International Journal of Approximate Reasoning. Special Section on Graphical Models and Information Retrieval. 50 (7): 969–978. doi:. ISSN   0888-613X.
26. Bourlard, H.; Kamp, Y. (1988). "Auto-association by multilayer perceptrons and singular value decomposition". Biological Cybernetics. 59 (4–5): 291–294. doi:10.1007/BF00332918. PMID   3196773. S2CID   206775335.
27. Chicco, Davide; Sadowski, Peter; Baldi, Pierre (2014). "Deep autoencoder neural networks for gene ontology annotation predictions". Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics - BCB '14. p. 533. doi:10.1145/2649387.2649442. hdl:11311/964622. ISBN   9781450328944. S2CID   207217210.
28. Plaut, E (2018). "From Principal Subspaces to Principal Components with Linear Autoencoders". arXiv: [stat.ML].
29. Morales-Forero, A., & Bassetto, S. (2019, December). Case Study: A Semi-Supervised Methodology for Anomaly Detection and Diagnosis. In 2019 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM) (p. 4) (pp. 1031-1037). IEEE.
30. Sakurada, M., & Yairi, T. (2014, December). Anomaly detection using autoencoders with nonlinear dimensionality reduction. In Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis (p. 4). ACM.
31. An, J., & Cho, S. (2015). Variational autoencoder based anomaly detection using reconstruction probability. Special Lecture on IE, 2, 1-18.
32. Zhou, C., & Paffenroth, R. C. (2017, August). Anomaly detection with robust deep autoencoders. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 665-674). ACM.
33. Ribeiro, Manassés; Lazzaretti, André Eugênio; Lopes, Heitor Silvério (2018). "A study of deep convolutional auto-encoders for anomaly detection in videos". Pattern Recognition Letters. 105: 13–22. doi:10.1016/j.patrec.2017.07.016.
34. Nalisnick, Eric; Matsukawa, Akihiro; Teh, Yee Whye; Gorur, Dilan; Lakshminarayanan, Balaji (2019-02-24). "Do Deep Generative Models Know What They Don't Know?". arXiv: [stat.ML].
35. Xiao, Zhisheng; Yan, Qing; Amit, Yali (2020). "Likelihood Regret: An Out-of-Distribution Detection Score For Variational Auto-encoder". Advances in Neural Information Processing Systems. 33. arXiv:.
36. Theis, Lucas; Shi, Wenzhe; Cunningham, Andrew; Huszár, Ferenc (2017). "Lossy Image Compression with Compressive Autoencoders". arXiv: [stat.ML].
37. Balle, J; Laparra, V; Simoncelli, EP (April 2017). "End-to-end optimized image compression". International Conference on Learning Representations. arXiv:.
38. Cho, K. (2013, February). Simple sparsification improves sparse denoising autoencoders in denoising highly corrupted images. In International Conference on Machine Learning (pp. 432-440).
39. Cho, Kyunghyun (2013). "Boltzmann Machines and Denoising Autoencoders for Image Denoising". arXiv: [stat.ML].
40. Buades, A.; Coll, B.; Morel, J. M. (2005). "A Review of Image Denoising Algorithms, with a New One". Multiscale Modeling & Simulation. 4 (2): 490–530. doi:10.1137/040616024.
41. Gondara, Lovedeep (December 2016). "Medical Image Denoising Using Convolutional Denoising Autoencoders". 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW). Barcelona, Spain: IEEE: 241–246. arXiv:. Bibcode:2016arXiv160804667G. doi:10.1109/ICDMW.2016.0041. ISBN   9781509059102. S2CID   14354973.
42. Zeng, Kun; Yu, Jun; Wang, Ruxin; Li, Cuihua; Tao, Dacheng (January 2017). "Coupled Deep Autoencoder for Single Image Super-Resolution". IEEE Transactions on Cybernetics. 47 (1): 27–37. doi:10.1109/TCYB.2015.2501373. ISSN   2168-2267. PMID   26625442. S2CID   20787612.
43. Tzu-Hsi, Song; Sanchez, Victor; Hesham, EIDaly; Nasir M., Rajpoot (2017). "Hybrid deep autoencoder with Curvature Gaussian for detection of various types of cells in bone marrow trephine biopsy images". 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017): 1040–1043. doi:10.1109/ISBI.2017.7950694. ISBN   978-1-5090-1172-8. S2CID   7433130.
44. Xu, Jun; Xiang, Lei; Liu, Qingshan; Gilmore, Hannah; Wu, Jianzhong; Tang, Jinghai; Madabhushi, Anant (January 2016). "Stacked Sparse Autoencoder (SSAE) for Nuclei Detection on Breast Cancer Histopathology Images". IEEE Transactions on Medical Imaging. 35 (1): 119–130. doi:10.1109/TMI.2015.2458702. PMC  . PMID   26208307.
45. Martinez-Murcia, Francisco J.; Ortiz, Andres; Gorriz, Juan M.; Ramirez, Javier; Castillo-Barnes, Diego (2020). "Studying the Manifold Structure of Alzheimer's Disease: A Deep Learning Approach Using Convolutional Autoencoders". IEEE Journal of Biomedical and Health Informatics. 24 (1): 17–26. doi:. PMID   31217131. S2CID   195187846.
46. Zhavoronkov, Alex (2019). "Deep learning enables rapid identification of potent DDR1 kinase inhibitors". Nature Biotechnology. 37 (9): 1038–1040. doi:10.1038/s41587-019-0224-x. PMID   31477924. S2CID   201716327.
47. Gregory, Barber. "A Molecule Designed By AI Exhibits 'Druglike' Qualities". Wired.
48. De, Shaunak; Maity, Abhishek; Goel, Vritti; Shitole, Sanjay; Bhattacharya, Avik (2017). "Predicting the popularity of instagram posts for a lifestyle magazine using deep learning". 2017 2nd IEEE International Conference on Communication Systems, Computing and IT Applications (CSCITA). pp. 174–177. doi:10.1109/CSCITA.2017.8066548. ISBN   978-1-5090-4381-1. S2CID   35350962.
49. Cho, Kyunghyun; Bart van Merrienboer; Bahdanau, Dzmitry; Bengio, Yoshua (2014). "On the Properties of Neural Machine Translation: Encoder-Decoder Approaches". arXiv: [cs.CL].
50. Sutskever, Ilya; Vinyals, Oriol; Le, Quoc V. (2014). "Sequence to Sequence Learning with Neural Networks". arXiv: [cs.CL].
51. Han, Lifeng; Kuang, Shaohui (2018). "Incorporating Chinese Radicals into Neural Machine Translation: Deeper Than Character Level". arXiv: [cs.CL].