Latent diffusion model

Last updated
Latent Diffusion Model
Original author(s) CompVis
Initial releaseDecember 20, 2021
Repository https://github.com/CompVis/latent-diffusion
Written in Python
Type
License MIT

The Latent Diffusion Model (LDM) [1] is a diffusion model architecture developed developed by the CompVis (Computer Vision & Learning) [2] group at LMU Munich. [3]

Contents

Introduced in 2015, diffusion models (DM) are trained with the objective of removing successive applications of Gaussian noise on training images. The LDM is an improvement on standard DM by performing diffusion modeling in latent space, and by allowing self-attention and cross-attention conditioning.

LDM are widely used in practical diffusion models. The Stable Diffusion 1.1 up to SD 2.1 were based on the LDM architecture. [4]

Version history

Diffusion models were introduced in 2015 as a method to learn a model that can sample from a highly complex probability distribution. They used techniques from non-equilibrium thermodynamics, especially diffusion. [5] It was accompanied by a software implementation in Theano. [6]

A 2019 paper proposed the noise conditional score network (NCSN) or score-matching with Langevin dynamics (SMLD). [7] The paper was accompanied by a software package written in PyTorch release on GitHub. [8]

A 2020 paper [9] proposed the Denoising Diffusion Probabilistic Model (DDPM), which improves upon the previous method by variational inference. The paper was accompanied by a software package written in TensorFlow release on GitHub. [10] It was reimplemented in PyTorch by lucidrains. [11] [12]

On December 20, 2021, the LDM paper was published on arXiv, [13] and both Stable Diffusion [14] and LDM [15] repositories were published on GitHub. However, they remained roughly the same. Substantial information concerning Stable Diffusion v1 was only added to GitHub on August 10, 2022. [16]

SD 1.1 to 1.4 were particular instantiations of the LDM architecture, released by CompVis on August 2022. There is no "version 1.0". SD 1.1 was a LDM trained on the laion2B-en dataset. SD 1.1 was finetuned to 1.2 on more aesthetic images. SD 1.2 was finetuned to 1.3, 1.4 and 1.5, with 10% of text-conditioning dropped, to improve classifier-free guidance. [17] [18] SD 1.5 was released by RunwayML in October 2022. [18]

Architecture

While the LDM can work for generating arbitrary data conditional on arbitrary data, for concreteness, we describe its operation in conditional text-to-image generation.

LDM consists of a variational autoencoder (VAE), a modified U-Net, and a text encoder.

The VAE encoder compresses the image from pixel space to a smaller dimensional latent space, capturing a more fundamental semantic meaning of the image. Gaussian noise is iteratively applied to the compressed latent representation during forward diffusion. The U-Net block, composed of a ResNet backbone, denoises the output from forward diffusion backwards to obtain a latent representation. Finally, the VAE decoder generates the final image by converting the representation back into pixel space. [4]

The denoising step can be conditioned on a string of text, an image, or another modality. The encoded conditioning data is exposed to denoising U-Nets via a cross-attention mechanism. [4] For conditioning on text, the fixed, a pretrained CLIP ViT-L/14 text encoder is used to transform text prompts to an embedding space. [3]

Variational Autoencoder

To compress the image data, a variational autoencoder (VAE) is first trained on a dataset of images. The encoder part of the VAE takes an image as input and outputs a lower-dimensional latent representation of the image. This latent representation is then used as input to the U-Net. Once the model is trained, the encoder is used to encode images into latent representations, and the decoder is used to decode latent representations back into images.

Let the encoder and the decoder of the VAE be .

To encode an RGB image, its three channels are divided by the maximum value, resulting in a tensor of shape with all entries within range . The encoded vector is , with shape , where 0.18215 is a hyperparameter, which the original authors picked to roughly whiten the encoded vector to roughly unit variance. Conversely, given a latent tensor , the decoded image is , then clipped to the range . [19] [20]

U-Net

The U-Net backbone takes the following kinds of inputs:

Each run through the UNet backbone produces a predicted noise vector. This noise vector is scaled down and subtracted away from the latent image array, resulting in a slightly less noisy latent image. The denoising is repeated according to a denoising schedule ("noise schedule"), and the output of the last step is processed by the VAE decoder into a finished image.

A single cross-attention mechanism as it appears in a standard Transformer language model. Encoder cross-attention.png
A single cross-attention mechanism as it appears in a standard Transformer language model.
Block diagram for the full Transformer architecture. The stack on the right is a standard pre-LN Transformer decoder, which is essentially the same as the SpatialTransformer. Transformer, full architecture.png
Block diagram for the full Transformer architecture. The stack on the right is a standard pre-LN Transformer decoder, which is essentially the same as the SpatialTransformer.

Similar to the standard U-Net, the U-Net backbone used in the SD 1.5 is essentially composed of down-scaling layers followed by up-scaling layers. However, the UNet backbone has additional modules to allow for it to handle the embedding. As an illustration, we describe a single down-scaling layer in the backbone:

In pseudocode,

defResBlock(x,time,residual_channels):x_in=xtime_embedding=feedforward_network(time)x=concatenate(x,residual_channels)x=conv_layer_1(activate(normalize_1(x)))+time_embeddingx=conv_layer_2(dropout(activate(normalize_2(x))))returnx_in+xdefSpatialTransformer(x,cond):x_in=xx=normalize(x)x=proj_in(x)x=cross_attention(x,cond)x=proj_out(x)returnx_in+xdefunet(x,time,cond):residual_channels=[]forresblock,spatialtransformerindownscaling_layers:x=resblock(x,time)residual_channels.append(x)x=spatialtransformer(x,cond)x=middle_layer.resblock_1(x,time)x=middle_layer.spatialtransformer(x,time)x=middle_layer.resblock_2(x,time)forresblock,spatialtransformerinupscaling_layers:residual=residual_channels.pop()x=resblock(concatenate(x,residual),time)x=spatialtransformer(x,cond)returnx

The detailed architecture may be found in. [22] [23]

Training and inference

The LDM is trained by using a Markov chain to gradually add noise to the training images. The model is then trained to reverse this process, starting with a noisy image and gradually removing the noise until it recovers the original image. More specifically, the training process can be described as follows:

The model is trained to minimize the difference between the predicted noise and the actual noise added at each step. This is typically done using a mean squared error (MSE) loss function.

Once the model is trained, it can be used to generate new images by simply running the reverse diffusion process starting from a random noise sample. The model gradually removes the noise from the sample, guided by the learned noise distribution, until it generates a final image.

See the diffusion model page for details.

See also

Related Research Articles

Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, where a small portion of the data is tagged, and self-supervision. Some researchers consider self-supervised learning a form of unsupervised learning.

<span class="mw-page-title-main">Nonlinear dimensionality reduction</span> Projection of data onto lower-dimensional manifolds

Nonlinear dimensionality reduction, also known as manifold learning, is any of various related techniques that aim to project high-dimensional data onto lower-dimensional latent manifolds, with the goal of either visualizing the data in the low-dimensional space, or learning the mapping itself. The techniques described below can be understood as generalizations of linear decomposition methods used for dimensionality reduction, such as singular value decomposition and principal component analysis.

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction, to generate lower-dimensional embeddings for subsequent use by other machine learning algorithms.

<span class="mw-page-title-main">Total variation denoising</span> Noise removal process during image processing

In signal processing, particularly image processing, total variation denoising, also known as total variation regularization or total variation filtering, is a noise removal process (filter). It is based on the principle that signals with excessive and possibly spurious detail have high total variation, that is, the integral of the image gradient magnitude is high. According to this principle, reducing the total variation of the signal—subject to it being a close match to the original signal—removes unwanted detail whilst preserving important details such as edges. The concept was pioneered by L. I. Rudin, S. Osher, and E. Fatemi in 1992 and so is today known as the ROF model.

Eclipse Deeplearning4j is a programming library written in Java for the Java virtual machine (JVM). It is a framework with wide support for deep learning algorithms. Deeplearning4j includes implementations of the restricted Boltzmann machine, deep belief net, deep autoencoder, stacked denoising autoencoder and recursive neural tensor network, word2vec, doc2vec, and GloVe. These algorithms all include distributed parallel versions that integrate with Apache Hadoop and Spark.

Deep image prior is a type of convolutional neural network used to enhance a given image with no prior training data other than the image itself. A neural network is randomly initialized and used as prior to solve inverse problems such as noise reduction, super-resolution, and inpainting. Image statistics are captured by the structure of a convolutional image generator rather than by any previously learned capabilities.

U-Net is a convolutional neural network that was developed for image segmentation. The network is based on a fully convolutional neural network whose architecture was modified and extended to work with fewer training images and to yield more precise segmentation. Segmentation of a 512 × 512 image takes less than a second on a modern (2015) GPU using the U-Net architecture.

<span class="mw-page-title-main">Transformer (deep learning architecture)</span> Machine learning algorithm used for natural-language processing

A transformer is a deep learning architecture developed by researchers at Google and based on the multi-head attention mechanism, proposed in a 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished.

Bidirectional encoder representations from transformers (BERT) is a language model introduced in October 2018 by researchers at Google. It learned by self-supervised learning to represent text as a sequence of vectors. It had the transformer encoder architecture. It was notable for its dramatic improvement over previous state of the art models, and as an early example of large language model. As of 2020, BERT was a ubiquitous baseline in natural language processing (NLP) experiments.

<span class="mw-page-title-main">Variational autoencoder</span> Deep learning generative model to encode data representation

In machine learning, a variational autoencoder (VAE) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling. It is part of the families of probabilistic graphical models and variational Bayesian methods.

A latent space, also known as a latent feature space or embedding space, is an embedding of a set of items within a manifold in which items resembling each other are positioned closer to one another. Position within the latent space can be viewed as being defined by a set of latent variables that emerge from the resemblances from the objects.

<span class="mw-page-title-main">Attention (machine learning)</span> Machine learning technique

Attention is a machine learning method that determines the relative importance of each component in a sequence relative to the other components in that sequence. In natural language processing, importance is represented by "soft" weights assigned to each word in a sentence. More generally, attention encodes vectors called token embeddings across a fixed-width sequence that can range from tens to millions of tokens in size.

<span class="mw-page-title-main">Contrastive Language-Image Pre-training</span> Technique in neural networks for learning joint representations of text and images

Contrastive Language-Image Pre-training (CLIP) is a technique for training a pair of neural network models, one for image understanding and one for text understanding, using a contrastive objective.

<span class="mw-page-title-main">Vision transformer</span> Variant of Transformer designed for vision processing

A vision transformer (ViT) is a transformer designed for computer vision. A ViT breaks down an input image into a series of patches, serialises each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

Prompt engineering is the process of structuring an instruction that can be interpreted and understood by a generative AI model. A prompt is natural language text describing the task that an AI should perform: a prompt for a text-to-text language model can be a query such as "what is Fermat's little theorem?", a command such as "write a poem about leaves falling", or a longer statement including context, instructions, and conversation history. Prompt engineering may involve phrasing a query, specifying a style, providing relevant context or assigning a role to the AI such as "Act as a native French speaker". A prompt may include a few examples for a model to learn from, such as asking the model to complete "maison → house, chat → cat, chien →", an approach called few-shot learning.

Hugging Face, Inc. is an American company incorporated under the Delaware General Corporation Law and based in New York City that develops computation tools for building applications using machine learning. It is most notable for its transformers library built for natural language processing applications and its platform that allows users to share machine learning models and datasets and showcase their work.

<span class="mw-page-title-main">Stable Diffusion</span> Image-generating machine learning model

Stable Diffusion is a deep learning, text-to-image model released in 2022 based on diffusion techniques. The generative artificial intelligence technology is the premier product of Stability AI and is considered to be a part of the ongoing artificial intelligence boom.

In machine learning, diffusion models, also known as diffusion probabilistic models or score-based generative models, are a class of latent variable generative models. A diffusion model consists of three major components: the forward process, the reverse process, and the sampling procedure. The goal of diffusion models is to learn a diffusion process for a given dataset, such that the process can generate new elements that are distributed similarly as the original dataset. A diffusion model models data as generated by a diffusion process, whereby a new datum performs a random walk with drift through the space of all possible data. A trained diffusion model can be sampled in many ways, with different efficiency and quality.

Text-to-Image personalization is a task in deep learning for computer graphics that augments pre-trained text-to-image generative models. In this task, a generative model that was trained on large-scale data, is adapted such that it can generate images of novel, user-provided concepts. These concepts are typically unseen during training, and may represent specific objects or more abstract categories.

T5 is a series of large language models developed by Google AI. Introduced in 2019, T5 models are trained on a massive dataset of text and code using a text-to-text framework. The T5 models are capable of performing the text-based tasks that they were pretrained for. They can also be finetuned to perform other tasks. They have been employed in various applications, including chatbots, machine translation systems, text summarization tools, code generation, and robotics.

References

  1. Rombach, Robin; Blattmann, Andreas; Lorenz, Dominik; Esser, Patrick; Ommer, Björn (2022). High-Resolution Image Synthesis With Latent Diffusion Models. The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022. pp. 10684–10695.
  2. "Home". Computer Vision & Learning Group. Retrieved 2024-09-05.
  3. 1 2 "Stable Diffusion Repository on GitHub". CompVis - Machine Vision and Learning Research Group, LMU Munich. 17 September 2022. Archived from the original on January 18, 2023. Retrieved 17 September 2022.
  4. 1 2 3 Alammar, Jay. "The Illustrated Stable Diffusion". jalammar.github.io. Archived from the original on November 1, 2022. Retrieved 2022-10-31.
  5. Sohl-Dickstein, Jascha; Weiss, Eric; Maheswaranathan, Niru; Ganguli, Surya (2015-06-01). "Deep Unsupervised Learning using Nonequilibrium Thermodynamics" (PDF). Proceedings of the 32nd International Conference on Machine Learning. 37. PMLR: 2256–2265.
  6. Sohl-Dickstein, Jascha (2024-09-01), Sohl-Dickstein/Diffusion-Probabilistic-Models , retrieved 2024-09-07
  7. ermongroup/ncsn, ermongroup, 2019, retrieved 2024-09-07
  8. Song, Yang; Ermon, Stefano (2019). "Generative Modeling by Estimating Gradients of the Data Distribution". Advances in Neural Information Processing Systems. 32. Curran Associates, Inc.
  9. Ho, Jonathan; Jain, Ajay; Abbeel, Pieter (2020). "Denoising Diffusion Probabilistic Models". Advances in Neural Information Processing Systems. 33. Curran Associates, Inc.: 6840–6851.
  10. Ho, Jonathan (Jun 20, 2020), hojonathanho/diffusion , retrieved 2024-09-07
  11. Wang, Phil (2024-09-07), lucidrains/denoising-diffusion-pytorch , retrieved 2024-09-07
  12. "The Annotated Diffusion Model". huggingface.co. Retrieved 2024-09-07.
  13. Rombach, Robin; Blattmann, Andreas; Lorenz, Dominik; Esser, Patrick; Ommer, Björn (2021-12-20), High-Resolution Image Synthesis with Latent Diffusion Models, doi:10.48550/arXiv.2112.10752 , retrieved 2024-09-16
  14. "Update README.md · CompVis/stable-diffusion@17e64e3". GitHub. Retrieved 2024-09-07.
  15. "Update README.md · CompVis/latent-diffusion@17e64e3". GitHub. Retrieved 2024-09-07.
  16. "stable diffusion · CompVis/stable-diffusion@2ff270f". GitHub. Retrieved 2024-09-07.
  17. "CompVis (CompVis)". huggingface.co. 2023-08-23. Retrieved 2024-03-06.
  18. 1 2 "runwayml/stable-diffusion-v1-5 · Hugging Face". huggingface.co. Archived from the original on September 21, 2023. Retrieved 2023-08-17.
  19. "Explanation of the 0.18215 factor in textual_inversion? · Issue #437 · huggingface/diffusers". GitHub. Retrieved 2024-09-19.
  20. "diffusion-nbs/Stable Diffusion Deep Dive.ipynb at master · fastai/diffusion-nbs". GitHub. Retrieved 2024-09-19.
  21. "latent-diffusion/ldm/modules/attention.py at main · CompVis/latent-diffusion". GitHub. Retrieved 2024-09-09.
  22. "U-Net for Stable Diffusion". U-Net for Stable Diffusion. Retrieved 2024-08-31.
  23. "Transformer for Stable Diffusion U-Net". Transformer for Stable Diffusion U-Net. Retrieved 2024-09-07.

Further reading