Text-to-image personalization

Last updated

Text-to-Image personalization is a task in deep learning for computer graphics that augments pre-trained text-to-image generative models. In this task, a generative model that was trained on large-scale data (usually a foundation model), is adapted such that it can generate images of novel, user-provided concepts. [1] [2] These concepts are typically unseen during training, and may represent specific objects (such as the user's pet) or more abstract categories (new artistic style [3] or object relations [4] ).

Contents

Text-to-Image personalization methods typically bind the novel (personal) concept to new words in the vocabulary of the model. These words can then be used in future prompts to invoke the concept for subject-driven generation, [5] inpainting, style transfer [6] and even to correct biases in the model. To do so, models either optimize word-embeddings, fine-tune the generative model itself, or employ a mixture of both approaches.

Technology

Text-to-Image personalization was first proposed during August 2022 by two concurrent works, Textual Inversion [7] and DreamBooth. [8]

In both cases, a user provides a few images (typically 3–5) of a concept, like their own dog, together with a coarse descriptor of the concept class (like the word "dog"). The model then learns to represent the subject through a reconstruction based objective, where prompts referring to the subject are expected to reconstruct images from the training set.

In Textual Inversion, the personalized concepts are introduced into the text-to-image model by adding new words to the vocabulary of the model. Typical text-to-image models represent words (and sometimes parts-of-words) as tokens, or indices in a predefined dictionary. During generation, an input prompt is converted into such tokens, each of which is converted into a ‘word-embedding’: a continuous vector representation which is learned for each token as part of the model's training. Textual Inversion proposes to optimize a new word-embedding vector for representing the novel concept. This new embedding vector can then be assigned to a user-chosen string, and invoked whenever the user's prompt contains this string. [7]

In DreamBooth, rather than optimizing a new word vector, the full generative model itself is fine-tuned. The user first selects an existing token, typically one which rarely appears in prompts. The subject itself is then represented by a string containing this token, followed by a coarse descriptor of the subject's class. A prompt describing the subject will then take the form: "A photo of <token> <class>" (e.g. "a photo of sks cat" when learning to represent a specific cat). The text-to-image model is then tuned so that prompts of this form will generate images of the subject. [8]

Textual Inversion

The key idea in textual inversion is to add a new term to the vocabulary of the diffusion model that corresponds to the new (personalized) concept. Textual inversion optimizes the vector embedding of that new term such that using it as an input text prompt will generate images that are similar to given image examples of the concept. The resulting model is extremely light-weight per concept: only 1K long, but succeeds to encode detailed visual properties of the concept.

Extensions

Several approaches were proposed to refine and improve over the original methods. These include the following.

  1. Low-rank Adaptation (LoRA) - an adapter-based technique for efficient finetuning of models. [9] In the case of text-to-image models, LoRA is typically used to modify the cross-attention layers of a diffusion model. [10]
  2. Perfusion - a low rank update method that also locks the activations of the key matrix in the diffusion model's cross attention layers to the concept's coarse class. [11]
  3. Extended Textual Inversion - a technique that learns an individual word embedding for each layer in the diffusion model's denoising network. [12]
  4. Encoder-based methods that use another neural network to quickly personalize a model [13] [14]

Challenges and limitations

Text-to-image personalization methods must contend with several challenges. At their core is the goal of achieving high-fidelity to the personal concept while maintaining high alignment between novel prompts containing the subject, and the generated images (typically referred to as ‘editability’).

Another challenge that personalization methods must contend with is memory requirements. Initial implementations of personalization methods required more than 20 Gigabytes of GPU memory, and more recent approaches have reported requirements of more than 40 Gigabytes. [13] However, optimizations such as Flash Attention [15] have since reduced this requirement considerably.

Approaches that tune the entire generative model may also create checkpoints that are several gigabytes in size, making it difficult to share or store many models. Embedding based approaches require only a few kilobytes, but typically struggle to preserve identity while maintaining editability. More recent approaches have proposed hybrid tuning goals which optimize both an embedding and a subset of network weights. These can reduce storage requirements to as little as 100 Kilobytes while achieving quality comparable to full tuning methods. [11]

Finally, optimization processes can be lengthy, requiring several minutes of tuning for each novel concept. Encoder and quick-tuning methods aim to reduce this to seconds or less. [16]

Related Research Articles

A language model is a probabilistic model of a natural language. In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.

<span class="mw-page-title-main">Feature learning</span> Set of learning techniques in machine learning

In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

In natural language processing (NLP), a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using language modeling and feature learning techniques, where words or phrases from the vocabulary are mapped to vectors of real numbers.

Multimodal learning, in the context of machine learning, is a type of deep learning using a combination of various modalities of data, such as text, audio, or images, in order to create a more robust model of the real-world phenomena in question. In contrast, singular modal learning would analyze text or imaging data independently. Multimodal machine learning combines these fundamentally different statistical analyses using specialized modeling strategies and algorithms, resulting in a model that comes closer to representing the real world.

Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.

<span class="mw-page-title-main">Artificial intelligence art</span> Machine application of knowledge of human aesthetic expressions

Artificial intelligence art is any visual artwork created through the use of an artificial intelligence (AI) program.

<span class="mw-page-title-main">Transformer (deep learning architecture)</span> Machine learning algorithm used for natural-language processing

A transformer is a deep learning architecture developed by Google and based on the multi-head attention mechanism, proposed in a 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. The transformer paper, published in 2017, is based on the softmax-based attention mechanism proposed by Bahdanau et. al. in 2014 for machine translation, and the Fast Weight Controller, similar to a transformer, proposed in 1992.

Synthetic media is a catch-all term for the artificial production, manipulation, and modification of data and media by automated means, especially through the use of artificial intelligence algorithms, such as for the purpose of misleading people or changing an original meaning. Synthetic media as a field has grown rapidly since the creation of generative adversarial networks, primarily through the rise of deepfakes as well as music synthesis, text generation, human image synthesis, speech synthesis, and more. Though experts use the term "synthetic media," individual methods such as deepfakes and text synthesis are sometimes not referred to as such by the media but instead by their respective terminology Significant attention arose towards the field of synthetic media starting in 2017 when Motherboard reported on the emergence of AI altered pornographic videos to insert the faces of famous actresses. Potential hazards of synthetic media include the spread of misinformation, further loss of trust in institutions such as media and government, the mass automation of creative and journalistic jobs and a retreat into AI-generated fantasy worlds. Synthetic media is an applied form of artificial imagination.

Bidirectional Encoder Representations from Transformers (BERT) is a language model based on the transformer architecture, notable for its dramatic improvement over previous state of the art models. It was introduced in October 2018 by researchers at Google. A 2020 literature survey concluded that "in a little over a year, BERT has become a ubiquitous baseline in Natural Language Processing (NLP) experiments counting over 150 research publications analyzing and improving the model."

Zero-shot learning (ZSL) is a problem setup in deep learning where, at test time, a learner observes samples from classes which were not observed during training, and needs to predict the class that they belong to. The name is a play on words based on the earlier concept of one-shot learning, in which classification can be learned from only one, or a few, examples.

<span class="mw-page-title-main">DALL-E</span> Image-generating deep-learning model

DALL·E, DALL·E 2, and DALL·E 3 are text-to-image models developed by OpenAI using deep learning methodologies to generate digital images from natural language descriptions known as "prompts".

A vision transformer (ViT) is a transformer designed for computer vision. A ViT breaks down an input image into a series of patches, serialises each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

Prompt engineering is the process of structuring an instruction that can be interpreted and understood by a generative AI model. A prompt is natural language text describing the task that an AI should perform.

<span class="mw-page-title-main">Stable Diffusion</span> Image-generating machine learning model

Stable Diffusion is a deep learning, text-to-image model released in 2022 based on diffusion techniques. It is considered to be a part of the ongoing artifical intelligence boom.

<span class="mw-page-title-main">Text-to-image model</span> Machine learning model

A text-to-image model is a machine learning model which takes an input natural language description and produces an image matching that description.

In machine learning, diffusion models, also known as diffusion probabilistic models or score-based generative models, are a class of latent variable generative models. A diffusion model consists of three major components: the forward process, the reverse process, and the sampling procedure. The goal of diffusion models is to learn a diffusion process that generates a probability distribution for a given dataset from which we can then sample new images. They learn the latent structure of a dataset by modeling the way in which data points diffuse through their latent space.

<span class="mw-page-title-main">DreamBooth</span> Deep learning generation model

DreamBooth is a deep learning generation model used to personalize existing text-to-image models by fine-tuning. It was developed by researchers from Google Research and Boston University in 2022. Originally developed using Google's own Imagen text-to-image model, DreamBooth implementations can be applied to other text-to-image models, where it can allow the model to generate more fine-tuned and personalized outputs after training on three to five images of a subject.

A large language model (LLM) is a computational model notable for its ability to achieve general-purpose language generation and other natural language processing tasks such as classification. Based on language models, LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training process. LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word.

<span class="mw-page-title-main">Generative artificial intelligence</span> AI system capable of generating content in response to prompts

Generative artificial intelligence is artificial intelligence capable of generating text, images, videos, or other data using generative models, often in response to prompts. Generative AI models learn the patterns and structure of their input training data and then generate new data that has similar characteristics.

Llama is a family of autoregressive large language models released by Meta AI starting in February 2023. The latest version is Llama 3 released in April 2024.

References

  1. Murphy, Brendan Paul (2022-10-12). "AI image generation is advancing at astronomical speeds. Can we still tell if a picture is fake?". The Conversation. Retrieved 2023-09-14.
  2. "「好きなキャラに近い絵をAIが量産」――ある概念を"単語"に圧縮し入力テキストに使える技術". ITmedia NEWS (in Japanese). Retrieved 2023-09-14.
  3. Baio, Andy (2022-11-01). "Invasive Diffusion: How one unwilling illustrator found herself turned into an AI model". Waxy.org. Retrieved 2023-09-14.
  4. Huang, Ziqi; Wu, Tianxing; Jiang, Yuming; Chan, Kelvin C. K.; Liu, Ziwei (2023). "ReVersion: Diffusion-Based Relation Inversion from Images". arXiv: 2303.13495 [cs.CV].
  5. Jr, Edward Ongweso (2022-10-14). "People Are Now Making Fake Selfies With AI". Vice. Retrieved 2023-09-20.
  6. Dave James (2022-12-27). "I thrashed the RTX 4090 for 8 hours straight training Stable Diffusion to paint like my uncle Hermann". PC Gamer. Retrieved 2023-09-20.
  7. 1 2 Gal, Rinon; Alaluf, Yuval; Atzmon, Yuval; Patashnik, Or; Bermano, Amit Haim; Chechik, Gal; Cohen-or, Daniel (2022-09-29). "An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion". arXiv: 2208.01618 .{{cite journal}}: Cite journal requires |journal= (help)
  8. 1 2 Ruiz, Nataniel; Li, Yuanzhen; Jampani, Varun; Pritch, Yael; Rubinstein, Michael; Aberman, Kfir (2023). "DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation": 22500–22510. arXiv: 2208.12242 .{{cite journal}}: Cite journal requires |journal= (help)
  9. Singh, Niharika (2023-02-18). "HuggingFace Publishes LoRA Scripts For Efficient Stable Diffusion Fine-Tuning". MarkTechPost. Retrieved 2023-09-14.
  10. Hu, Edward J.; Shen, Yelong; Wallis, Phillip; Allen-Zhu, Zeyuan; Li, Yuanzhi; Wang, Shean; Wang, Lu; Chen, Weizhu (2021-10-06). "LoRA: Low-Rank Adaptation of Large Language Models". arXiv: 2106.09685 .{{cite journal}}: Cite journal requires |journal= (help)
  11. 1 2 Tewel, Yoad; Gal, Rinon; Chechik, Gal; Atzmon, Yuval (2023-07-23). "Key-Locked Rank One Editing for Text-to-Image Personalization". Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Proceedings. SIGGRAPH '23. New York, NY, USA: Association for Computing Machinery. pp. 1–11. doi:10.1145/3588432.3591506. ISBN   979-8-4007-0159-7. S2CID   258436985.
  12. Lorenzi, Daniele (2023-07-22). "Meet P+: A Rich Embeddings Space for Extended Textual Inversion in Text-to-Image Generation". MarkTechPost. Retrieved 2023-08-29.
  13. 1 2 Gal, Rinon; Arar, Moab; Atzmon, Yuval; Bermano, Amit H.; Chechik, Gal; Cohen-Or, Daniel (2023-07-26). "Encoder-based Domain Tuning for Fast Personalization of Text-to-Image Models". ACM Transactions on Graphics. 42 (4): 150:1–150:13. arXiv: 2302.12228 . doi:10.1145/3592133. ISSN   0730-0301. S2CID   257364757.
  14. Wei, Yuxiang; Zhang, Yabo; Ji, Zhilong; Bai, Jinfeng; Zhang, Lei; Zuo, Wangmeng (2023). "ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation". arXiv: 2302.13848 [cs.CV].
  15. Dao, Tri; Fu, Daniel Y.; Ermon, Stefano; Rudra, Atri; Ré, Christopher (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness". arXiv: 2205.14135 [cs.LG].
  16. Shi, Jing; Xiong, Wei; Lin, Zhe; Jung, Hyun Joon (2023). "InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning". arXiv: 2304.03411 [cs.CV].