Text-to-image model

Last updated
An image conditioned on the prompt "an astronaut riding a horse, by Hiroshige", generated by Stable Diffusion, a large-scale text-to-image model released in 2022 An astronaut riding a horse (Hiroshige) 2022-08-30.png
An image conditioned on the prompt "an astronaut riding a horse, by Hiroshige", generated by Stable Diffusion, a large-scale text-to-image model released in 2022

A text-to-image model is a machine learning model which takes an input natural language description and produces an image matching that description.

Contents

Text-to-image models began to be developed in the mid-2010s during the beginnings of the AI boom, as a result of advances in deep neural networks. In 2022, the output of state-of-the-art text-to-image models—such as OpenAI's DALL-E 2, Google Brain's Imagen, Stability AI's Stable Diffusion, and Midjourney—began to be considered to approach the quality of real photographs and human-drawn art.

Text-to-image models generally combine a language model, which transforms the input text into a latent representation, and a generative image model, which produces an image conditioned on that representation. The most effective models have generally been trained on massive amounts of image and text data scraped from the web. [1]

History

Before the rise of deep learning, attempts to build text-to-image models were limited to collages by arranging existing component images, such as from a database of clip art. [2] [3]

The inverse task, image captioning, was more tractable, and a number of image captioning deep learning models came prior to the first text-to-image models. [4]

The first modern text-to-image model, alignDRAW, was introduced in 2015 by researchers from the University of Toronto. alignDRAW extended the previously-introduced DRAW architecture (which used a recurrent variational autoencoder with an attention mechanism) to be conditioned on text sequences. [4] Images generated by alignDRAW were in small resolution (32×32 pixels, attained from resizing) and were considered to be 'low in diversity'. The model was able to generalize to objects not represented in the training data (such as a red school bus) and appropriately handled novel prompts such as "a stop sign is flying in blue skies", exhibiting output that it was not merely "memorizing" data from the training set. [4] [5]

AlignDRAW - Flying stop sign.png
Eight images generated from the text prompt "A stop sign is flying in blue skies." by AlignDRAW (2015). Enlarged to show detail. [6]

In 2016, Reed, Akata, Yan et al. became the first to use generative adversarial networks for the text-to-image task. [5] [7] With models trained on narrow, domain-specific datasets, they were able to generate "visually plausible" images of birds and flowers from text captions like "an all black bird with a distinct thick, rounded bill". A model trained on the more diverse COCO (Common Objects in Context) dataset produced images which were "from a distance... encouraging", but which lacked coherence in their details. [5] Later systems include VQGAN-CLIP, [8] XMC-GAN, and GauGAN2. [9]

Image generator-A stop sign is flying in blue skies-Dall-e2-03.png
Image generator-A stop sign is flying in blue skies-Dall-e2-01.png
Image generator-A stop sign is flying in blue skies-Dall-e3-01.png
Image generator-A stop sign is flying in blue skies-Dall-e3-02.png
DALL·E 2's (top, April 2022) and DALL·E 3's (bottom, September 2023) generated images for the prompt "A stop sign is flying in blue skies"

One of the first text-to-image models to capture widespread public attention was OpenAI's DALL-E, a transformer system announced in January 2021. [10] A successor capable of generating more complex and realistic images, DALL-E 2, was unveiled in April 2022, [11] followed by Stable Diffusion that was publicly released in August 2022. [12] In August 2022, text-to-image personalization allows to teach the model a new concept using a small set of images of a new object that was not included in the training set of the text-to-image foundation model. This is achieved by textual inversion, namely, finding a new text term that correspond to these images.

Following other text-to-image models, language model-powered text-to-video platforms such as Runway, Make-A-Video, [13] Imagen Video, [14] Midjourney, [15] and Phenaki [16] can generate video from text and/or text/image prompts. [17]

Architecture and training

High-level architecture showing the state of AI art machine learning models, and notable models and applications as a clickable SVG image map State of AI Art Machine Learning Models.svg
High-level architecture showing the state of AI art machine learning models, and notable models and applications as a clickable SVG image map

Text-to-image models have been built using a variety of architectures. The text encoding step may be performed with a recurrent neural network such as a long short-term memory (LSTM) network, though transformer models have since become a more popular option. For the image generation step, conditional generative adversarial networks (GANs) have been commonly used, with diffusion models also becoming a popular option in recent years. Rather than directly training a model to output a high-resolution image conditioned on a text embedding, a popular technique is to train a model to generate low-resolution images, and use one or more auxiliary deep learning models to upscale it, filling in finer details.

Text-to-image models are trained on large datasets of (text, image) pairs, often scraped from the web. With their 2022 Imagen model, Google Brain reported positive results from using a large language model trained separately on a text-only corpus (with its weights subsequently frozen), a departure from the theretofore standard approach. [18]

Datasets

Examples of images and captions from three public datasets which are commonly used to train text-to-image models Captioned image dataset examples.jpg
Examples of images and captions from three public datasets which are commonly used to train text-to-image models

Training a text-to-image model requires a dataset of images paired with text captions. One dataset commonly used for this purpose is the COCO dataset. Released by Microsoft in 2014, COCO consists of around 123,000 images depicting a diversity of objects with five captions per image, generated by human annotators. Oxford-120 Flowers and CUB-200 Birds are smaller datasets of around 10,000 images each, restricted to flowers and birds, respectively. It is considered less difficult to train a high-quality text-to-image model with these datasets because of their narrow range of subject matter. [7]

Quality evaluation

Evaluating and comparing the quality of text-to-image models is a problem involving assessing multiple desirable properties. A desideratum specific to text-to-image models is that generated images semantically align with the text captions used to generate them. A number of schemes have been devised for assessing these qualities, some automated and others based on human judgement. [7]

A common algorithmic metric for assessing image quality and diversity is the Inception Score (IS), which is based on the distribution of labels predicted by a pretrained Inceptionv3 image classification model when applied to a sample of images generated by the text-to-image model. The score is increased when the image classification model predicts a single label with high probability, a scheme intended to favour "distinct" generated images. Another popular metric is the related Fréchet inception distance, which compares the distribution of generated images and real training images according to features extracted by one of the final layers of a pretrained image classification model. [7]

Impact and applications

AI has the potential for a societal transformation, which may include enabling the expansion of noncommercial niche genres (such as cyberpunk derivatives like solarpunk) by amateurs, novel entertainment, fast prototyping, [19] increasing art-making accessibility, [19] and artistic output per effort and/or expenses and/or time [19] —e.g., via generating drafts, draft-refinitions, and image components (inpainting). Generated images are sometimes used as sketches, [20] low-cost experimentations, [21] inspiration, or illustrations of proof-of-concept-stage ideas. Additional functionalities or improvements may also relate to post-generation manual editing (i.e., polishing), such as subsequent tweaking with an image editor. [21]

List of notable text-to-image models

NameRelease dateDeveloperLicense
DALL-E January 2021 OpenAI Proprietary
DALL-E 2 April 2022
DALL-E 3 September 2023
Imagen Google
Imagen 2December 2023 [22]
Parti Unreleased
Firefly June 2023 Adobe Inc.
Midjourney July 2022Midjourney, Inc.
Stable Diffusion August 2022Stability AICreative ML, OpenRAIL-M

See also

Related Research Articles

<span class="mw-page-title-main">WikiArt</span> User-generated website displaying artworks

WikiArt is a visual art wiki, active since 2010.

<span class="mw-page-title-main">Feature learning</span> Set of learning techniques in machine learning

In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

Google Brain was a deep learning artificial intelligence research team under the umbrella of Google AI, a research division at Google dedicated to artificial intelligence. Formed in 2011, it combined open-ended machine learning research with information systems and large-scale computing resources. It created tools such as TensorFlow, which allow neural networks to be used by the public, and multiple internal AI research projects, and aimed to create research opportunities in machine learning and natural language processing. It was merged into former Google sister company DeepMind to form Google DeepMind in April 2023.

Multimodal learning, in the context of machine learning, is a type of deep learning using a combination of various modalities of data, such as text, audio, or images, in order to create a more robust model of the real-world phenomena in question. In contrast, singular modal learning would analyze text or imaging data independently. Multimodal machine learning combines these fundamentally different statistical analyses using specialized modeling strategies and algorithms, resulting in a model that comes closer to representing the real world.

<span class="mw-page-title-main">Generative adversarial network</span> Deep learning method

A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative AI. The concept was initially developed by Ian Goodfellow and his colleagues in June 2014. In a GAN, two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is another agent's loss.

<span class="mw-page-title-main">Artificial intelligence art</span> Machine application of knowledge of human aesthetic expressions

Artificial intelligence art is any visual artwork created through the use of artificial intelligence (AI) programs such as text-to-image models. AI art began to gain popularity in the mid- to late-20th century through the boom of artificial intelligence.

Synthetic media is a catch-all term for the artificial production, manipulation, and modification of data and media by automated means, especially through the use of artificial intelligence algorithms, such as for the purpose of misleading people or changing an original meaning. Synthetic media refers to any form of media, including but not limited to images, videos, audio recordings, and text, that are generated or manipulated using artificial intelligence (AI) techniques. This technology enables the creation of highly realistic content that may be indistinguishable from authentic media produced by humans. Synthetic media as a field has grown rapidly since the creation of generative adversarial networks, primarily through the rise of deepfakes as well as music synthesis, text generation, human image synthesis, speech synthesis, and more. Though experts use the term "synthetic media," individual methods such as deepfakes and text synthesis are sometimes not referred to as such by the media but instead by their respective terminology Significant attention arose towards the field of synthetic media starting in 2017 when Motherboard reported on the emergence of AI altered pornographic videos to insert the faces of famous actresses. Potential hazards of synthetic media include the spread of misinformation, further loss of trust in institutions such as media and government, the mass automation of creative and journalistic jobs and a retreat into AI-generated fantasy worlds. Synthetic media is an applied form of artificial imagination.

An energy-based model (EBM) (also called a Canonical Ensemble Learning(CEL) or Learning via Canonical Ensemble (LCE)) is an application of canonical ensemble formulation of statistical physics for learning from data problems. The approach prominently appears in generative models (GMs).

<span class="mw-page-title-main">15.ai</span> Real-time text-to-speech tool using artificial intelligence

15.ai is a non-commercial freeware artificial intelligence web application that generates natural emotive high-fidelity text-to-speech voices from an assortment of fictional characters from a variety of media sources. Developed by a pseudonymous MIT researcher under the name 15, the project uses a combination of audio synthesis algorithms, speech synthesis deep neural networks, and sentiment analysis models to generate and serve emotive character voices faster than real-time, particularly those with a very small amount of trainable data.

The Fréchet inception distance (FID) is a metric used to assess the quality of images created by a generative model, like a generative adversarial network (GAN). Unlike the earlier inception score (IS), which evaluates only the distribution of generated images, the FID compares the distribution of generated images with the distribution of a set of real images. The FID metric does not completely replace the IS metric. Classifiers that achieve the best (lowest) FID score tend to have greater sample variety while classifiers achieving the best (highest) IS score tend to have better quality within individual images.

An audio deepfake is a product of artificial intelligence used to create convincing speech sentences that sound like specific people saying things they did not say. This technology was initially developed for various applications to improve human life. For example, it can be used to produce audiobooks, and also to help people who have lost their voices to get them back. Commercially, it has opened the door to several opportunities. This technology can also create more personalized digital assistants and natural-sounding text-to-speech as well as speech translation services.

<span class="mw-page-title-main">DALL-E</span> Image-generating deep-learning model

DALL·E, DALL·E 2, and DALL·E 3 are text-to-image models developed by OpenAI using deep learning methodologies to generate digital images from natural language descriptions, called "prompts."

A vision transformer (ViT) is a transformer designed for computer vision. A ViT breaks down an input image into a series of patches, serialises each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

Prompt engineering is the process of structuring an instruction that can be interpreted and understood by a generative AI model. A prompt is natural language text describing the task that an AI should perform.

Deep learning speech synthesis refers to the application of deep learning models to generate natural-sounding human speech from written text (text-to-speech) or spectrum (vocoder). Deep neural networks (DNN) are trained using a large amount of recorded speech and, in the case of a text-to-speech system, the associated labels and/or input text.

<span class="mw-page-title-main">Stable Diffusion</span> Image-generating machine learning model

Stable Diffusion is a deep learning, text-to-image model released in 2022 based on diffusion techniques. It is considered to be a part of the ongoing AI boom.

<span class="mw-page-title-main">LAION</span> Non-profit German artificial intelligence organization

LAION is a German non-profit which makes open-sourced artificial intelligence models and datasets. It is best known for releasing a number of large datasets of images and captions scraped from the web which have been used to train a number of high-profile text-to-image models, including Stable Diffusion and Imagen.

A text-to-video model is a machine learning model which takes a natural language description as input and producing a video or multiples videos from the input.

<span class="mw-page-title-main">Generative artificial intelligence</span> AI system capable of generating content in response to prompts

Generative artificial intelligence is artificial intelligence capable of generating text, images, videos, or other data using generative models, often in response to prompts. Generative AI models learn the patterns and structure of their input training data and then generate new data that has similar characteristics.

<span class="mw-page-title-main">AI boom</span> Rapid progress in artificial intelligence

The AI boom, or AI spring, is an ongoing period of rapid progress in the field of artificial intelligence (AI). Prominent examples include protein folding prediction led by Google DeepMind and generative AI led by OpenAI.

References

  1. Vincent, James (May 24, 2022). "All these images were generated by Google's latest text-to-image AI". The Verge. Vox Media. Retrieved May 28, 2022.
  2. Agnese, Jorge; Herrera, Jonathan; Tao, Haicheng; Zhu, Xingquan (October 2019), A Survey and Taxonomy of Adversarial Neural Networks for Text-to-Image Synthesis, arXiv: 1910.09399
  3. Zhu, Xiaojin; Goldberg, Andrew B.; Eldawy, Mohamed; Dyer, Charles R.; Strock, Bradley (2007). "A text-to-picture synthesis system for augmenting communication" (PDF). AAAI. 7: 1590–1595.
  4. 1 2 3 Mansimov, Elman; Parisotto, Emilio; Lei Ba, Jimmy; Salakhutdinov, Ruslan (November 2015). "Generating Images from Captions with Attention". ICLR. arXiv: 1511.02793 .
  5. 1 2 3 Reed, Scott; Akata, Zeynep; Logeswaran, Lajanugen; Schiele, Bernt; Lee, Honglak (June 2016). "Generative Adversarial Text to Image Synthesis" (PDF). International Conference on Machine Learning.
  6. Mansimov, Elman; Parisotto, Emilio; Ba, Jimmy Lei; Salakhutdinov, Ruslan (February 29, 2016). "Generating Images from Captions with Attention". International Conference on Learning Representations. arXiv: 1511.02793 .
  7. 1 2 3 4 Frolov, Stanislav; Hinz, Tobias; Raue, Federico; Hees, Jörn; Dengel, Andreas (December 2021). "Adversarial text-to-image synthesis: A review". Neural Networks. 144: 187–209. arXiv: 2101.09983 . doi: 10.1016/j.neunet.2021.07.019 . PMID   34500257. S2CID   231698782.
  8. Rodriguez, Jesus. "🌅 Edge#229: VQGAN + CLIP". thesequence.substack.com. Retrieved 2022-10-10.
  9. Rodriguez, Jesus. "🎆🌆 Edge#231: Text-to-Image Synthesis with GANs". thesequence.substack.com. Retrieved 2022-10-10.
  10. Coldewey, Devin (5 January 2021). "OpenAI's DALL-E creates plausible images of literally anything you ask it to". TechCrunch.
  11. Coldewey, Devin (6 April 2022). "OpenAI's new DALL-E model draws anything — but bigger, better and faster than before". TechCrunch.
  12. "Stable Diffusion Public Release". Stability.Ai. Retrieved 2022-10-27.
  13. Kumar, Ashish (2022-10-03). "Meta AI Introduces 'Make-A-Video': An Artificial Intelligence System That Generates Videos From Text". MarkTechPost. Retrieved 2022-10-03.
  14. Edwards, Benj (2022-10-05). "Google's newest AI generator creates HD video from text prompts". Ars Technica. Retrieved 2022-10-25.
  15. Rodriguez, Jesus. "🎨 Edge#237: What is Midjourney?". thesequence.substack.com. Retrieved 2022-10-26.
  16. "Phenaki". phenaki.video. Retrieved 2022-10-03.
  17. Edwards, Benj (9 September 2022). "Runway teases AI-powered text-to-video editing using written prompts". Ars Technica. Retrieved 12 September 2022.
  18. Saharia, Chitwan; Chan, William; Saxena, Saurabh; Li, Lala; Whang, Jay; Denton, Emily; Kamyar Seyed Ghasemipour, Seyed; Karagol Ayan, Burcu; Sara Mahdavi, S.; Gontijo Lopes, Rapha; Salimans, Tim; Ho, Jonathan; J Fleet, David; Norouzi, Mohammad (23 May 2022). "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding". arXiv: 2205.11487 [cs.CV].
  19. 1 2 3 Elgan, Mike (1 November 2022). "How 'synthetic media' will transform business forever". Computerworld. Retrieved 9 November 2022.
  20. Roose, Kevin (21 October 2022). "A.I.-Generated Art Is Already Transforming Creative Work". The New York Times. Retrieved 16 November 2022.
  21. 1 2 Leswing, Kif. "Why Silicon Valley is so excited about awkward drawings done by artificial intelligence". CNBC. Retrieved 16 November 2022.
  22. "Imagen 2 on Vertex AI is now generally available". Google Cloud Blog. Retrieved 2024-01-02.