Text-to-image model

Last updated

An image conditioned on the prompt "an astronaut riding a horse, by Hiroshige", generated by Stable Diffusion 3.5, a large-scale text-to-image model first released in 2022 Astronaut Riding a Horse Hiroshige (SD3.5).webp
An image conditioned on the prompt "an astronaut riding a horse, by Hiroshige", generated by Stable Diffusion 3.5, a large-scale text-to-image model first released in 2022

A text-to-image model is a machine learning model which takes an input natural language description and produces an image matching that description.

Contents

Text-to-image models began to be developed in the mid-2010s during the beginnings of the AI boom, as a result of advances in deep neural networks. In 2022, the output of state-of-the-art text-to-image models—such as OpenAI's DALL-E 2, Google Brain's Imagen, Stability AI's Stable Diffusion, and Midjourney—began to be considered to approach the quality of real photographs and human-drawn art.

Text-to-image models are generally latent diffusion models, which combine a language model, which transforms the input text into a latent representation, and a generative image model, which produces an image conditioned on that representation. The most effective models have generally been trained on massive amounts of image and text data scraped from the web. [1]

History

Before the rise of deep learning,[ when? ] attempts to build text-to-image models were limited to collages by arranging existing component images, such as from a database of clip art. [2] [3]

The inverse task, image captioning, was more tractable, and a number of image captioning deep learning models came prior to the first text-to-image models. [4]

The first modern text-to-image model, alignDRAW, was introduced in 2015 by researchers from the University of Toronto. alignDRAW extended the previously-introduced DRAW architecture (which used a recurrent variational autoencoder with an attention mechanism) to be conditioned on text sequences. [4] Images generated by alignDRAW were in small resolution (32×32 pixels, attained from resizing) and were considered to be 'low in diversity'. The model was able to generalize to objects not represented in the training data (such as a red school bus) and appropriately handled novel prompts such as "a stop sign is flying in blue skies", exhibiting output that it was not merely "memorizing" data from the training set. [4] [5]

AlignDRAW - Flying stop sign.png
Eight images generated from the text prompt "A stop sign is flying in blue skies." by AlignDRAW (2015). Enlarged to show detail. [6]

In 2016, Reed, Akata, Yan et al. became the first to use generative adversarial networks for the text-to-image task. [5] [7] With models trained on narrow, domain-specific datasets, they were able to generate "visually plausible" images of birds and flowers from text captions like "an all black bird with a distinct thick, rounded bill". A model trained on the more diverse COCO (Common Objects in Context) dataset produced images which were "from a distance... encouraging", but which lacked coherence in their details. [5] Later systems include VQGAN-CLIP, [8] XMC-GAN, and GauGAN2. [9]

Image generator-A stop sign is flying in blue skies-Dall-e2-03.png
Image generator-A stop sign is flying in blue skies-Dall-e2-01.png
Image generator-A stop sign is flying in blue skies-Dall-e3-01.png
Image generator-A stop sign is flying in blue skies-Dall-e3-02.png
DALL·E 2's (top, April 2022) and DALL·E 3's (bottom, September 2023) generated images for the prompt "A stop sign is flying in blue skies"

One of the first text-to-image models to capture widespread public attention was OpenAI's DALL-E, a transformer system announced in January 2021. [10] A successor capable of generating more complex and realistic images, DALL-E 2, was unveiled in April 2022, [11] followed by Stable Diffusion that was publicly released in August 2022. [12] In August 2022, text-to-image personalization allows to teach the model a new concept using a small set of images of a new object that was not included in the training set of the text-to-image foundation model. This is achieved by textual inversion, namely, finding a new text term that correspond to these images.

Following other text-to-image models, language model-powered text-to-video platforms such as Runway, Make-A-Video, [13] Imagen Video, [14] Midjourney, [15] and Phenaki [16] can generate video from text and/or text/image prompts. [17]

Architecture and training

High-level architecture showing the state of AI art machine learning models, and notable models and applications as a clickable SVG image map State of AI Art Machine Learning Models.svg
High-level architecture showing the state of AI art machine learning models, and notable models and applications as a clickable SVG image map

Text-to-image models have been built using a variety of architectures. The text encoding step may be performed with a recurrent neural network such as a long short-term memory (LSTM) network, though transformer models have since become a more popular option. For the image generation step, conditional generative adversarial networks (GANs) have been commonly used, with diffusion models also becoming a popular option in recent years. Rather than directly training a model to output a high-resolution image conditioned on a text embedding, a popular technique is to train a model to generate low-resolution images, and use one or more auxiliary deep learning models to upscale it, filling in finer details.

Text-to-image models are trained on large datasets of (text, image) pairs, often scraped from the web. With their 2022 Imagen model, Google Brain reported positive results from using a large language model trained separately on a text-only corpus (with its weights subsequently frozen), a departure from the theretofore standard approach. [18]

Datasets

Examples of images and captions from three public datasets which are commonly used to train text-to-image models Captioned image dataset examples.jpg
Examples of images and captions from three public datasets which are commonly used to train text-to-image models

Training a text-to-image model requires a dataset of images paired with text captions. One dataset commonly used for this purpose is the COCO dataset. Released by Microsoft in 2014, COCO consists of around 123,000 images depicting a diversity of objects with five captions per image, generated by human annotators. Oxford-120 Flowers and CUB-200 Birds are smaller datasets of around 10,000 images each, restricted to flowers and birds, respectively. It is considered less difficult to train a high-quality text-to-image model with these datasets because of their narrow range of subject matter. [7]

Quality evaluation

Evaluating and comparing the quality of text-to-image models is a problem involving assessing multiple desirable properties. A desideratum specific to text-to-image models is that generated images semantically align with the text captions used to generate them. A number of schemes have been devised for assessing these qualities, some automated and others based on human judgement. [7]

A common algorithmic metric for assessing image quality and diversity is the Inception Score (IS), which is based on the distribution of labels predicted by a pretrained Inceptionv3 image classification model when applied to a sample of images generated by the text-to-image model. The score is increased when the image classification model predicts a single label with high probability, a scheme intended to favour "distinct" generated images. Another popular metric is the related Fréchet inception distance, which compares the distribution of generated images and real training images according to features extracted by one of the final layers of a pretrained image classification model. [7]

Impact and applications

AI has the potential for a societal transformation, which may include enabling the expansion of noncommercial niche genres (such as cyberpunk derivatives like solarpunk) by amateurs, novel entertainment, fast prototyping, [19] increasing art-making accessibility, [19] and artistic output per effort and/or expenses and/or time [19] —e.g., via generating drafts, draft-refinitions, and image components (inpainting). Generated images are sometimes used as sketches, [20] low-cost experiments, [21] inspiration, or illustrations of proof-of-concept-stage ideas. Additional functionalities or improvements may also relate to post-generation manual editing (i.e., polishing), such as subsequent tweaking with an image editor. [21]

List of notable text-to-image models

NameRelease dateDeveloperLicense
DALL-E January 2021 OpenAI Proprietary
DALL-E 2 April 2022
DALL-E 3 September 2023
Ideogram 2.0 August 2024Ideogram
Imagen April 2023 Google
Imagen 2December 2023 [22]
Imagen 3May 2024
Parti Unreleased
Firefly March 2023 Adobe Inc.
Midjourney July 2022Midjourney, Inc.
Stable Diffusion August 2022Stability AI Stability AI Community License [note 1]
Flux August 2024Black Forest Labs Apache License [note 2]
Aurora December 2024 xAI Proprietary
RunwayML 2018Runway AI, Inc.Proprietary

Explanatory notes

  1. This license can be used by individuals and organizations up to $1 million in revenue, for organizations with annual revenue more than $1 million, Stability AI Enterprise License is needed. All outputs are retained by users regardless of revenue
  2. For the schnell model, the dev model is using a non-commercial license while the pro model is proprietary (only available as API)

See also

Related Research Articles

<span class="mw-page-title-main">WikiArt</span> User-generated website displaying artworks

WikiArt is a visual art wiki, active since 2010.

Multimodal learning is a type of deep learning that integrates and processes multiple types of data, referred to as modalities, such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, improving model performance in tasks like visual question answering, cross-modal retrieval, text-to-image generation, aesthetic ranking, and image captioning.

<span class="mw-page-title-main">Generative adversarial network</span> Deep learning method

A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative artificial intelligence. The concept was initially developed by Ian Goodfellow and his colleagues in June 2014. In a GAN, two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is another agent's loss.

<span class="mw-page-title-main">Artificial intelligence art</span> Machine application of knowledge of human aesthetic expressions

Artificial intelligence art is visual artwork created or enhanced through the use of artificial intelligence (AI) programs.

Synthetic media is a catch-all term for the artificial production, manipulation, and modification of data and media by automated means, especially through the use of artificial intelligence algorithms, such as for the purpose of misleading people or changing an original meaning. Synthetic media as a field has grown rapidly since the creation of generative adversarial networks, primarily through the rise of deepfakes as well as music synthesis, text generation, human image synthesis, speech synthesis, and more. Though experts use the term "synthetic media," individual methods such as deepfakes and text synthesis are sometimes not referred to as such by the media but instead by their respective terminology Significant attention arose towards the field of synthetic media starting in 2017 when Motherboard reported on the emergence of AI altered pornographic videos to insert the faces of famous actresses. Potential hazards of synthetic media include the spread of misinformation, further loss of trust in institutions such as media and government, the mass automation of creative and journalistic jobs and a retreat into AI-generated fantasy worlds. Synthetic media is an applied form of artificial imagination.

An energy-based model (EBM) is an application of canonical ensemble formulation from statistical physics for learning from data. The approach prominently appears in generative artificial intelligence.

15.ai was a freeware artificial intelligence web application that generated text-to-speech voices from fictional characters from various media sources. Created by a pseudonymous developer under the alias 15, the project used a combination of audio synthesis algorithms, speech synthesis deep neural networks, and sentiment analysis models to generate emotive character voices faster than real-time.

The Fréchet inception distance (FID) is a metric used to assess the quality of images created by a generative model, like a generative adversarial network (GAN) or a diffusion model.

<span class="mw-page-title-main">DALL-E</span> Image-generating deep-learning model

DALL-E, DALL-E 2, and DALL-E 3 are text-to-image models developed by OpenAI using deep learning methodologies to generate digital images from natural language descriptions known as "prompts".

<span class="mw-page-title-main">Contrastive Language-Image Pre-training</span> Technique in neural networks for learning joint representations of text and images

Contrastive Language-Image Pre-training (CLIP) is a technique for training a pair of neural network models, one for image understanding and one for text understanding, using a contrastive objective. This method has enabled broad applications across multiple domains, including cross-modal retrieval, text-to-image generation, aesthetic ranking, and image captioning.

<span class="mw-page-title-main">Vision transformer</span> Machine learning model for vision processing

A vision transformer (ViT) is a transformer designed for computer vision. A ViT decomposes an input image into a series of patches, serializes each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

Prompt engineering is the process of structuring an instruction that can be interpreted and understood by a generative artificial intelligence (AI) model. A prompt is natural language text describing the task that an AI should perform. A prompt for a text-to-text language model can be a query such as "what is Fermat's little theorem?", a command such as "write a poem in the style of Edgar Allan Poe about leaves falling", or a longer statement including context, instructions, and conversation history.

Deep learning speech synthesis refers to the application of deep learning models to generate natural-sounding human speech from written text (text-to-speech) or spectrum (vocoder). Deep neural networks are trained using large amounts of recorded speech and, in the case of a text-to-speech system, the associated labels and/or input text.

<span class="mw-page-title-main">Midjourney</span> Image-generating machine learning model

Midjourney is a generative artificial intelligence program and service created and hosted by the San Francisco-based independent research lab Midjourney, Inc. Midjourney generates images from natural language descriptions, called prompts, similar to OpenAI's DALL-E and Stability AI's Stable Diffusion. It is one of the technologies of the AI boom.

<span class="mw-page-title-main">Stable Diffusion</span> Image-generating machine learning model

Stable Diffusion is a deep learning, text-to-image model released in 2022 based on diffusion techniques. The generative artificial intelligence technology is the premier product of Stability AI and is considered to be a part of the ongoing artificial intelligence boom.

<span class="mw-page-title-main">Text-to-video model</span> Machine learning model

A text-to-video model is a machine learning model that uses a natural language description as input to produce a video relevant to the input text. Advancements during the 2020s in the generation of high-quality, text-conditioned videos have largely been driven by the development of video diffusion models.

<span class="mw-page-title-main">Hallucination (artificial intelligence)</span> Erroneous material generated by AI

In the field of artificial intelligence (AI), a hallucination or artificial hallucination is a response generated by AI that contains false or misleading information presented as fact. This term draws a loose analogy with human psychology, where hallucination typically involves false percepts. However, there is a key difference: AI hallucination is associated with erroneous responses rather than perceptual experiences.

<span class="mw-page-title-main">Generative artificial intelligence</span> AI system capable of generating content in response to prompts

Generative artificial intelligence is a subset of artificial intelligence that uses generative models to produce text, images, videos, or other forms of data. These models learn the underlying patterns and structures of their training data and use them to produce new data based on the input, which often comes in the form of natural language prompts.

In the 2020s, the rapid advancement of deep learning-based generative artificial intelligence models raised questions about whether copyright infringement occurs when such are trained or used. This includes text-to-image models such as Stable Diffusion and large language models such as ChatGPT. As of 2023, there were several pending U.S. lawsuits challenging the use of copyrighted data to train AI models, with defendants arguing that this falls under fair use.

<span class="mw-page-title-main">Flux (text-to-image model)</span> Image-generating machine learning model

Flux is a text-to-image model developed by Black Forest Labs, based in Freiburg im Breisgau, Germany. Black Forest Labs was founded by Robin Rombach, Andreas Blattmann, and Patrick Esser. As with other text-to-image models, Flux generates images from natural language descriptions, called prompts.

References

  1. Vincent, James (May 24, 2022). "All these images were generated by Google's latest text-to-image AI". The Verge. Vox Media. Retrieved May 28, 2022.
  2. Agnese, Jorge; Herrera, Jonathan; Tao, Haicheng; Zhu, Xingquan (October 2019), A Survey and Taxonomy of Adversarial Neural Networks for Text-to-Image Synthesis, arXiv: 1910.09399
  3. Zhu, Xiaojin; Goldberg, Andrew B.; Eldawy, Mohamed; Dyer, Charles R.; Strock, Bradley (2007). "A text-to-picture synthesis system for augmenting communication" (PDF). AAAI. 7: 1590–1595.
  4. 1 2 3 Mansimov, Elman; Parisotto, Emilio; Lei Ba, Jimmy; Salakhutdinov, Ruslan (November 2015). "Generating Images from Captions with Attention". ICLR. arXiv: 1511.02793 .
  5. 1 2 3 Reed, Scott; Akata, Zeynep; Logeswaran, Lajanugen; Schiele, Bernt; Lee, Honglak (June 2016). "Generative Adversarial Text to Image Synthesis" (PDF). International Conference on Machine Learning. arXiv: 1605.05396 .
  6. Mansimov, Elman; Parisotto, Emilio; Ba, Jimmy Lei; Salakhutdinov, Ruslan (February 29, 2016). "Generating Images from Captions with Attention". International Conference on Learning Representations. arXiv: 1511.02793 .
  7. 1 2 3 4 Frolov, Stanislav; Hinz, Tobias; Raue, Federico; Hees, Jörn; Dengel, Andreas (December 2021). "Adversarial text-to-image synthesis: A review". Neural Networks. 144: 187–209. arXiv: 2101.09983 . doi: 10.1016/j.neunet.2021.07.019 . PMID   34500257. S2CID   231698782.
  8. Rodriguez, Jesus (September 27, 2022). "🌅 Edge#229: VQGAN + CLIP". thesequence.substack.com. Retrieved October 10, 2022.
  9. Rodriguez, Jesus (October 4, 2022). "🎆🌆 Edge#231: Text-to-Image Synthesis with GANs". thesequence.substack.com. Retrieved October 10, 2022.
  10. Coldewey, Devin (January 5, 2021). "OpenAI's DALL-E creates plausible images of literally anything you ask it to". TechCrunch.
  11. Coldewey, Devin (April 6, 2022). "OpenAI's new DALL-E model draws anything — but bigger, better and faster than before". TechCrunch.
  12. "Stable Diffusion Public Release". Stability.Ai. Retrieved October 27, 2022.
  13. Kumar, Ashish (October 3, 2022). "Meta AI Introduces 'Make-A-Video': An Artificial Intelligence System That Generates Videos From Text". MarkTechPost. Retrieved October 3, 2022.
  14. Edwards, Benj (October 5, 2022). "Google's newest AI generator creates HD video from text prompts". Ars Technica. Retrieved October 25, 2022.
  15. Rodriguez, Jesus (October 25, 2022). "🎨 Edge#237: What is Midjourney?". thesequence.substack.com. Retrieved October 26, 2022.
  16. "Phenaki". phenaki.video. Retrieved October 3, 2022.
  17. Edwards, Benj (September 9, 2022). "Runway teases AI-powered text-to-video editing using written prompts". Ars Technica. Retrieved September 12, 2022.
  18. Saharia, Chitwan; Chan, William; Saxena, Saurabh; Li, Lala; Whang, Jay; Denton, Emily; Kamyar Seyed Ghasemipour, Seyed; Karagol Ayan, Burcu; Sara Mahdavi, S.; Gontijo Lopes, Rapha; Salimans, Tim; Ho, Jonathan; J Fleet, David; Norouzi, Mohammad (May 23, 2022). "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding". arXiv: 2205.11487 [cs.CV].
  19. 1 2 3 Elgan, Mike (November 1, 2022). "How 'synthetic media' will transform business forever". Computerworld. Retrieved November 9, 2022.
  20. Roose, Kevin (October 21, 2022). "A.I.-Generated Art Is Already Transforming Creative Work". The New York Times. Retrieved November 16, 2022.
  21. 1 2 Leswing, Kif. "Why Silicon Valley is so excited about awkward drawings done by artificial intelligence". CNBC. Retrieved November 16, 2022.
  22. "Imagen 2 on Vertex AI is now generally available". Google Cloud Blog. Retrieved January 2, 2024.