Stable Diffusion

Last updated

Stable Diffusion
Original author(s) Runway, CompVis, and Stability AI
Developer(s) Stability AI
Initial releaseAugust 22, 2022
Stable release
SDXL 1.0 (model) [1] / July 26, 2023
Repository
Written in Python [2]
Operating system Any that support CUDA kernels
Type Text-to-image model
License Creative ML OpenRAIL-M
Website stability.ai/stable-image   OOjs UI icon edit-ltr-progressive.svg

Stable Diffusion is a deep learning, text-to-image model released in 2022 based on diffusion techniques. It is considered to be a part of the ongoing AI boom.

Contents

It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt. [3] Its development involved researchers from the CompVis Group at Ludwig Maximilian University of Munich and Runway with a computational donation from the authors and training data from non-profit organizations. [4] [5] [6] [7]

Stable Diffusion is a latent diffusion model, a kind of deep generative artificial neural network. Its code and model weights have been open sourced, [8] and it can run on most consumer hardware equipped with a modest GPU with at least 4 GB VRAM. This marked a departure from previous proprietary text-to-image models such as DALL-E and Midjourney which were accessible only via cloud services. [9] [10]

Development

Stable Diffusion, originated from a project called Latent Diffusion, developed by researchers at Ludwig Maximilian University in Munich and Heidelberg University. [11] [12] Stability AI offered computational resources to support the project, and the model was officially released in August 2022 under the name Stable Diffusion. [12] During the launch, Emad Mostaque focused on promoting the program and his role as its chief evangelist. [12] However, the company's initial press releases did not extensively detail the major contributions of the original researchers, particularly Professor Björn Ommer from Ludwig Maximilian University, who led the project's initial research and development. [12]

The development of Stable Diffusion was funded and shaped by the start-up company Stability AI . [10] [13] [14] [15] The technical license for the model was released by the CompVis group at Ludwig Maximilian University of Munich. [10] Development was led by Patrick Esser of Runway and Robin Rombach of CompVis, who were among the researchers who had earlier invented the latent diffusion model architecture used by Stable Diffusion. [7] Stability AI also credited EleutherAI and LAION (a German nonprofit which assembled the dataset on which Stable Diffusion was trained) as supporters of the project. [7]

In October 2022, Stability AI raised US$101 million in a round led by Lightspeed Venture Partners and Coatue Management. [16]

Technology

Diagram of the latent diffusion architecture used by Stable Diffusion Stable Diffusion architecture.png
Diagram of the latent diffusion architecture used by Stable Diffusion
The denoising process used by Stable Diffusion. The model generates images by iteratively denoising random noise until a configured number of steps have been reached, guided by the CLIP text encoder pretrained on concepts along with the attention mechanism, resulting in the desired image depicting a representation of the trained concept. X-Y plot of algorithmically-generated AI art of European-style castle in Japan demonstrating DDIM diffusion steps.png
The denoising process used by Stable Diffusion. The model generates images by iteratively denoising random noise until a configured number of steps have been reached, guided by the CLIP text encoder pretrained on concepts along with the attention mechanism, resulting in the desired image depicting a representation of the trained concept.

Architecture

Stable Diffusion uses a kind of diffusion model (DM), called a latent diffusion model (LDM) developed by the CompVis group at LMU Munich. [17] [8] Introduced in 2015, diffusion models are trained with the objective of removing successive applications of Gaussian noise on training images, which can be thought of as a sequence of denoising autoencoders. Stable Diffusion consists of 3 parts: the variational autoencoder (VAE), U-Net, and an optional text encoder. [18] The VAE encoder compresses the image from pixel space to a smaller dimensional latent space, capturing a more fundamental semantic meaning of the image. [17] Gaussian noise is iteratively applied to the compressed latent representation during forward diffusion. [18] The U-Net block, composed of a ResNet backbone, denoises the output from forward diffusion backwards to obtain a latent representation. Finally, the VAE decoder generates the final image by converting the representation back into pixel space. [18]

The denoising step can be flexibly conditioned on a string of text, an image, or another modality. The encoded conditioning data is exposed to denoising U-Nets via a cross-attention mechanism. [18] For conditioning on text, the fixed, pretrained CLIP ViT-L/14 text encoder is used to transform text prompts to an embedding space. [8] Researchers point to increased computational efficiency for training and generation as an advantage of LDMs. [7] [17]

The name diffusion takes inspiration from the thermodynamic diffusion and an important link was made between this purely physical field and deep learning in 2015. [19] [20]

With 860 million parameters in the U-Net and 123 million in the text encoder, Stable Diffusion is considered relatively lightweight by 2022 standards, and unlike other diffusion models, it can run on consumer GPUs, [21] and even CPU-only if using the OpenVINO version of Stable Diffusion. [22]

SD XL

The XL version uses the same architecture, [23] except larger: larger UNet backbone, larger cross-attention context, two text encoders instead of one, and trained on multiple aspect ratios (not just the square aspect ratio like previous versions).

The SD XL Refiner, released at the same time, has the same architecture as SD XL, but it was trained for adding fine details to preexisting images via text-conditional img2img.

SD 3.0

The 3.0 version [24] completely changes the backbone. Not a UNet, but a Rectified Flow Transformer, which implements the rectified flow method [25] [26] with a Transformer.

The Transformer architecture used for SD 3.0 has three "tracks", for original text encoding, transformed text encoding, and image encoding (in latent space). The transformed text encoding and image encoding are mixed during each transformer block.

The architecture is named "multimodal diffusion transformer (MMDiT), where the "multimodal" means that it mixes text and image encodings inside its operations. This differs from previous versions of DiT, where the text encoding affects the image encoding, but not vice versa.

Training data

Stable Diffusion was trained on pairs of images and captions taken from LAION-5B, a publicly available dataset derived from Common Crawl data scraped from the web, where 5 billion image-text pairs were classified based on language and filtered into separate datasets by resolution, a predicted likelihood of containing a watermark, and predicted "aesthetic" score (e.g. subjective visual quality). [27] The dataset was created by LAION, a German non-profit which receives funding from Stability AI. [27] [28] The Stable Diffusion model was trained on three subsets of LAION-5B: laion2B-en, laion-high-resolution, and laion-aesthetics v2 5+. [27] A third-party analysis of the model's training data identified that out of a smaller subset of 12 million images taken from the original wider dataset used, approximately 47% of the sample size of images came from 100 different domains, with Pinterest taking up 8.5% of the subset, followed by websites such as WordPress, Blogspot, Flickr, DeviantArt and Wikimedia Commons.[ citation needed ] An investigation by Bayerischer Rundfunk showed that LAION's datasets, hosted on Hugging Face, contain large amounts of private and sensitive data. [29]

Training procedures

The model was initially trained on the laion2B-en and laion-high-resolution subsets, with the last few rounds of training done on LAION-Aesthetics v2 5+, a subset of 600 million captioned images which the LAION-Aesthetics Predictor V2 predicted that humans would, on average, give a score of at least 5 out of 10 when asked to rate how much they liked them. [30] [27] [31] The LAION-Aesthetics v2 5+ subset also excluded low-resolution images and images which LAION-5B-WatermarkDetection identified as carrying a watermark with greater than 80% probability. [27] Final rounds of training additionally dropped 10% of text conditioning to improve Classifier-Free Diffusion Guidance. [32]

The model was trained using 256 Nvidia A100 GPUs on Amazon Web Services for a total of 150,000 GPU-hours, at a cost of $600,000. [33] [34] [35]

Limitations

Stable Diffusion has issues with degradation and inaccuracies in certain scenarios. Initial releases of the model were trained on a dataset that consists of 512×512 resolution images, meaning that the quality of generated images noticeably degrades when user specifications deviate from its "expected" 512×512 resolution; [36] the version 2.0 update of the Stable Diffusion model later introduced the ability to natively generate images at 768×768 resolution. [37] Another challenge is in generating human limbs due to poor data quality of limbs in the LAION database. [38] The model is insufficiently trained to understand human limbs and faces due to the lack of representative features in the database, and prompting the model to generate images of such type can confound the model. [39] Stable Diffusion XL (SDXL) version 1.0, released in July 2023, introduced native 1024x1024 resolution and improved generation for limbs and text. [40] [41]

Accessibility for individual developers can also be a problem. In order to customize the model for new use cases that are not included in the dataset, such as generating anime characters ("waifu diffusion"), [42] new data and further training are required. Fine-tuned adaptations of Stable Diffusion created through additional retraining have been used for a variety of different use-cases, from medical imaging [43] to algorithmically generated music. [44] However, this fine-tuning process is sensitive to the quality of new data; low resolution images or different resolutions from the original data can not only fail to learn the new task but degrade the overall performance of the model. Even when the model is additionally trained on high quality images, it is difficult for individuals to run models in consumer electronics. For example, the training process for waifu-diffusion requires a minimum 30 GB of VRAM, [45] which exceeds the usual resource provided in such consumer GPUs as Nvidia's GeForce 30 series, which has only about 12 GB. [46]

The creators of Stable Diffusion acknowledge the potential for algorithmic bias, as the model was primarily trained on images with English descriptions. [34] As a result, generated images reinforce social biases and are from a western perspective, as the creators note that the model lacks data from other communities and cultures. The model gives more accurate results for prompts that are written in English in comparison to those written in other languages, with western or white cultures often being the default representation. [34]

End-user fine-tuning

To address the limitations of the model's initial training, end-users may opt to implement additional training to fine-tune generation outputs to match more specific use-cases, a process also referred to as personalization. There are three methods in which user-accessible fine-tuning can be applied to a Stable Diffusion model checkpoint:

Capabilities

The Stable Diffusion model supports the ability to generate new images from scratch through the use of a text prompt describing elements to be included or omitted from the output. [8] Existing images can be re-drawn by the model to incorporate new elements described by a text prompt (a process known as "guided image synthesis" [51] ) through its diffusion-denoising mechanism. [8] In addition, the model also allows the use of prompts to partially alter existing images via inpainting and outpainting, when used with an appropriate user interface that supports such features, of which numerous different open source implementations exist. [52]

Stable Diffusion is recommended to be run with 10 GB or more VRAM, however users with less VRAM may opt to load the weights in float16 precision instead of the default float32 to tradeoff model performance with lower VRAM usage. [36]

Text to image generation

Algorithmically-generated landscape artwork of forest with Shinto shrine.png
Algorithmically-generated landscape artwork of forest with Shinto shrine using negative prompt for green trees.png
Algorithmically-generated landscape artwork of forest with Shinto shrine using negative prompt for round stones.png
Demonstration of the effect of negative prompts on image generation
  • Top: no negative prompt
  • Centre: "green trees"
  • Bottom: "round stones, round rocks"

The text to image sampling script within Stable Diffusion, known as "txt2img", consumes a text prompt in addition to assorted option parameters covering sampling types, output image dimensions, and seed values. The script outputs an image file based on the model's interpretation of the prompt. [8] Generated images are tagged with an invisible digital watermark to allow users to identify an image as generated by Stable Diffusion, [8] although this watermark loses its efficacy if the image is resized or rotated. [53]

Each txt2img generation will involve a specific seed value which affects the output image. Users may opt to randomize the seed in order to explore different generated outputs, or use the same seed to obtain the same image output as a previously generated image. [36] Users are also able to adjust the number of inference steps for the sampler; a higher value takes a longer duration of time, however a smaller value may result in visual defects. [36] Another configurable option, the classifier-free guidance scale value, allows the user to adjust how closely the output image adheres to the prompt. [32] More experimentative use cases may opt for a lower scale value, while use cases aiming for more specific outputs may use a higher value. [36]

Additional text2img features are provided by front-end implementations of Stable Diffusion, which allow users to modify the weight given to specific parts of the text prompt. Emphasis markers allow users to add or reduce emphasis to keywords by enclosing them with brackets. [54] An alternative method of adjusting weight to parts of the prompt are "negative prompts". Negative prompts are a feature included in some front-end implementations, including Stability AI's own DreamStudio cloud service, and allow the user to specify prompts which the model should avoid during image generation. The specified prompts may be undesirable image features that would otherwise be present within image outputs due to the positive prompts provided by the user, or due to how the model was originally trained, with mangled human hands being a common example. [52] [55]

Image modification

NightCitySphere (SD1.5).jpg
NightCitySphere (SDXL).jpg
Demonstration of img2img modification
  • Left: Original image created with Stable Diffusion 1.5
  • Right: Modified image created with Stable Diffusion XL 1.0

Stable Diffusion also includes another sampling script, "img2img", which consumes a text prompt, path to an existing image, and strength value between 0.0 and 1.0. The script outputs a new image based on the original image that also features elements provided within the text prompt. The strength value denotes the amount of noise added to the output image. A higher strength value produces more variation within the image but may produce an image that is not semantically consistent with the prompt provided. [8]

The ability of img2img to add noise to the original image makes it potentially useful for data anonymization and data augmentation, in which the visual features of image data are changed and anonymized. [56] The same process may also be useful for image upscaling, in which the resolution of an image is increased, with more detail potentially being added to the image. [56] Additionally, Stable Diffusion has been experimented with as a tool for image compression. Compared to JPEG and WebP, the recent methods used for image compression in Stable Diffusion face limitations in preserving small text and faces. [57]

Additional use-cases for image modification via img2img are offered by numerous front-end implementations of the Stable Diffusion model. Inpainting involves selectively modifying a portion of an existing image delineated by a user-provided layer mask, which fills the masked space with newly generated content based on the provided prompt. [52] A dedicated model specifically fine-tuned for inpainting use-cases was created by Stability AI alongside the release of Stable Diffusion 2.0. [37] Conversely, outpainting extends an image beyond its original dimensions, filling the previously empty space with content generated based on the provided prompt. [52]

A depth-guided model, named "depth2img", was introduced with the release of Stable Diffusion 2.0 on November 24, 2022; this model infers the depth of the provided input image, and generates a new output image based on both the text prompt and the depth information, which allows the coherence and depth of the original input image to be maintained in the generated output. [37]

ControlNet

ControlNet [58] is a neural network architecture designed to manage diffusion models by incorporating additional conditions. It duplicates the weights of neural network blocks into a "locked" copy and a "trainable" copy. The "trainable" copy learns the desired condition, while the "locked" copy preserves the original model. This approach ensures that training with small datasets of image pairs does not compromise the integrity of production-ready diffusion models. The "zero convolution" is a 1×1 convolution with both weight and bias initialized to zero. Before training, all zero convolutions produce zero output, preventing any distortion caused by ControlNet. No layer is trained from scratch; the process is still fine-tuning, keeping the original model secure. This method enables training on small-scale or even personal devices.

Releases

Version numberRelease dateNotes
1.1, 1.2, 1.3, 1.4 [59] August 2022All released by CompVis. There is no "version 1.0". 1.1 gave rise to 1.2, and 1.2 gave rise to both 1.3 and 1.4. [60]
1.5 [61] October 2022Initialized with the weights of 1.2, not 1.4. Released by RunwayML.
2.0 [62] November 2022Retrained from scratch on a filtered dataset. [63]
2.1 [64] December 2022Initialized with the weights of 2.0.
XL 1.0 [65] [23] July 2023The XL 1.0 base model has 3.5 billion parameters, making it

around 3.5x larger than previous versions. [66]

XL Turbo [67] November 2023Distilled from XL 1.0 to run in fewer diffusion steps. [68]
3.0 [69] [24] February 2024 (early preview)A family of models, ranging from 800M to 8B parameters.

Key papers

Training cost

Usage and controversy

Stable Diffusion claims no rights on generated images and freely gives users the rights of usage to any generated images from the model provided that the image content is not illegal or harmful to individuals. [73]

The images Stable Diffusion was trained on have been filtered without human input, leading to some harmful images and large amounts of private and sensitive information appearing in the training data. [29]

As visual styles and compositions are not subject to copyright, it is often interpreted that users of Stable Diffusion who generate images of artworks should not be considered to be infringing upon the copyright of visually similar works. [74] However, individuals depicted in generated images may be protected by personality rights if their likeness is used, [74] and intellectual property such as recognizable brand logos still remain protected by copyright. Nonetheless, visual artists have expressed concern that widespread usage of image synthesis software such as Stable Diffusion may eventually lead to human artists, along with photographers, models, cinematographers, and actors, gradually losing commercial viability against AI-based competitors. [15]

Stable Diffusion is notably more permissive in the types of content users may generate, such as violent or sexually explicit imagery, in comparison to other commercial products based on generative AI. [75] Addressing the concerns that the model may be used for abusive purposes, CEO of Stability AI, Emad Mostaque, argues that "[it is] peoples' responsibility as to whether they are ethical, moral, and legal in how they operate this technology", [10] and that putting the capabilities of Stable Diffusion into the hands of the public would result in the technology providing a net benefit, in spite of the potential negative consequences. [10] In addition, Mostaque argues that the intention behind the open availability of Stable Diffusion is to end corporate control and dominance over such technologies, who have previously only developed closed AI systems for image synthesis. [10] [75] This is reflected by the fact that any restrictions Stability AI places on the content that users may generate can easily be bypassed due to the availability of the source code. [76]

Controversy around photorealistic sexualized depictions of underage characters have been brought up, due to such images generated by Stable Diffusion being shared on websites such as Pixiv. [77]

Litigation

In January 2023, three artists, Sarah Andersen, Kelly McKernan, and Karla Ortiz, filed a copyright infringement lawsuit against Stability AI, Midjourney, and DeviantArt, claiming that these companies have infringed the rights of millions of artists by training AI tools on five billion images scraped from the web without the consent of the original artists. [78] The same month, Stability AI was also sued by Getty Images for using its images in the training data. [14]

In July 2023, U.S. District Judge William Orrick inclined to dismiss most of the lawsuit filed by Andersen, McKernan, and Ortiz but allowed them to file a new complaint. [79]

License

Unlike models like DALL-E, Stable Diffusion makes its source code available, [80] [8] along with the model (pretrained weights). It applies the Creative ML OpenRAIL-M license, a form of Responsible AI License (RAIL), to the model (M). [81] The license prohibits certain use cases, including crime, libel, harassment, doxing, "exploiting ... minors", giving medical advice, automatically creating legal obligations, producing legal evidence, and "discriminating against or harming individuals or groups based on ... social behavior or ... personal or personality characteristics ... [or] legally protected characteristics or categories". [82] [83] The user owns the rights to their generated output images, and is free to use them commercially. [84]

See also

Related Research Articles

Music and artificial intelligence is the development of music software programs which use AI to generate music. As with applications in other fields, AI in music also simulates mental tasks. A prominent feature is the capability of an AI algorithm to learn based on past data, such as in computer accompaniment technology, wherein the AI is capable of listening to a human performer and performing accompaniment. Artificial intelligence also drives interactive composition technology, wherein a computer composes music in response to a live performance. There are other AI applications in music that cover not only music composition, production, and performance but also how music is marketed and consumed. Several music player programs have also been developed to use voice recognition and natural language processing technology for music voice control. Current research includes the application of AI in music composition, performance, theory and digital sound processing.

Multimodal learning, in the context of machine learning, is a type of deep learning using a combination of various modalities of data, such as text, audio, or images, in order to create a more robust model of the real-world phenomena in question. In contrast, singular modal learning would analyze text or imaging data independently. Multimodal machine learning combines these fundamentally different statistical analyses using specialized modeling strategies and algorithms, resulting in a model that comes closer to representing the real world.

<span class="mw-page-title-main">Artificial intelligence art</span> Machine application of knowledge of human aesthetic expressions

Artificial intelligence art is any visual artwork created through the use of artificial intelligence (AI) programs such as text-to-image models. AI art began to gain popularity in the mid- to late-20th century through the boom of artificial intelligence.

<span class="mw-page-title-main">Transformer (deep learning architecture)</span> Machine learning algorithm used for natural-language processing

A transformer is a deep learning architecture developed by Google and based on the multi-head attention mechanism, proposed in a 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. The transformer paper, published in 2017, is based on the softmax-based attention mechanism proposed by Bahdanau et. al. in 2014 for machine translation, and the Fast Weight Controller, similar to a transformer, proposed in 1992.

<span class="mw-page-title-main">DALL-E</span> Image-generating deep-learning model

DALL·E, DALL·E 2, and DALL·E 3 are text-to-image models developed by OpenAI using deep learning methodologies to generate digital images from natural language descriptions, called "prompts."

<span class="mw-page-title-main">Artbreeder</span> Art website

Artbreeder, formerly known as Ganbreeder, is a collaborative, machine learning-based art website. Using the models StyleGAN and BigGAN, the website allows users to generate and modify images of faces, landscapes, and paintings, among other categories.

Prompt engineering is the process of structuring an instruction that can be interpreted and understood by a generative AI model. A prompt is natural language text describing the task that an AI should perform.

<span class="mw-page-title-main">Midjourney</span> Image-generating machine learning model

Midjourney is a generative artificial intelligence program and service created and hosted by the San Francisco–based independent research lab Midjourney, Inc. Midjourney generates images from natural language descriptions, called prompts, similar to OpenAI's DALL-E and Stability AI's Stable Diffusion. It is one of the technologies of the AI boom.

Hugging Face, Inc. is a French-American company based in New York City that develops computation tools for building applications using machine learning. It is most notable for its transformers library built for natural language processing applications and its platform that allows users to share machine learning models and datasets and showcase their work.

<span class="mw-page-title-main">Loab</span> Image found using AI text-to-image software

Loab is a fictional character that artist and writer Steph Maj Swanson has claimed to have discovered with a text-to-image AI model in April 2022. In a viral Twitter thread, Swanson described it as an unexpectedly emergent property of the software, saying they discovered it when asking the model to produce something "as different from the prompt as possible".

<span class="mw-page-title-main">Text-to-image model</span> Machine learning model

A text-to-image model is a machine learning model which takes an input natural language description and produces an image matching that description.

<span class="mw-page-title-main">LAION</span> Non-profit German artificial intelligence organization

LAION is a German non-profit which makes open-sourced artificial intelligence models and datasets. It is best known for releasing a number of large datasets of images and captions scraped from the web which have been used to train a number of high-profile text-to-image models, including Stable Diffusion and Imagen.

<span class="mw-page-title-main">NovelAI</span> Online service for AI media creation

NovelAI is an online cloud-based, SaaS model, and a paid subscription service for AI-assisted storywriting and text-to-image synthesis, originally launched in beta on June 15, 2021, with the image generation feature being implemented later on October 3, 2022. NovelAI is owned and operated by Anlatan, which is headquartered in Wilmington, Delaware.

A text-to-video model is a machine learning model which takes a natural language description as input and producing a video or multiples videos from the input.

<span class="mw-page-title-main">DreamBooth</span> Deep learning generation model

DreamBooth is a deep learning generation model used to personalize existing text-to-image models by fine-tuning. It was developed by researchers from Google Research and Boston University in 2022. Originally developed using Google's own Imagen text-to-image model, DreamBooth implementations can be applied to other text-to-image models, where it can allow the model to generate more fine-tuned and personalized outputs after training on three to five images of a subject.

<span class="mw-page-title-main">Riffusion</span> Music-generating machine learning model

Riffusion is a neural network, designed by Seth Forsgren and Hayk Martiros, that generates music using images of sound rather than audio. It was created as a fine-tuning of Stable Diffusion, an existing open-source model for generating images from text prompts, on spectrograms. This results in a model which uses text prompts to generate image files, which can be put through an inverse Fourier transform and converted into audio files. While these files are only several seconds long, the model can also use latent space between outputs to interpolate different files together. This is accomplished using a functionality of the Stable Diffusion model known as img2img.

In the 2020s, the rapid advancement of deep learning-based generative artificial intelligence models are raising questions about whether copyright infringement occurs when the generative AI is trained or used. This includes text-to-image models such as Stable Diffusion and large language models such as ChatGPT. As of 2023, there are several pending US lawsuits challenging the use of copyrighted data to train AI models, with defendants arguing that this falls under fair use.

Text-to-Image personalization is a task in deep learning for computer graphics that augments pre-trained text-to-image generative models. In this task, a generative model that was trained on large-scale data, is adapted such that it can generate images of novel, user-provided concepts. These concepts are typically unseen during training, and may represent specific objects or more abstract categories.

Runway AI, Inc. is an American company headquartered in New York City that specializes in generative artificial intelligence research and technologies. The company is primarily focused on creating products and models for generating videos, images, and various multimedia content. It is most notable for developing the first commercial text-to-video generative AI models Gen-1 and Gen-2 and co-creating the research for the popular image generation AI system Stable Diffusion.

<span class="mw-page-title-main">Sora (text-to-video model)</span> Text-to-video model by OpenAI

Sora is an upcoming generative artificial intelligence model developed by OpenAI, that specializes in text-to-video generation. The model accepts textual descriptions, known as prompts, from users and generates short video clips corresponding to those descriptions. Prompts can specify artistic styles, fantastical imagery, or real-world scenarios. When creating real-world scenarios, user input may be required to ensure factual accuracy, otherwise features can be added erroneously. Sora is praised for its ability to produce videos with high levels of visual detail, including intricate camera movements and characters that exhibit a range of emotions. Furthermore, the model possesses the functionality to extend existing short videos by generating new content that seamlessly precedes or follows the original clip. As of April 2024, it is unreleased and not yet available to the public.

References

  1. "Announcing SDXL 1.0". stability.ai. Archived from the original on July 26, 2023.
  2. Ryan O'Connor (August 23, 2022). "How to Run Stable Diffusion Locally to Generate Images". Archived from the original on October 13, 2023. Retrieved May 4, 2023.
  3. "Diffuse The Rest - a Hugging Face Space by huggingface". huggingface.co. Archived from the original on September 5, 2022. Retrieved September 5, 2022.
  4. "Leaked deck raises questions over Stability AI's Series A pitch to investors". sifted.eu. Archived from the original on June 29, 2023. Retrieved June 20, 2023.
  5. "Revolutionizing image generation by AI: Turning text into images". www.lmu.de. Archived from the original on September 17, 2022. Retrieved June 21, 2023.
  6. Mostaque, Emad (November 2, 2022). "Stable Diffusion came from the Machine Vision & Learning research group (CompVis) @LMU_Muenchen". Twitter. Archived from the original on July 20, 2023. Retrieved June 22, 2023.
  7. 1 2 3 4 "Stable Diffusion Launch Announcement". Stability.Ai. Archived from the original on September 5, 2022. Retrieved September 6, 2022.
  8. 1 2 3 4 5 6 7 8 9 "Stable Diffusion Repository on GitHub". CompVis - Machine Vision and Learning Research Group, LMU Munich. September 17, 2022. Archived from the original on January 18, 2023. Retrieved September 17, 2022.
  9. "The new killer app: Creating AI art will absolutely crush your PC". PCWorld. Archived from the original on August 31, 2022. Retrieved August 31, 2022.
  10. 1 2 3 4 5 6 Vincent, James (September 15, 2022). "Anyone can use this AI art generator — that's the risk". The Verge. Archived from the original on January 21, 2023. Retrieved September 30, 2022.
  11. Cai, Kenrick. "The AI Founder Taking Credit For Stable Diffusion's Success Has A History Of Exaggeration". Forbes.
  12. 1 2 3 4 Growcoot, Matt (June 5, 2023). "'So Many Things Don't Add Up': Stability AI Founder Accused of Exaggerations". PetaPixel.
  13. "The AI Founder Taking Credit For Stable Diffusion's Success Has A History Of Exaggeration". www.forbes.com. Archived from the original on June 21, 2023. Retrieved June 20, 2023.
  14. 1 2 Korn, Jennifer (January 17, 2023). "Getty Images suing the makers of popular AI art tool for allegedly stealing photos". CNN. Archived from the original on March 1, 2023. Retrieved January 22, 2023.
  15. 1 2 Heikkilä, Melissa (September 16, 2022). "This artist is dominating AI-generated art. And he's not happy about it". MIT Technology Review. Archived from the original on January 14, 2023. Retrieved September 26, 2022.
  16. Wiggers, Kyle (October 17, 2022). "Stability AI, the startup behind Stable Diffusion, raises $101M". Techcrunch. Archived from the original on October 17, 2022. Retrieved October 17, 2022.
  17. 1 2 3 Rombach; Blattmann; Lorenz; Esser; Ommer (June 2022). High-Resolution Image Synthesis with Latent Diffusion Models (PDF). International Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA. pp. 10684–10695. arXiv: 2112.10752 . Archived (PDF) from the original on January 20, 2023. Retrieved September 17, 2022.
  18. 1 2 3 4 Alammar, Jay. "The Illustrated Stable Diffusion". jalammar.github.io. Archived from the original on November 1, 2022. Retrieved October 31, 2022.
  19. David, Foster. "8. Diffusion Models". Generative Deep Learning (2 ed.). O'Reilly.
  20. Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, Surya Ganguli (March 12, 2015). "Deep Unsupervised Learning using Nonequilibrium Thermodynamics". Arxiv. arXiv: 1503.03585 .{{cite journal}}: CS1 maint: multiple names: authors list (link)
  21. "Stable diffusion pipelines". huggingface.co. Archived from the original on June 25, 2023. Retrieved June 22, 2023.
  22. "Text-to-Image Generation with Stable Diffusion and OpenVINO™". openvino.ai. Intel . Retrieved February 10, 2024.
  23. 1 2 3 Podell, Dustin; English, Zion; Lacey, Kyle; Blattmann, Andreas; Dockhorn, Tim; Müller, Jonas; Penna, Joe; Rombach, Robin (July 4, 2023), SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis, arXiv: 2307.01952 , retrieved March 6, 2024
  24. 1 2 3 Esser, Patrick; Kulal, Sumith; Blattmann, Andreas; Entezari, Rahim; Müller, Jonas; Saini, Harry; Levi, Yam; Lorenz, Dominik; Sauer, Axel (March 5, 2024), Scaling Rectified Flow Transformers for High-Resolution Image Synthesis, arXiv: 2403.03206 , retrieved March 6, 2024
  25. 1 2 Liu, Xingchao; Gong, Chengyue; Liu, Qiang (September 7, 2022), Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow, arXiv: 2209.03003 , retrieved March 6, 2024
  26. 1 2 "Rectified Flow — Rectified Flow". www.cs.utexas.edu. Retrieved March 6, 2024.
  27. 1 2 3 4 5 Baio, Andy (August 30, 2022). "Exploring 12 Million of the 2.3 Billion Images Used to Train Stable Diffusion's Image Generator". Waxy.org. Archived from the original on January 20, 2023. Retrieved November 2, 2022.
  28. "This artist is dominating AI-generated art. And he's not happy about it". MIT Technology Review. Archived from the original on January 14, 2023. Retrieved November 2, 2022.
  29. 1 2 Brunner, Katharina; Harlan, Elisa (July 7, 2023). "We Are All Raw Material for AI". Bayerischer Rundfunk (BR). Archived from the original on September 12, 2023. Retrieved September 12, 2023.
  30. Schuhmann, Christoph (November 2, 2022), CLIP+MLP Aesthetic Score Predictor, archived from the original on June 8, 2023, retrieved November 2, 2022
  31. "LAION-Aesthetics | LAION". laion.ai. Archived from the original on August 26, 2022. Retrieved September 2, 2022.
  32. 1 2 3 Ho, Jonathan; Salimans, Tim (July 25, 2022). "Classifier-Free Diffusion Guidance". arXiv: 2207.12598 [cs.LG].
  33. Mostaque, Emad (August 28, 2022). "Cost of construction". Twitter. Archived from the original on September 6, 2022. Retrieved September 6, 2022.
  34. 1 2 3 "CompVis/stable-diffusion-v1-4 · Hugging Face". huggingface.co. Archived from the original on January 11, 2023. Retrieved November 2, 2022.
  35. Wiggers, Kyle (August 12, 2022). "A startup wants to democratize the tech behind DALL-E 2, consequences be damned". TechCrunch. Archived from the original on January 19, 2023. Retrieved November 2, 2022.
  36. 1 2 3 4 5 "Stable Diffusion with 🧨 Diffusers". huggingface.co. Archived from the original on January 17, 2023. Retrieved October 31, 2022.
  37. 1 2 3 "Stable Diffusion 2.0 Release". stability.ai. Archived from the original on December 10, 2022.
  38. "LAION". laion.ai. Archived from the original on October 16, 2023. Retrieved October 31, 2022.
  39. "Generating images with Stable Diffusion". Paperspace Blog. August 24, 2022. Archived from the original on October 31, 2022. Retrieved October 31, 2022.
  40. "Announcing SDXL 1.0". Stability AI. Archived from the original on July 26, 2023. Retrieved August 21, 2023.
  41. Edwards, Benj (July 27, 2023). "Stability AI releases Stable Diffusion XL, its next-gen image synthesis model". Ars Technica. Archived from the original on August 21, 2023. Retrieved August 21, 2023.
  42. "hakurei/waifu-diffusion · Hugging Face". huggingface.co. Archived from the original on October 8, 2023. Retrieved October 31, 2022.
  43. Chambon, Pierre; Bluethgen, Christian; Langlotz, Curtis P.; Chaudhari, Akshay (October 9, 2022). "Adapting Pretrained Vision-Language Foundational Models to Medical Imaging Domains". arXiv: 2210.04133 [cs.CV].
  44. Seth Forsgren; Hayk Martiros. "Riffusion - Stable diffusion for real-time music generation". Riffusion. Archived from the original on December 16, 2022.
  45. Mercurio, Anthony (October 31, 2022), Waifu Diffusion, archived from the original on October 31, 2022, retrieved October 31, 2022
  46. Smith, Ryan. "NVIDIA Quietly Launches GeForce RTX 3080 12GB: More VRAM, More Power, More Money". www.anandtech.com. Archived from the original on August 27, 2023. Retrieved October 31, 2022.
  47. Dave James (October 28, 2022). "I thrashed the RTX 4090 for 8 hours straight training Stable Diffusion to paint like my uncle Hermann". PC Gamer . Archived from the original on November 9, 2022.
  48. Gal, Rinon; Alaluf, Yuval; Atzmon, Yuval; Patashnik, Or; Bermano, Amit H.; Chechik, Gal; Cohen-Or, Daniel (August 2, 2022). "An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion". arXiv: 2208.01618 [cs.CV].
  49. "NovelAI Improvements on Stable Diffusion". NovelAI. October 11, 2022. Archived from the original on October 27, 2022.
  50. Yuki Yamashita (September 1, 2022). "愛犬の合成画像を生成できるAI 文章で指示するだけでコスプレ 米Googleが開発". ITmedia Inc. (in Japanese). Archived from the original on August 31, 2022.
  51. Meng, Chenlin; He, Yutong; Song, Yang; Song, Jiaming; Wu, Jiajun; Zhu, Jun-Yan; Ermon, Stefano (August 2, 2021). "SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations". arXiv: 2108.01073 [cs.CV].
  52. 1 2 3 4 "Stable Diffusion web UI". GitHub. November 10, 2022. Archived from the original on January 20, 2023. Retrieved September 27, 2022.
  53. invisible-watermark, Shield Mountain, November 2, 2022, archived from the original on October 18, 2022, retrieved November 2, 2022
  54. "stable-diffusion-tools/emphasis at master · JohannesGaessler/stable-diffusion-tools". GitHub. Archived from the original on October 2, 2022. Retrieved November 2, 2022.
  55. "Stable Diffusion v2.1 and DreamStudio Updates 7-Dec 22". stability.ai. Archived from the original on December 10, 2022.
  56. 1 2 Luzi, Lorenzo; Siahkoohi, Ali; Mayer, Paul M.; Casco-Rodriguez, Josue; Baraniuk, Richard (October 21, 2022). "Boomerang: Local sampling on image manifolds using diffusion models". arXiv: 2210.12100 [cs.CV].
  57. Bühlmann, Matthias (September 28, 2022). "Stable Diffusion Based Image Compression". Medium. Archived from the original on November 2, 2022. Retrieved November 2, 2022.
  58. Zhang, Lvmin (February 10, 2023). "Adding Conditional Control to Text-to-Image Diffusion Models". arXiv: 2302.05543 [cs.CV].
  59. "CompVis/stable-diffusion-v1-4 · Hugging Face". huggingface.co. Archived from the original on January 11, 2023. Retrieved August 17, 2023.
  60. "CompVis (CompVis)". huggingface.co. August 23, 2023. Retrieved March 6, 2024.
  61. "runwayml/stable-diffusion-v1-5 · Hugging Face". huggingface.co. Archived from the original on September 21, 2023. Retrieved August 17, 2023.
  62. 1 2 "stabilityai/stable-diffusion-2 · Hugging Face". huggingface.co. Archived from the original on September 21, 2023. Retrieved August 17, 2023.
  63. "stabilityai/stable-diffusion-2-base · Hugging Face". huggingface.co. Retrieved January 1, 2024.
  64. "stabilityai/stable-diffusion-2-1 · Hugging Face". huggingface.co. Archived from the original on September 21, 2023. Retrieved August 17, 2023.
  65. "stabilityai/stable-diffusion-xl-base-1.0 · Hugging Face". huggingface.co. Archived from the original on October 8, 2023. Retrieved August 17, 2023.
  66. "Announcing SDXL 1.0". Stability AI. Retrieved January 1, 2024.
  67. "stabilityai/sdxl-turbo · Hugging Face". huggingface.co. Retrieved January 1, 2024.
  68. "Adversarial Diffusion Distillation". Stability AI. Retrieved January 1, 2024.
  69. "Stable Diffusion 3". Stability AI. Retrieved March 5, 2024.
  70. Radford, Alec; Kim, Jong Wook; Hallacy, Chris; Ramesh, Aditya; Goh, Gabriel; Agarwal, Sandhini; Sastry, Girish; Askell, Amanda; Mishkin, Pamela (February 26, 2021), Learning Transferable Visual Models From Natural Language Supervision, arXiv: 2103.00020 , retrieved March 6, 2024
  71. Meng, Chenlin; He, Yutong; Song, Yang; Song, Jiaming; Wu, Jiajun; Zhu, Jun-Yan; Ermon, Stefano (January 4, 2022), SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations, arXiv: 2108.01073 , retrieved March 6, 2024
  72. Rombach, Robin; Blattmann, Andreas; Lorenz, Dominik; Esser, Patrick; Ommer, Björn (2022). "High-Resolution Image Synthesis With Latent Diffusion Models": 10684–10695. arXiv: 2112.10752 .{{cite journal}}: Cite journal requires |journal= (help)
  73. "LICENSE.md · stabilityai/stable-diffusion-xl-base-1.0 at main". huggingface.co. July 26, 2023. Retrieved January 1, 2024.
  74. 1 2 "高性能画像生成AI「Stable Diffusion」無料リリース。「kawaii」までも理解し創造する画像生成AI". Automaton Media (in Japanese). August 24, 2022. Archived from the original on December 8, 2022. Retrieved October 4, 2022.
  75. 1 2 Ryo Shimizu (August 26, 2022). "Midjourneyを超えた? 無料の作画AI「 #StableDiffusion 」が「AIを民主化した」と断言できる理由". Business Insider Japan (in Japanese). Archived from the original on December 10, 2022. Retrieved October 4, 2022.
  76. Cai, Kenrick. "Startup Behind AI Image Generator Stable Diffusion Is In Talks To Raise At A Valuation Up To $1 Billion". Forbes. Archived from the original on September 30, 2023. Retrieved October 31, 2022.
  77. "Illegal trade in AI child sex abuse images exposed". BBC News. June 27, 2023. Archived from the original on September 21, 2023. Retrieved September 26, 2023.
  78. Vincent, James (January 16, 2023). "AI art tools Stable Diffusion and Midjourney targeted with copyright lawsuit". The Verge. Archived from the original on March 9, 2023. Retrieved January 16, 2023.
  79. Brittain, Blake (July 19, 2023). "US judge finds flaws in artists' lawsuit against AI companies". Reuters. Archived from the original on September 6, 2023. Retrieved August 6, 2023.
  80. "Stable Diffusion Public Release". Stability.Ai. Archived from the original on August 30, 2022. Retrieved August 31, 2022.
  81. "From RAIL to Open RAIL: Topologies of RAIL Licenses". Responsible AI Licenses (RAIL). August 18, 2022. Archived from the original on July 27, 2023. Retrieved February 20, 2023.
  82. "Ready or not, mass video deepfakes are coming". The Washington Post. August 30, 2022. Archived from the original on August 31, 2022. Retrieved August 31, 2022.
  83. "License - a Hugging Face Space by CompVis". huggingface.co. Archived from the original on September 4, 2022. Retrieved September 5, 2022.
  84. Katsuo Ishida (August 26, 2022). "言葉で指示した画像を凄いAIが描き出す「Stable Diffusion」 ~画像は商用利用も可能". Impress Corporation (in Japanese). Archived from the original on November 14, 2022. Retrieved October 4, 2022.