Sora (text-to-video model)

Last updated

Sora
Developer(s) OpenAI
Platform OpenAI
Type Text-to-video model
Website openai.com/sora   OOjs UI icon edit-ltr-progressive.svg

Sora is an upcoming generative artificial intelligence model developed by OpenAI, that specializes in text-to-video generation. The model accepts textual descriptions, known as prompts, from users and generates short video clips corresponding to those descriptions. Prompts can specify artistic styles, fantastical imagery, or real-world scenarios. When creating real-world scenarios, user input may be required to ensure factual accuracy, otherwise features can be added erroneously. Sora is praised for its ability to produce videos with high levels of visual detail, including intricate camera movements and characters that exhibit a range of emotions. Furthermore, the model possesses the functionality to extend existing short videos by generating new content that seamlessly precedes or follows the original clip. [1] [2] [3] As of April 2024, it is unreleased and not yet available to the public. [4]

Contents

History

Several other text-to-video generating models had been created prior to Sora, including Meta's Make-A-Video, Runway's Gen-2, and Google's Lumiere, the last of which, as of February 2024, is also still in its research phase. [5] OpenAI, the company behind Sora, had released DALL·E 3, the third of its DALL-E text-to-image models, in September 2023. [6]

The team that developed Sora named it after the Japanese word for sky to signify its "limitless creative potential". [1] On February 15, 2024, OpenAI first previewed Sora by releasing multiple clips of high-definition videos that it created, including an SUV driving down a mountain road, an animation of a "short fluffy monster" next to a candle, two people walking through Tokyo in the snow, and fake historical footage of the California gold rush, and stated that it was able to generate videos up to one minute long. [5] The company then shared a technical report, which highlighted the methods used to train the model. [2] [7] OpenAI CEO Sam Altman also posted a series of tweets, responding to Twitter users' prompts with Sora-generated videos of the prompts.

OpenAI has stated that it plans to make Sora available to the public but that it would not be soon; it has not specified when. [5] [4] The company provided limited access to a small "red team", including experts in misinformation and bias, to perform adversarial testing on the model. [6] The company also shared Sora with a small group of creative professionals, including video makers and artists, to seek feedback on its usefulness in creative fields. [8]

Capabilities and limitations

A video generated by Sora of someone lying in a bed with a cat on it, containing several mistakes

The technology behind Sora is an adaptation of the technology behind DALL-E 3. According to OpenAI, Sora is a diffusion transformer [9] – a denoising latent diffusion model with one Transformer as the denoiser. A video is generated in latent space by denoising 3D "patches", then transformed to standard space by a video decompressor. Re-captioning is used to augment training data, by using a video-to-text model to create detailed captions on videos. [7]

OpenAI trained the model using publicly available videos as well as copyrighted videos licensed for the purpose, but did not reveal the number or the exact source of the videos. [1] Upon its release, OpenAI acknowledged some of Sora's shortcomings, including its struggling to simulate complex physics, to understand causality, and to differentiate left from right. [10] One example shows a group of wolf pups seemingly multiplying and converging, creating a hard-to-follow scenario. [11] OpenAI also stated that, in adherence to the company's existing safety practices, Sora will restrict text prompts for sexual, violent, hateful, or celebrity imagery, as well as content featuring pre-existing intellectual property. [6]

Tim Brooks, a researcher on Sora, stated that the model figured out how to create 3D graphics from its dataset alone, while Bill Peebles, also a Sora researcher, said that the model automatically created different video angles without being prompted. [5] According to OpenAI, Sora-generated videos are tagged with C2PA metadata to indicate that they were AI-generated. [1]

Reception

Will Douglas Heaven of the MIT Technology Review called the demonstration videos "impressive", but noted that they must have been cherry-picked and may not be representative of Sora's typical output. [8] American academic Oren Etzioni expressed concerns over the technology's ability to create online disinformation for political campaigns. [1] For Wired , Steven Levy similarly wrote that it had the potential to become "a misinformation train wreck" and opined that its preview clips were "impressive" but "not perfect" and that it "show[ed] an emergent grasp of cinematic grammar" due to its unprompted shot changes. Levy added, "[i]t will be a very long time, if ever, before text-to-video threatens actual filmmaking." [5] Lisa Lacy of CNET called its example videos "remarkably realistic – except perhaps when a human face appears close up or when sea creatures are swimming". [6]

Filmmaker Tyler Perry announced he would be putting a planned $800 million expansion of his Atlanta studio on hold, expressing concern about Sora's potential impact on the film industry. [12] [13]

See also

Related Research Articles

Music and artificial intelligence is the development of music software programs which use AI to generate music. As with applications in other fields, AI in music also simulates mental tasks. A prominent feature is the capability of an AI algorithm to learn based on past data, such as in computer accompaniment technology, wherein the AI is capable of listening to a human performer and performing accompaniment. Artificial intelligence also drives interactive composition technology, wherein a computer composes music in response to a live performance. There are other AI applications in music that cover not only music composition, production, and performance but also how music is marketed and consumed. Several music player programs have also been developed to use voice recognition and natural language processing technology for music voice control. Current research includes the application of AI in music composition, performance, theory and digital sound processing.

Multimodal learning, in the context of machine learning, is a type of deep learning using a combination of various modalities of data, such as text, audio, or images, in order to create a more robust model of the real-world phenomena in question. In contrast, singular modal learning would analyze text or imaging data independently. Multimodal machine learning combines these fundamentally different statistical analyses using specialized modeling strategies and algorithms, resulting in a model that comes closer to representing the real world.

<span class="mw-page-title-main">OpenAI</span> Artificial intelligence research organization

OpenAI is a U.S.-based artificial intelligence (AI) research organization founded in December 2015, researching artificial intelligence with the goal of developing "safe and beneficial" artificial general intelligence, which it defines as "highly autonomous systems that outperform humans at most economically valuable work". As one of the leading organizations of the AI boom, it has developed several large language models, advanced image generation models, and previously, released open-source models. Its release of ChatGPT has been credited with starting the AI boom.

<span class="mw-page-title-main">Artificial intelligence art</span> Machine application of knowledge of human aesthetic expressions

Artificial intelligence art is any visual artwork created through the use of artificial intelligence (AI) programs such as text-to-image models. AI art began to gain popularity in the mid- to late-20th century through the boom of artificial intelligence.

Synthetic media is a catch-all term for the artificial production, manipulation, and modification of data and media by automated means, especially through the use of artificial intelligence algorithms, such as for the purpose of misleading people or changing an original meaning. Synthetic media refers to any form of media, including but not limited to images, videos, audio recordings, and text, that are generated or manipulated using artificial intelligence (AI) techniques. This technology enables the creation of highly realistic content that may be indistinguishable from authentic media produced by humans. Synthetic media as a field has grown rapidly since the creation of generative adversarial networks, primarily through the rise of deepfakes as well as music synthesis, text generation, human image synthesis, speech synthesis, and more. Though experts use the term "synthetic media," individual methods such as deepfakes and text synthesis are sometimes not referred to as such by the media but instead by their respective terminology Significant attention arose towards the field of synthetic media starting in 2017 when Motherboard reported on the emergence of AI altered pornographic videos to insert the faces of famous actresses. Potential hazards of synthetic media include the spread of misinformation, further loss of trust in institutions such as media and government, the mass automation of creative and journalistic jobs and a retreat into AI-generated fantasy worlds. Synthetic media is an applied form of artificial imagination.

Generative Pre-trained Transformer 3 (GPT-3) is a large language model released by OpenAI in 2020. Like its predecessor, GPT-2, it is a decoder-only transformer model of deep neural network, which supersedes recurrence and convolution-based architectures with a technique known as "attention". This attention mechanism allows the model to selectively focus on segments of input text it predicts to be most relevant. It uses a 2048-tokens-long context, float16 (16-bit) precision, and a hitherto-unprecedented 175 billion parameters, requiring 350GB of storage space as each parameter takes 2 bytes of space, and has demonstrated strong "zero-shot" and "few-shot" learning abilities on many tasks.

<span class="mw-page-title-main">GPT-2</span> 2019 text-generating language model

Generative Pre-trained Transformer 2 (GPT-2) is a large language model by OpenAI and the second in their foundational series of GPT models. GPT-2 was pre-trained a dataset of 8 million web pages. It was partially released in February 2019, followed by full release of the 1.5-billion-parameter model on November 5, 2019.

<span class="mw-page-title-main">DALL-E</span> Image-generating deep-learning model

DALL·E, DALL·E 2, and DALL·E 3 are text-to-image models developed by OpenAI using deep learning methodologies to generate digital images from natural language descriptions, called "prompts."

<span class="mw-page-title-main">Artbreeder</span> Art website

Artbreeder, formerly known as Ganbreeder, is a collaborative, machine learning-based art website. Using the models StyleGAN and BigGAN, the website allows users to generate and modify images of faces, landscapes, and paintings, among other categories.

Prompt engineering is the process of structuring an instruction that can be interpreted and understood by a generative AI model. A prompt is natural language text describing the task that an AI should perform.

<span class="mw-page-title-main">Midjourney</span> Image-generating machine learning model

Midjourney is a generative artificial intelligence program and service created and hosted by the San Francisco–based independent research lab Midjourney, Inc. Midjourney generates images from natural language descriptions, called prompts, similar to OpenAI's DALL-E and Stability AI's Stable Diffusion. It is one of the technologies of the AI boom.

<span class="mw-page-title-main">Stable Diffusion</span> Image-generating machine learning model

Stable Diffusion is a deep learning, text-to-image model released in 2022 based on diffusion techniques. It is considered to be a part of the ongoing AI boom.

<span class="mw-page-title-main">Text-to-image model</span> Machine learning model

A text-to-image model is a machine learning model which takes an input natural language description and produces an image matching that description.

<span class="mw-page-title-main">LAION</span> Non-profit German artificial intelligence organization

LAION is a German non-profit which makes open-sourced artificial intelligence models and datasets. It is best known for releasing a number of large datasets of images and captions scraped from the web which have been used to train a number of high-profile text-to-image models, including Stable Diffusion and Imagen.

In machine learning, diffusion models, also known as diffusion probabilistic models or score-based generative models, are a class of latent variable generative models. A diffusion model consists of three major components: the forward process, the reverse process, and the sampling procedure. The goal of diffusion models is to learn a diffusion process that generates a probability distribution for a given dataset from which we can then sample new images. They learn the latent structure of a dataset by modeling the way in which data points diffuse through their latent space.

A text-to-video model is a machine learning model which takes a natural language description as input and producing a video or multiples videos from the input.

<span class="mw-page-title-main">Riffusion</span> Music-generating machine learning model

Riffusion is a neural network, designed by Seth Forsgren and Hayk Martiros, that generates music using images of sound rather than audio. It was created as a fine-tuning of Stable Diffusion, an existing open-source model for generating images from text prompts, on spectrograms. This results in a model which uses text prompts to generate image files, which can be put through an inverse Fourier transform and converted into audio files. While these files are only several seconds long, the model can also use latent space between outputs to interpolate different files together. This is accomplished using a functionality of the Stable Diffusion model known as img2img.

<span class="mw-page-title-main">Generative artificial intelligence</span> AI system capable of generating content in response to prompts

Generative artificial intelligence is artificial intelligence capable of generating text, images, videos, or other data using generative models, often in response to prompts. Generative AI models learn the patterns and structure of their input training data and then generate new data that has similar characteristics.

<span class="mw-page-title-main">AI boom</span> Rapid progress in artificial intelligence

The AI boom, or AI spring, is an ongoing period of rapid progress in the field of artificial intelligence (AI). Prominent examples include protein folding prediction led by Google DeepMind and generative AI led by OpenAI.

Runway AI, Inc. is an American company headquartered in New York City that specializes in generative artificial intelligence research and technologies. The company is primarily focused on creating products and models for generating videos, images, and various multimedia content. It is most notable for developing the first commercial text-to-video generative AI models Gen-1 and Gen-2 and co-creating the research for the popular image generation AI system Stable Diffusion.

References

  1. 1 2 3 4 5 Metz, Cade (February 15, 2024). "OpenAI Unveils A.I. That Instantly Generates Eye-Popping Videos". The New York Times . Archived from the original on February 15, 2024. Retrieved February 15, 2024.
  2. 1 2 Brooks, Tim; Peebles, Bill; Holmes, Connor; DePue, Will; Guo, Yufei; Jing, Li; Schnurr, David; Taylor, Joe; Luhman, Troy; Luhman, Eric; Ng, Clarence Wing Yin; Wang, Ricky; Ramesh, Aditya (February 15, 2024). "Video generation models as world simulators". OpenAI . Archived from the original on February 16, 2024. Retrieved February 16, 2024.
  3. Roth, Emma (February 15, 2024). "OpenAI introduces Sora, its text-to-video AI model". The Verge . Retrieved February 21, 2024.
  4. 1 2 Yang, Angela (February 15, 2024). "OpenAI teases 'Sora,' its new text-to-video AI model". NBC News . Archived from the original on February 15, 2024. Retrieved February 16, 2024.
  5. 1 2 3 4 5 Levy, Steven (February 15, 2024). "OpenAI's Sora Turns AI Prompts Into Photorealistic Videos". Wired . Archived from the original on February 15, 2024. Retrieved February 16, 2024.
  6. 1 2 3 4 Lacy, Lisa (February 15, 2024). "Meet Sora, OpenAI's Text-to-Video Generator". CNET . Archived from the original on February 16, 2024. Retrieved February 16, 2024.
  7. 1 2 Edwards, Benj (February 16, 2024). "OpenAI collapses media reality with Sora, a photorealistic AI video generator". Ars Technica . Archived from the original on February 17, 2024. Retrieved February 17, 2024.
  8. 1 2 Heaven, Will Douglas (February 15, 2024). "OpenAI teases an amazing new generative video model called Sora". MIT Technology Review . Archived from the original on February 15, 2024. Retrieved February 15, 2024.
  9. Peebles, William; Xie, Saining (2023). "Scalable Diffusion Models with Transformers". 2023 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4172–4182. arXiv: 2212.09748 . doi:10.1109/ICCV51070.2023.00387. ISBN   979-8-3503-0718-4. ISSN   2380-7504. S2CID   254854389.
  10. Pequeño IV, Antonio (February 15, 2024). "OpenAI Reveals 'Sora': AI Video Model Capable Of Realistic Text-To-Video Prompts". Forbes . Archived from the original on February 15, 2024. Retrieved February 15, 2024.
  11. "Sora Review | New AI Video Generator From OpenAI". February 18, 2024. Retrieved February 20, 2024.
  12. Kilkenny, Katie (February 23, 2024). "Tyler Perry Puts $800M Studio Expansion on Hold After Seeing OpenAI's Sora: "Jobs Are Going to Be Lost"". The Hollywood Reporter. Retrieved February 26, 2024.
  13. Talha, Rashid (February 23, 2024). "Sora Release new footage of Text to video generator that cost $800M worth in 4 april 2024". Sora Ai APK (Text to Video generator). Archived from the original on March 25, 2024. Retrieved April 7, 2024.