Sora (text-to-video model)

Last updated

Sora
Developer(s) OpenAI
Initial releaseDecember 9, 2024(35 days ago) (2024-12-09)
Platform OpenAI
Type Text-to-video model
Website sora.com

Sora is a text-to-video model developed by OpenAI. The model generates short video clips based on user prompts, and can also extend existing short videos. Sora was released publicly for ChatGPT Plus and ChatGPT Pro users in December 2024. [1] [2]

Contents

History

Several other text-to-video generating models had been created prior to Sora, including Meta's Make-A-Video, Runway's Gen-2, and Google's Lumiere, the last of which, as of February 2024, is also still in its research phase. [3] OpenAI, the company behind Sora, had released DALL·E 3, the third of its DALL-E text-to-image models, in September 2023. [4]

The team that developed Sora named it after the Japanese word for sky to signify its "limitless creative potential". [5] On February 15, 2024, OpenAI first previewed Sora by releasing multiple clips of high-definition videos that it created, including an SUV driving down a mountain road, an animation of a "short fluffy monster" next to a candle, two people walking through Tokyo in the snow, and fake historical footage of the California gold rush, and stated that it was able to generate videos up to one minute long. [3] The company then shared a technical report, which highlighted the methods used to train the model. [6] [7] OpenAI CEO Sam Altman also posted a series of tweets, responding to Twitter users' prompts with Sora-generated videos of the prompts.

In November 2024, an API key for Sora access was leaked by a group of testers on Hugging Face, who posted a manifesto stating they protesting that Sora was used for "art washing". OpenAI revoked the access three hours after the leak was made public, and gave a statement that "hundreds of artists" have shaped the development, and that "participation is voluntary." [8]

As of the 9th of December 2024, OpenAI has made Sora available to the public, for ChatGPT Pro and ChatGPT Plus users. Prior to this, the company had provided limited access to a small "red team", including experts in misinformation and bias, to perform adversarial testing on the model. [4] The company also shared Sora with a small group of creative professionals, including video makers and artists, to seek feedback on its usefulness in creative fields. [9]

Capabilities and limitations

A video generated by Sora of someone lying in a bed with a cat on it, containing several mistakes

The technology behind Sora is an adaptation of the technology behind DALL-E 3. According to OpenAI, Sora is a diffusion transformer [10] – a denoising latent diffusion model with one Transformer as the denoiser. A video is generated in latent space by denoising 3D "patches", then transformed to standard space by a video decompressor. Re-captioning is used to augment training data, by using a video-to-text model to create detailed captions on videos. [7]

OpenAI trained the model using publicly available videos as well as copyrighted videos licensed for the purpose, but did not reveal the number or the exact source of the videos. [5] Upon its release, OpenAI acknowledged some of Sora's shortcomings, including its struggling to simulate complex physics, to understand causality, and to differentiate left from right. [11] One example shows a group of wolf pups seemingly multiplying and converging, creating a hard-to-follow scenario. [12] OpenAI also stated that, in adherence to the company's existing safety practices, Sora will restrict text prompts for sexual, violent, hateful, or celebrity imagery, as well as content featuring pre-existing intellectual property. [4]

Tim Brooks, a researcher on Sora, stated that the model figured out how to create 3D graphics from its dataset alone, while Bill Peebles, also a Sora researcher, said that the model automatically created different video angles without being prompted. [3] According to OpenAI, Sora-generated videos are tagged with C2PA metadata to indicate that they were AI-generated. [5]

Reception

Will Douglas Heaven of the MIT Technology Review called the demonstration videos "impressive", but noted that they must have been cherry-picked and may not be representative of Sora's typical output. [9] American academic Oren Etzioni expressed concerns over the technology's ability to create online disinformation for political campaigns. [5] For Wired , Steven Levy similarly wrote that it had the potential to become "a misinformation train wreck" and opined that its preview clips were "impressive" but "not perfect" and that it "show[ed] an emergent grasp of cinematic grammar" due to its unprompted shot changes. Levy added, "[i]t will be a very long time, if ever, before text-to-video threatens actual filmmaking." [3] Lisa Lacy of CNET called its example videos "remarkably realistic – except perhaps when a human face appears close up or when sea creatures are swimming". [4]

Filmmaker Tyler Perry announced he would be putting a planned $800 million expansion of his Atlanta studio on hold, expressing concern about Sora's potential impact on the film industry. [13] [14]

See also

Related Research Articles

<span class="mw-page-title-main">Chatbot</span> Program that simulates conversation

A chatbot is a software application or web interface designed to have textual or spoken conversations. Modern chatbots are typically online and use generative artificial intelligence systems that are capable of maintaining a conversation with a user in natural language and simulating the way a human would behave as a conversational partner. Such chatbots often use deep learning and natural language processing, but simpler chatbots have existed for decades.

Multimodal learning is a type of deep learning that integrates and processes multiple types of data, referred to as modalities, such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, improving model performance in tasks like visual question answering, cross-modal retrieval, text-to-image generation, aesthetic ranking, and image captioning.

OpenAI is an American artificial intelligence (AI) research organization founded in December 2015 and headquartered in San Francisco, California. Its stated mission is to develop "safe and beneficial" artificial general intelligence (AGI), which it defines as "highly autonomous systems that outperform humans at most economically valuable work". As a leading organization in the ongoing AI boom, OpenAI is known for the GPT family of large language models, the DALL-E series of text-to-image models, and a text-to-video model named Sora. Its release of ChatGPT in November 2022 has been credited with catalyzing widespread interest in generative AI.

<span class="mw-page-title-main">Artificial intelligence art</span> Visual media created with AI

Artificial intelligence art is visual artwork created or enhanced through the use of artificial intelligence (AI) programs.

Synthetic media is a catch-all term for the artificial production, manipulation, and modification of data and media by automated means, especially through the use of artificial intelligence algorithms, such as for the purpose of misleading people or changing an original meaning. Synthetic media as a field has grown rapidly since the creation of generative adversarial networks, primarily through the rise of deepfakes as well as music synthesis, text generation, human image synthesis, speech synthesis, and more. Though experts use the term "synthetic media," individual methods such as deepfakes and text synthesis are sometimes not referred to as such by the media but instead by their respective terminology Significant attention arose towards the field of synthetic media starting in 2017 when Motherboard reported on the emergence of AI altered pornographic videos to insert the faces of famous actresses. Potential hazards of synthetic media include the spread of misinformation, further loss of trust in institutions such as media and government, the mass automation of creative and journalistic jobs and a retreat into AI-generated fantasy worlds. Synthetic media is an applied form of artificial imagination.

Generative Pre-trained Transformer 3 (GPT-3) is a large language model released by OpenAI in 2020.

<span class="mw-page-title-main">GPT-2</span> 2019 text-generating language model

Generative Pre-trained Transformer 2 (GPT-2) is a large language model by OpenAI and the second in their foundational series of GPT models. GPT-2 was pre-trained on a dataset of 8 million web pages. It was partially released in February 2019, followed by full release of the 1.5-billion-parameter model on November 5, 2019.

<span class="mw-page-title-main">DALL-E</span> Image-generating deep-learning model

DALL-E, DALL-E 2, and DALL-E 3 are text-to-image models developed by OpenAI using deep learning methodologies to generate digital images from natural language descriptions known as prompts.

Prompt engineering is the process of structuring an instruction that can be interpreted and understood by a generative artificial intelligence (AI) model.

<span class="mw-page-title-main">Midjourney</span> Image-generating machine learning model

Midjourney is a generative artificial intelligence program and service created and hosted by the San Francisco-based independent research lab Midjourney, Inc. Midjourney generates images from natural language descriptions, called prompts, similar to OpenAI's DALL-E and Stability AI's Stable Diffusion. It is one of the technologies of the AI boom.

<span class="mw-page-title-main">Stable Diffusion</span> Image-generating machine learning model

Stable Diffusion is a deep learning, text-to-image model released in 2022 based on diffusion techniques. The generative artificial intelligence technology is the premier product of Stability AI and is considered to be a part of the ongoing artificial intelligence boom.

<span class="mw-page-title-main">Text-to-image model</span> Machine learning model

A text-to-image model is a machine learning model which takes an input natural language description and produces an image matching that description.

<span class="mw-page-title-main">Text-to-video model</span> Machine learning model

A text-to-video model is a machine learning model that uses a natural language description as input to produce a video relevant to the input text. Advancements during the 2020s in the generation of high-quality, text-conditioned videos have largely been driven by the development of video diffusion models.

<span class="mw-page-title-main">ChatGPT</span> Chatbot developed by OpenAI

ChatGPT is a generative artificial intelligence chatbot developed by OpenAI and launched in 2022. It is currently based on the GPT-4o large language model (LLM). ChatGPT can generate human-like conversational responses and enables users to refine and steer a conversation towards a desired length, format, style, level of detail, and language. It is credited with accelerating the AI boom, which has led to ongoing rapid investment in and public attention to the field of artificial intelligence (AI). Some observers have raised concern about the potential of ChatGPT and similar programs to displace human intelligence, enable plagiarism, or fuel misinformation.

Generative Pre-trained Transformer 4 (GPT-4) is a multimodal large language model trained and created by OpenAI and the fourth in its series of GPT foundation models. It was launched on March 14, 2023, and made publicly available via the paid chatbot product ChatGPT Plus, via OpenAI's API, and via the free chatbot Microsoft Copilot. As a transformer-based model, GPT-4 uses a paradigm where pre-training using both public data and "data licensed from third-party providers" is used to predict the next token. After this step, the model was then fine-tuned with reinforcement learning feedback from humans and AI for human alignment and policy compliance.

<span class="mw-page-title-main">Generative pre-trained transformer</span> Type of large language model

A generative pre-trained transformer (GPT) is a type of large language model (LLM) and a prominent framework for generative artificial intelligence. It is an artificial neural network that is used in natural language processing by machines. It is based on the transformer deep learning architecture, pre-trained on large data sets of unlabeled text, and able to generate novel human-like content. As of 2023, most LLMs had these characteristics and are sometimes referred to broadly as GPTs.

<span class="mw-page-title-main">Generative artificial intelligence</span> AI system capable of generating content in response to prompts

Generative artificial intelligence is a subset of artificial intelligence that uses generative models to produce text, images, videos, or other forms of data. These models learn the underlying patterns and structures of their training data and use them to produce new data based on the input, which often comes in the form of natural language prompts.

<span class="mw-page-title-main">AI boom</span> Ongoing period of rapid progress in artificial intelligence

The AI boom, or AI spring, is an ongoing period of rapid progress in the field of artificial intelligence (AI) that started in the late 2010s before gaining international prominence in the early 2020s. Examples include protein folding prediction led by Google DeepMind as well as large language models and generative AI applications developed by OpenAI.

<span class="mw-page-title-main">Microsoft Copilot</span> Chatbot developed by Microsoft

Microsoft Copilot is a generative artificial intelligence chatbot developed by Microsoft. Based on the GPT-4 series of large language models, it was launched in 2023 as Microsoft's primary replacement for the discontinued Cortana.

<span class="mw-page-title-main">Flux (text-to-image model)</span> Image-generating machine learning model

Flux is a text-to-image model developed by Black Forest Labs, based in Freiburg im Breisgau, Germany. Black Forest Labs were founded by former employees of Stability AI. As with other text-to-image models, Flux generates images from natural language descriptions, called prompts.

References

  1. "Sora | OpenAI". openai.com. Retrieved December 9, 2024.
  2. Wang, Gerui. "How Sora And AI Videos Transform Media: Strengths And Challenges". Forbes. Retrieved December 24, 2024.
  3. 1 2 3 4 Levy, Steven (February 15, 2024). "OpenAI's Sora Turns AI Prompts Into Photorealistic Videos". Wired . Archived from the original on February 15, 2024. Retrieved February 16, 2024.
  4. 1 2 3 4 Lacy, Lisa (February 15, 2024). "Meet Sora, OpenAI's Text-to-Video Generator". CNET . Archived from the original on February 16, 2024. Retrieved February 16, 2024.
  5. 1 2 3 4 Metz, Cade (February 15, 2024). "OpenAI Unveils A.I. That Instantly Generates Eye-Popping Videos". The New York Times . Archived from the original on February 15, 2024. Retrieved February 15, 2024.
  6. Brooks, Tim; Peebles, Bill; Holmes, Connor; DePue, Will; Guo, Yufei; Jing, Li; Schnurr, David; Taylor, Joe; Luhman, Troy; Luhman, Eric; Ng, Clarence Wing Yin; Wang, Ricky; Ramesh, Aditya (February 15, 2024). "Video generation models as world simulators". OpenAI . Archived from the original on February 16, 2024. Retrieved February 16, 2024.
  7. 1 2 Edwards, Benj (February 16, 2024). "OpenAI collapses media reality with Sora, a photorealistic AI video generator". Ars Technica . Archived from the original on February 17, 2024. Retrieved February 17, 2024.
  8. Spangler, Todd (November 27, 2024). "OpenAI Shuts Down Sora Access After Artists Released Video-Generation Tool in Protest: 'We Are Not Your PR Puppets'". Variety . Retrieved December 2, 2024.
  9. 1 2 Heaven, Will Douglas (February 15, 2024). "OpenAI teases an amazing new generative video model called Sora". MIT Technology Review . Archived from the original on February 15, 2024. Retrieved February 15, 2024.
  10. Peebles, William; Xie, Saining (2023). "Scalable Diffusion Models with Transformers". 2023 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4172–4182. arXiv: 2212.09748 . doi:10.1109/ICCV51070.2023.00387. ISBN   979-8-3503-0718-4. ISSN   2380-7504. S2CID   254854389. Archived from the original on February 17, 2024. Retrieved February 17, 2024.
  11. Pequeño IV, Antonio (February 15, 2024). "OpenAI Reveals 'Sora': AI Video Model Capable Of Realistic Text-To-Video Prompts". Forbes . Archived from the original on February 15, 2024. Retrieved February 15, 2024.
  12. "Sora-generated video of wolves playing with some video issues". ABC News Australia. Retrieved May 16, 2024.
  13. Kilkenny, Katie (February 23, 2024). "Tyler Perry Puts $800M Studio Expansion on Hold After Seeing OpenAI's Sora: "Jobs Are Going to Be Lost"". The Hollywood Reporter. Archived from the original on February 26, 2024. Retrieved February 26, 2024.
  14. Edwards, Benj (February 23, 2024). "Tyler Perry puts $800 million studio expansion on hold because of OpenAI's Sora". Ars Technica. Archived from the original on February 26, 2024. Retrieved February 26, 2024.