Text-to-video model

Last updated

A video generated using OpenAI's unreleased, open source Sora text-to-video model, using the prompt: A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.

A text-to-video model is a machine learning model that uses a natural language description as input to produce a video relevant to the input text. [1] Advancements during the 2020s in the generation of high-quality, text-conditioned videos have largely been driven by the development of video diffusion models. [2]

Contents

Models

There are different models, including open source models. Chinese-language input [3] CogVideo is the earliest text-to-video model "of 9.4 billion parameters" to be developed, with its demo version of open source codes first presented on GitHub in 2022. [4] That year, Meta Platforms released a partial text-to-video model called "Make-A-Video", [5] [6] [7] and Google's Brain (later Google DeepMind) introduced Imagen Video, a text-to-video model with 3D U-Net. [8] [9] [10] [11] [12]

In March 2023, a research paper titled "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation" was published, presenting a novel approach to video generation. [13] The VideoFusion model decomposes the diffusion process into two components: base noise and residual noise, which are shared across frames to ensure temporal coherence. By utilizing a pre-trained image diffusion model as a base generator, the model efficiently generated high-quality and coherent videos. Fine-tuning the pre-trained model on video data addressed the domain gap between image and video data, enhancing the model's ability to produce realistic and consistent video sequences. [14] In the same month, Adobe introduced Firefly AI as part of its features. [15]

In January 2024, Google announced development of a text-to-video model named Lumiere which is anticipated to integrate advanced video editing capabilities. [16] Matthias Niessner and Lourdes Agapito at AI company Synthesia work on developing 3D neural rendering techniques that can synthesise realistic video by using 2D and 3D neural representations of shape, appearances, and motion for controllable video synthesis of avatars. [17] In June 2024, Luma Labs launched its Dream Machine video tool. [18] [19] That same month, [20] Kuaishou extended its Kling AI text-to-video model to international users. In July 2024, TikTok owner ByteDance released Jimeng AI in China, through its subsidiary, Faceu Technology. [21] By September 2024, the Chinese AI company MiniMax debuted its video-01 model, joining other established AI model companies like Zhipu AI, Baichuan, and Moonshot AI, which contribute to China’s involvement in AI technology. [22]

Alternative approaches to text-to-video models include [23] Google's Phenaki, Hour One, Colossyan, [3] Runway's Gen-3 Alpha, [24] [25] and OpenAI's unreleased (as at August 2024) Sora, [26] available only to alpha testers. [27] Several additional text-to-video models, such as Plug-and-Play, Text2LIVE, and TuneAVideo, have emerged. [28] Google is also preparing to launch a video generation tool named Veo for YouTube Shorts in 2025. [29] FLUX.1 developer Black Forest Labs has announced its text-to-video model SOTA. [30]

Architecture and Training

There are several architectures that have been used to create Text-to-Video models. Similar to Text-to-Image models, these models can be trained using Recurrent Neural Networks (RNNs) such as long short-term memory (LSTM) networks, which has been used for Pixel Transformation Models and Stochastic Video Generation Models, which aid in consistency and realism respectively. [31] An alternative for these include transformer models. Generative adversarial networks (GANs), Variational autoencoders (VAEs), — which can aid in the prediction of human motion [32] — and diffusion models have also been used to develop the image generation aspects of the model. [33]

Text-video datasets used to train models include, but are not limited to, WebVid-10M, HDVILA-100M, CCV, ActivityNet, and Panda-70M. [34] [35] These datasets contain millions of original videos of interest, generated videos, captioned-videos, and textual information that help train models for accuracy. Text-video datasets used to train models include, but are not limited to PromptSource, DiffusionDB, and VidProM. [34] [35] These datasets provide the range of text inputs needed to teach models how to interpret a variety of textual prompts.

The video generation process involves synchronizing the text inputs with video frames, ensuring alignment and consistency throughout the sequence. [35] This predictive process is subject to decline in quality as the length of the video increases due to resource limitations. [35]

Limitations

Despite the rapid evolution of Text-to-Video models in their performance, a primary limitation is that they are very computationally heavy which limits its capacity to provide high quality and lengthy outputs. [36] [37] Additionally, these models require a large amount of specific training data to be able to generate high quality and coherent outputs, which brings about the issue of accessibility. [37] [36]

Moreover, models may misinterpret textual prompts, resulting in video outputs that deviate from the intended meaning. This can occur due to limitations in capturing semantic context embedded in text, which affects the model’s ability to align generated video with the user’s intended message. [37] [35] Various models, including Make-A-Video, Imagen Video, Phenaki, CogVideo, GODIVA, and NUWA, are currently being tested and refined to enhance their alignment capabilities and overall performance in text-to-video generation. [37]

Ethics

The deployment of Text-to-Video models raises ethical considerations related to content generation. These models have the potential to create inappropriate or unauthorized content, including explicit material, graphic violence, misinformation, and likenesses of real individuals without consent. [38] Ensuring that AI-generated content complies with established standards for safe and ethical usage is essential, as content generated by these models may not always be easily identified as harmful or misleading. The ability of AI to recognize and filter out NSFW or copyrighted content remains an ongoing challenge, with implications for both creators and audiences. [38]

Impacts and Applications

Text-to-Video models offer a broad range of applications that may benefit various fields, from educational and promotional to creative industries. These models can streamline content creation for training videos, movie previews, gaming assets, and visualizations, making it easier to generate high-quality, dynamic content. [39] These features provide users with economical and personal benefits.

Comparison of existing models

Model/ProductCompanyYear releasedStatusKey featuresCapabilitiesPricingVideo lengthSupported languages
SynthesiaSynthesia2019ReleasedAI avatars, multilingual support for 60+ languages, customization options [40] Specialized in realistic AI avatars for corporate training and marketing [40] Subscription-based, starting around $30/monthVaries based on subscription60+
InVideo AIInVideo2021ReleasedAI-powered video creation, large stock library, AI talking avatars [40] Tailored for social media content with platform-specific templates [40] Free plan available, Paid plans starting at $16/monthVaries depending on content typeMultiple (not specified)
FlikiFliki AI2022ReleasedText-to-video with AI avatars and voices, extensive language and voice support [40] Supports 65+ AI avatars and 2,000+ voices in 70 languages [40] Free plan available, Paid plans starting at $30/monthVaries based on subscription70+
Runway Gen-2Runway AI2023ReleasedMultimodal video generation from text, images, or videos [41] High-quality visuals, various modes like stylization and storyboard [41] Free trial, Paid plans (details not specified)Up to 16 secondsMultiple (not specified)
Pika LabsPika Labs2024BetaDynamic video generation, camera and motion customization [42] User-friendly, focused on natural dynamic generation [42] Currently free during betaFlexible, supports longer videos with frame continuationMultiple (not specified)
Runway Gen-3 AlphaRunway AI2024AlphaEnhanced visual fidelity, photorealistic humans, fine-grained temporal control [43] Ultra-realistic video generation with precise key-framing and industry-level customization [43] Free trial available, custom pricing for enterprisesUp to 10 seconds per clip, extendableMultiple (not specified)
OpenAI SoraOpenAI2024 (expected)AlphaDeep language understanding, high-quality cinematic visuals, multi-shot videos [44] Capable of creating detailed, dynamic, and emotionally expressive videos; still under development with safety measures [44] Pricing not yet disclosedExpected to generate longer videos; duration specifics TBDMultiple (not specified)

See also

Related Research Articles

Music and artificial intelligence (AI) is the development of music software programs which use AI to generate music. As with applications in other fields, AI in music also simulates mental tasks. A prominent feature is the capability of an AI algorithm to learn based on past data, such as in computer accompaniment technology, wherein the AI is capable of listening to a human performer and performing accompaniment. Artificial intelligence also drives interactive composition technology, wherein a computer composes music in response to a live performance. There are other AI applications in music that cover not only music composition, production, and performance but also how music is marketed and consumed. Several music player programs have also been developed to use voice recognition and natural language processing technology for music voice control. Current research includes the application of AI in music composition, performance, theory and digital sound processing.

Google Brain was a deep learning artificial intelligence research team that served as the sole AI branch of Google before being incorporated under the newer umbrella of Google AI, a research division at Google dedicated to artificial intelligence. Formed in 2011, it combined open-ended machine learning research with information systems and large-scale computing resources. It created tools such as TensorFlow, which allow neural networks to be used by the public, and multiple internal AI research projects, and aimed to create research opportunities in machine learning and natural language processing. It was merged into former Google sister company DeepMind to form Google DeepMind in April 2023.

Multimodal learning is a type of deep learning that integrates and processes multiple types of data, referred to as modalities, such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, improving model performance in tasks like visual question answering, cross-modal retrieval, text-to-image generation, aesthetic ranking, and image captioning.

<span class="mw-page-title-main">Generative adversarial network</span> Deep learning method

A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative artificial intelligence. The concept was initially developed by Ian Goodfellow and his colleagues in June 2014. In a GAN, two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is another agent's loss.

<span class="mw-page-title-main">StyleGAN</span> Novel generative adversarial network

The Style Generative Adversarial Network, or StyleGAN for short, is an extension to the GAN architecture introduced by Nvidia researchers in December 2018, and made source available in February 2019.

<span class="mw-page-title-main">Artificial intelligence art</span> Machine application of knowledge of human aesthetic expressions

Artificial intelligence art is visual artwork created or enhanced through the use of artificial intelligence (AI) programs.

An energy-based model (EBM) is an application of canonical ensemble formulation from statistical physics for learning from data. The approach prominently appears in generative artificial intelligence.

The Fréchet inception distance (FID) is a metric used to assess the quality of images created by a generative model, like a generative adversarial network (GAN) or a diffusion model.

Audio deepfake technology, also referred to as voice cloning or deepfake audio, is an application of artificial intelligence designed to generate speech that convincingly mimics specific individuals, often synthesizing phrases or sentences they have never spoken. Initially developed with the intent to enhance various aspects of human life, it has practical applications such as generating audiobooks and assisting individuals who have lost their voices due to medical conditions. Additionally, it has commercial uses, including the creation of personalized digital assistants, natural-sounding text-to-speech systems, and advanced speech translation services.

<span class="mw-page-title-main">DALL-E</span> Image-generating deep-learning model

DALL-E, DALL-E 2, and DALL-E 3 are text-to-image models developed by OpenAI using deep learning methodologies to generate digital images from natural language descriptions known as "prompts".

<span class="mw-page-title-main">Stable Diffusion</span> Image-generating machine learning model

Stable Diffusion is a deep learning, text-to-image model released in 2022 based on diffusion techniques. The generative artificial intelligence technology is the premier product of Stability AI and is considered to be a part of the ongoing artificial intelligence boom.

<span class="mw-page-title-main">Text-to-image model</span> Machine learning model

A text-to-image model is a machine learning model which takes an input natural language description and produces an image matching that description.

In machine learning, diffusion models, also known as diffusion probabilistic models or score-based generative models, are a class of latent variable generative models. A diffusion model consists of three major components: the forward process, the reverse process, and the sampling procedure. The goal of diffusion models is to learn a diffusion process for a given dataset, such that the process can generate new elements that are distributed similarly as the original dataset. A diffusion model models data as generated by a diffusion process, whereby a new datum performs a random walk with drift through the space of all possible data. A trained diffusion model can be sampled in many ways, with different efficiency and quality.

<span class="mw-page-title-main">Generative artificial intelligence</span> AI system capable of generating content in response to prompts

Generative artificial intelligence is a subset of artificial intelligence that uses generative models to produce text, images, videos, or other forms of data. These models learn the underlying patterns and structures of their training data and use them to produce new data based on the input, which often comes in the form of natural language prompts.

In the 2020s, the rapid advancement of deep learning-based generative artificial intelligence models raised questions about whether copyright infringement occurs when such are trained or used. This includes text-to-image models such as Stable Diffusion and large language models such as ChatGPT. As of 2023, there were several pending U.S. lawsuits challenging the use of copyrighted data to train AI models, with defendants arguing that this falls under fair use.

Text-to-Image personalization is a task in deep learning for computer graphics that augments pre-trained text-to-image generative models. In this task, a generative model that was trained on large-scale data, is adapted such that it can generate images of novel, user-provided concepts. These concepts are typically unseen during training, and may represent specific objects or more abstract categories.

Runway AI, Inc. is an American company headquartered in New York City that specializes in generative artificial intelligence research and technologies. The company is primarily focused on creating products and models for generating videos, images, and various multimedia content. It is most notable for developing the commercial text-to-video and video generative AI models Gen-1, Gen-2 and Gen-3 Alpha.

<span class="mw-page-title-main">Gaussian splatting</span> Volume rendering technique

Gaussian splatting is a volume rendering technique that deals with the direct rendering of volume data without converting the data into surface or line primitives. The technique was originally introduced as splatting by Lee Westover in the early 1990s.

<span class="mw-page-title-main">Sora (text-to-video model)</span> Generative artificial intelligence model

Sora is an upcoming text-to-video model developed by OpenAI. The model generates short video clips based on user prompts, and can also extend existing short videos. As of December 2024, it is unreleased and not yet available to the public. OpenAI has provided no official Sora release date.

<span class="mw-page-title-main">ComfyUI</span> Open source node-based generative artificial intelligence UI

ComfyUI is an open source, node-based program that allows users to generate images from a series of text prompts. It uses free diffusion models such as Stable Diffusion as the base model for its image capabilities combined with other tools such as ControlNet and LCM Low-rank adaptation with each tool being represented by a node in the program.

References

  1. Artificial Intelligence Index Report 2023 (PDF) (Report). Stanford Institute for Human-Centered Artificial Intelligence. p. 98. Multiple high quality text-to-video models, AI systems that can generate video clips from prompted text, were released in 2022.
  2. Melnik, Andrew; Ljubljanac, Michal; Lu, Cong; Yan, Qi; Ren, Weiming; Ritter, Helge (6 May 2024). "Video Diffusion Models: A Survey". arXiv: 2405.03150 [cs.CV].
  3. 1 2 Wodecki, Ben (11 August 2023). "Text-to-Video Generative AI Models: The Definitive List". AI Business. Informa . Retrieved 18 November 2024.
  4. CogVideo, THUDM, 12 October 2022, retrieved 12 October 2022
  5. Davies, Teli (29 September 2022). "Make-A-Video: Meta AI's New Model For Text-To-Video Generation". Weights & Biases. Retrieved 12 October 2022.
  6. Monge, Jim Clyde (3 August 2022). "This AI Can Create Video From Text Prompt". Medium. Retrieved 12 October 2022.
  7. "Meta's Make-A-Video AI creates videos from text". www.fonearena.com. Retrieved 12 October 2022.
  8. "google: Google takes on Meta, introduces own video-generating AI". The Economic Times . 6 October 2022. Retrieved 12 October 2022.
  9. Monge, Jim Clyde (3 August 2022). "This AI Can Create Video From Text Prompt". Medium. Retrieved 12 October 2022.
  10. "Nuh-uh, Meta, we can do text-to-video AI, too, says Google". The Register . Retrieved 12 October 2022.
  11. "Papers with Code - See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction". paperswithcode.com. Retrieved 12 October 2022.
  12. "Papers with Code - Text-driven Video Prediction". paperswithcode.com. Retrieved 12 October 2022.
  13. Luo, Zhengxiong; Chen, Dayou; Zhang, Yingya; Huang, Yan; Wang, Liang; Shen, Yujun; Zhao, Deli; Zhou, Jingren; Tan, Tieniu (2023). "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation". arXiv: 2303.08320 [cs.CV].
  14. Luo, Zhengxiong; Chen, Dayou; Zhang, Yingya; Huang, Yan; Wang, Liang; Shen, Yujun; Zhao, Deli; Zhou, Jingren; Tan, Tieniu (2023). "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation". arXiv: 2303.08320 [cs.CV].
  15. "Adobe launches Firefly Video model and enhances image, vector and design models. Adobe Newsroom". Adobe Inc. 10 October 2024. Retrieved 18 November 2024.
  16. Yirka, Bob (26 January 2024). "Google announces the development of Lumiere, an AI-based next-generation text-to-video generator". Tech Xplore. Retrieved 18 November 2024.
  17. "Text to Speech for Videos". Synthesia.io. Retrieved 17 October 2023.
  18. Nuñez, Michael (12 June 2024). "Luma AI debuts 'Dream Machine' for realistic video generation, heating up AI media race". VentureBeat. Retrieved 18 November 2024.
  19. Fink, Charlie. "Apple Debuts Intelligence, Mistral Raises $600 Million, New AI Text-To-Video". Forbes. Retrieved 18 November 2024.
  20. Franzen, Carl (12 June 2024). "What you need to know about Kling, the AI video generator rival to Sora that's wowing creators". VentureBeat. Retrieved 18 November 2024.
  21. "ByteDance joins OpenAI's Sora rivals with AI video app launch". Reuters. 6 August 2024. Retrieved 18 November 2024.
  22. "Chinese ai "tiger" minimax launches text-to-video-generating model to rival OpenAI's sora". Yahoo! Finance. 2 September 2024. Retrieved 18 November 2024.
  23. Text2Video-Zero, Picsart AI Research (PAIR), 12 August 2023, retrieved 12 August 2023
  24. Kemper, Jonathan (1 July 2024). "Runway's Sora competitor Gen-3 Alpha now available". THE DECODER. Retrieved 18 November 2024.
  25. "Generative AI's Next Frontier Is Video". Bloomberg.com. 20 March 2023. Retrieved 18 November 2024.
  26. "OpenAI teases 'Sora,' its new text-to-video AI model". NBC News. 15 February 2024. Retrieved 18 November 2024.
  27. Kelly, Chris (25 June 2024). "Toys R Us creates first brand film to use OpenAI's text-to-video tool". Marketing Dive. Informa . Retrieved 18 November 2024.
  28. Jin, Jiayao; Wu, Jianhang; Xu, Zhoucheng; Zhang, Hang; Wang, Yaxin; Yang, Jielong (4 August 2023). "Text to Video: Enhancing Video Generation Using Diffusion Models and Reconstruction Network". 2023 2nd International Conference on Computing, Communication, Perception and Quantum Technology (CCPQT). IEEE. pp. 108–114. doi:10.1109/CCPQT60491.2023.00024. ISBN   979-8-3503-4269-7.
  29. Forlini, Emily Dreibelbis (18 September 2024). "Google's veo text-to-video AI generator is coming to YouTube shorts". PC Magazine . Retrieved 18 November 2024.
  30. "Announcing Black Forest Labs". Black Forest Labs. 1 August 2024. Retrieved 18 November 2024.
  31. Bhagwatkar, Rishika; Bachu, Saketh; Fitter, Khurshed; Kulkarni, Akshay; Chiddarwar, Shital (17 December 2020). "A Review of Video Generation Approaches". 2020 International Conference on Power, Instrumentation, Control and Computing (PICC). IEEE. pp. 1–5. doi:10.1109/PICC51425.2020.9362485. ISBN   978-1-7281-7590-4.
  32. Kim, Taehoon; Kang, ChanHee; Park, JaeHyuk; Jeong, Daun; Yang, ChangHee; Kang, Suk-Ju; Kong, Kyeongbo (3 January 2024). "Human Motion Aware Text-to-Video Generation with Explicit Camera Control". 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE. pp. 5069–5078. doi:10.1109/WACV57701.2024.00500. ISBN   979-8-3503-1892-0.
  33. Singh, Aditi (9 May 2023). "A Survey of AI Text-to-Image and AI Text-to-Video Generators". 2023 4th International Conference on Artificial Intelligence, Robotics and Control (AIRC). IEEE. pp. 32–36. arXiv: 2311.06329 . doi:10.1109/AIRC57904.2023.10303174. ISBN   979-8-3503-4824-8.
  34. 1 2 Miao, Yibo; Zhu, Yifan; Dong, Yinpeng; Yu, Lijia; Zhu, Jun; Gao, Xiao-Shan (8 September 2024). "T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models". arXiv: 2407.05965 [cs.CV].
  35. 1 2 3 4 5 Zhang, Ji; Mei, Kuizhi; Wang, Xiao; Zheng, Yu; Fan, Jianping (August 2018). "From Text to Video: Exploiting Mid-Level Semantics for Large-Scale Video Classification". 2018 24th International Conference on Pattern Recognition (ICPR). IEEE. pp. 1695–1700. doi:10.1109/ICPR.2018.8545513. ISBN   978-1-5386-3788-3.
  36. 1 2 Bhagwatkar, Rishika; Bachu, Saketh; Fitter, Khurshed; Kulkarni, Akshay; Chiddarwar, Shital (17 December 2020). "A Review of Video Generation Approaches". 2020 International Conference on Power, Instrumentation, Control and Computing (PICC). IEEE. pp. 1–5. doi:10.1109/PICC51425.2020.9362485. ISBN   978-1-7281-7590-4.
  37. 1 2 3 4 Singh, Aditi (9 May 2023). "A Survey of AI Text-to-Image and AI Text-to-Video Generators". 2023 4th International Conference on Artificial Intelligence, Robotics and Control (AIRC). IEEE. pp. 32–36. arXiv: 2311.06329 . doi:10.1109/AIRC57904.2023.10303174. ISBN   979-8-3503-4824-8.
  38. 1 2 Miao, Yibo; Zhu, Yifan; Dong, Yinpeng; Yu, Lijia; Zhu, Jun; Gao, Xiao-Shan (8 September 2024). "T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models". arXiv: 2407.05965 [cs.CV].
  39. Singh, Aditi (9 May 2023). "A Survey of AI Text-to-Image and AI Text-to-Video Generators". 2023 4th International Conference on Artificial Intelligence, Robotics and Control (AIRC). IEEE. pp. 32–36. arXiv: 2311.06329 . doi:10.1109/AIRC57904.2023.10303174. ISBN   979-8-3503-4824-8.
  40. 1 2 3 4 5 6 "Top AI Video Generation Models of 2024". Deepgram. Retrieved 30 August 2024.
  41. 1 2 "Runway Research | Gen-2: Generate novel videos with text, images or video clips". runwayml.com. Retrieved 30 August 2024.
  42. 1 2 Sharma, Shubham (26 December 2023). "Pika Labs' text-to-video AI platform opens to all: Here's how to use it". VentureBeat. Retrieved 30 August 2024.
  43. 1 2 "Runway Research | Introducing Gen-3 Alpha: A New Frontier for Video Generation". runwayml.com. Retrieved 30 August 2024.
  44. 1 2 "Sora | OpenAI". openai.com. Retrieved 30 August 2024.