Text-to-video model

Last updated

A text-to-video model is a machine learning model which takes a natural language description as input and producing a video or multiples videos from the input. [1]

Contents

Video prediction on making objects realistic in a stable background is performed by using recurrent neural network for a sequence to sequence model with a connector convolutional neural network encoding and decoding each frame pixel by pixel, [2] creating video using deep learning. [3] Testing of the data set in conditional generative model for existing information from text can be done by variational autoencoder and generative adversarial network (GAN).

Models

There are different models, including open source models. The demo version of CogVideo is an early text-to-video model "of 9.4 billion parameters", with their codes presented on GitHub. [4] Meta Platforms has a partial text-to-video [note 1] model called "Make-A-Video". [5] [6] [7] Google's Brain has released a research paper introducing Imagen Video, a text-to-video model with 3D U-Net. [8] [9] [10] [11] [12]

In March 2023, a landmark research paper by Alibaba was published, applying many of the principles found in latent image diffusion models to video generation. [13] [14] Services like Kaiber and Reemix have since adopted similar approaches to video generation in their respective products.

Matthias Niessner and Lourdes Agapito at AI company Synthesia work on developing 3D neural rendering techniques that can synthesise realistic video by using 2D and 3D neural representations of shape, appearances, and motion for controllable video synthesis of avatars. [15]

Alternative approaches to text-to-video models exist. [16]

See also

Footnotes

  1. It can also generate videos from images, video insertion between two images, and videos variations.

Related Research Articles

Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Recently, artificial neural networks have been able to surpass many previous approaches in performance.

A language model is a probabilistic model of a natural language. In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.

Music and artificial intelligence is the development of music software programs which use AI to generate music. As with applications in other fields, AI in music also simulates mental tasks. A prominent feature is the capability of an AI algorithm to learn based on past data, such as in computer accompaniment technology, wherein the AI is capable of listening to a human performer and performing accompaniment. Artificial intelligence also drives interactive composition technology, wherein a computer composes music in response to a live performance. There are other AI applications in music that cover not only music composition, production, and performance but also how music is marketed and consumed. Several music player programs have also been developed to use voice recognition and natural language processing technology for music voice control. Current research includes the application of AI in music composition, performance, theory and digital sound processing.

<span class="mw-page-title-main">Feature learning</span> Set of learning techniques in machine learning

In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

Google Brain was a deep learning artificial intelligence research team under the umbrella of Google AI, a research division at Google dedicated to artificial intelligence. Formed in 2011, it combined open-ended machine learning research with information systems and large-scale computing resources. It created tools such as TensorFlow, which allow neural networks to be used by the public, and multiple internal AI research projects, and aimed to create research opportunities in machine learning and natural language processing. It was merged into former Google sister company DeepMind to form Google DeepMind in April 2023.

Multimodal learning, in the context of machine learning, is a type of deep learning using a combination of various modalities of data, such as text, audio, or images, in order to create a more robust model of the real-world phenomena in question. In contrast, singular modal learning would analyze text or imaging data independently. Multimodal machine learning combines these fundamentally different statistical analyses using specialized modeling strategies and algorithms, resulting in a model that comes closer to representing the real world.

<span class="mw-page-title-main">Artificial intelligence art</span> Machine application of knowledge of human aesthetic expressions

Artificial intelligence art is any visual artwork created through the use of artificial intelligence (AI) programs such as text-to-image models. AI art began to gain popularity in the mid- to late-20th century through the boom of artificial intelligence.

Synthetic media is a catch-all term for the artificial production, manipulation, and modification of data and media by automated means, especially through the use of artificial intelligence algorithms, such as for the purpose of misleading people or changing an original meaning. Synthetic media refers to any form of media, including but not limited to images, videos, audio recordings, and text, that are generated or manipulated using artificial intelligence (AI) techniques. This technology enables the creation of highly realistic content that may be indistinguishable from authentic media produced by humans. Synthetic media as a field has grown rapidly since the creation of generative adversarial networks, primarily through the rise of deepfakes as well as music synthesis, text generation, human image synthesis, speech synthesis, and more. Though experts use the term "synthetic media," individual methods such as deepfakes and text synthesis are sometimes not referred to as such by the media but instead by their respective terminology Significant attention arose towards the field of synthetic media starting in 2017 when Motherboard reported on the emergence of AI altered pornographic videos to insert the faces of famous actresses. Potential hazards of synthetic media include the spread of misinformation, further loss of trust in institutions such as media and government, the mass automation of creative and journalistic jobs and a retreat into AI-generated fantasy worlds. Synthetic media is an applied form of artificial imagination.

Prompt engineering is the process of structuring an instruction that can be interpreted and understood by a generative AI model. A prompt is natural language text describing the task that an AI should perform.

Meta AI is an artificial intelligence laboratory owned by Meta Platforms Inc.. Meta AI develops various forms of artificial intelligence, including augmented and artificial reality technologies. Meta AI is also an academic research laboratory focused on generating knowledge for the AI community. This is in contrast to Facebook's Applied Machine Learning (AML) team, which focuses on practical applications of its products.

A foundation model is a machine learning or deep learning model that is trained on broad data such that it can be applied across a wide range of use cases. Foundation models have transformed artificial intelligence (AI), powering prominent generative AI applications like ChatGPT. The Stanford Institute for Human-Centered Artificial Intelligence's (HAI) Center for Research on Foundation Models (CRFM) created and popularized the term.

<span class="mw-page-title-main">Stable Diffusion</span> Image-generating machine learning model

Stable Diffusion is a deep learning, text-to-image model released in 2022 based on diffusion techniques. It is considered to be a part of the ongoing AI boom.

<span class="mw-page-title-main">Text-to-image model</span> Machine learning model

A text-to-image model is a machine learning model which takes an input natural language description and produces an image matching that description.

<span class="mw-page-title-main">Hallucination (artificial intelligence)</span> Confident unjustified claim by AI

In the field of artificial intelligence (AI), a hallucination or artificial hallucination is a response generated by AI which contains false or misleading information presented as fact. This term draws a loose analogy with human psychology, where hallucination typically involves false percepts. However, there’s a key difference: AI hallucination is associated with unjustified responses or beliefs rather than perceptual experiences.

<span class="mw-page-title-main">Generative pre-trained transformer</span> Type of large language model

Generative pre-trained transformers (GPT) are a type of large language model (LLM) and a prominent framework for generative artificial intelligence. They are artificial neural networks that are used in natural language processing tasks. GPTs are based on the transformer architecture, pre-trained on large data sets of unlabelled text, and able to generate novel human-like content. As of 2023, most LLMs have these characteristics and are sometimes referred to broadly as GPTs.

A large language model (LLM) is a language model notable for its ability to achieve general-purpose language generation and other natural language processing tasks such as classification. LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training process. LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word.

<span class="mw-page-title-main">Generative artificial intelligence</span> AI system capable of generating content in response to prompts

Generative artificial intelligence is artificial intelligence capable of generating text, images, videos, or other data using generative models, often in response to prompts. Generative AI models learn the patterns and structure of their input training data and then generate new data that has similar characteristics.

<span class="mw-page-title-main">AI boom</span> Rapid progress in artificial intelligence

The AI boom, or AI spring, is an ongoing period of rapid progress in the field of artificial intelligence (AI). Prominent examples include protein folding prediction led by Google DeepMind and generative AI led by OpenAI.

Runway AI, Inc. is an American company headquartered in New York City that specializes in generative artificial intelligence research and technologies. The company is primarily focused on creating products and models for generating videos, images, and various multimedia content. It is most notable for developing the first commercial text-to-video generative AI models Gen-1 and Gen-2 and co-creating the research for the popular image generation AI system Stable Diffusion.

<span class="mw-page-title-main">Sora (text-to-video model)</span> Text-to-video model by OpenAI

Sora is an upcoming generative artificial intelligence model developed by OpenAI, that specializes in text-to-video generation. The model accepts textual descriptions, known as prompts, from users and generates short video clips corresponding to those descriptions. Prompts can specify artistic styles, fantastical imagery, or real-world scenarios. When creating real-world scenarios, user input may be required to ensure factual accuracy, otherwise features can be added erroneously. Sora is praised for its ability to produce videos with high levels of visual detail, including intricate camera movements and characters that exhibit a range of emotions. Furthermore, the model possesses the functionality to extend existing short videos by generating new content that seamlessly precedes or follows the original clip. As of April 2024, it is unreleased and not yet available to the public.

References

  1. Artificial Intelligence Index Report 2023 (PDF) (Report). Stanford Institute for Human-Centered Artificial Intelligence. p. 98. Multiple high quality text-to-video models, AI systems that can generate video clips from prompted text, were released in 2022.
  2. "Leading India" (PDF).
  3. Narain, Rohit (2021-12-29). "Smart Video Generation from Text Using Deep Neural Networks" . Retrieved 2022-10-12.
  4. CogVideo, THUDM, 2022-10-12, retrieved 2022-10-12
  5. Davies, Teli (2022-09-29). "Make-A-Video: Meta AI's New Model For Text-To-Video Generation". Weights & Biases. Retrieved 2022-10-12.
  6. Monge, Jim Clyde (2022-08-03). "This AI Can Create Video From Text Prompt". Medium. Retrieved 2022-10-12.
  7. "Meta's Make-A-Video AI creates videos from text". www.fonearena.com. Retrieved 2022-10-12.
  8. "google: Google takes on Meta, introduces own video-generating AI - The Economic Times". m.economictimes.com. Retrieved 2022-10-12.
  9. Monge, Jim Clyde (2022-08-03). "This AI Can Create Video From Text Prompt". Medium. Retrieved 2022-10-12.
  10. "Nuh-uh, Meta, we can do text-to-video AI, too, says Google". www.theregister.com. Retrieved 2022-10-12.
  11. "Papers with Code - See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction". paperswithcode.com. Retrieved 2022-10-12.
  12. "Papers with Code - Text-driven Video Prediction". paperswithcode.com. Retrieved 2022-10-12.
  13. "Home - DAMO Academy". damo.alibaba.com. Retrieved 2023-08-12.
  14. Luo, Zhengxiong; Chen, Dayou; Zhang, Yingya; Huang, Yan; Wang, Liang; Shen, Yujun; Zhao, Deli; Zhou, Jingren; Tan, Tieniu (2023). "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation". arXiv: 2303.08320 [cs.CV].
  15. "Text to Speech for Videos" . Retrieved 2023-10-17.
  16. Text2Video-Zero, Picsart AI Research (PAIR), 2023-08-12, retrieved 2023-08-12