VideoPoet

Last updated

VideoPoet
Developer(s) Google
Initial releaseFebruary 8, 2024;4 months ago (2024-02-08)
Type Large language model

VideoPoet is a large language model developed by Google Research in 2023 for video making. [1] [2] [3] [4] It can be asked to animate still images. [5] The model accepts text, images, and videos as inputs, with a program to add feature for any input to any format generated content. [4] VideoPoet was publicly announced on December 19, 2023. [1] It uses an autoregressive language model.

Related Research Articles

Artificial intelligence (AI), in its broadest sense, is intelligence exhibited by machines, particularly computer systems. It is a field of research in computer science that develops and studies methods and software that enable machines to perceive their environment and uses learning and intelligence to take actions that maximize their chances of achieving defined goals. Such machines may be called AIs.

Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Recently, artificial neural networks have been able to surpass many previous approaches in performance.

A language model is a probabilistic model of a natural language. In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.

<span class="mw-page-title-main">Deep learning</span> Branch of machine learning

Deep learning is the subset of machine learning methods based on neural networks with representation learning. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.

Google Brain was a deep learning artificial intelligence research team under the umbrella of Google AI, a research division at Google dedicated to artificial intelligence. Formed in 2011, it combined open-ended machine learning research with information systems and large-scale computing resources. It created tools such as TensorFlow, which allow neural networks to be used by the public, and multiple internal AI research projects, and aimed to create research opportunities in machine learning and natural language processing. It was merged into former Google sister company DeepMind to form Google DeepMind in April 2023.

<span class="mw-page-title-main">Google DeepMind</span> Artificial intelligence division

Google DeepMind Technologies Limited is a British-American artificial intelligence research laboratory which serves as a subsidiary of Google. Founded in the UK in 2010, it was acquired by Google in 2014 and merged with Google AI's Google Brain division to become Google DeepMind in April 2023. The company is based in London, with research centres in Canada, France, Germany, and the United States.

Multimodal learning, in the context of machine learning, is a type of deep learning using multiple modalities of data, such as text, audio, or images.

<span class="mw-page-title-main">OpenAI</span> Artificial intelligence research organization

OpenAI is an American artificial intelligence (AI) research organization founded in December 2015. Its mission is to develop "safe and beneficial" artificial general intelligence, which it defines as "highly autonomous systems that outperform humans at most economically valuable work". As a leading organization in the ongoing AI boom, OpenAI has developed several large language models, advanced image generation models, and previously, released open-source models. Its release of ChatGPT has been credited with catalyzing widespread interest in AI.

<span class="mw-page-title-main">Artificial intelligence art</span> Machine application of knowledge of human aesthetic expressions

Artificial intelligence art is any visual artwork created through the use of an artificial intelligence (AI) program.

<span class="mw-page-title-main">Transformer (deep learning architecture)</span> Machine learning algorithm used for natural-language processing

A transformer is a deep learning architecture developed by Google and based on the multi-head attention mechanism, proposed in a 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. The transformer paper, published in 2017, is based on the softmax-based attention mechanism proposed by Bahdanau et. al. in 2014 for machine translation, and the Fast Weight Controller, similar to a transformer, proposed in 1992.

<span class="mw-page-title-main">Timeline of computing 2020–present</span> Historical timeline

This article presents a detailed timeline of events in the history of computing from 2020 to the present. For narratives explaining the overall developments, see the history of computing.

<span class="mw-page-title-main">DALL-E</span> Image-generating deep-learning model

DALL·E, DALL·E 2, and DALL·E 3 are text-to-image models developed by OpenAI using deep learning methodologies to generate digital images from natural language descriptions known as "prompts".

Prompt engineering is the process of structuring an instruction that can be interpreted and understood by a generative AI model. A prompt is natural language text describing the task that an AI should perform.

<span class="mw-page-title-main">Stable Diffusion</span> Image-generating machine learning model

Stable Diffusion is a deep learning, text-to-image model released in 2022 based on diffusion techniques. The generative artificial intelligence technology is the premier product of Stability AI and is considered to be a part of the ongoing artificial intelligence boom.

<span class="mw-page-title-main">Text-to-image model</span> Machine learning model

A text-to-image model is a machine learning model which takes an input natural language description and produces an image matching that description.

A text-to-video model is a machine learning model which takes a natural language description as input and producing a video or multiples videos from the input.

<span class="mw-page-title-main">Hallucination (artificial intelligence)</span> Confident unjustified claim by AI

In the field of artificial intelligence (AI), a hallucination or artificial hallucination is a response generated by AI which contains false or misleading information presented as fact. This term draws a loose analogy with human psychology, where hallucination typically involves false percepts. However, there is a key difference: AI hallucination is associated with unjustified responses or beliefs rather than perceptual experiences.

A large language model (LLM) is a computational model notable for its ability to achieve general-purpose language generation and other natural language processing tasks such as classification. Based on language models, LLMs acquire these abilities by learning statistical relationships from vast amounts of text during a computationally intensive self-supervised and semi-supervised training process. LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word.

<span class="mw-page-title-main">Generative artificial intelligence</span> AI system capable of generating content in response to prompts

Generative artificial intelligence is artificial intelligence capable of generating text, images, videos, or other data using generative models, often in response to prompts. Generative AI models learn the patterns and structure of their input training data and then generate new data that has similar characteristics.

<span class="mw-page-title-main">Gemini (language model)</span> Large language model developed by Google

Google Gemini is a family of multimodal large language models developed by Google DeepMind, serving as the successor to LaMDA and PaLM 2. Comprising Gemini Ultra, Gemini Pro, Gemini Flash, and Gemini Nano, it was announced on December 6, 2023, positioned as a competitor to OpenAI's GPT-4. It powers the chatbot of the same name.

References

  1. 1 2 Krithika, K. L. (December 20, 2023). "Google Unveils VideoPoet, a New LLM for Video Generation". Analytics India Magazine. Retrieved April 29, 2024.
  2. Kondratyuk, Dan; Yu, Lijun; Gu, Xiuye; Lezama, José; Huang, Jonathan; Hornung, Rachel; Adam, Hartwig; Akbari, Hassan; Alon, Yair; Birodkar, Vighnesh; Cheng, Yong; Chiu, Ming-Chang; Dillon, Josh; Essa, Irfan; Gupta, Agrim; Hahn, Meera; Hauth, Anja; Hendon, David; Martinez, Alonso; Minnen, David; Ross, David; Schindler, Grant; Sirotenko, Mikhail; Sohn, Kihyuk; Somandepalli, Krishna; Wang, Huisheng; Yan, Jimmy; Yang, Ming-Hsuan; Yang, Xuan; Seybold, Bryan; Jiang, Lu (December 21, 2023). "VideoPoet: A Large Language Model for Zero-Shot Video Generation". arXiv: 2312.14125 [cs.CV].
  3. "Google has introduced VideoPOET breaking new ground in coherent video generation". Gizmochina. December 21, 2023.
  4. 1 2 "VideoPoet". Google Research. Retrieved April 29, 2024.
  5. Franzen, Carl (December 20, 2023). "Google's new multimodal AI video generator VideoPoet looks incredible". VentureBeat.