VideoPoet: Google’s Amazing New AI for Video Generation

Google Research has recently unveiled VideoPoet, a new artificial intelligence (AI) model that can generate stunning videos from various inputs. VideoPoet is a large language model (LLM), a type of AI that is usually used for generating text and code. However, the Google Research team trained it to produce videos instead, using a massive dataset of 270 million videos and more than 1 billion text-and-image pairs from the internet and other sources.

How VideoPoet works

VideoPoet is based on the transformer architecture, a neural network design that allows for efficient and flexible learning from sequential data. It converts the input data into text embeddings, visual tokens, and audio tokens, and then uses these as “conditions” for generating the output video. For example, if the input is a text prompt, it will create a video that matches the text description.

It is different from most existing video generation models, which use diffusion-based methods. These methods start with a pre-trained image model that produces high-fidelity images for individual frames and then fine-tunes the model to improve temporal consistency across video frames. However, diffusion-based models often struggle with producing coherent large motions and tend to generate artifacts or glitches when the motion is too large or complex.

Why VideoPoet is better

VideoPoet, on the other hand, can generate larger and more consistent motion across longer videos of 16 frames, without compromising the quality or realism of the video. It can also simulate different camera motions, and different visual and aesthetic styles, and even generate new audio to match the video. Moreover, It can handle a range of inputs, including text, images, and videos, and use them as prompts for generating new videos.

It is an all-in-one solution for video creation, as it integrates all these video generation capabilities within a single LLM. This eliminates the need for multiple, specialized components, and offers a seamless and versatile experience for users.

The Google Research team has demonstrated the impressive results of VideoPoet in their pre-review research paper and their blog post. They have also compared VideoPoet with other video generation models, such as Source-1, VideoCrafter, and Phenaki, which use diffusion-based methods. They showed video clips generated by it and the competing models to human raters, who preferred it in most cases.

According to the Google Research blog post: “On average people selected 24–35% of examples from VideoPoet as following prompts better than a competing model vs. 8–11% for competing models. Raters also preferred 41–54% of examples from VideoPoet for more interesting motion than 11–21% for other models.”

What’s next for VideoPoet

Videopoet
Image Source: Google Research

VideoPoet is designed to produce videos in portrait orientation by default, or “vertical video”, catering to the mobile video market popularized by Snap and TikTok. However, the Google Research team envisions expanding VideoPoet’s capabilities to support “any-to-any” generation tasks, such as text-to-audio and audio-to-video, further pushing the boundaries of what’s possible in video and audio generation.

The only drawback of it is that it is not currently available for public use. The Google Research team has not announced when it will be released or how it will be integrated with Google’s products and services. Until then, we will have to wait eagerly for its arrival to see how it compares to other tools on the market.

What is the transformer architecture?

The transformer architecture is a type of neural network capable of processing sequential data such as texts, audio, videos, and images.

The transformer, unlike recurrent or convolutional networks, does not employ any loops or filters, instead relying on a mechanism known as attention to learn the dependencies between the input and output elements. The transformer is made up of two components: an encoder and a decoder.

The encoder converts an input sequence into a series of continuous representations, and the decoder generates an output sequence from the encoder’s output and previous outputs. The transformer is capable of performing a variety of tasks, including machine translation, natural language understanding, speech recognition, and computer vision.