VideoPoet: An Extensive Linguistic Model for Producing Zero-Shot Videos

A new generation of models has emerged recently, many of them with breathtakingly beautiful qualities. The capacity to create cohesive massive motions is now one of the obstacles in the video-generating process. Even the state-of-the-art models frequently produce tiny motions or, when producing bigger motions, show observable artifacts.


VideoPoet, a large language model (LLM) that can perform a wide range of video generation tasks, such as text-to-video, image-to-video, video stylization, video inpainting and outpainting, and video-to-audio, to investigate the use of language models in video creation. 


One noteworthy finding is that diffusion-based models, of which Imagen Video is one example, account for nearly all leading video-generating models. However, LLMs are commonly accepted as the de facto norm because of their outstanding modalities-learning capabilities in language, code, and audio (e.g., AudioPaLM). 


Unlike other models in this field, the method relies on independently trained components that specialize on individual tasks, instead of smoothly integrating many video-generating capabilities within a single LLM. In order to create the text-guided style, the model first processes a video that represents motion and depth and then applies contents to create a stylized look.



VideoPoet shows the highly competitive quality of LLMs’ video generation in a range of applications, particularly in creating engaging and high-quality motions inside videos. Based on the findings, LLMs have a bright future in the realm of creating videos. The system should be able to handle “any-to-any” generation in the future; this includes the ability to add text-to-audio, audio-to-video, and video captioning capabilities, among many others.