Title: Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning
Authors: Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, Ishan Misra
Published: 17th November 2023 (Friday) @ 18:59:04
Link: http://arxiv.org/abs/2311.10709v2
Abstract
We present Emu Video, a text-to-video generation model that factorizes the generation into two steps: first generating an image conditioned on the text, and then generating a video conditioned on the text and the generated image. We identify critical design decisionsâadjusted noise schedules for diffusion, and multi-stage training that enable us to directly generate high quality and high resolution videos, without requiring a deep cascade of models as in prior work. In human evaluations, our generated videos are strongly preferred in quality compared to all prior workâ81% vs. Googleâs Imagen Video, 90% vs. Nvidiaâs PYOCO, and 96% vs. Metaâs Make-A-Video. Our model outperforms commercial solutions such as RunwayMLâs Gen2 and Pika Labs. Finally, our factorizing approach naturally lends itself to animating images based on a userâs text prompt, where our generations are preferred 96% over prior work.