Video generation models as world simulators
Excerpt
We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a minute of high fidelity video. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.
-
Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhudinov. âUnsupervised learning of video representations using lstms.â International conference on machine learning. PMLR, 2015.â©ïž
-
Chiappa, Silvia, et al. âRecurrent environment simulators.â arXiv preprint arXiv:1704.02254 (2017).â©ïž
-
Ha, David, and JĂŒrgen Schmidhuber. âWorld models.â arXiv preprint arXiv:1803.10122 (2018).â©ïž
-
Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. âGenerating videos with scene dynamics.â Advances in neural information processing systems 29 (2016).â©ïž
-
Tulyakov, Sergey, et al. âMocogan: Decomposing motion and content for video generation.â Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.â©ïž
-
Clark, Aidan, Jeff Donahue, and Karen Simonyan. âAdversarial video generation on complex datasets.â arXiv preprint arXiv:1907.06571 (2019).â©ïž
-
Brooks, Tim, et al. âGenerating long videos of dynamic scenes.â Advances in Neural Information Processing Systems 35 (2022): 31769-31781.â©ïž
-
Yan, Wilson, et al. âVideogpt: Video generation using vq-vae and transformers.â arXiv preprint arXiv:2104.10157 (2021).â©ïž
-
Wu, Chenfei, et al. âNĂŒwa: Visual synthesis pre-training for neural visual world creation.â European conference on computer vision. Cham: Springer Nature Switzerland, 2022.â©ïž
-
Ho, Jonathan, et al. âImagen video: High definition video generation with diffusion models.â arXiv preprint arXiv:2210.02303 (2022).â©ïž
-
Blattmann, Andreas, et al. âAlign your latents: High-resolution video synthesis with latent diffusion models.â Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.â©ïž
-
Gupta, Agrim, et al. âPhotorealistic video generation with diffusion models.â arXiv preprint arXiv:2312.06662 (2023).â©ïž
-
Vaswani, Ashish, et al. âAttention is all you need.â Advances in neural information processing systems 30 (2017).â©ïžâ©ïž
-
Brown, Tom, et al. âLanguage models are few-shot learners.â Advances in neural information processing systems 33 (2020): 1877-1901.â©ïžâ©ïž
-
Dosovitskiy, Alexey, et al. âAn image is worth 16x16 words: Transformers for image recognition at scale.â arXiv preprint arXiv:2010.11929 (2020).â©ïžâ©ïž
-
Arnab, Anurag, et al. âVivit: A video vision transformer.â Proceedings of the IEEE/CVF international conference on computer vision. 2021.â©ïžâ©ïž
-
He, Kaiming, et al. âMasked autoencoders are scalable vision learners.â Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.â©ïžâ©ïž
-
Dehghani, Mostafa, et al. âPatch nâPack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution.â arXiv preprint arXiv:2307.06304 (2023).â©ïžâ©ïž
-
Rombach, Robin, et al. âHigh-resolution image synthesis with latent diffusion models.â Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.â©ïž
-
Kingma, Diederik P., and Max Welling. âAuto-encoding variational bayes.â arXiv preprint arXiv:1312.6114 (2013).â©ïž
-
Sohl-Dickstein, Jascha, et al. âDeep unsupervised learning using nonequilibrium thermodynamics.â International conference on machine learning. PMLR, 2015.â©ïž
-
Ho, Jonathan, Ajay Jain, and Pieter Abbeel. âDenoising diffusion probabilistic models.â Advances in neural information processing systems 33 (2020): 6840-6851.â©ïž
-
Nichol, Alexander Quinn, and Prafulla Dhariwal. âImproved denoising diffusion probabilistic models.â International Conference on Machine Learning. PMLR, 2021.â©ïž
-
Dhariwal, Prafulla, and Alexander Quinn Nichol. âDiffusion Models Beat GANs on Image Synthesis.â Advances in Neural Information Processing Systems. 2021.â©ïž
-
Karras, Tero, et al. âElucidating the design space of diffusion-based generative models.â Advances in Neural Information Processing Systems 35 (2022): 26565-26577.â©ïž
-
Peebles, William, and Saining Xie. âScalable diffusion models with transformers.â Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.â©ïž
-
Chen, Mark, et al. âGenerative pretraining from pixels.â International conference on machine learning. PMLR, 2020.â©ïž
-
Ramesh, Aditya, et al. âZero-shot text-to-image generation.â International Conference on Machine Learning. PMLR, 2021.â©ïž
-
Yu, Jiahui, et al. âScaling autoregressive models for content-rich text-to-image generation.â arXiv preprint arXiv:2206.10789 2.3 (2022): 5.â©ïž
-
Betker, James, et al. âImproving image generation with better captions.â Computer Science. https://cdn.openai.com/papers/dall-e-3. pdf 2.3 (2023): 8â©ïžâ©ïž
-
Ramesh, Aditya, et al. âHierarchical text-conditional image generation with clip latents.â arXiv preprint arXiv:2204.06125 1.2 (2022): 3.â©ïž
-
Meng, Chenlin, et al. âSdedit: Guided image synthesis and editing with stochastic differential equations.â arXiv preprint arXiv:2108.01073 (2021).â©ïž