Title: MERLOT: Multimodal Neural Script Knowledge Models
Authors: Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, Yejin Choi
Published: 4th June 2021 (Friday) @ 17:57:39
Link: http://arxiv.org/abs/2106.02636v3

Abstract

As humans, we understand events in the visual world contextually, performing multimodal reasoning across time to make inferences about the past, present, and future. We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech — in an entirely label-free, self-supervised manner. By pretraining with a mix of both frame-level (spatial) and video-level (temporal) objectives, our model not only learns to match images to temporally corresponding words, but also to contextualize what is happening globally over time. As a result, MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets when finetuned. It also transfers well to the world of static images, allowing models to reason about the dynamic context behind visual scenes. On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%, even those that make heavy use of auxiliary supervised data (like object bounding boxes). Ablation analyses demonstrate the complementary importance of: 1) training on videos versus static images; 2) scaling the magnitude and diversity of the pretraining video corpus; and 3) using diverse objectives that encourage full-stack multimodal reasoning, from the recognition to cognition level.


achieves video language alignment where the text is temporally aligned with the video. Contrary to VideoBERT, which is trained on curated instructional cooking videos, MERLOT is trained on a large-scale dataset of YouTube videos that is less curated and also more diverse, and where the corresponding text is obtained by ASR. The model uses a transformer network trained in a purely self-supervised way, with a contrastive objective between local text tokens and frame visual tokens, a masked language modeling objective, and a temporal reordering objective. The model demonstrated at the time impressive capabilities on question answering tasks, particularly visual common sense reasoning. First, it is able to transfer the knowledge it has learned from videos to answer questions about what is going to happen next from an image, which demonstrates how video models are useful for understanding the visual world. Second, it is able to answer particularly difficult questions from videos on a wide set of datasets and benchmarks. The main limitation of MERLOT is that it lacks the ability to generate text, which prevents it from demonstrating advanced visual reasoning capabilities. — Summary of MERLOT from An Introduction to Vision-Language Modeling

See also