Title: MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
Authors: Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yanpeng Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali Farhadi, Yejin Choi
Published: 7th January 2022 (Friday) @ 19:00:21
Link: http://arxiv.org/abs/2201.02639v4
Abstract
As humans, we navigate a multimodal world, building a holistic understanding from all our senses. We introduce MERLOT Reserve, a model that represents videos jointly over time â through a new training objective that learns from audio, subtitles, and video frames. Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet. Our objective learns faster than alternatives, and performs well at scale: we pretrain on 20 million YouTube videos. Empirical results show that MERLOT Reserve learns strong multimodal representations. When finetuned, it sets state-of-the-art on Visual Commonsense Reasoning (VCR), TVQA, and Kinetics-600; outperforming prior work by 5%, 7%, and 1.5% respectively. Ablations show that these tasks benefit from audio pretraining â even VCR, a QA task centered around images (without sound). Moreover, our objective enables out-of-the-box prediction, revealing strong multimodal commonsense understanding. In a fully zero-shot setting, our model obtains competitive results on four video tasks, even outperforming supervised approaches on the recently proposed Situated Reasoning (STAR) benchmark. We analyze why audio enables better vision-language representations, suggesting significant opportunities for future research. We conclude by discussing ethical and societal implications of multimodal pretraining.
Code and weights: https://rowanzellers.com/merlotreserve/
Our model differs from past work that learns from audioimage pairs [54, 71], from subtitled videos [105, 128], or from static images with literal descriptions [106, 21, 92]. Instead, we learn joint representations from all modalities of a video, using each modality to teach others. We do this at scale, training on over 20 million YouTube videos.
We introduce a new contrastive masked span learning objective to learn script knowledge across modalities. It generalizes and outperforms a variety of previously proposed approaches (e.g. [29, 106, 92, 128]), while enabling audio to be used as signal. The idea is outlined in Figure 1: the model must figure out which span of text (or audio) was MASKed out of a video sequence. We combine our objective with a second contrastive learning approach, tailored to learning visual recognition from scratch: the model must also match each video frame to a contextualized representation of the videoâs transcript [128]. Through ablations, we show that our framework enables rapid pretraining of a model and readily scales to âlargeâ transformer sizes (of 644M parameters).
đ„đ„ Experimental results show that Reserve learns powerful representations, useful even for tasks posed over only a few of the studied modalities. For example, when finetuned on Visual Commonsense Reasoning [126] (a vision+language task with no audio), it sets a new state-of-the-art, outperforming models trained on supervised image-caption pairs by over 5%. It does even better on video tasks: fine-tuning without audio, it outperforms prior work on TVQA [75] by a margin of over 7% (and given TVQA audio, performance increases even further). Finally, audio enables 91.1% accuracy on Kinetics-600 [19]. These performance improvements do not come at the expense of efficiency: our largest model uses one-fifths the FLOPs of a VisualBERT. đ„đ„
zero-shot settings. We evaluate on four diverse benchmarks: Situated Reasoning (STAR) [119], EPIC-Kitchens [26], LSMDC-FiB [96], and MSR-VTT QA [120]. These benchmarks require visual reasoning with respective emphasis on temporality, future prediction, and both social and physical understanding. With no fine-tuning or supervision, our model obtains competitive performance on each. Of note, it nearly doubles [123]âs SoTA zero-shot accuracy on MSR-VTT QA, and it outperforms supervised approaches (like ClipBERT [74]) on STAR.
For instance, predicting audio rewards models for recognizing dynamic state changes (like cooked popcorn) and human communication dynamics (what are peopleâs emotions and towards whom). Our model progressively learns these phenomena as pretraining progresses. These signals are often orthogonal to what snippets of text provide, which motivates learning from both modalities
Contributions:
- MERLOT Reserve, a model for multimodal script knowledge fusing vision, audio, and text
- A new contrastive span matching objective, enabling our model to learn from text and audio self-supervision
- Experiments, ablations, and analysis, that demonstrate strong multimodal video representations
Improves over MERLOT Multimodal Neural Script Knowledge Models by adding the audio modality, which MERLOT lacks.