🪴 Anil's Garden

❯

MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound

17 Jun 20255 min read

paper
video
vision
annotated
ai2
edinburgh
uw-udub

Title: MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
Authors: Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yanpeng Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali Farhadi, Yejin Choi
Published: 7th January 2022 (Friday) @ 19:00:21
Link: http://arxiv.org/abs/2201.02639v4

Abstract

As humans, we navigate a multimodal world, building a holistic understanding from all our senses. We introduce MERLOT Reserve, a model that represents videos jointly over time — through a new training objective that learns from audio, subtitles, and video frames. Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet. Our objective learns faster than alternatives, and performs well at scale: we pretrain on 20 million YouTube videos. Empirical results show that MERLOT Reserve learns strong multimodal representations. When finetuned, it sets state-of-the-art on Visual Commonsense Reasoning (VCR), TVQA, and Kinetics-600; outperforming prior work by 5%, 7%, and 1.5% respectively. Ablations show that these tasks benefit from audio pretraining — even VCR, a QA task centered around images (without sound). Moreover, our objective enables out-of-the-box prediction, revealing strong multimodal commonsense understanding. In a fully zero-shot setting, our model obtains competitive results on four video tasks, even outperforming supervised approaches on the recently proposed Situated Reasoning (STAR) benchmark. We analyze why audio enables better vision-language representations, suggesting significant opportunities for future research. We conclude by discussing ethical and societal implications of multimodal pretraining.

Code and weights: https://rowanzellers.com/merlotreserve/

Improves over MERLOT Multimodal Neural Script Knowledge Models by adding the audio modality, which MERLOT lacks.

Our model differs from past work that learns from audioimage pairs [54, 71], from subtitled videos [105, 128], or from static images with literal descriptions [106, 21, 92]. Instead, we learn joint representations from all modalities of a video, using each modality to teach others. We do this at scale, training on over 20 million YouTube videos.

We introduce a new contrastive masked span learning objective to learn script knowledge across modalities. It generalizes and outperforms a variety of previously proposed approaches (e.g. [29, 106, 92, 128]), while enabling audio to be used as signal. The idea is outlined in Figure 1: the model must figure out which span of text (or audio) was MASKed out of a video sequence. We combine our objective with a second contrastive learning approach, tailored to learning visual recognition from scratch: the model must also match each video frame to a contextualized representation of the video’s transcript [128]. Through ablations, we show that our framework enables rapid pretraining of a model and readily scales to ‘large’ transformer sizes (of 644M parameters).

🔥🔥 Experimental results show that Reserve learns powerful representations, useful even for tasks posed over only a few of the studied modalities. For example, when finetuned on Visual Commonsense Reasoning [126] (a vision+language task with no audio), it sets a new state-of-the-art, outperforming models trained on supervised image-caption pairs by over 5%. It does even better on video tasks: fine-tuning without audio, it outperforms prior work on TVQA [75] by a margin of over 7% (and given TVQA audio, performance increases even further). Finally, audio enables 91.1% accuracy on Kinetics-600 [19]. These performance improvements do not come at the expense of efficiency: our largest model uses one-fifths the FLOPs of a VisualBERT. 🔥🔥

zero-shot settings. We evaluate on four diverse benchmarks: Situated Reasoning (STAR) [119], EPIC-Kitchens [26], LSMDC-FiB [96], and MSR-VTT QA [120]. These benchmarks require visual reasoning with respective emphasis on temporality, future prediction, and both social and physical understanding. With no fine-tuning or supervision, our model obtains competitive performance on each. Of note, it nearly doubles [123]’s SoTA zero-shot accuracy on MSR-VTT QA, and it outperforms supervised approaches (like ClipBERT [74]) on STAR.

For instance, predicting audio rewards models for recognizing dynamic state changes (like cooked popcorn) and human communication dynamics (what are people’s emotions and towards whom). Our model progressively learns these phenomena as pretraining progresses. These signals are often orthogonal to what snippets of text provide, which motivates learning from both modalities

Contributions:

MERLOT Reserve, a model for multimodal script knowledge fusing vision, audio, and text
A new contrastive span matching objective, enabling our model to learn from text and audio self-supervision
Experiments, ablations, and analysis, that demonstrate strong multimodal video representations

Data

They introduce the new training dataset of 20 million English-subtitled YouTube videos, and 1 billion frames, called YT-Temporal-1B.

See Appendix E for details

We had several goals in mind. We wanted to use only public-facing data, which motivated our choice of YouTube as it is a public platform that users understand is public [65] - bit weird
filtering with Mechanical Turk (annotated 2000 videos) and T5 to exclude videos with racist/sexist content before download
One side benefit of this model is that it allowed us to estimate our videos’ genre breakdown before downloading them. We found 1% Gaming videos, 11% News videos, 20% How-To videos, 20% ‘chatting’ videos, 5% sports videos, 5% Music videos, 3% Movies/Drama videos, 4% Documentary videos, and 31% Miscellaneous. The Gaming videos were then filtered out.
points scoring scheme for download
Software/implementation:
- used the Python package cld3 to filter out any transcript with a probability of less than 80% of being English
- youtube-dl Python package: Each channel often links to other channels, and given a channel it is inexpensive to obtain a list of all its videos using the youtube-dl Python package

Graph View

Backlinks

MERLOT: Multimodal Neural Script Knowledge Models
Datasets
Multimodality
Vision

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋

🪴 Anil's Garden

Explorer

MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound

Data

Graph View

Backlinks