MERLOT: Multimodal Neural Script Knowledge Models

🪴 Anil's Garden

Title: MERLOT: Multimodal Neural Script Knowledge Models
Authors: Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, Yejin Choi
Published: 4th June 2021 (Friday) @ 17:57:39
Link: http://arxiv.org/abs/2106.02636v3

Abstract

As humans, we understand events in the visual world contextually, performing multimodal reasoning across time to make inferences about the past, present, and future. We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech — in an entirely label-free, self-supervised manner. By pretraining with a mix of both frame-level (spatial) and video-level (temporal) objectives, our model not only learns to match images to temporally corresponding words, but also to contextualize what is happening globally over time. As a result, MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets when finetuned. It also transfers well to the world of static images, allowing models to reason about the dynamic context behind visual scenes. On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%, even those that make heavy use of auxiliary supervised data (like object bounding boxes). Ablation analyses demonstrate the complementary importance of: 1) training on videos versus static images; 2) scaling the magnitude and diversity of the pretraining video corpus; and 3) using diverse objectives that encourage full-stack multimodal reasoning, from the recognition to cognition level.

Multimodal Event Representation Learning Over Time (MERLOT)

Summary of MERLOT from An Introduction to Vision-Language Modeling

achieves video language alignment where the text is temporally aligned with the video. Contrary to VideoBERT, which is trained on curated instructional cooking videos, MERLOT is trained on a large-scale dataset of YouTube videos that is less curated and also more diverse, and where the corresponding text is obtained by ASR. The model uses a transformer network trained in a purely self-supervised way, with a contrastive objective between local text tokens and frame visual tokens, a masked language modeling objective, and a temporal reordering objective. The model demonstrated at the time impressive capabilities on question answering tasks, particularly visual common sense reasoning. First, it is able to transfer the knowledge it has learned from videos to answer questions about what is going to happen next from an image, which demonstrates how video models are useful for understanding the visual world. Second, it is able to answer particularly difficult questions from videos on a wide set of datasets and benchmarks. The main limitation of MERLOT is that it lacks the ability to generate text, which prevents it from demonstrating advanced visual reasoning capabilities. — Summary of MERLOT from An Introduction to Vision-Language Modeling

MERLOT outperforms strong baselines like CLIP [89] and UNITER [22], which independently match images to text and thus cannot reason over long-term contexts as effectively. This capacity for temporal coherence emerges during pretraining: analysis of MERLOT’s attention patterns (Figure 11) show that regions attend to captions that are distant in time (and vice versa), allowing it perform cross-modal coreference to piece together a holistic view of situations.

Ablations of MERLOT show:

Pretraining works better when we train on videos rather than still images, aided crucially by our strategy of corrupting highly visual words in the masked language modeling task
Using a diverse set of videos covering many aspects of everyday situations improves downstream performance compared to curated instructional video corpora [107, 80] which both cover a smaller slice of the visual world (confirming hypotheses from past work [47])
MERLOT’s performance does not saturate even after many epochs of training on the pretraining corpus we curated, YT-Temporal-180M, as it continues to improve performance simply with more pretraining.

Temporal Ordering and Forecasting (§2.3)

There has been a large body of work on analyzing ‘what happens next’ in videos [58]. Some modeling choices include using pixels [34, 113], graphs [11], euclidean distance using sensors [3], or studying cycle consistency across time [32]. In addition to extrapolation, past work has studied deshuffling objectives in videos [82, 115], though this has mostly been limited to the visual modality. In contrast to these papers, our goal is learning multimodal script knowledge representations: using both language and vision as complementary views into the world, instead of just tracking what changes on-screen.

YT-Temporal-180M

6 million YT videos
- starting from 27M candidate videos
- including instructional videos from HowTo100M [80], lifestyle vlogs of everyday events from the VLOG dataset [35], and YouTube’s auto-suggested videos for popular topics like ‘science’ or ‘home improvement.
Emphasis on breadth (diversity) in contrast to previous work, which works on procedural videos
YouTube API:
- Video (identified by ID)
- ASR transcript
Discard:
- Belong to “ungrounded” categories (like video game commentaries; emphasis on grounded/real world videos)
- $> 20$ minutes long
- No English ASR transcript
- Thumbnails that are unlikely to contain objects - used “lightweight image classifier”

Video Representations used for MERLOT

They use only a single image to represent a given video segment, $s_{t}$

Each video V might contain thousands of frames. In this work, we represent a video V as a sequence of consecutive video segments ${s_{t}}$ . Each segment $s_{t}$ consists of:

an image frame $I_{t}$ , extracted from the middle timestep of the segment,
the words $w_{t}$ spoken during the segment, with a total length of L tokens.

To split the videos into segments, we byte-pair-encode (BPE; [97, 881]) each video transcript and align tokens with YouTube’s word-level timestamps. This enables us to split the videos into segments of $L = 32$ BPE tokens each (Appendix A.4); our final dataset has 180 million segments of this form.

MERLOT

Figure 2: Left: MERLOT learns to match contextualized captions with their corresponding video frames. Right: the same image encoding is provided, along with (masked) word embeddings, into a joint vision-language Transformer model; it then unmasks ground words (like ‘saw’ in this example) and puts scrambled video frames into the correct order.

Image Encoder

for the sake of pre-training efficiency we use a grid-based hybrid ResNet/Vision Transformer
encoder uses a ResNet-50 backbone, followed by a 12-layer, 768-dimensional Vision Transformer
additional modifications that improve efficiency:
- we trained on smaller, widescreen images of size 192x352 (because most YouTube videos are widescreen) using a patch size of 16x16 pixels
- we mirror [31]’s alterations of removing the C5 block in ResNet-50 -question what is this?
- we save compute further by average-pooling the final-layer region cells using a kernel size of 2 × 2
our image encoder requires 40 gigaFLOPs for a forward pass, which is 2% of the 2 teraFLOPs required for the Faster-RCNN.

Summary: given an image of size W × H, the image encoder will output a W/32×H/32 feature map, along with two CLS hidden states: one for pooling a global representation of the image, and another for pretraining (Task 1.)

See In Defense of Grid Features for Visual Question Answering for info on grid features

Joint Vision-Language Encoder

The joint encoder is a 12-layer, 768-dimensional Transformer [112], mirroring the RoBERTa base architecture [72]; we initialize it with pretrained RoBERTa weights.

Positional embeddings differ across segments - distinguish images and captions at different timesteps
Captions begin with CLS token
Feature maps begin with (another) CLS token

Pretraining Tasks & Objectives

Contrastive frame-transcript matching - Project both text-only encoder encoding of the caption and image encoding to 768-dimensional $L_{2}$ -normalized vector
- Negative samples come from batch (whether or not they originate from the same video)
- Caption-level information is insufficient for contrastive matching so encoder is forced to look at (left and right; because RoBERTa) context
Attention Masked Language Modeling
- People ramble in videos and ungrounded words like “umm” get masked - not useful for representation learning
- they introduce attention masking
  - use attention weights from a language-only transformer as a heuristic for which words are grounded
  - 50% of the time, we mask out a random token; the other 50% of the time, we mask out one of the top 20% most-attended-to-tokens
- Apply SpanBERT masking [54]: randomly corrupt following or preceding tokens with an average length of 0.5 tokens in each direction; this makes it harder for models to over-rely on BPE artifacts.
Temporal Reordering $\to$ allows the model to order each ‘shuffled’ frame conditioned on frames provided in the correct order (if any)
- We have the model order the image frames in a video, forcing it to explicitly learn temporal reasoning and giving it an interface to measure such temporal reasoning.
- Protocol:
  1. 40% of the time, we randomly pick an integer i between 2 and N (number of segments provided to the joint encoder)
  2. Randomly scramble i video frames chosen at random, by replacing the segment-level position embeddings (e.g. [image_t]) for that frame with a random and unique position embedding, e.g. [image_unk_0])
  3. These random position embeddings are learned, and separate from the ‘unshuffled’ position embeddings.
- reordering loss:
  - we extract hidden states from each frame at the CLS token position
  - For each pair of frames, we concatenate their hidden states ht, and ht; and pass the result through a two-layer MLP, predicting if ti < tj or ti > tj
  - Optimized with cross-entropy

Results

Table 3: Comparison with state-of-the-art methods on video reasoning tasks. MERLOT outperforms state of the art methods in 12 downstream tasks that involve short and long videos.

Tasks				Split Vid. Length ActBERT (127] ClipBERTs2 [67]	SOTA	MERLOT
MSRVTT-QA	Test	Short		37.4	41.5 [118]	43.1
MSR-VTT-MC	Test	Short	88.2		88.2 [127]	90.9
TGIF-Action	Test	Short		82.8	82.8 [67]	94.0
TGIF-Transition	Test	Short		87.8	87.8 [67]	96.2
TGIF-Frame QA	Test	Short		60.3	60.3 [67]	69.5
LSMDC-FiB QA	Test	Short	48.6		48.6 [127]	52.9
LSMDC-MC	Test	Short			73.5 [121]	81.7
ActivityNetQA	Test	Long			38.9 [118]	41.4
Drama-QA	Val	Long		…	81.0 [56]	81.4
TVQA	Test	Long	…		76.2 [56]	78.7
TVQA+	Test	Long			76.2 [56]	80.9
VLEP	Test	Long			67.5 [66]	68.4

Ablations

Table 4: Ablation study on the validation set of VCR question answering (Q → A) and TVQA+, in accuraty (%). We put a next to the configurations we chose for MERLOT

Ablations are insanely well done and helpful

Table 4

Diverse video data is important: Note that YT-Temporal-180M at scale of HowTo100M still outperforms HowTo100M