Title: Efficient Pre-training for Localized Instruction Generation of Videos
Authors: Anil Batra, Davide Moltisanti, Laura Sevilla-Lara, Marcus Rohrbach, Frank Keller
Published: 27th November 2023 (Monday) @ 16:07:37
Link: http://arxiv.org/abs/2311.15964v4
Abstract
Procedural videos, exemplified by recipe demonstrations, are instrumental in conveying step-by-step instructions. However, understanding such videos is challenging as it involves the precise localization of steps and the generation of textual instructions. Manually annotating steps and writing instructions is costly, which limits the size of current datasets and hinders effective learning. Leveraging large but noisy video-transcript datasets for pre-training can boost performance but demands significant computational resources. Furthermore, transcripts contain irrelevant content and differ in style from human-written instructions. To mitigate these issues, we propose a novel technique, Sieve-&-Swap, to automatically generate high-quality training data for the recipe domain: (i) Sieve: filters irrelevant transcripts and (ii) Swap: acquires high-quality text by replacing transcripts with human-written instruction from a text-only recipe dataset. The resulting dataset is three orders of magnitude smaller than current web-scale datasets but enables efficient training of large-scale models. Alongside Sieve-&-Swap, we propose Procedure Transformer (ProcX), a model for end-to-end step localization and instruction generation for procedural videos. When pre-trained on our curated dataset, this model achieves state-of-the-art performance on YouCook2 and Tasty while using a fraction of the training data. We have released code and dataset.
Quick Notes
Studies by Chafe et al. [6] and Einhorn et al. [8] highlight the distinct attributes of written and spoken language: written language is more concise and varied, while spoken language tends to be more protracted and repetitive. This creates a substantial domain gap. In this work, we propose Sieve & Swap, a technique that fuses an instructional video dataset and a recipe text dataset, resulting in a smaller multimodal dataset for effective procedure learning.
Main Contributions:
- Specifically, we combine a subset of HowTo100M [22] of instructional cooking videos with the RecipeNLG [4] collection of text-only cooking recipes. We develop a novel method to retrieve relevant and high-quality reference descriptions for procedural learning, which involves extracting human-written instructional sentences from RecipeNLG.
- We employ sentence embeddings to filter irrelevant sentences from video transcripts based on text similarity, and replace the relevant transcript sentences with instructional sentences from RecipeNLG.
- The resulting Sieve & Swap dataset is characterized by less noise, providing high-quality pre-training data for the task of human-style instruction generation.
- The Sieve & Swap dataset is smaller by three orders of magnitude (â48K videos) than the pre-training datasets employed in previous studies (â15M videos) [20, 37]
- âŠbut contains both segments with temporal boundaries and human-written instructions.
- Despite the smaller scale, models pre-trained on the Sieve & Swap dataset outperform the same models pre-trained on raw transcripts, as we show in Figure 1b.
- Propose Procedure Transformer (ProcX), an improved instruction localization and description model, that we pre-train on the Sieve & Swap dataset to achieve state-of-the-art performance on YouCook2 and Tasty.
Fig. 1: (a) Sieve & Swap at a glance. We first remove irrelevant video ASR (Automatic Speech Recognition) segments, then substitute the raw transcripts with recipe steps retrieved from a recipe database. The resulting dataset is used for pre-training. Previous work uses raw and noisier transcripts, thus requiring larger amounts of data for effective pre-training. (b) With Sieve & Swap we generate a smaller but better pre-training dataset. Compared to using a larger number of raw segments (right), models achieve higher performance with a fraction of the data (middle). Note how the improvement from no pre-training (left) is steeper with Sieve & Swap. The plotted metric is the generated instruction coherency metric (SODA-C [9]), models were fine-tuned and tested on YouCook2 [42].
Related Work
We operate in the cooking domain as it offers a diversity of both natural language and visual activities. Due to task complexity, prior work addresses the problem with two-stage training, where events are first localized and captions are later generated. Wang et al. [34] introduced single-stage model (PDVC) to jointly identify temporal boundaries and generate procedural instructions.
Unlike prior work [17, 41], we sieve noisy ASR segments and replace conversational speech with human-written text without visual training. Our method uses five times less text than [17], which uses all ASR-transcribed text. Moreover, the text-replacement approach is crucial for human-style instruction generation.
Task â The Goal
Our task is to output the start/end times of each instruction, together with its textual description, given only the untrimmed video as input. This task has been termed âProcedure Segmentation and Summarizationâ (PSS) in [3] and overlaps with Dense Video Captioning (DVC). In both cases the goal is to detect and describe activities in a video; however, key distinctions are that DVC doesnât involve instructions and that it allows overlapping segments.
In our task, instructions are sequential parts of a single procedure (e.g., a recipe), and the focus is to provide a textual sequence summarizing the procedure.
In fact, popular datasets for DVC such as ActivityNet Captions [13] collect non-procedural videos, i.e., captions describe what happens in the video, but they do not form the textual sequence of a procedure, which is the focus of our task.
Sieve & Swap Protocol
Overview
- Source Datasets. We utilize all cooking videos from HowTo100M (â338K) and all the cooking recipes with steps from RecipesNLG [4] (â2M).
- HowTo100M is the largest dataset for pre-training video models
- RecipesNLG is collected to generate structured text sequences
- Keep videos with a maximum duration of 10 minutes
- Require a minimum of five videos per cooking category (e.g., âmake a bean saladâ) obtained from video metadata
- This set of videos is the corpus V with â241K videos paired with a title containing a high-level goal or a cooking recipe title.
Protocol:
- Sieve irrelevant videos and recipes by pairing them based on their title and text content
- Apply Sieve & Swap on ASR segments to
- discard irrelevant ASR text (e.g., greetings)
- refine segments by substituting ASR transcripts with steps from the recipes database (RecipesNLG)
Sieving Videos. The source dataset contains a large amount of noisy videos and requires large computational resources. We filter the video dataset by pairing each video v with a set of textual recipes based on (i) word overlap between video and recipe titles and (ii) word overlap between ASR Transcript and recipe instructions.
We filter and further by comparing the content of the transcripts and recipe steps. First, we tokenize the transcripts and recipe steps, followed by lemmatization, and select only content words. Then, we compute the token-loU and token-recall for each video-recipe pair. Finally, we keep the pairs with token-IoU ℠and token-recall ℠. We set = 0.3 and = 0.1 to obtain the refined video set , which yields 52K videos. Later, we split the videos into a train and validation set. Videos with ℠0.2 form the validation set ( 3K videos), whereas the remaining ~48K videos with 0.1 †< 0.2 form the training set. Similarly, we note the refined recipe set with . We use these thresholds to balance storage/computation resources and the resulting noise, however these hyper-parameters can be tuned to increase the size of the dataset. Figure 2 (top) illustrates this part of the dataset creation.
ASR Sieving and Swapping We retrieve the nearest recipe step q â S, âS â R2 for each individual ASR transcript segment a (see Eq. 2) based on the cosine similarity between their text embeddings (extracted with MPNet [26, 30]). If the similarity is above a threshold, we swap a with q.
- The size of remains the same as in terms of videos
- however, the number of video segments drops from 2.75M to 0.51M
- With this procedure (depicted at the bottom of Figure 2), non-instructional parts of transcripts are removed since our text database contains only recipe steps
In contrast to prior methods, our selection of transcripts with their temporal boundaries helps to bridge the gap between the pre-training and downstream tasks and filters the irrelevant transcripts.
Procedure Transformer (ProcX)
Extends PDVC i.e. End-to-End Dense Video Captioning with Parallel Decoding
Set-based Localization and Captioning:
- The PDVC architecture has been inspired by the DETR model [5]
- generates N proposals along with its captions using an LSTM to sequence events
- Like DETR, it optimizes various losses using the Hungarian algorithm
- see The Hungarian Method for the Assignment Problem for help on the Hungarian algorithm