Title: Principles of Visual Tokens for Efficient Video Understanding
Authors: Xinyue Hao, Gen Li, Shreyank N Gowda, Robert B Fisher, Jonathan Huang, Anurag Arnab, Laura Sevilla-Lara
Published: 20th November 2024 (Wednesday) @ 14:09:47
Link: http://arxiv.org/abs/2411.13626v1
Abstract
Video understanding has made huge strides in recent years, relying largely on the power of the transformer architecture. As this architecture is notoriously expensive and video is highly redundant, research into improving efficiency has become particularly relevant. This has led to many creative solutions, including token merging and token selection. While most methods succeed in reducing the cost of the model and maintaining accuracy, an interesting pattern arises: most methods do not outperform the random sampling baseline. In this paper we take a closer look at this phenomenon and make several observations. First, we develop an oracle for the value of tokens which exposes a clear Pareto distribution where most tokens have remarkably low value, and just a few carry most of the perceptual information. Second, we analyze why this oracle is extremely hard to learn, as it does not consistently coincide with visual cues. Third, we observe that easy videos need fewer tokens to maintain accuracy. We build on these and further insights to propose a lightweight video model we call LITE that can select a small number of tokens effectively, outperforming state-of-the-art and existing baselines across datasets (Kinetics400 and Something-Something-V2) in the challenging trade-off of computation (GFLOPs) vs accuracy.
- Using a subset of âgoodâ tokens can be better than using all tokens. First, we design an oracle that estimates the value of each token for a particular task, concretely we take Action Classification as our testbed. This oracle is created such that the value of each input token corresponds to its gradient [32]. We refer to it as an oracle because it uses the ground truth label of the class. Given this oracle, we can now sample a subset of the tokens according to their gradient value, keeping those with higher values.
- Small number of tokens are most âvaluableâ
- Value of tokens is Pareto distributed
- explains ârandom drop baselineâ - dropping randomly
- Try filtering/selecting tokens based on human-aligned cues: foreground, regions with more attention etc all perform worse than random baseline
- Train Lightweight Token Elector network - LITE - to select tokens best method
- MLP network
- from abstract: âoutperforming state-of-the-art and existing baselines across datasets (Kinetics400 and Something-Something-V2) in the challenging trade-off of computation (GFLOPs) vs accuracyâ
- There are easier and harder videos - thinner token value histogram tails imply easier to classify videos require fewer tokens