🪴 Anil's Garden

❯

Principles of Visual Tokens for Efficient Video Understanding

18 Jul 20253 min read

paper
vision
edinburgh
annotated

Title: Principles of Visual Tokens for Efficient Video Understanding
Authors: Xinyue Hao, Gen Li, Shreyank N Gowda, Robert B Fisher, Jonathan Huang, Anurag Arnab, Laura Sevilla-Lara
Published: 20th November 2024 (Wednesday) @ 14:09:47
Link: http://arxiv.org/abs/2411.13626v1

Abstract

Video understanding has made huge strides in recent years, relying largely on the power of the transformer architecture. As this architecture is notoriously expensive and video is highly redundant, research into improving efficiency has become particularly relevant. This has led to many creative solutions, including token merging and token selection. While most methods succeed in reducing the cost of the model and maintaining accuracy, an interesting pattern arises: most methods do not outperform the random sampling baseline. In this paper we take a closer look at this phenomenon and make several observations. First, we develop an oracle for the value of tokens which exposes a clear Pareto distribution where most tokens have remarkably low value, and just a few carry most of the perceptual information. Second, we analyze why this oracle is extremely hard to learn, as it does not consistently coincide with visual cues. Third, we observe that easy videos need fewer tokens to maintain accuracy. We build on these and further insights to propose a lightweight video model we call LITE that can select a small number of tokens effectively, outperforming state-of-the-art and existing baselines across datasets (Kinetics400 and Something-Something-V2) in the challenging trade-off of computation (GFLOPs) vs accuracy.

Using a subset of “good” tokens can be better than using all tokens. First, we design an oracle that estimates the value of each token for a particular task, concretely we take Action Classification as our testbed. This oracle is created such that the value of each input token corresponds to its gradient [32]. We refer to it as an oracle because it uses the ground truth label of the class. Given this oracle, we can now sample a subset of the tokens according to their gradient value, keeping those with higher values.
Small number of tokens are most “valuable”
Value of tokens is Pareto distributed
- explains “random drop baseline” - dropping randomly
Try filtering/selecting tokens based on human-aligned cues: foreground, regions with more attention etc $\to$ all perform worse than random baseline
Train Lightweight Token Elector network - LITE - to select tokens $\to$ best method
- MLP network
- from abstract: “outperforming state-of-the-art and existing baselines across datasets (Kinetics400 and Something-Something-V2) in the challenging trade-off of computation (GFLOPs) vs accuracy”
There are easier and harder videos - thinner token value histogram tails imply easier to classify videos require fewer tokens

Graph View

Backlinks

CAT: Content-Adaptive Image Tokenization
Pyramid Feature Attention Network for Saliency detection
Vision

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋

🪴 Anil's Garden

Explorer

Principles of Visual Tokens for Efficient Video Understanding

Graph View

Backlinks