🪴 Anil's Garden

❯

CAT: Content-Adaptive Image Tokenization

19 Dec 20251 min read

paper
annotated

Title: CAT: Content-Adaptive Image Tokenization
Authors: Junhong Shen, Kushal Tirumala, Michihiro Yasunaga, Ishan Misra, Luke Zettlemoyer, Lili Yu, Chunting Zhou
Published: 6th January 2025 (Monday) @ 16:28:47
Link: http://arxiv.org/abs/2501.03120v1

Abstract

Most existing image tokenizers encode images into a fixed number of tokens or patches, overlooking the inherent variability in image complexity. To address this, we introduce Content-Adaptive Tokenizer (CAT), which dynamically adjusts representation capacity based on the image content and encodes simpler images into fewer tokens. We design a caption-based evaluation system that leverages large language models (LLMs) to predict content complexity and determine the optimal compression ratio for a given image, taking into account factors critical to human perception. Trained on images with diverse compression ratios, CAT demonstrates robust performance in image reconstruction. We also utilize its variable-length latent representations to train Diffusion Transformers (DiTs) for ImageNet generation. By optimizing token allocation, CAT improves the FID score over fixed-ratio baselines trained with the same flops and boosts the inference throughput by 18.5%.

Found thanks to this tweet from Simone Scardapane, retweeted by Ishan Misra.

Made me think of Principles of Visual Tokens for Efficient Video Understanding

Graph View

Backlinks

Vision

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋

🪴 Anil's Garden

Explorer

CAT: Content-Adaptive Image Tokenization

Graph View

Backlinks