🪴 Anil's Garden

❯

❯

Visual Prompt Tuning

Visual Prompt Tuning

17 Jun 20253 min read

paper
vision
annotated

Title: Visual Prompt Tuning
Authors: Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, Ser-Nam Lim
Published: 23rd March 2022 (Wednesday) @ 01:17:16
Link: http://arxiv.org/abs/2203.12119v2

Abstract

The current modus operandi in adapting pre-trained models involves updating all the backbone parameters, ie, full fine-tuning. This paper introduces Visual Prompt Tuning (VPT) as an efficient and effective alternative to full fine-tuning for large-scale Transformer models in vision. Taking inspiration from recent advances in efficiently tuning large language models, VPT introduces only a small amount (less than 1% of model parameters) of trainable parameters in the input space while keeping the model backbone frozen. Via extensive experiments on a wide variety of downstream recognition tasks, we show that VPT achieves significant performance gains compared to other parameter efficient tuning protocols. Most importantly, VPT even outperforms full fine-tuning in many cases across model capacities and training data scales, while reducing per-task storage cost.

Very much builds on top of Prefix-Tuning Optimizing Continuous Prompts for Generation (arXived in Jan. 21; this paper in March ‘22)

Visual-Prompt Tuning (VPT; §3.2; p. 5)

There are two methods:

VPT-Shallow - first Transformer layer only
VPT-Deep - prompts introduced at every Transformer layer

Quoting section from paper:

Given a pre-trained Transformer model, we introduce a set of $p$ continuous embeddings of dimension $d$ , i.e., prompts, in the input space after the Embed layer. Only the task-specific prompts are being updated during fine-tuning, while the Transformer backbone is kept frozen. Depending on the number of Transformer layers involved, our approach has two variants, VPT-sHALLOW and VPT-DEEP, as shown in Fig. 2.

VPT-Shallow. Prompts are inserted into the first Transformer layer $L_{1}$ only. Each prompt token is a learnable $d$ -dimensional vector. A collection of $p$ prompts is denoted as $P = {p^{k} \in R^{d} ∣ k \in N, 1 \leq k \leq p}$ , the shallow-prompted ViT is:

[x_{1}, Z_{1}, E_{1}] [x_{i}, Z_{i}, E_{i}] y = L_{1} ([x_{0}, P, E_{0}]) = L_{i} ([x_{i - 1}, Z_{i - 1}, E_{i - 1}]) = Head (x_{N}), i = 2, 3, \dots, N

where $Z_{i} \in R^{p \times d}$ represents the features computed by the $i$ -th Transformer layer, and $[x_{i}, Z_{i}, E_{i}] \in R^{(1 + p + m) \times d}$ . The colors $∙$ and $∙$ indicate learnable and frozen parameters, respectively. Notably for ViT, $x_{N}$ is invariant to the location of prompts since they are inserted after positional encoding, e.g., $[x_{0}, P, E_{0}]$ and $[x_{0}, E_{0}, P]$ are mathematically equivalent. This also applies to VPT-Deep.

VPT-Deep. Prompts are introduced at every Transformer layer’s input space. For ( $i + 1$ )-th Layer $L_{i + 1}$ , we denote the collection of input learnable prompts as $P_{i} = {p_{i}^{k} \in R^{d} ∣ k \in N, 1 \leq k \leq m}$ . The deep-prompted ViT is formulated as:

[x_{i}, \dots, E_{i}] y = L_{i} ([x_{i - 1}, P_{i - 1}, E_{i - 1}]) i = 1, 2, \dots, N = Head (x_{N})

Storing Visual Prompts. VPT is beneficial in presence of multiple downstream tasks. We only need to store the learned prompts and classification head for each task and re-use the original copy of the pre-trained Transformer model, significantly reducing the storage cost. For instance, given a ViT-Base with 86 million (M) parameters and $d = 768, 50$ shallow prompts and deep prompts yield additional $p \times d = 50 \times 768 = 0.038 M$ , and $N \times p \times d = 0.46 M$ parameters, amounting to only $0.04%$ and $0.53%$ of all ViT-Base parameters, respectively.

Graph View

Backlinks

Vision

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋