Title: Visual Prompt Tuning
Authors: Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, Ser-Nam Lim
Published: 23rd March 2022 (Wednesday) @ 01:17:16
Link: http://arxiv.org/abs/2203.12119v2

Abstract

The current modus operandi in adapting pre-trained models involves updating all the backbone parameters, ie, full fine-tuning. This paper introduces Visual Prompt Tuning (VPT) as an efficient and effective alternative to full fine-tuning for large-scale Transformer models in vision. Taking inspiration from recent advances in efficiently tuning large language models, VPT introduces only a small amount (less than 1% of model parameters) of trainable parameters in the input space while keeping the model backbone frozen. Via extensive experiments on a wide variety of downstream recognition tasks, we show that VPT achieves significant performance gains compared to other parameter efficient tuning protocols. Most importantly, VPT even outperforms full fine-tuning in many cases across model capacities and training data scales, while reducing per-task storage cost.


Very much builds on top of Prefix-Tuning Optimizing Continuous Prompts for Generation (arXived in Jan. 21; this paper in March ‘22)

Visual-Prompt Tuning (VPT; §3.2; p. 5)

There are two methods:

  1. VPT-Shallow - first Transformer layer only
  2. VPT-Deep - prompts introduced at every Transformer layer

Quoting section from paper:

Given a pre-trained Transformer model, we introduce a set of continuous embeddings of dimension , i.e., prompts, in the input space after the Embed layer. Only the task-specific prompts are being updated during fine-tuning, while the Transformer backbone is kept frozen. Depending on the number of Transformer layers involved, our approach has two variants, VPT-sHALLOW and VPT-DEEP, as shown in Fig. 2.

VPT-Shallow. Prompts are inserted into the first Transformer layer only. Each prompt token is a learnable -dimensional vector. A collection of prompts is denoted as , the shallow-prompted ViT is:

where represents the features computed by the -th Transformer layer, and . The colors and indicate learnable and frozen parameters, respectively. Notably for ViT, is invariant to the location of prompts since they are inserted after positional encoding, e.g., and are mathematically equivalent. This also applies to VPT-Deep.

VPT-Deep. Prompts are introduced at every Transformer layer’s input space. For ( )-th Layer , we denote the collection of input learnable prompts as . The deep-prompted ViT is formulated as:

Storing Visual Prompts. VPT is beneficial in presence of multiple downstream tasks. We only need to store the learned prompts and classification head for each task and re-use the original copy of the pre-trained Transformer model, significantly reducing the storage cost. For instance, given a ViT-Base with 86 million (M) parameters and shallow prompts and deep prompts yield additional , and parameters, amounting to only and of all ViT-Base parameters, respectively.