🪴 Anil's Garden

❯

Autoregressive Image Generation using Residual Quantization

16 Oct 20253 min read

paper
theory
annotated
question

Title: Autoregressive Image Generation using Residual Quantization
Authors: Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, Wook-Shin Han
Published: 3rd March 2022 (Thursday) @ 11:44:46
Link: http://arxiv.org/abs/2203.01941v2

Abstract

For autoregressive (AR) modeling of high-resolution images, vector quantization (VQ) represents an image as a sequence of discrete codes. A short sequence length is important for an AR model to reduce its computational costs to consider long-range interactions of codes. However, we postulate that previous VQ cannot shorten the code sequence and generate high-fidelity images together in terms of the rate-distortion trade-off. In this study, we propose the two-stage framework, which consists of Residual-Quantized VAE (RQ-VAE) and RQ-Transformer, to effectively generate high-resolution images. Given a fixed codebook size, RQ-VAE can precisely approximate a feature map of an image and represent the image as a stacked map of discrete codes. Then, RQ-Transformer learns to predict the quantized feature vector at the next position by predicting the next stack of codes. Thanks to the precise approximation of RQ-VAE, we can represent a $256 \times 256$ image as $8 \times 8$ resolution of the feature map, and RQ-Transformer can efficiently reduce the computational costs. Consequently, our framework outperforms the existing AR models on various benchmarks of unconditional and conditional image generation. Our approach also has a significantly faster sampling speed than previous AR models to generate high-quality images.

question what is the “exposure bias” - answer is in Sequence Level Training with Recurrent Neural Networks

Notes

Sharing the codebook, $C = {(k, e (k))}_{k \in [K]}$ , across $D$ quantization layers allows for the partitioning of the input space into up to $∣ C ∣^{D} = K^{D}$ clusters at most, c.f. $K$ clusters for vector quantization with the same number of codes and embeddings (inc. memory footprint etc.)

We remark that RQ can more precisely approximate a vector than VQ when their codebook sizes are the same.

While V partitions the entire vector space R™ into K clusters, RQ with depth D partitions the vector space into

KD clusters at most. That is, RQ with D has the same partition capacity as VQ with KP codes. Thus, we can increase D for RQ to replace VQ with an exponentially growing codebook.

Modelling the residual vector quantized codes constitutes autoregressive factorization as is described in §3.2.1 AR Modeling for Codes with Depth $D$ :

RQ-VAE

extracts a code map $M \in [K]^{H \times W \times D}$ , the raster scan order [34]
rearranges the spatial indices of $M$ to a 2D array of codes $S \in [K]^{T \times D}$ where $T = H W$ . That is, $S_{t}$ , which is a $t$ -th row of $S$ , contains $D$ codes as

S_{t} = (S_{t 1}, \dots, S_{t D}) \in [K]^{D} for t \in [T]

Regarding $S$ as discrete latent variables of an image, AR models learn $p (S)$ which is autoregressively factorized as

p (S) = t = 1 \prod T d = 1 \prod D p (S_{t d} ∣ S_{< t, d}, S_{t, < d})

§3.2.2 RQ-Transformer Architecture: RQ-Transformer consists of a spatial transformer and depth transformer

The spatial transformer - stack of masked self-attention blocks to extract a context vector summarizing information in previous positions
- input defined as $u_{t} = PE_{T} (t) + \sum_{d = 1}^{D} e (S_{t - 1, d}) for t > 1$ where $PE_{T} (t)$ is a positional embedding for spatial position $t$ in the raster-scan order.

Overview of Architecture / Method

Note: I wrote the LaTeX above putting { in a text field - like this \text{\{} - because Obsidian was not displaying it correctly (it rendered { as e; the closing bracket } was fine).

Graph View

Backlinks

Moshi: a speech-text foundation model for real-time dialogue

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋

🪴 Anil's Garden

Explorer

Autoregressive Image Generation using Residual Quantization

Notes

Graph View

Backlinks