Title: Autoregressive Image Generation using Residual Quantization
Authors: Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, Wook-Shin Han
Published: 3rd March 2022 (Thursday) @ 11:44:46
Link: http://arxiv.org/abs/2203.01941v2
Abstract
For autoregressive (AR) modeling of high-resolution images, vector quantization (VQ) represents an image as a sequence of discrete codes. A short sequence length is important for an AR model to reduce its computational costs to consider long-range interactions of codes. However, we postulate that previous VQ cannot shorten the code sequence and generate high-fidelity images together in terms of the rate-distortion trade-off. In this study, we propose the two-stage framework, which consists of Residual-Quantized VAE (RQ-VAE) and RQ-Transformer, to effectively generate high-resolution images. Given a fixed codebook size, RQ-VAE can precisely approximate a feature map of an image and represent the image as a stacked map of discrete codes. Then, RQ-Transformer learns to predict the quantized feature vector at the next position by predicting the next stack of codes. Thanks to the precise approximation of RQ-VAE, we can represent a image as resolution of the feature map, and RQ-Transformer can efficiently reduce the computational costs. Consequently, our framework outperforms the existing AR models on various benchmarks of unconditional and conditional image generation. Our approach also has a significantly faster sampling speed than previous AR models to generate high-quality images.
- question what is the “exposure bias” - answer is in Sequence Level Training with Recurrent Neural Networks
Notes
Sharing the codebook, , across quantization layers allows for the partitioning of the input space into up to clusters at most, c.f. clusters for vector quantization with the same number of codes and embeddings (inc. memory footprint etc.)
We remark that RQ can more precisely approximate a vector than VQ when their codebook sizes are the same.
While V partitions the entire vector space R™ into K clusters, RQ with depth D partitions the vector space into
KD clusters at most. That is, RQ with D has the same partition capacity as VQ with KP codes. Thus, we can increase D for RQ to replace VQ with an exponentially growing codebook.
Modelling the residual vector quantized codes constitutes autoregressive factorization as is described in §3.2.1 AR Modeling for Codes with Depth :
RQ-VAE
- extracts a code map , the raster scan order [34]
- rearranges the spatial indices of to a 2D array of codes where . That is, , which is a -th row of , contains codes as
Regarding as discrete latent variables of an image, AR models learn which is autoregressively factorized as
§3.2.2 RQ-Transformer Architecture: RQ-Transformer consists of a spatial transformer and depth transformer
- The spatial transformer - stack of masked self-attention blocks to extract a context vector summarizing information in previous positions
- input defined as where is a positional embedding for spatial position in the raster-scan order.
Note: I wrote the LaTeX above putting { in a text field - like this \text{\{}
- because Obsidian was not displaying it correctly (it rendered { as e; the closing bracket } was fine).