Title: Neural Discrete Representation Learning
Authors: Aaron van den Oord, Oriol Vinyals, Koray Kavukcuoglu
Published: 2nd November 2017 (Thursday) @ 21:14:44
Link: http://arxiv.org/abs/1711.00937v2

Abstract

Learning useful representations without supervision remains a key challenge in machine learning. In this paper, we propose a simple yet powerful generative model that learns such discrete representations. Our model, the Vector Quantised-Variational AutoEncoder (VQ-VAE), differs from VAEs in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, we incorporate ideas from vector quantisation (VQ). Using the VQ method allows the model to circumvent issues of “posterior collapse” —where the latents are ignored when they are paired with a powerful autoregressive decoder — typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.



Summary / Takeaways

Vector-Quantised Variational AutoEncoder (VQ-VAE)

Our model, the Vector Quantised Variational AutoEncoder (VQ-VAE), differs from VAEs in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static [Source: Abstract]

Language is inherently discrete, similarly speech is typically represented as a sequence of symbols. Images can often be described concisely by language

  • Encode image with CNN
  • Snap CNN vectors and “snap” to codeword vectors from codebook
    • codewords are trainable
    • closeness defined by norm
    • index using indices into codebook
  • Input codeword vectors
  • Discretisation step is not differentiable
  • Use “straight-through” gradient
  • Tricky part: Copy-paste gradients from decoder (codebook vectors) across during backward pass into encoder
    • Intuition for why this works: Codeword vectors are close to vectors from encoder

    Note that there is no real gradient defined for equation 2, however we approximate the gradient similar to the straight-through estimator [3] and just copy gradients from decoder input zq(x) to encoder output ze(x).

  • sg is stop gradient - equivalent to detaching vector in PyTorch → vector doesn’t contribute to gradients in backward pass
  • Reminder on VAEs: Autoencoders with KL divergence against Gaussian added to reconstruction loss (e.g. MSE → images)

Code / Implementations

  1. VQ-VAE PyTorch Implementation by Zalando Research
  2. DeepMind official implementation (written in DeepMind’s framework, Sonnet)

Resources

Q&A

Explain the meaning of annealing in this context (it’s a quote from Neural Discrete Representation Learning by Aaron van den Oord and co):

Explain the meaning of annealing in this context (it’s a quote from Neural Discrete Representation Learning by Aaron van den Oord and co):

Finally, our approach also relates to work in image compression with neural networks. Theis et. al. [36] use scalar quantisation to compress activations for lossy image compression before arithmetic encoding. Other authors [1] propose a method for similar compression model with vector quantisation.

The authors propose a continuous relaxation of vector quantisation which is annealed over time to obtain a hard clustering. In their experiments they first train an autoencoder, afterwards vector quantisation is applied to the activations of the encoder, and finally the whole network is fine tuned using the soft-to-hard relaxation with a small learning rate.

Answer [ChatGPT]

In this context, annealing refers to gradually reducing the temperature parameter in a process known as simulated annealing. Simulated annealing is an optimization technique used to find the global minimum of a function by starting with a high temperature that allows for exploration of the search space and gradually decreasing the temperature to encourage convergence to the global minimum.

In the context of the paper by van den Oord et al., annealing is used to obtain a hard clustering from a continuous relaxation of vector quantization. The continuous relaxation provides a smooth approximation of the hard clustering that is gradually made more precise over time as the annealing process reduces the temperature parameter. The end result of the annealing process is a hard clustering, which is used for image compression by quantizing the activations of the encoder. The entire network is then fine-tuned using the soft-to-hard relaxation with a small learning rate to optimize the compression performance.