Neural Audio Codecs & (Residual) Vector Quantization

🪴 Anil's Garden

In this blog post, I’ll take you through two important concepts behind modern Audio AI models such as Google’s AudioLM and VALL-E, Meta’s AudioGen and MusicGen, Microsoft’s NaturalSpeech 2, Suno’s Bark, Kyutai’s Moshi and Hibiki, and many more: Neural Audio Codecs and (Residual) Vector Quantization.

If you don’t mind a short primer/refresher (depending on your prior knowledge) on data compression (needed before delving into the actual topics of this blog post), then just read this blog post from the beginning. Otherwise, if you are already confident with concepts like codecs and bitrate, feel free to skip over to Neural Audio Codecs.

Introduction

Did you ever wonder how multimedia files (music, videos, etc.) are efficiently stored in your PC? Have you got any clue as to how they can be transmitted in real time over the internet, for instance during videocalls? Even if you don’t know the details, you’ve probably guessed there must be some sort of compression going on at some point. If there was no compression involved, your files would be pretty damn large (ever tried to extract audio tracks from a CD?) and the internet traffic would be much bigger than it currently is.

Let’s take audio files as an example. Analog audio is converted to digital form by means of Pulse-code modulation (PCM), which simply amounts to sampling the amplitude of analog audio at regular intervals or, equivalently, with a certain sampling rate, and then quantize said values to the nearest value within a discrete range, for example 24-bit integers. Now suppose we want to digitize the song Ode to My Family by The Cranberries (great song, I know), which has a duration of 4 minutes and 32 seconds, or equivalently 272 seconds. Employing a sampling rate of 44.1kHz, our digitized audio will consist of 272 * 44100 = 11995200 samples, each encoded as a 24-bit integer. In this case, the size of the resulting file would be 11995200 * 24 = 287884800 bits = 35985600 bytes = over 34 MiB 🤯.

Can you imagine fitting songs that big in the kind of pocket-size music players we used to have 15+ years ago that only had a few hundred MiB of storage? Similarly, can you imagine streaming 30-50 MiB of data for each song you listen to nowadays on Spotify when you’re on a limited internet data plan of a handful of gigabytes? In practice, thanks to compression, standard music tracks seldom exceed 2-3 MiB in size and sometimes are even smaller than 1 MiB.

Now we know for a fact that multimedia files get compressed before being stored in a given device or transmitted over the internet, so let’s see how the compression takes place. As it turns out, we have software tools called codecs (portmanteau of coder/decoder) which serve precisely this purpose. You might not realize, but you have been dealing with codecs all the time: do MP3 and JPEG ring a bell? The former is a popular audio codec, whereas the latter is commonly used to compress images.

Compression parameters

When thinking about compression, a couple of questions arise naturally:

How small can I make a file (without making it absolute trash)?
How does the compressed file compare to the original file?

The two parameters that help us answer the questions above are bitrate and perceptual quality.

The bitrate refers to the amount of bits required to encode a “unit” of data. For instance, in the case of audio codecs, said unit of data corresponds to 1 second of audio, hence the bitrate is expressed in bits per second (bps). On the other hand, in the case of image codecs, a unit corresponds to 1 pixel, therefore the bitrate is expressed in bits per pixel (bpp). Perceptual quality, on the other hand, can be measured either with objective metrics (such as PESQ and STOI for audio) or via subjective evaluations involving human experts. A good codec aims to minimize the bitrate while maximizing the perceptual quality of the compressed data.

Codecs categorization

Codecs can be categorized along two orthogonal dimensions:

lossy vs lossless
generic vs content-aware (I made this term up)

Lossy codecs, as the name suggests, are codecs that give up part of the original information to achieve a larger compression rate. Examples of lossy codecs are MP3 for audio and JPEG for images. Lossless codecs, on the other hand, are codecs that can retain all the original information while still being able to shrink the data a little bit. Examples of lossless codecs are FLAC for audio and PNG for images.

🙌 Hands-on: proof that FLAC is a lossless audio codec (requires ffmpeg)

Step 1: Get a WAV file off the internet (or find one in your machine)

sh

wget https://samples-files.com/samples/audio/wav/sample-file-3.wav -O audio-original.wav

Step 2: Compress it using ffmpeg and the FLAC codec

sh

ffmpeg -y -i audio-original.wav -acodec flac audio-compressed.flac

Step 3: Check the compressed file is indeed smaller than the original one

sh

ls -lh audio-original.wav audio-compressed.flac

Step 4: Use ffmpeg to decompress the FLAC file back to WAV

sh

ffmpeg -y -i audio-compressed.flac audio-decompressed.wav

Step 5: Compare the original and decompressed files

sh

diff -s <(ffmpeg -i audio-original.wav -f md5 - 2>/dev/null | grep MD5) <(ffmpeg -i audio-decompressed.wav -f md5 - 2>/dev/null | grep MD5)

Note: the reason why we don’t run diff on the audio files directly is because their metadata might differ. Instead, we compare the MD5 checksums of their contents (i.e. the actual audio tracks), which is what we’re really interested in.

Generic codecs aim to reduce the size of the data without making any assumption on the nature of their inputs. Content-aware codecs, instead, rely on additional assumptions on the input data that allow them to achieve a better tradeoff between bitrate and perceptual quality. For example, Speex is an audio codec specifically designed and tuned to encode/decode human speech, hence it might not work very well, say, for music.

Neural Audio Codecs

In order to achieve the best possible tradeoff between compression rate and perceptual quality of the reconstructed data, traditional audio codecs require careful design using hand-engineered Signal Processing techniques. In situations like this, it is natural to wonder if we can have a neural network learn to perform such a complex task from data. As you have probably guessed, the answer is yes 😎

Neural audio codecs are neural networks that try to learn how to reconstruct an audio signal given a compressed representation of it. If you’re familiar with Autoencoders, this problem formulation won’t sound new to you. Unsurprisingly, like autoencoders, neural audio codecs are made of an encoder and a decoder. The encoder takes a raw audio waveform as input and outputs a compressed representation of it. On the other hand, the decoder is fed the same compressed representation and is tasked with reconstructing the original audio waveform. Although quite simplistic, this description of how a neural audio codec works reveals an important fact about how such models can be trained. In particular, you might have noticed that no human supervision (i.e. labels) is needed to train neural audio codecs! As a matter of fact, the network is essentially requested to learn the identity function: given an audio waveform , output a reconstructed version such that (in practice, we make it so that the two are as close as possible).

Soooo… basically we’re out here racking our brains to train a sophisticated neural architecture to… learn the identity function?! 😅 That’s right, but the point is that we’re not interested in the network’s output, but rather in the encoder’s output, namely the learned compressed representation of the input waveform. But how do we obtain that, and how does it even look like? In this blog post, we’ll learn about a procedure called Residual Vector Quantization adopted by state-of-the-art neural audio codecs such as Google’s SoundStream, Meta’s EnCodec, and Kyutai’s Mimi.

(Residual) Vector Quantization

Broadly speaking, quantization is the process through which a continuous representation of a signal is mapped to a discrete space. For instance, as I mentioned at the very beginning of this blog post, Pulse-code modulation is a form of quantization.

More specifically, vector quantization (VQ) is a method for quantizing a vector of real numbers by means of a so-called codebook, that is a fixed-size collection of vectors that can be used to approximate any other vector. Formally, given a vector to be quantized and a codebook , we can obtain a quantized version of as:

where is a function measuring the similarity between vectors and . For example, a very simple could be the Cosine similarity between and :

Since the codebook has a fixed size, it is inevitable for potentially quite different vectors and to be approximated with the same codebook vector (Pigeonhole principle), leading to information loss after the quantization. This is where Residual Vector Quantization (RVQ) shines: what if we refined the approximation by first computing the approximation error, which is itself a vector, then approximating the approximation error (😵‍💫) by means of a second codebook? And what if we repeated this process again and again? 🤯

Now that we understand the intuition behind RVQ, let’s get a little formal 🤵‍♂️ Given:

a vector to be quantized
a set of codebooks , …,

we can compute:

If you’re more of a visual learner, the same process is illustrated in the picture below:

RVQ

Finally, we can obtain an approximation for as:

This last bit of insight isn’t necessarily obvious, so let’s devote a little more time to it. In particular, let’s focus on the very last expression:

Since quantization is really nothing fancier than an approximation, we can rewrite it as:

now all we need to do is add to both sides of the equation, which results in:

So while in the case of VQ the probability of two vectors and being approximated with the same codebook vector was , said probability shrinks to in the case of RVQ. At this point you might wonder: couldn’t you simply use plain VQ with a larger codebook? By employing a codebook with vectors, the probability of “collision” would also be . Seems reasonable, right? Well, let me explain why RVQ is still a better choice than VQ with a larger codebook in this case.

Suppose all our vectors, i.e. the vectors to be quantized as well as codebook vectors, have elements each. If we were to use a codebook of size , we would be dealing with different numbers that need to be stored in memory. With RVQ, on the other hand, we would need only different numbers. Also, remember I said this RVQ wizardry is used by state-of-the-art neural audio codecs? What if I told you these “numbers” are nothing but the learnable parameters of a neural network? Let’s pick some arbitrary yet reasonable values for , and : let’s say , and . If we decide to go for plain VQ, we would have to learn parameters. Using RVQ, on the other hand, we would have to learn just parameters.

Neural Audio Codecs in state-of-the-art audio AI models

Now that we learned how Neural Audio Codecs and RVQ work, you might be left with one last doubt: what’s the connection between them and state-of-the-art AI models for tasks like Text-to-Speech and Speech-to-Speech translation? Sure, they can compress audio efficiently with minimal loss in quality, but how is that relevant?

As it turns out, RVQ-based Neural Audio Codecs can serve as the equivalent of text tokenizers for audio, with a small difference owing to their residual nature: while text tokenizers turn text sequences into integers representing indices of tokens in a vocabulary, RVQ-based Neural Audio Codecs turn audio sequences into vectors of integers with elements, where is the number of codebooks. The -th element of each vector represents the index of in the codebook . Here are plausible outputs of some hypothetical text tokenizer and RVQ (just the encoder part):

where and are the sequence lengths of the text and audio after the tokenization/quantization process.

So what can we do with audio tokenizers? Well, exactly the same things we can do with regular tokenizers, for instance training language models on their output. AudioLM picks up on this idea to train an audio language model in a purely self-supervised fashion, achieving remarkable performance on audio and speech continuation given a short prompt. On the other hand, VALL-E performs speech synthesis via text-conditioned audio language modeling.

Wrapping up

Despite their original intent, which was to push the boundaries of audio compression while still retaining high perceptual quality, Neural Audio Codecs also serve as a crucial building block for modern audio AI models by providing a way to discretize audio into learnable, token-like representations. This tokenization capability has enabled breakthrough models like AudioLM and VALL-E to treat audio generation similarly to how language models handle text generation, opening up exciting possibilities in speech synthesis, audio continuation, speech to speech translation, and other audio-related tasks.

🪴 Anil's Garden

Explorer

Neural Audio Codecs & (Residual) Vector Quantization | Francesco Cariaggi