Discrete Audio Tokens for Multimodal LLMs

Mirco Ravanelli discusses discrete tokens for multimodal LLMs as part of the Conversational AI Reading Group.

Talk available: https://www.youtube.com/watch?v=2-Dqzg3fuVE

Discrete audio tokens have recently gained considerable attention for their potential to connect audio and language processing, enabling the creation of modern multimodal large language models. Ideal audio tokens must effectively preserve phonetic and semantic content along with paralinguistic information, speaker identity, and other details. While several types of audio tokens have been recently proposed, identifying the optimal tokenizer for various tasks is challenging due to the inconsistent evaluation settings in existing studies. To address this gap, we release the Discrete Audio and Speech Benchmark (DASB), a comprehensive leaderboard for benchmarking discrete audio tokens across a wide range of discriminative tasks, including speech recognition, speaker identification and verification, emotion recognition, keyword spotting, and intent classification, as well as generative tasks such as speech enhancement, separation, and text-to-speech. Our results show that, on average, semantic tokens outperform compression tokens across most discriminative and generative tasks. However, the performance gap between semantic tokens and standard continuous representations remains substantial, highlighting the need for further research in this field.

Famous Bengio paper from 2002 arguing for LMs with neural networks: A Neural Probabilistic Language Model
Ilya Sutskever (at NeurIPS last week) does not believe in scaling up exclusively - “pre-training as we know it will end”
- See his talk at NeurIPS 2024: Ilya Sutskever full talk at NeurIPS 2024 Vancouver 15122024

Response from Shital Shah (co-authored Phi-4 Technical Report):

Scaling laws assume that the quality of tokens remains mostly the same as you scale. However, in real-world large-scale datasets, this is not true. When there is an upper bound on quality training tokens, there is an upper bound on scaling. But what about synthetic data?

With current synthetic data techniques, one issue is that they don’t add a lot of new entropy to the original pre-training data. Pre-training data is synthesized from spending centuries of human-FLOPs. Prompt-based synthetic generation can only produce data in the neighborhood of existing data. This creates an entropy bottleneck: there is simply not enough entropy per token to gain as you move down the tail of organic data or rely on prompt-based synthetic data.

A possible solution is to spend more compute time during testing to generate synthetic data with higher entropy content. The entropy per token in a given dataset seems to be related to the FLOPs spent on generating that data. Human data was generated from a vast amount of compute spent by humans over millennia. Our pre-training data is the equivalent of fossil fuel, and that data is running out.

While human-FLOPs are in limited supply, GPU-FLOPs through techniques like ttc can allow us to generate synthetic data with high entropy, offering a way to overcome this bottleneck. However, the bad news is that we will need more compute than predicted by scaling laws. So, can’t we just rely solely on task-targeted compute?

Merely scaling inference compute won’t be sufficient. A weak model can spend an inordinate amount of inference compute and still fail to solve a hard problem. There seems to be an intricate, intertwined dance between training and inference compute, with each improving the other. Imagine a cycle of training a model, generating high-entropy synthetic data by scaling inference compute, and then using that data to continue training. This is the self-improving recipe.

Humans operate in a similar way: we consume previously generated data and use it to create new data for the next generation. One critical element in this process is embodiment, which enables the transfer of entropy from our environment. Spend thousands of years of human-FLOPs in this way, and you get the pre-training data that we currently use!

Source: https://x.com/sytelus/status/1857102074070352290. Linked on Reddit thread.

How Should We Extract Discrete Audio Tokens from Self-Supervised Models

Core questions:

Which layers should we cluster?
What is the optimal number of clusters?
Which datasets are we using for clustering?
What is the best approach to train the decoder (vocoder)?
How should we initialize the embeddings effectively?
Can we extract universal tokens for both discriminative and generative tasks?

Ravanelli discusses the main points from the paper (already broken these down 👉 How Should We Extract Discrete Audio Tokens from Self-Supervised Models).

DASB - Discrete Audio and Speech Benchmark

ESPNet has their own benchmark

Tokenizer	Type
Discrete HuBERT	Semantic
Discrete WavLM	Semantic
Discrete Wav2Vec2	Semantic
EnCodec	Compression
DAC	Compression
Speech Tokenizer	Hybrid

Task	Type
Automatic Speech Recognition (ASR)	Discriminative
Speaker Identification/Verification (SID, SV)	Discriminative
Emotion Recognition (ER)	Discriminative
Intent Classification (IC)	Discriminative
Keyword Spotting (KS)	Discriminative
Speech Enhancement (SE)	Generative
Speech Separation (SS)	Generative
Text-to-Speech (TTS)	Generative

Take-home messages from Benchmarking Results on Discriminative Tasks

Semantic tokens outperform compression tokens in most discriminative tasks
Exception is speaker recognition, where EnCodec excels
Big gap compared to continuous baselines
- Mirco: We should ask ourselves if tokens are the way to go, over continuous approaches to inject info into multimodal LLMs

Take-home messages from Benchmarking Results on Generative Tasks

Semantic tokens show the best performance for generative tasks as well
Big gap - again - against continuous baselines
Trick is to earn a very good vocoder

Ranking aggregation for models (medium bitrate)

Model	Disc.	Gen.	Comb.
Discrete HuBERT	2.66	3.62	3.11
Discrete WavLM	2.00	2.75	1.94
Discrete Wav2Vec2	3.33	2.68	3.41
EnCodec	4.11	3.93	4.23
DAC	5.55	4.06	4.64
SpeechTokenizer	3.44	3.81	3.64

Semantic tokens are very computationally expensive
Don’t preserve speaker IDs well
Performance drop significant cf continuous representations
DASB is implemented in SpeechBrain

Future Directions

We are still very far from Universal Audio Tokens - audio/speech tokens which are suitable across a range of tasks

Ideas:

Massive Multitask learning similar to
- compression mainly based on single task or two tasks
- his team explored this in Multi-task self-supervised learning for Robust Speech Recognition (PASE)
- see also: Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks
Hierarchical codebooks with dynamic allocation (more details for music)
Perceptual loss optimisation
- similar to MetricGAN Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement
Better multi-scale processing

How can we learn “interpretable” audio tokens?

Text example:

Text: “The City of Montréal” Tokenized: "[The] [City] [of] [Mont] [ré] [al]"

Audio case not so trivial.

Interpretable codebook: Every token is connected to some local properties e.g. some codebooks model higher spectral components.

Not only do you make error analysis easier, but forcing the codes / codebooks to align with interpretable features, you impose regularisation that can help generalisation.

See NIPS 2024 and ICML 2024 papers.

At NeurIPS, saw part AR, part diffusion-based.

Survey paper and extended benchmark are in progress.

Questions & Discussion

Critiques

Shahab: How do you differentiate MM audio-text with MM image-text? Why is image tokenization more effective?
- It’s equally critical. People in CV complain about similar issues. Most models outsource the image part and do diffusion on top of that.
- Best paper award at NeurIPS proposed better way of tokenising images: Tokens correspond to different resolutions instead of patches.
- Lot of room for improvement in the image setting as well
Yi Zhu: For SI and SV, comparison of semantic tokens and compression tokens, is the deficiency of semantic tokens cf compression tokens due to residual vector quantisation or training objects?
- Semantic tokens correspond to high-level info cf compression which are lower level (what I thought)
- Compression tokens are trained to reconstruct (audio) signals
- There are hybrid tokens too
- Saw a paper that addresses this (at NeurIPS I think)
Ankur Bhatia: why is SSL → Clustering → Discrete tokens called semantic tokens? as the tasks like emotion recognition requires acoustic information as well?
- Don’t like the term semantic tokens (Mirco only uses it since it is the standard in the literature)
Mudit Batra: What is SSL continuous baseline?
- missed this
Mudit Batra: regarding the model architecture, are attention weights being calculated across all layers and again being added with primary embedding?
- missed this
Lin Zhang: We could fuse compression and semantic tokens
- This is a thing: “Hybrid tokens” e.g. SpeechTokenizer Unified Speech Tokenizer for Speech Large Language Models
  - performance is not that good (DASB Discrim. tasks)
- His bet: Need to focus on compression tokens and train them in a more rich way than just reconstruction
Lin Zhang: You plan to expand DASB to more audio tasks. ESPNet, Codec SUPERB and (another one; I couldn’t hear). What are the differences?
- Will have more tokenizers, tasks etc.
- Have a team to combine all those benchmarks to figure out what these benchmarks have in common. Common trends
- Pooneh: We are added more tokenisers (Scalar quantization, single-layer); we are adding music and audio tasks.
- Pooneh: Have tried to train on the internal representations as opposed to the decoder output
  - Mirco: Could be more informative for MM LLMs since this internal (latent) repr. is what is used in LLMs (text modality)
Yoshiki Masuyama: Model size and architecture (HuBERT/WavLM uses transformer vs EnCodec uses conv net). Modelling capacity. Affects
- Yoshiki: SpeechTokenizer uses conv net + LSTM and distills WavLM HuBERT
- Codecs not trained for
- Have to use architectures which are much smaller than SSL models designed for the specific spoken language modelling task.
- 👉 Scaling up the compression models is the way to go (like previous points: multitask, scaled up)
- “I definitely think compression tokens are the way to go”
For TTS, low-level layers are preferred
- TTS prefers low-level info
- (a year or two ago) people were doing TTS estimating the spectrogram directly
Yadav Hemant: Why for solving ASR task, it’s called high-level info and for SV, it’s called low-level
- standard discussion
Heitor Guimarães: How you tried to exploit the discretisation performed during training of HuBERT / WavLM
- main works: people train a new k-means (come up with different clusters, datasets, layer choice)
Heitor Guimarães: Have you
- Normally want to stay as aligned as possible with the literature
- Not a super big fan of the semantic tokens because compression-based tokens learn everything. You learn a discrete repr. from scratch. Semantic tokens pipeline is not differentiable.

Notes & Thoughts

Continuous representations: Cannot use negative log-likelihood

🪴 Anil's Garden

Explorer

Discrete Audio Tokens for Multimodal LLMs - Mirco Ravanelli

How Should We Extract Discrete Audio Tokens from Self-Supervised Models

DASB - Discrete Audio and Speech Benchmark

Take-home messages from Benchmarking Results on Discriminative Tasks

Take-home messages from Benchmarking Results on Generative Tasks

Ranking aggregation for models (medium bitrate)

Future Directions

How can we learn “interpretable” audio tokens?

Questions & Discussion

Notes & Thoughts

Team Members

Graph View

Table of Contents

Backlinks