Mirco Ravanelli discusses discrete tokens for multimodal LLMs as part of the Conversational AI Reading Group.

Talk available: https://www.youtube.com/watch?v=2-Dqzg3fuVE

Discrete audio tokens have recently gained considerable attention for their potential to connect audio and language processing, enabling the creation of modern multimodal large language models. Ideal audio tokens must effectively preserve phonetic and semantic content along with paralinguistic information, speaker identity, and other details. While several types of audio tokens have been recently proposed, identifying the optimal tokenizer for various tasks is challenging due to the inconsistent evaluation settings in existing studies. To address this gap, we release the Discrete Audio and Speech Benchmark (DASB), a comprehensive leaderboard for benchmarking discrete audio tokens across a wide range of discriminative tasks, including speech recognition, speaker identification and verification, emotion recognition, keyword spotting, and intent classification, as well as generative tasks such as speech enhancement, separation, and text-to-speech. Our results show that, on average, semantic tokens outperform compression tokens across most discriminative and generative tasks. However, the performance gap between semantic tokens and standard continuous representations remains substantial, highlighting the need for further research in this field.


Response from Shital Shah (co-authored Phi-4 Technical Report):

Scaling laws assume that the quality of tokens remains mostly the same as you scale. However, in real-world large-scale datasets, this is not true. When there is an upper bound on quality training tokens, there is an upper bound on scaling. But what about synthetic data?

With current synthetic data techniques, one issue is that they don’t add a lot of new entropy to the original pre-training data. Pre-training data is synthesized from spending centuries of human-FLOPs. Prompt-based synthetic generation can only produce data in the neighborhood of existing data. This creates an entropy bottleneck: there is simply not enough entropy per token to gain as you move down the tail of organic data or rely on prompt-based synthetic data.

A possible solution is to spend more compute time during testing to generate synthetic data with higher entropy content. The entropy per token in a given dataset seems to be related to the FLOPs spent on generating that data. Human data was generated from a vast amount of compute spent by humans over millennia. Our pre-training data is the equivalent of fossil fuel, and that data is running out.

While human-FLOPs are in limited supply, GPU-FLOPs through techniques like ttc can allow us to generate synthetic data with high entropy, offering a way to overcome this bottleneck. However, the bad news is that we will need more compute than predicted by scaling laws. So, can’t we just rely solely on task-targeted compute?

Merely scaling inference compute won’t be sufficient. A weak model can spend an inordinate amount of inference compute and still fail to solve a hard problem. There seems to be an intricate, intertwined dance between training and inference compute, with each improving the other. Imagine a cycle of training a model, generating high-entropy synthetic data by scaling inference compute, and then using that data to continue training. This is the self-improving recipe.

Humans operate in a similar way: we consume previously generated data and use it to create new data for the next generation. One critical element in this process is embodiment, which enables the transfer of entropy from our environment. Spend thousands of years of human-FLOPs in this way, and you get the pre-training data that we currently use!

Source: https://x.com/sytelus/status/1857102074070352290. Linked on Reddit thread.

How Should We Extract Discrete Audio Tokens from Self-Supervised Models

Core questions:

  1. ï»żï»żï»żWhich layers should we cluster?
  2. ï»żï»żï»żWhat is the optimal number of clusters?
  3. ï»żï»żï»żWhich datasets are we using for clustering?
  4. ï»żï»żï»żWhat is the best approach to train the decoder (vocoder)?
  5. ï»żï»żï»żHow should we initialize the embeddings effectively?
  6. ï»żï»żï»żCan we extract universal tokens for both discriminative and generative tasks?

Ravanelli discusses the main points from the paper (already broken these down 👉 How Should We Extract Discrete Audio Tokens from Self-Supervised Models).

DASB - Discrete Audio and Speech Benchmark

  • ESPNet has their own benchmark
TokenizerType
Discrete HuBERTSemantic
Discrete WavLMSemantic
Discrete

Wav2Vec2
Semantic
EnCodecCompression
DACCompression
Speech TokenizerHybrid
TaskType
Automatic Speech Recognition (ASR)Discriminative
Speaker Identification/Verification (SID, SV)Discriminative
Emotion Recognition (ER)Discriminative
Intent Classification (IC)Discriminative
Keyword Spotting (KS)Discriminative
Speech Enhancement (SE)Generative
Speech Separation (SS)Generative
Text-to-Speech (TTS)Generative

Take-home messages from Benchmarking Results on Discriminative Tasks

  • Semantic tokens outperform compression tokens in most discriminative tasks
  • Exception is speaker recognition, where EnCodec excels
  • Big gap compared to continuous baselines
    • Mirco: We should ask ourselves if tokens are the way to go, over continuous approaches to inject info into multimodal LLMs

Take-home messages from Benchmarking Results on Generative Tasks

  • Semantic tokens show the best performance for generative tasks as well
  • Big gap - again - against continuous baselines
  • Trick is to earn a very good vocoder

Ranking aggregation for models (medium bitrate)

ModelDisc.Gen.Comb.
Discrete HuBERT2.663.623.11
Discrete WavLM2.002.751.94
Discrete Wav2Vec23.332.683.41
EnCodec4.113.934.23
DAC5.554.064.64
SpeechTokenizer3.443.813.64
  • Semantic tokens are very computationally expensive
  • Don’t preserve speaker IDs well
  • Performance drop significant cf continuous representations
  • DASB is implemented in SpeechBrain

Future Directions

  • We are still very far from Universal Audio Tokens - audio/speech tokens which are suitable across a range of tasks

Ideas:

How can we learn “interpretable” audio tokens?

Text example:

Text: “The City of MontrĂ©al” Tokenized: "[The] [City] [of] [Mont] [rĂ©] [al]"

Audio case not so trivial.

Interpretable codebook: Every token is connected to some local properties e.g. some codebooks model higher spectral components.

Not only do you make error analysis easier, but forcing the codes / codebooks to align with interpretable features, you impose regularisation that can help generalisation.

See NIPS 2024 and ICML 2024 papers.

At NeurIPS, saw part AR, part diffusion-based.

Survey paper and extended benchmark are in progress.

Questions & Discussion

Critiques

  • Shahab: How do you differentiate MM audio-text with MM image-text? Why is image tokenization more effective?
    • It’s equally critical. People in CV complain about similar issues. Most models outsource the image part and do diffusion on top of that.
    • Best paper award at NeurIPS proposed better way of tokenising images: Tokens correspond to different resolutions instead of patches.
    • Lot of room for improvement in the image setting as well
  • Yi Zhu: For SI and SV, comparison of semantic tokens and compression tokens, is the deficiency of semantic tokens cf compression tokens due to residual vector quantisation or training objects?
    • Semantic tokens correspond to high-level info cf compression which are lower level (what I thought)
    • Compression tokens are trained to reconstruct (audio) signals
    • There are hybrid tokens too
    • Saw a paper that addresses this (at NeurIPS I think)
  • Ankur Bhatia: why is SSL → Clustering → Discrete tokens called semantic tokens? as the tasks like emotion recognition requires acoustic information as well?
    • Don’t like the term semantic tokens (Mirco only uses it since it is the standard in the literature)
  • Mudit Batra: What is SSL continuous baseline?
    • missed this
  • Mudit Batra: regarding the model architecture, are attention weights being calculated across all layers and again being added with primary embedding?
    • missed this
  • Lin Zhang: We could fuse compression and semantic tokens
  • Lin Zhang: You plan to expand DASB to more audio tasks. ESPNet, Codec SUPERB and (another one; I couldn’t hear). What are the differences?
    • Will have more tokenizers, tasks etc.
    • Have a team to combine all those benchmarks to figure out what these benchmarks have in common. Common trends
    • Pooneh: We are added more tokenisers (Scalar quantization, single-layer); we are adding music and audio tasks.
    • Pooneh: Have tried to train on the internal representations as opposed to the decoder output
      • Mirco: Could be more informative for MM LLMs since this internal (latent) repr. is what is used in LLMs (text modality)
  • Yoshiki Masuyama: Model size and architecture (HuBERT/WavLM uses transformer vs EnCodec uses conv net). Modelling capacity. Affects
    • Yoshiki: SpeechTokenizer uses conv net + LSTM and distills WavLM HuBERT
    • Codecs not trained for
    • Have to use architectures which are much smaller than SSL models designed for the specific spoken language modelling task.
    • 👉 Scaling up the compression models is the way to go (like previous points: multitask, scaled up)
    • “I definitely think compression tokens are the way to go”
  • For TTS, low-level layers are preferred
    • TTS prefers low-level info
    • (a year or two ago) people were doing TTS estimating the spectrogram directly
  • Yadav Hemant: Why for solving ASR task, it’s called high-level info and for SV, it’s called low-level
    • standard discussion
  • Heitor GuimarĂŁes: How you tried to exploit the discretisation performed during training of HuBERT / WavLM
    • main works: people train a new k-means (come up with different clusters, datasets, layer choice)
  • Heitor GuimarĂŁes: Have you
    • Normally want to stay as aligned as possible with the literature
    • Not a super big fan of the semantic tokens because compression-based tokens learn everything. You learn a discrete repr. from scratch. Semantic tokens pipeline is not differentiable.

Notes & Thoughts

  • Continuous representations: Cannot use negative log-likelihood

Team Members