Mirco Ravanelli discusses discrete tokens for multimodal LLMs as part of the Conversational AI Reading Group.
Talk available: https://www.youtube.com/watch?v=2-Dqzg3fuVE
Discrete audio tokens have recently gained considerable attention for their potential to connect audio and language processing, enabling the creation of modern multimodal large language models. Ideal audio tokens must effectively preserve phonetic and semantic content along with paralinguistic information, speaker identity, and other details. While several types of audio tokens have been recently proposed, identifying the optimal tokenizer for various tasks is challenging due to the inconsistent evaluation settings in existing studies. To address this gap, we release the Discrete Audio and Speech Benchmark (DASB), a comprehensive leaderboard for benchmarking discrete audio tokens across a wide range of discriminative tasks, including speech recognition, speaker identification and verification, emotion recognition, keyword spotting, and intent classification, as well as generative tasks such as speech enhancement, separation, and text-to-speech. Our results show that, on average, semantic tokens outperform compression tokens across most discriminative and generative tasks. However, the performance gap between semantic tokens and standard continuous representations remains substantial, highlighting the need for further research in this field.
- Famous Bengio paper from 2002 arguing for LMs with neural networks: A Neural Probabilistic Language Model
- Ilya Sutskever (at NeurIPS last week) does not believe in scaling up exclusively - âpre-training as we know it will endâ
- See his talk at NeurIPS 2024: Ilya Sutskever full talk at NeurIPS 2024 Vancouver 15122024
Response from Shital Shah (co-authored Phi-4 Technical Report):
Scaling laws assume that the quality of tokens remains mostly the same as you scale. However, in real-world large-scale datasets, this is not true. When there is an upper bound on quality training tokens, there is an upper bound on scaling. But what about synthetic data?
With current synthetic data techniques, one issue is that they donât add a lot of new entropy to the original pre-training data. Pre-training data is synthesized from spending centuries of human-FLOPs. Prompt-based synthetic generation can only produce data in the neighborhood of existing data. This creates an entropy bottleneck: there is simply not enough entropy per token to gain as you move down the tail of organic data or rely on prompt-based synthetic data.
A possible solution is to spend more compute time during testing to generate synthetic data with higher entropy content. The entropy per token in a given dataset seems to be related to the FLOPs spent on generating that data. Human data was generated from a vast amount of compute spent by humans over millennia. Our pre-training data is the equivalent of fossil fuel, and that data is running out.
While human-FLOPs are in limited supply, GPU-FLOPs through techniques like ttc can allow us to generate synthetic data with high entropy, offering a way to overcome this bottleneck. However, the bad news is that we will need more compute than predicted by scaling laws. So, canât we just rely solely on task-targeted compute?
Merely scaling inference compute wonât be sufficient. A weak model can spend an inordinate amount of inference compute and still fail to solve a hard problem. There seems to be an intricate, intertwined dance between training and inference compute, with each improving the other. Imagine a cycle of training a model, generating high-entropy synthetic data by scaling inference compute, and then using that data to continue training. This is the self-improving recipe.
Humans operate in a similar way: we consume previously generated data and use it to create new data for the next generation. One critical element in this process is embodiment, which enables the transfer of entropy from our environment. Spend thousands of years of human-FLOPs in this way, and you get the pre-training data that we currently use!
Source: https://x.com/sytelus/status/1857102074070352290. Linked on Reddit thread.
How Should We Extract Discrete Audio Tokens from Self-Supervised Models
Core questions:
- ï»żï»żï»żWhich layers should we cluster?
- ï»żï»żï»żWhat is the optimal number of clusters?
- ï»żï»żï»żWhich datasets are we using for clustering?
- ï»żï»żï»żWhat is the best approach to train the decoder (vocoder)?
- ï»żï»żï»żHow should we initialize the embeddings effectively?
- ï»żï»żï»żCan we extract universal tokens for both discriminative and generative tasks?
Ravanelli discusses the main points from the paper (already broken these down đ How Should We Extract Discrete Audio Tokens from Self-Supervised Models).
DASB - Discrete Audio and Speech Benchmark
- ESPNet has their own benchmark
Tokenizer | Type |
---|---|
Discrete HuBERT | Semantic |
Discrete WavLM | Semantic |
Discrete Wav2Vec2 | Semantic |
EnCodec | Compression |
DAC | Compression |
Speech Tokenizer | Hybrid |
Task | Type |
---|---|
Automatic Speech Recognition (ASR) | Discriminative |
Speaker Identification/Verification (SID, SV) | Discriminative |
Emotion Recognition (ER) | Discriminative |
Intent Classification (IC) | Discriminative |
Keyword Spotting (KS) | Discriminative |
Speech Enhancement (SE) | Generative |
Speech Separation (SS) | Generative |
Text-to-Speech (TTS) | Generative |
Take-home messages from Benchmarking Results on Discriminative Tasks
- Semantic tokens outperform compression tokens in most discriminative tasks
- Exception is speaker recognition, where EnCodec excels
- Big gap compared to continuous baselines
- Mirco: We should ask ourselves if tokens are the way to go, over continuous approaches to inject info into multimodal LLMs
Take-home messages from Benchmarking Results on Generative Tasks
- Semantic tokens show the best performance for generative tasks as well
- Big gap - again - against continuous baselines
- Trick is to earn a very good vocoder
Ranking aggregation for models (medium bitrate)
Model | Disc. | Gen. | Comb. |
---|---|---|---|
Discrete HuBERT | 2.66 | 3.62 | 3.11 |
Discrete WavLM | 2.00 | 2.75 | 1.94 |
Discrete Wav2Vec2 | 3.33 | 2.68 | 3.41 |
EnCodec | 4.11 | 3.93 | 4.23 |
DAC | 5.55 | 4.06 | 4.64 |
SpeechTokenizer | 3.44 | 3.81 | 3.64 |
- Semantic tokens are very computationally expensive
- Donât preserve speaker IDs well
- Performance drop significant cf continuous representations
- DASB is implemented in SpeechBrain
Future Directions
- We are still very far from Universal Audio Tokens - audio/speech tokens which are suitable across a range of tasks
Ideas:
- Massive Multitask learning similar to
- compression mainly based on single task or two tasks
- his team explored this in Multi-task self-supervised learning for Robust Speech Recognition (PASE)
- see also: Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks
- Hierarchical codebooks with dynamic allocation (more details for music)
- Perceptual loss optimisation
- Better multi-scale processing
How can we learn âinterpretableâ audio tokens?
Text example:
Text: âThe City of MontrĂ©alâ Tokenized:
"[The] [City] [of] [Mont] [ré] [al]"
Audio case not so trivial.
Interpretable codebook: Every token is connected to some local properties e.g. some codebooks model higher spectral components.
Not only do you make error analysis easier, but forcing the codes / codebooks to align with interpretable features, you impose regularisation that can help generalisation.
See NIPS 2024 and ICML 2024 papers.
At NeurIPS, saw part AR, part diffusion-based.
Survey paper and extended benchmark are in progress.
Questions & Discussion
- Shahab: How do you differentiate MM audio-text with MM image-text? Why is image tokenization more effective?
- Itâs equally critical. People in CV complain about similar issues. Most models outsource the image part and do diffusion on top of that.
- Best paper award at NeurIPS proposed better way of tokenising images: Tokens correspond to different resolutions instead of patches.
- Lot of room for improvement in the image setting as well
- Yi Zhu: For SI and SV, comparison of semantic tokens and compression tokens, is the deficiency of semantic tokens cf compression tokens due to residual vector quantisation or training objects?
- Semantic tokens correspond to high-level info cf compression which are lower level (what I thought)
- Compression tokens are trained to reconstruct (audio) signals
- There are hybrid tokens too
- Saw a paper that addresses this (at NeurIPS I think)
- Ankur Bhatia: why is SSL â Clustering â Discrete tokens called semantic tokens? as the tasks like emotion recognition requires acoustic information as well?
- Donât like the term semantic tokens (Mirco only uses it since it is the standard in the literature)
- Mudit Batra: What is SSL continuous baseline?
- missed this
- Mudit Batra: regarding the model architecture, are attention weights being calculated across all layers and again being added with primary embedding?
- missed this
- Lin Zhang: We could fuse compression and semantic tokens
- This is a thing: âHybrid tokensâ e.g. SpeechTokenizer Unified Speech Tokenizer for Speech Large Language Models
- performance is not that good (DASB Discrim. tasks)
- His bet: Need to focus on compression tokens and train them in a more rich way than just reconstruction
- This is a thing: âHybrid tokensâ e.g. SpeechTokenizer Unified Speech Tokenizer for Speech Large Language Models
- Lin Zhang: You plan to expand DASB to more audio tasks. ESPNet, Codec SUPERB and (another one; I couldnât hear). What are the differences?
- Will have more tokenizers, tasks etc.
- Have a team to combine all those benchmarks to figure out what these benchmarks have in common. Common trends
- Pooneh: We are added more tokenisers (Scalar quantization, single-layer); we are adding music and audio tasks.
- Pooneh: Have tried to train on the internal representations as opposed to the decoder output
- Mirco: Could be more informative for MM LLMs since this internal (latent) repr. is what is used in LLMs (text modality)
- Yoshiki Masuyama: Model size and architecture (HuBERT/WavLM uses transformer vs EnCodec uses conv net). Modelling capacity. Affects
- Yoshiki: SpeechTokenizer uses conv net + LSTM and distills WavLM HuBERT
- Codecs not trained for
- Have to use architectures which are much smaller than SSL models designed for the specific spoken language modelling task.
- đ Scaling up the compression models is the way to go (like previous points: multitask, scaled up)
- âI definitely think compression tokens are the way to goâ
- For TTS, low-level layers are preferred
- TTS prefers low-level info
- (a year or two ago) people were doing TTS estimating the spectrogram directly
- Yadav Hemant: Why for solving ASR task, itâs called high-level info and for SV, itâs called low-level
- standard discussion
- Heitor GuimarĂŁes: How you tried to exploit the discretisation performed during training of HuBERT / WavLM
- main works: people train a new k-means (come up with different clusters, datasets, layer choice)
- Heitor GuimarĂŁes: Have you
- Normally want to stay as aligned as possible with the literature
- Not a super big fan of the semantic tokens because compression-based tokens learn everything. You learn a discrete repr. from scratch. Semantic tokens pipeline is not differentiable.
Notes & Thoughts
- Continuous representations: Cannot use negative log-likelihood