Title: SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models
Authors: Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, Xipeng Qiu
Published: 31st August 2023 (Thursday) @ 12:53:09
Link: http://arxiv.org/abs/2308.16692v2
Abstract
Current speech large language models build upon discrete speech representations, which can be categorized into semantic tokens and acoustic tokens. However, existing speech tokens are not specifically designed for speech language modeling. To assess the suitability of speech tokens for building speech language models, we established the first benchmark, SLMTokBench. Our results indicate that neither semantic nor acoustic tokens are ideal for this purpose. Therefore, we propose SpeechTokenizer, a unified speech tokenizer for speech large language models. SpeechTokenizer adopts the Encoder-Decoder architecture with residual vector quantization (RVQ). Unifying semantic and acoustic tokens, SpeechTokenizer disentangles different aspects of speech information hierarchically across different RVQ layers. Furthermore, We construct a Unified Speech Language Model (USLM) leveraging SpeechTokenizer. Experiments show that SpeechTokenizer performs comparably to EnCodec in speech reconstruction and demonstrates strong performance on the SLMTokBench benchmark. Also, USLM outperforms VALL-E in zero-shot Text-to-Speech tasks. Code and models are available at https://github.com/ZhangXInFD/SpeechTokenizer/.
- Update from the repo: [2024/3] 🔥 We released a checkpoint of SpeechTokenizer with Snake activation trained on LibriSpeech and Common Voice.
- Note: Snake activation comes from High-Fidelity Audio Compression with Improved RVQGAN
- Codebases:
- SpeechTokenizer
- USLM
- SLMTokBench -
nothing there as of 2025-04 (two years after publication)- 👉 the implementation for SLMTokBench is at gyt1145028706/CodecEvaluation
- “Sorry for the late update, you can check https://github.com/gyt1145028706/CodecEvaluation. This is an implementation from out team member.” - comment on Issue #1 from 0nutation on SLMTokBench
- Text Alignment Evaluation (§2.1): Referring to Appendix A: Rely on the variational contrastive log-ratio upper bound (vCLUB) from Cheng et al., 2020 of the mutual information
- specifically using a test set to provide an unbiased estimate of the (upper bound) of the mutual information
- “we first establish an embedding matrix, which can be either randomly initialized or derived from the k-means centroid matrix or vector quantization codebooks obtained during the discretization process”
- question Come back and better understand the contrastive variational upper bound of the mutual information (MI; speech-text mutual information)
- Information Preservation Evaluation (§2.2):
- Content preservation is evaluated by computing the WER through transcribing the resynthesized speech using the Whisper en-medium model (Radford et al., 2023).
- Timbre preservation is evaluated by utilizing WavLM-TDNN (Chen et al., 2022) to calculate speaker similarity between the synthesized and groundtruth speech.
- We randomly sample 300 speech samples from LibriSpeech test set for evaluation.
- Semantic vs Acoustic tokens (§2.3):
- HuBERT L9 units used to represent semantic tokens
- EnCodec codes to represent acoustic tokens
- SpeechTokenizer (§3)
- Based on RVQ-GAN encoder-decoder architecture of EnCodec - substitute EnCodec’s 2-layer LSTM with a 2-layer BLSTM
Training Regime (Training Objectives)
Semantic Distillation of HuBERT into the first codebook
- Semantic Distillation of HuBERT into the first codebook both:
- continuous - L9 features via cosine similarity loss
- discrete - HuBERT pseudo-label matching via cross-entropy loss
GAN training objectives
- Reconstruction loss:
- time domain: minimize the distance between and , i.e. .
- frequency domain: linearly combine the L1 and L2 losses over the mel-spectrogram using several time scales.
- Formally,
- where is a 64-bins mel-spectrogram using a normalized STFT with window size of and hop length of
- is the set of scales
- Formally,
- Discriminative loss:
- same discriminators as HiFi-Codec Yang et al. (2023) - three discriminators:
- a multi-scale STFT-based (MS-STFT) discriminator
- a multi-period discriminator (MPD)
- a multi-scale discriminator (MSD)
- The adversarial loss is used to promote perceptual quality
- The adversarial loss is defined as a hinge loss over the logits of the discriminator, averaged over multiple discriminators and over time
- same discriminators as HiFi-Codec Yang et al. (2023) - three discriminators:
Let denote the number of discriminators, the adversarial loss for the generator is constructed as follows, . For the discriminators is defined as:
Additionally, a feature matching loss for the generator is computed as follow:
where the mean is computed over all dimensions and is the number of layers in discriminators. RVQ Commitment Loss We add a commitment loss between the pre-quantized value, and its quantized value, without gradient computed for the quantized value. RVQ commitment loss is defined as: ., where and denote current residual and nearest entry in the corresponding codebook respectively.
Generally, the generator is trained to optimize the following loss:
where and are hyper-parameters used to balance each loss term.
- We employ HuBERT (Hsu et al., 2021) as our semantic teacher in this study, as HuBERT is demonstrated to encompass substantial content information (Mohamed et al., 2022).
Questions / Follow up
- Come back and better understand the contrastive variational upper bound of the mutual information (MI; speech-text mutual information)
- Review notes from the unit-HiFIGAN paper: Speech Resynthesis from Discrete Disentangled Self-Supervised Representations