Title: SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models
Authors: Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, Xipeng Qiu
Published: 31st August 2023 (Thursday) @ 12:53:09
Link: http://arxiv.org/abs/2308.16692v2

Abstract

Current speech large language models build upon discrete speech representations, which can be categorized into semantic tokens and acoustic tokens. However, existing speech tokens are not specifically designed for speech language modeling. To assess the suitability of speech tokens for building speech language models, we established the first benchmark, SLMTokBench. Our results indicate that neither semantic nor acoustic tokens are ideal for this purpose. Therefore, we propose SpeechTokenizer, a unified speech tokenizer for speech large language models. SpeechTokenizer adopts the Encoder-Decoder architecture with residual vector quantization (RVQ). Unifying semantic and acoustic tokens, SpeechTokenizer disentangles different aspects of speech information hierarchically across different RVQ layers. Furthermore, We construct a Unified Speech Language Model (USLM) leveraging SpeechTokenizer. Experiments show that SpeechTokenizer performs comparably to EnCodec in speech reconstruction and demonstrates strong performance on the SLMTokBench benchmark. Also, USLM outperforms VALL-E in zero-shot Text-to-Speech tasks. Code and models are available at https://github.com/ZhangXInFD/SpeechTokenizer/.



  • Text Alignment Evaluation (§2.1): Referring to Appendix A: Rely on the variational contrastive log-ratio upper bound (vCLUB) from Cheng et al., 2020 of the mutual information
    • specifically using a test set to provide an unbiased estimate of the (upper bound) of the mutual information
    • “we first establish an embedding matrix, which can be either randomly initialized or derived from the k-means centroid matrix or vector quantization codebooks obtained during the discretization process”
      • question I remember a paper which showed experimental results demonstrating that it was hurtful to performance to use the k-means centroids… Was it TWIST?
    • question Come back and better understand the contrastive variational upper bound of the mutual information (MI; speech-text mutual information)
  • Information Preservation Evaluation (§2.2):
    • Content preservation is evaluated by computing the WER through transcribing the resynthesized speech using the Whisper en-medium model (Radford et al., 2023).
    • Timbre preservation is evaluated by utilizing WavLM-TDNN (Chen et al., 2022) to calculate speaker similarity between the synthesized and groundtruth speech.
    • We randomly sample 300 speech samples from LibriSpeech test set for evaluation.
  • Semantic vs Acoustic tokens (§2.3):
    • HuBERT L9 units used to represent semantic tokens
    • EnCodec codes to represent acoustic tokens
  • SpeechTokenizer (§3)
    • Based on RVQ-GAN encoder-decoder architecture of EnCodec - substitute EnCodec’s 2-layer LSTM with a 2-layer BLSTM

Training Regime (Training Objectives)

Semantic Distillation of HuBERT into the first codebook

  • Semantic Distillation of HuBERT into the first codebook both:
    • continuous - L9 features via cosine similarity loss
    • discrete - HuBERT pseudo-label matching via cross-entropy loss

GAN training objectives

  • Reconstruction loss:
    • time domain: minimize the distance between and , i.e. .
    • frequency domain: linearly combine the L1 and L2 losses over the mel-spectrogram using several time scales.
      • Formally,
        • where is a 64-bins mel-spectrogram using a normalized STFT with window size of and hop length of
        • is the set of scales
  • Discriminative loss:
    • same discriminators as HiFi-Codec Yang et al. (2023) - three discriminators:
      • a multi-scale STFT-based (MS-STFT) discriminator
      • a multi-period discriminator (MPD)
      • a multi-scale discriminator (MSD)
    • The adversarial loss is used to promote perceptual quality
    • The adversarial loss is defined as a hinge loss over the logits of the discriminator, averaged over multiple discriminators and over time

Let denote the number of discriminators, the adversarial loss for the generator is constructed as follows, . For the discriminators is defined as:

Additionally, a feature matching loss for the generator is computed as follow:

where the mean is computed over all dimensions and is the number of layers in discriminators. RVQ Commitment Loss We add a commitment loss between the pre-quantized value, and its quantized value, without gradient computed for the quantized value. RVQ commitment loss is defined as: ., where and denote current residual and nearest entry in the corresponding codebook respectively.

Generally, the generator is trained to optimize the following loss:

where and are hyper-parameters used to balance each loss term.

  • We employ HuBERT (Hsu et al., 2021) as our semantic teacher in this study, as HuBERT is demonstrated to encompass substantial content information (Mohamed et al., 2022).

Questions / Follow up