Title: Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks
Authors: Soumi Maiti, Yifan Peng, Shukjae Choi, Jee-weon Jung, Xuankai Chang, Shinji Watanabe
Published: 14th September 2023 (Thursday) @ 03:13:18
Link: http://arxiv.org/abs/2309.07937v3

Abstract

We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation. VoxtLM integrates text vocabulary with discrete speech tokens from self-supervised speech features and uses special tokens to enable multitask learning. Compared to a single-task model, VoxtLM exhibits a significant improvement in speech synthesis, with improvements in both speech intelligibility from 28.9 to 5.6 and objective quality from 2.68 to 3.90. VoxtLM also improves speech generation and speech recognition performance over the single-task counterpart. Further, VoxtLM is trained with publicly available data and training recipes and model checkpoints are open-sourced to make fully reproducible work.


Quick Notes

Recent concurrent studies employ a single model for multiple speech and text conversion tasks [18–20], which are similar to our approach. SpeechGPT [20] uses a three-stage adaptation to combine audio generation with textLMs. PolyVoice [18] applies speechLM to speech-to-speech translation (S2ST) with three decoder-only LMs. VioLA [19] extends VALL-E [7] for ASR and S2ST. Among them, VioLA is the most related method to this work. However, VioLA does not incorporate speech or text continuation tasks and requires additional sequence modeling for speech representations, which makes it more complicated than our approach. Moreover, we utilize textually pre-trained OPT [21] for better initialization inspired by [22] and leverage different speech tokens. Also in comparison to other works, our work is fully reproducible.

Previous work [22] shows that in speechLM initializing with a pre-trained textLM achieves better performance and faster convergence. Motivated by this, we use the pretrained textLM OPT [21] to initialize VoxtLM weights and learn the embedding table from scratch.

The same model configuration is used as the pretrained model except for . OPT is used due to training on publicly available data and the availability of smaller pretrained models.

Effect of token vocabulary size

We compare k=50, 200 and 1000 as outlined in Table 5. Comparisons are made in DBal and Dset.

For ASR and TTS, performance of k=50 is poor. For speechLM with Dset best sores on sWUGGY and BLIMP are observed with the k=200 model. TextLM, as expected, does not show a significant pattern with varying k.

Scalability

Next, we explore whether model size can help with data balancing by comparing medium and large models with k=200, presented in Table 6. All metrics in TextLM, speechLM, and ASR show improvement with larger model. TTS shows a very small degradation in intelligibility (0.4) and quality (0.03).

To mitigate the smaller ratio of paired data, we incorporate more supervised data for ASR in Dsett. We compare with k = 200 and k = 1000 and observe an improvement in the ASR task.

Speech token decoder

The speech token decoder takes both and a speaker embedding of dimensionality N as inputs and produces . We use the HiFiGAN [28] as the architecture and xvector [29] as speaker embedding vector.

Evaluation Metrics

Single-task Baselines

  • Speech LM: GSLM [4] and AudioLM [5]
    • Result: For speechLM (Table 7), GSLM-k200 which uses the same tokenizer and a similar one-stage model, sBLIMP score is lower compared to VoxtLM. However, in AudioLM which uses two token representations (acoustic and semantic) and a threestage model, both sWUGGY and sBLIMP scores are higher, suggesting potential for further improvement with hierarchical tokens and multistage training.
  • ASR: E-Branchformer [40]
    • For ASR, we compare two models:
      • one using spectrogram as input (ASR-Fbank)
      • another using discrete speech tokens as input (dstASR-Hubert), trained following the procedure [26] and the same speech tokenizer as VoxtLM-k1000
    • Result: For ASR, compared to dst-ASR-Hubert, which used the same tokenizer as VoxtLM, we observe a lower WER. Compared to ASR-Fbank (no tokenizer), WER is higher, such a trend is also observed in other discrete ASR models [26].
  • TTS: VITS [39] (We use a pretrained VITS model with LibriTTS)
    • In TTS (Table 8), compared to VITS, VoxtLM reports better intelligibility and quality.
    • Although VoxtLM is trained with a larger data set compared to VITS, it is interesting to note that for traditional TTS diverse training data with more noise and more speakers degrade performance but here improvement is observed.

Finally, our experimental results show that both ASR and TTS can be modeled as language modeling tasks. Moreover, using special tokens we can combine ASR and TTS with joint speech-text language modeling framework. Although the four tasks are quite different, combining four tasks leads to improvement.

Configurations / Hyperparameter combinations for subword modelling:

To train the sub-word model, we use paired text-speech from ASR and TTS datasets.

  • Experiment with three k values (introduced in Sec. 3.1), 50, 200, and 1000, denoted as VoxtLM-k.
  • Also vary BPE sizes: setting them at 2K, 5K, and 10K for k values 50, 100 and 200, respectively.
  • Three model configurations (L := Layers, H := Heads and F := model/feature dimension)
    • small: L=12, F=768, H=12
    • medium: L=24, F=1024, H=16
    • large: L=24, F=2048, H=32

Table 4: Experimental results comparing with and without initialization with pretrained (PT) textLM for VoxtLM-k50 with Dset.

Initialising from OPT Open Pre-trained Transformer Language Models

Table 7: SpeechLM and ASR results: Comparison with the state-ofthe-art models with VoxtLM. † denotes initialization with OPT