🪴 Anil's Garden

❯

❯

Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks

Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks

18 Jul 20256 min read

paper
annotated
speech

Title: Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks
Authors: Soumi Maiti, Yifan Peng, Shukjae Choi, Jee-weon Jung, Xuankai Chang, Shinji Watanabe
Published: 14th September 2023 (Thursday) @ 03:13:18
Link: http://arxiv.org/abs/2309.07937v3

Abstract

We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation. VoxtLM integrates text vocabulary with discrete speech tokens from self-supervised speech features and uses special tokens to enable multitask learning. Compared to a single-task model, VoxtLM exhibits a significant improvement in speech synthesis, with improvements in both speech intelligibility from 28.9 to 5.6 and objective quality from 2.68 to 3.90. VoxtLM also improves speech generation and speech recognition performance over the single-task counterpart. Further, VoxtLM is trained with publicly available data and training recipes and model checkpoints are open-sourced to make fully reproducible work.

Quick Notes

Recent concurrent studies employ a single model for multiple speech and text conversion tasks [18–20], which are similar to our approach. SpeechGPT [20] uses a three-stage adaptation to combine audio generation with textLMs. PolyVoice [18] applies speechLM to speech-to-speech translation (S2ST) with three decoder-only LMs. VioLA [19] extends VALL-E [7] for ASR and S2ST. Among them, VioLA is the most related method to this work. However, VioLA does not incorporate speech or text continuation tasks and requires additional sequence modeling for speech representations, which makes it more complicated than our approach. Moreover, we utilize textually pre-trained OPT [21] for better initialization inspired by [22] and leverage different speech tokens. Also in comparison to other works, our work is fully reproducible.

Previous work [22] shows that in speechLM initializing with a pre-trained textLM achieves better performance and faster convergence. Motivated by this, we use the pretrained textLM OPT [21] to initialize VoxtLM weights and learn the embedding table from scratch.

The same model configuration is used as the pretrained model except for $∣ V_{v o x t} ∣$ . OPT is used due to training on publicly available data and the availability of smaller pretrained models.

Effect of token vocabulary size

We compare k=50, 200 and 1000 as outlined in Table 5. Comparisons are made in DBal and Dset.

For ASR and TTS, performance of k=50 is poor. For speechLM with Dset best sores on sWUGGY and BLIMP are observed with the k=200 model. TextLM, as expected, does not show a significant pattern with varying k.

Scalability

Next, we explore whether model size can help with data balancing by comparing medium and large models with k=200, presented in Table 6. All metrics in TextLM, speechLM, and ASR show improvement with larger model. TTS shows a very small degradation in intelligibility (0.4) and quality (0.03).

To mitigate the smaller ratio of paired data, we incorporate more supervised data for ASR in Dsett. We compare with k = 200 and k = 1000 and observe an improvement in the ASR task.

Speech token decoder

The speech token decoder takes both $\tilde{D}$ and a speaker embedding $s_{s p k} \in R^{N}$ of dimensionality N as inputs and produces $\hat{X}$ . We use the HiFiGAN [28] as the architecture and xvector [29] as speaker embedding vector.

Evaluation Metrics

speech and text generation: Perplexity (PPL) for evaluating models with same vocabulary size.
- For different vocabulary size models: spot-the-word error using sWUGGY and syntactic score using sBLIMP dev set.
- sWUGGY and sBLIMP are chosen as other speech LM works also report them.
- these are from The Zero Resource Speech Benchmark 2021 Metrics and baselines for unsupervised spoken language modeling (which is ref. [30] in the paper)
ASR: word error rate (WER)
TTS:
- intelligibility: character error rate (CER)
- quality: neural-predicted mean opinion score (MOS) with MOSNet
  - ref [31]: MOSNet Deep Learning based Objective Assessment for Voice Conversion
  - ref [32]: Generalization Ability of MOS Prediction Networks
  - We choose neural MOS prediction model because it scales to large number of evaluations and shows high-correlation with TTS evaluations in English

Single-task Baselines

Speech LM: GSLM [4] and AudioLM [5]
- Result: For speechLM (Table 7), GSLM-k200 which uses the same tokenizer and a similar one-stage model, sBLIMP score is lower compared to VoxtLM. However, in AudioLM which uses two token representations (acoustic and semantic) and a threestage model, both sWUGGY and sBLIMP scores are higher, suggesting potential for further improvement with hierarchical tokens and multistage training.
ASR: E-Branchformer [40]
- For ASR, we compare two models:
  - one using spectrogram as input (ASR-Fbank)
  - another using discrete speech tokens as input (dstASR-Hubert), trained following the procedure [26] and the same speech tokenizer as VoxtLM-k1000
- Result: For ASR, compared to dst-ASR-Hubert, which used the same tokenizer as VoxtLM, we observe a lower WER. Compared to ASR-Fbank (no tokenizer), WER is higher, such a trend is also observed in other discrete ASR models [26].
TTS: VITS [39] (We use a pretrained VITS model with LibriTTS)
- In TTS (Table 8), compared to VITS, VoxtLM reports better intelligibility and quality.
- Although VoxtLM is trained with a larger data set compared to VITS, it is interesting to note that for traditional TTS diverse training data with more noise and more speakers degrade performance but here improvement is observed.

Finally, our experimental results show that both ASR and TTS can be modeled as language modeling tasks. Moreover, using special tokens we can combine ASR and TTS with joint speech-text language modeling framework. Although the four tasks are quite different, combining four tasks leads to improvement.

Configurations / Hyperparameter combinations for subword modelling:

To train the sub-word model, we use paired text-speech from ASR and TTS datasets.

Experiment with three k values (introduced in Sec. 3.1), 50, 200, and 1000, denoted as VoxtLM-k.
Also vary BPE sizes: setting them at 2K, 5K, and 10K for k values 50, 100 and 200, respectively.
Three model configurations (L := Layers, H := Heads and F := model/feature dimension)
- small: L=12, F=768, H=12
- medium: L=24, F=1024, H=16
- large: L=24, F=2048, H=32

Important concurrent (related) work

SpeechGPT-Gen Scaling Chain-of-Information Speech Generation
VioLA Unified Codec Language Models for Speech Recognition, Synthesis, and Translation
PolyVoice Language Models for Speech to Speech Translation
- which uses SpeechLM Enhanced Speech Pre-Training with Unpaired Textual Data

Table 4: Experimental results comparing with and without initialization with pretrained (PT) textLM for VoxtLM-k50 with Dset.

Initialising from OPT Open Pre-trained Transformer Language Models

Table 7: SpeechLM and ASR results: Comparison with the state-ofthe-art models with VoxtLM. † denotes initialization with OPT

Graph View

Quick Notes
Effect of token vocabulary size
Scalability
Speech token decoder
Evaluation Metrics
Single-task Baselines
Configurations / Hyperparameter combinations for subword modelling:
Important concurrent (related) work
Table 4: Experimental results comparing with and without initialization with pretrained (PT) textLM for VoxtLM-k50 with Dset.
Table 7: SpeechLM and ASR results: Comparison with the state-ofthe-art models with VoxtLM. † denotes initialization with OPT

Backlinks

SpiRit-LM: Interleaved Spoken and Written Language Model
Speech and Audio

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋