🪴 Anil's Garden

❯

OpusLM: A Family of Open Unified Speech Language Models

18 Jul 20256 min read

paper
annotated

Title: OpusLM: A Family of Open Unified Speech Language Models
Authors: Jinchuan Tian, William Chen, Yifan Peng, Jiatong Shi, Siddhant Arora, Shikhar Bharadwaj, Takashi Maekaku, Yusuke Shinohara, Keita Goto, Xiang Yue, Huck Yang, Shinji Watanabe
Published: 21st June 2025 (Saturday) @ 06:30:59
Link: http://arxiv.org/abs/2506.17611v1

Abstract

This paper presents Open Unified Speech Language Models (OpusLMs), a family of open foundational speech language models (SpeechLMs) up to 7B. Initialized from decoder-only text language models, the OpusLMs are continuously pre-trained on 213K hours of speech-text pairs and 292B text-only tokens. We demonstrate our OpusLMs achieve comparable (or even superior) performance with existing SpeechLMs in speech recognition, speech synthesis, and text-only capabilities. Technically, this paper articulates our SpeechLM designs on tokenization, multi-stream language models, and multi-stage training strategies. We experimentally demonstrate the importance of model size scaling and the effect of annealing data selection. The OpusLMs are all built from publicly available materials and are fully transparent models. We release our code, data, checkpoints, and training logs to facilitate open SpeechLM research

Notes

Additionally, we demonstrate the necessity of adopting a large model size (e.g., 1.7B) for this multi-modal pre-training.
We also show that annealing is beneficial although sensitive to data selection.
Architectural choices:
- Use a combination of semantic and acoustic tokens — “Details about these tokenizers are in [29–31]”
- Adopt Delay Interleave Architecture from Simple and Controllable Music Generation (Jade Copet et al.)
  - basically the higher up tokens in the vector representing a single time-step (“frame”) are delayed by their “height” ( $n$ )
  - the text-audio sequence is a matrix - i.e. each time-step is a vector of the speech tokens ( $\in Z$ )
  - Why the delayed interleave architecture?
    - autoregressive in the $T$ axis - aligned with existing pre-trained text LM
    - preserves “intra-frame autoregression” - dependencies between the tokens at different $n$ for a given frame (i.e. given $t$ ) - see [12, 29]
    - maintains inference complexity at $O (T)$ - independent of number of tokens $N$
- Sum pre-transformer embeddings to produce input for next causal transformer step
- predict all the next tokens - i.e. across the vector $x_{t}$ - with $Softmax (Linear (h_{t}^{post} + b_{n}))$
  - $b_{n}$ is a trainable vector to specify / select which level, $n$ , in $x_{t + 1}$ to predict
Loss computing: we reweigh the tokens from different sources by the ratio of text: semantic: acoustic = 1: 1/2: 1/(N − 1)
- based on our relative importance assumptions that: one text token is equivalent to one speech frame and have summed weight of 1; for all tokens in one speech frame, the semantic token are equivalently important to all acoustic tokens combined (0.5: 0.5).
Training:
- pre-training: 500k “updates” (steps) - this is for the 1.7B model - the 7B model was only trained for 250k updates
- remarks:
  1. “We switch the loss region from the whole sequence to the target region at 250k steps” - only beneficial for the 1.7B model; marginal benefit at “scale” (7B)
  2. batch size: 1M frames - large batch size 4M frames following 2 OLMo 2 Furious but had poorer performance (“was sub-optimal”)
  3. important to keep the text-only corpus in domain for the pre-trained text LM
Annealing := “Annealing refers to the practice of decaying the learning rate linearly and quickly to zero in the ending phase of training, which is usually accompanied by a small portion of high-quality data” - they cite Llama 2 Open Foundation and Fine-Tuned Chat Models
Experimental setup
- Data
  - YODAS [41], Emilia [42], and OWSM v3.2 suite [22]
  - We restrict our corpus to English-only
  - obtain a total volume of 213K hours.
  - We apply this data mixture to all ASR, TTS, and audio-only tasks, which gives 128B frames.
  - For text corpus, we follow the composition of the pre-trained LLMs [23, 36].
  - These datasets are sampled to ensure the text-only data always accounts for 50% of the training data mixture (292B text tokens in total).
- Speech Tokenizers: open-source, 50Hz - they point to ESPnet-SpeechLM An Open Speech Language Model Toolkit for details of which tokenizers they used
- Models:
  - OpusLMs of size 135M, 360M, and 1.7B, we adopt the SmolLM2 series [36] for initialization - these are trained on up to 11T text tokens
  - For the OpusLMs of size 7B, we adopt OLMo-2-7B [23] - pre-trained on 4T text tokens
- Training:
  - bf16
  - context length 8192
  - AdamW - peak LR of 1e-4 for 7B and 2e-4 for others (smaller models inc. the 1.7B)
    - 25k step warmup + linear LR decay to 2e-5 (i.e. decay by one order of magnitude, linearly)
  - hardware: “up to” 64 H100s 🤮
- greedy search for ASR - nice and simple ✅
- top-k sampling for TTS - $k = 30$ and temperature 0.7 - constantly
- evaluation ASR: evaluate the word error rate (WER) on the Test-Clean, TestOther subset of LibriSpeech [26], and the English subset of FLEURS [45]
  - In-domain evaluation, we adopt the LibriSpeech Test-Other and Test-Clean subset for ASR and TTS, respectively14.
  - To verify that the benefit of annealing is general rather than domain adapted to annealing data, we additionally use the test set of GigaSpeech [51] for out-domain evaluation.
  - We also splice the test set of LibriHeavy [52] to up to 1 min for long-form evaluation.
- For Text-only capability, we evaluate 5-shot MMLU [27] - using EleutherAI/lm-evaluation-harness
- TTS, we evaluate the ASR-WER [32], Speaker Similarity [46], and Proxy MOS [47]. TTS evaluation relies on VERSA [48].
- Annealing starts from the learning rate of 5e-5.
  - We have two data compositions for annealing:
    1. Opt-A: LibriSpeech [26] + FLEURS [45] + YODAS [41];
    2. Opt-B: LibriTTS [49] + VCTK [50] + YODAS [41].
  - We use the same YODAS data as in pre-training, but splice the utterances to up to 2 minutes for long-form training.
  - We believe the other four datasets are of high quality due to their rigorous curation13.
  - We upsample these four datasets by 10 times, to balance with the massive YODAS data.
Headline Results
- ASR:
  - Both 1.7B and 7B outperform other SpeechLMs - especially on the test-other
  - models match Whisper and OWSM models - uniquely designed for speech-to-text tasks
- Text-only:
  - 7B outperforms all other SpeechLMs on the MMLU metric (good gap) + outperforms some early text LLMs of 7B like LLaMA-2-7B
  - 1.7B shows close performance with Moshi7B on MMLU (46.2 vs.49.8) - with 3x fewer parameters
  - 2.7 points of degradation on OpusLM-7B compared with OLMo2-7B - text capability preserved during speech-text training
- TTS: “our massive pre-trained OpusLM-1.7B outperforms all other competitors (No. 3-5, 13-15) in terms of the WER and achieves comparable results on speaker similarity and proxy MOS score”
- Annealing helps for all scenarios - apart from out-of-domain ASR with annealing data Opt-B

Table 1 - Models' performances on ASR, TTS, MMLU

Annealing Helps

Questions

what do they mean by “splice” to 2 mins, 1 min
where is their code and model? it’s just dummy stuff right now (July 25)

Graph View

Notes
Questions

Backlinks

No backlinks found

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋

🪴 Anil's Garden

Explorer

OpusLM: A Family of Open Unified Speech Language Models

Notes

Questions

Graph View

Table of Contents

Backlinks