🪴 Anil's Garden

❯

On The Landscape of Spoken Language Models: A Comprehensive Survey

23 Nov 20252 min read

paper

Title: On The Landscape of Spoken Language Models: A Comprehensive Survey
Authors: Siddhant Arora, Kai-Wei Chang, Chung-Ming Chien, Yifan Peng, Haibin Wu, Yossi Adi, Emmanuel Dupoux, Hung-Yi Lee, Karen Livescu, Shinji Watanabe
Published: 11th April 2025 (Friday) @ 13:40:53
Link: http://arxiv.org/abs/2504.08528v1

Abstract

The field of spoken language processing is undergoing a shift from training custom-built, task-specific models toward using and optimizing spoken language models (SLMs) which act as universal speech processing systems. This trend is similar to the progression toward universal language models that has taken place in the field of (text) natural language processing. SLMs include both “pure” language models of speech — models of the distribution of tokenized speech sequences — and models that combine speech encoders with text language models, often including both spoken and written input or output. Work in this area is very diverse, with a range of terminology and evaluation settings. This paper aims to contribute an improved understanding of SLMs via a unifying literature survey of recent work in the context of the evolution of the field. Our survey categorizes the work in this area by model architecture, training, and evaluation choices, and describes some key challenges and directions for future work.

Notes

“Additionally, byte pair encoding (BPE) (Gage, 1994) is also sometimes applied to the discrete token sequences (Wu et al., 2023a; Shen et al., 2024) to capture recurring patterns.”
- Wav2Seq Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages
- Acoustic BPE for Speech Generation with Discrete Tokens
  - they do exactly what we propose to do with BPE
    - except they use - presumably single-codepoint - Chinese Unicode characters to represent tokens; and
    - use HuBERT’s final layer and a smaller codebook size; $k = 2000$ IIRC
  - …but they don’t look at ASR; they focus instead on “syntax capturing”
    - basically tested by scrambling the words in a sentence and generating synthetically with TTS and doing a classification between these nonsense utterances and correct, grammatical utterances, if I understood correctly
  - need to understand
    1. their rescoring method (some kind of decoding thing like the guys in the Sardine lab work on a lot …?); and
    2. the entropy informativeness point they make

Graph View

Backlinks

Speech and Audio

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋

🪴 Anil's Garden

Explorer

On The Landscape of Spoken Language Models: A Comprehensive Survey

Notes

Graph View

Backlinks