🪴 Anil's Garden

❯

Recent Advances in Speech Language Models: A Survey

18 Jul 20253 min read

paper
annotated
evaluation
pGSLM
vall-e
google
gslm

Title: Recent Advances in Speech Language Models: A Survey
Authors: Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, Ziqiao Meng, Guangyan Zhang, Qichao Wang, Yiwen Guo, Irwin King
Published: 1st October 2024 (Tuesday) @ 21:48:12
Link: http://arxiv.org/abs/2410.03751v1

Abstract

Large Language Models (LLMs) have recently garnered significant attention, primarily for their capabilities in text-based interactions. However, natural human interaction often relies on speech, necessitating a shift towards voice-based models. A straightforward approach to achieve this involves a pipeline of “Automatic Speech Recognition (ASR) + LLM + Text-to-Speech (TTS)”, where input speech is transcribed to text, processed by an LLM, and then converted back to speech. Despite being straightforward, this method suffers from inherent limitations, such as information loss during modality conversion and error accumulation across the three stages. To address these issues, Speech Language Models (SpeechLMs) — end-to-end models that generate speech without converting from text — have emerged as a promising alternative. This survey paper provides the first comprehensive overview of recent methodologies for constructing SpeechLMs, detailing the key components of their architecture and the various training recipes integral to their development. Additionally, we systematically survey the various capabilities of SpeechLMs, categorize the evaluation metrics for SpeechLMs, and discuss the challenges and future research directions in this rapidly evolving field.

Transcribing Notebook Notes

Novel classification of SpeechLM evaluation -evaluation
What is the extend of cross-language transfer learning?
- they cite Generative Spoken Language Modeling from Raw Audio (seminal SpeechLM paper) who I guess mentioned this motivation for working with speech - added to Questions
Explicit paralinguistic modelling:
- SpiRit-LM Interleaved Spoken and Written Language Model - they have Spirit LM Expressive w/ style and pitch tokens
- Text-Free Prosody-Aware Generative Spoken Language Modeling -pGSLM
ptrs ➡️
- 👉 Generative Spoken Dialogue Language Modeling
- 👉 SpiRit-LM Interleaved Spoken and Written Language Model - joint speech text modelling
- 👉 SpeechGPT Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities - joint speech text modelling
- 👉 SpeechGPT-Gen Scaling Chain-of-Information Speech Generation
- 👉 Google USM Scaling Automatic Speech Recognition Beyond 100 Languages
- 👉 SpeechTokenizer Unified Speech Tokenizer for Speech Large Language Models
Explicit autoregressive neural codec-based models:
- Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers -vall-e
- VioLA Unified Codec Language Models for Speech Recognition, Synthesis, and Translation
What the relationship between AudioLM and AudioPaLM?
- Both fromgoogle with lots of common authors
- AudioPaLM is an iteration / improvement over AudioLM
- From the AudioPaLM abstract: “AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks.”
💡 Mixed tokens:
- SpeechGPT-Gen Scaling Chain-of-Information Speech Generation
- Moshi a speech-text foundation model for real-time dialogue
Continuous speech representations (c.f. DSUs)
- Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM
- Mini-Omni Language Models Can Hear, Talk While Thinking in Streaming
Realtime spoken dialogue
- Generative Spoken Dialogue Language Modeling
- Language Model Can Listen While Speaking
- Moshi a speech-text foundation model for real-time dialogue - added this; wasn’t included in my written notes (notebook)
💡 Silent mode:
- VITA Towards Open-Source Interactive Omni Multimodal LLM

To read:

Evaluation
Paralinguistic tasks evaluation 👈 this is what’s important and interesting
Read Lakhotia 2021 Generative Spoken Language Modeling from Raw Audio gslm

Graph View

Backlinks

Speech and Audio

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋

🪴 Anil's Garden

Explorer

Recent Advances in Speech Language Models: A Survey

Transcribing Notebook Notes

Graph View

Backlinks