Title: Recent Advances in Speech Language Models: A Survey
Authors: Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, Ziqiao Meng, Guangyan Zhang, Qichao Wang, Yiwen Guo, Irwin King
Published: 1st October 2024 (Tuesday) @ 21:48:12
Link: http://arxiv.org/abs/2410.03751v1
Abstract
Large Language Models (LLMs) have recently garnered significant attention, primarily for their capabilities in text-based interactions. However, natural human interaction often relies on speech, necessitating a shift towards voice-based models. A straightforward approach to achieve this involves a pipeline of âAutomatic Speech Recognition (ASR) + LLM + Text-to-Speech (TTS)â, where input speech is transcribed to text, processed by an LLM, and then converted back to speech. Despite being straightforward, this method suffers from inherent limitations, such as information loss during modality conversion and error accumulation across the three stages. To address these issues, Speech Language Models (SpeechLMs) â end-to-end models that generate speech without converting from text â have emerged as a promising alternative. This survey paper provides the first comprehensive overview of recent methodologies for constructing SpeechLMs, detailing the key components of their architecture and the various training recipes integral to their development. Additionally, we systematically survey the various capabilities of SpeechLMs, categorize the evaluation metrics for SpeechLMs, and discuss the challenges and future research directions in this rapidly evolving field.
Transcribing Notebook Notes
- Novel classification of SpeechLM evaluation -evaluation
- What is the extend of cross-language transfer learning?
- they cite Generative Spoken Language Modeling from Raw Audio (seminal SpeechLM paper) who I guess mentioned this motivation for working with speech - added to Questions
- Explicit paralinguistic modelling:
- SpiRit-LM Interleaved Spoken and Written Language Model - they have Spirit LM Expressive w/ style and pitch tokens
- Text-Free Prosody-Aware Generative Spoken Language Modeling -pGSLM
- ptrs âĄïž
- đ Generative Spoken Dialogue Language Modeling
- đ SpiRit-LM Interleaved Spoken and Written Language Model - joint speech text modelling
- đ SpeechGPT Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities - joint speech text modelling
- đ SpeechGPT-Gen Scaling Chain-of-Information Speech Generation
- đ Google USM Scaling Automatic Speech Recognition Beyond 100 Languages
- đ SpeechTokenizer Unified Speech Tokenizer for Speech Large Language Models
- Explicit autoregressive neural codec-based models:
- What the relationship between AudioLM and AudioPaLM?
- Both fromgoogle with lots of common authors
- AudioPaLM is an iteration / improvement over AudioLM
- From the AudioPaLM abstract: âAudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks.â
- đĄ Mixed tokens:
- Continuous speech representations (c.f. DSUs)
- Realtime spoken dialogue
- Generative Spoken Dialogue Language Modeling
- Language Model Can Listen While Speaking
- Moshi a speech-text foundation model for real-time dialogue - added this; wasnât included in my written notes (notebook)
- đĄ Silent mode:
To read:
- Evaluation
- Paralinguistic tasks evaluation đ this is whatâs important and interesting
- Read Lakhotia 2021 Generative Spoken Language Modeling from Raw Audiogslm