🪴 Anil's Garden

❯

Generative Spoken Dialogue Language Modeling

18 Jul 20252 min read

paper
speech
annotated

Title: Generative Spoken Dialogue Language Modeling
Authors: Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoit Sagot, Abdelrahman Mohamed, Emmanuel Dupoux
Published: 30th March 2022 (Wednesday) @ 17:39:45
Link: http://arxiv.org/abs/2203.16502v2

Abstract

We introduce dGSLM, the first “textless” model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. We show that our model is able to generate speech, laughter and other paralinguistic signals in the two channels simultaneously and reproduces more naturalistic and fluid turn-taking compared to a text-based cascaded model.

Quick Notes

In fact, in human-human conversation, pauses within speaker turns tend to be on average longer than gaps between speaker turns (Brady, 1968; Ten Bosch et al., 2005; Heldner and Edlund, 2010), indicating that silence may not be the main cue for humans to switch turns.

Unsupervised Spoken Language Modeling

(Nice autoencoder-masking distinction drawn)

Recently great advances have been achieved in the area of representation learning from raw audio. Models trained with either autoencoder objectives (Ondel et al., 2016; van den Oord et al., 2017) or masked objectives (CPC: van den Oord et al., 2018; APC: Chung and Glass, 2020; wav2vec 2.0: Baevski et al., 2020; HuBERT: Hsu et al., 2021a; MockingJay: Liu et al., 2020) from raw speech can learn audio representation that can be used for a variety of downstream tasks (Yang et al., 2021), see Borgholt et al. (2022) for a review

Graph View

Quick Notes
Unsupervised Spoken Language Modeling

Backlinks

Recent Advances in Speech Language Models: A Survey
SpiRit-LM: Interleaved Spoken and Written Language Model

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋

🪴 Anil's Garden

Explorer

Generative Spoken Dialogue Language Modeling

Quick Notes

Unsupervised Spoken Language Modeling

Graph View

Table of Contents

Backlinks