Title: Generative Spoken Dialogue Language Modeling
Authors: Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoit Sagot, Abdelrahman Mohamed, Emmanuel Dupoux
Published: 30th March 2022 (Wednesday) @ 17:39:45
Link: http://arxiv.org/abs/2203.16502v2
Abstract
We introduce dGSLM, the first âtextlessâ model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. We show that our model is able to generate speech, laughter and other paralinguistic signals in the two channels simultaneously and reproduces more naturalistic and fluid turn-taking compared to a text-based cascaded model.
Quick Notes
In fact, in human-human conversation, pauses within speaker turns tend to be on average longer than gaps between speaker turns (Brady, 1968; Ten Bosch et al., 2005; Heldner and Edlund, 2010), indicating that silence may not be the main cue for humans to switch turns.
Unsupervised Spoken Language Modeling
(Nice autoencoder-masking distinction drawn)
Recently great advances have been achieved in the area of representation learning from raw audio. Models trained with either autoencoder objectives (Ondel et al., 2016; van den Oord et al., 2017) or masked objectives (CPC: van den Oord et al., 2018; APC: Chung and Glass, 2020; wav2vec 2.0: Baevski et al., 2020; HuBERT: Hsu et al., 2021a; MockingJay: Liu et al., 2020) from raw speech can learn audio representation that can be used for a variety of downstream tasks (Yang et al., 2021), see Borgholt et al. (2022) for a review