Title: Moshi: a speech-text foundation model for real-time dialogue
Authors: Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, Neil Zeghidour
Published: 18th September 2024 (Wednesday) @ 00:00:00
Link: https://kyutai.org/Moshi.pdf
Abstract
We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning— such as emotion or non-speech sounds— is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections. Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. We moreover extend the hierarchical semantic-to-acoustic token generation of previous work to first predict time-aligned text tokens as a prefix to audio tokens. Not only this “Inner Monologue” method significantly improves the linguistic quality of generated speech, but we also illustrate how it can provide streaming speech recognition and text-to-speech. Our resulting model is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice, and is available at github.com/kyutai-labs/moshi.
Quick Notes
- §3.3 Audio Tokenization:
- “In the literature, and following the terminology defined by Borsos et al. (2022), these tokens are referred to as acoustic tokens, as they model fine audio details and are optimized for high-quality reconstruction. While these acoustic tokens provide appropriate targets for conditioned text-to-audio models (e.g. text-tospeech (Wang et al., 2023) or text-to-music (Copet et al., 2023)), unconditioned speech generation requires combining them with semantic tokens extracted from self-supervised speech models (Baevski et al., 2020; Hsu et al., 2021; Chung et al., 2021).”
- question really? such a clear statement and I guess I haven’t seen it framed like this
- taking inspiration from previous work on SpeechTokenizer (Zhang et al., 2024b), Mimi uses distillation to transfer non-causal, high-level semantic information into the tokens produced by a causal model, allowing for streaming encoding and decoding of semantic-acoustic tokens.
- “In the literature, and following the terminology defined by Borsos et al. (2022), these tokens are referred to as acoustic tokens, as they model fine audio details and are optimized for high-quality reconstruction. While these acoustic tokens provide appropriate targets for conditioned text-to-audio models (e.g. text-tospeech (Wang et al., 2023) or text-to-music (Copet et al., 2023)), unconditioned speech generation requires combining them with semantic tokens extracted from self-supervised speech models (Baevski et al., 2020; Hsu et al., 2021; Chung et al., 2021).”
- §3.3.1 Mimi Architecture:
- SEANet A Multi-modal Speech Enhancement Network encoder and RVQ decoder
- based on SoundStream An End-to-End Neural Audio Codec and High Fidelity Neural Audio Compression (EnCodec)
- The encoder projects a single-channel waveform x ∈ R L to a latent representation enc(x) ∈ R S×D
- …by cascading residual convolutional blocks that interleave dilated (van den Oord et al., 2016; i.e. WaveNet) and strided convolutions along with ELU (Clevert et al., 2016) non-linearities and Weight Normalization (Salimans and Kingma, 2016)
- All convolutions are causal such that this autoencoder can run in a streaming fashion
- Transformer-based bottleneckquestion - see Table 3 for ablations this is important
- Optimization: AdamW
- Quantization:
- Q = 8 quantizers
- each with a codebook size of NA = 2048
- At 12.5Hz, this represents a bitrate of 1.1kbps
- While the latent dimension is 512, we project embeddings to 256 dimensions before applying the RVQ, and project back to 512 before the decoder
- question why?
- follow the observation of Kumar et al. (2023) High-Fidelity Audio Compression with Improved RVQGAN that not applying quantization with a certain probability during training improves audio quality.
- More precisely, we only apply quantization 50% of the time, on a per-sequence basis, during training.
- Unlike Kumar et al. (2023), this means passing unquantized embeddings to the decoder, rather than passing embeddings quantized with all quantizers.
- Adversarial-only training: adv. and feature losses only, no reconstruction loss; performs better than expected based on previous work that tried this
- Learning semantic-acoustic tokens with a split RVQ
- learning good semantic representations is to the detriment of audio quality (measured by ABX) so they use a split vector quantizer and residual vector quantizer for semantic tokens and audio tokens respectively
- “We address this issue by proposing a split RVQ. Rather than a single RVQ with 8 levels, we distill semantic information into a plain VQ and apply an RVQ with 7 levels in parallel. We sum their outputs, such that while both can be used for reconstruction, we remove the constraint that acoustic information should be conserved in the residual of the semantic quantizer”
- 3.4.1 Hierarchical autoregressive modeling with RQ-Transformer - the Depth & Temporal Transformers
- Temporal transformer: Operates on the temporal dimension of the encoded audio sequence given the past encoded steps (all depthwise tokens from resp. codebooks)
- Depth transformer: Operates on the depth dimension along the codebooks, given the past encoded steps and also the lower level tokens from codebooks
- Builds on previous work by:
- §3.4.2 Acoustic Modeling:
- we find that introducing a slight delay between the semantic and acoustic tokens led to more stable generations. Copet et al. (2023) show that this leads to reduced dependencies between the sub-sequences for a given time step, conditioned on the past, thus allowing to use a weaker model to approximate the joint distribution P [Vs,k|V0, … , Vs−1] (in their case, as the product of the conditioned marginals). Lemercier et al. (2024) further show a connection between the mutual information between the sub-sequences at a given step, and the quality of the generation: naturally, the more complex the interdependence, the more powerful a model will be needed to estimate them.
- introducing a delay of 1 or 2 steps between the semantic and acoustic features greatly improves the quality of the generation. This allows the Temporal, larger, Transformer to model the inter-dependence between semantic and acoustic features
- Note that using RQ-Transformers to model audio was successfully used by Yang et al. (2023) UniAudio An Audio Foundation Model Toward Universal Audio Generation and Zhu et al. (2024) Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer.
- We introduce here the use of per-codebook parameters in the Depth Transformer, and the use of the acoustic delay. Compared with (Zhu et al., 2024) which first generates all the semantic tokens, we generate them jointly with the acoustic tokens, which allows for the first time a streaming modeling of semantic and acoustic tokens jointly.
- §4.2 Audio Data:
- 7 million hours - unsupervised set - mostly English
- To achieve multi-stream, we need the model to gain the ability to both listen and speak at the same time. For this, we further leverage the Fisher dataset (Cieri et al., 2004). It consists of 2000 hours of phone conversations between randomly paired participants, with a given topic to discuss. A property of Fisher is that each conversation side is recorded on a separate channels, which allows providing ground-truth separated streams to Moshi. The original audio is sampled at 8kHz, and we use AudioSR (Liu et al., 2023a) to upsample it to 24kHz.
- (For better training set timestamps (even with long silences) To obtain reliable timestamps, despite long silences in each stream, we use transcription obtained with the whisper-timestamped package (Louradour, 2023), along with the medium Whisper model.
- Louradour 2023 is: linto-ai/whisper-timestamped - Multilingual Automatic Speech Recognition with word-level timestamps and confidence