Title: Generative Spoken Language Modeling from Raw Audio
Authors: Kushal Lakhotia, Evgeny Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Adelrahman Mohamed, Emmanuel Dupoux
Published: 1st February 2021 (Monday) @ 21:41:40
Link: http://arxiv.org/abs/2102.01192v2
Abstract
We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo-text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems.
Aim
This paper is seminal, which guides its scope:
Compared to standard text generation, a critical and novel component of the audio variant is clearly the discovery of units since it conditions all the other components. This is why we devote our analyses of model architectures to the unit-to-speech component specifically, and leave it for further work to evaluate how the downstream components can also be optimized for spoken language generation.
Contributions
- Two novel evaluation metrics for the generation mode of spoken language modeling at the acoustic and language levels respectively
- đĄ use a pretrained ASR model to establish model-independent assessments of the intelligibility (acoustic level) and meaningfulness (language level) of the produced outputs
- the ASR system converts the generated waveform back to text can adapt standard text-based metrics for these two levels
- validate metrics by comparison with human evaluation show high degree of concordance between human and machine evaluations of intelligibility and meaningfulness of generated audio
- show these metrics can be predicted by simpler ones geared to evaluate the encoding mode of the spoken LM
- Zero-shot metrics borrowed from previous studies in the Zero Resource Speech Challenges (Versteegh et al., 2016; Nguyen et al., 2020) correlate well with their generative counterpart, offering an easier proxy to rapidly iterate on model selection. (4) we systematically study the effect of the type of encoding units by factorially crossing three recent speech-to-unit encoders, CPC, Wave2vec 2.0 and HuBERT, with three codebook sizes for the discrete units, 50, 100, 200. We keep constant the rest of the system built from out-ofthe-box components (standard Transformer for the uLM, Tacotron 2 for u2S). We show that both the encoder type and the number of units matter, and that they matter differently depending on the evaluation task. (5) we open source our evaluation tools and models to help reproducibility and comparability with future work. In Section 3, we introduce the ASR, zero-shot and human evaluation metrics, in Section 4 we present the models, in Section 5, we analyze the results and discuss them in Section 6. 2 Related work Unsupervised speech representation learning aims to distill features useful for downstream tasks, such as phone discrimination (Kharitonov et al., 2021; Schneider et al., 2019) and semantic prediction (Lai et al., 2021; Wu et al., 2020), by constructing pretext tasks that can exploit large quantities of unlabeled speech. Pretext tasks in the literature can be roughly divided into two categories: reconstruction and prediction. Recons
The core motive of âtextless NLPâ is stated in the intro of this paper, which is the first in thetextless-nlp initiative fromMeta
Being able to achieve âtextless NLPâ would be beneficial for the majority of the worldâs languages which do not have large textual resources or even a widely used standardized orthography (Swiss German, dialectal Arabic, Igbo, etc.), and which, despite being used by millions of users, have little chance of being served by current text-based technology. It would also be useful for âhigh-resourceâ languages, where the oral and written forms often mismatch in terms of lexicon and syntax, and where some linguistically relevant signals carried by prosody and intonation are basically absent from text.
Difficulties of Evaluation
- Evaluation for speech generation is difficult because speech is:
- continuous
- variable
- speech waveform is multi-level - fine grained acoustic details for intelligible audio and higher level language concepts (morphological / âsemanticâ)
- Text-based models do not have this problem:
- the input is already expressed in terms of mid-level discrete units (characters or words)
- they are typically evaluated with unsupervised metrics close to the learning objectives - e.g. perplexity, log-likelihood
- approach is not directly applicable even if we rely on discrete pseudo-text units because these depend on their granularity and different models use different number, duration and distribution of DSUs
Tasks Implicit in Spoken Language Modelling - Framework
At two levels:
- High level: Language (âsemanticsâ)
- Granular level: Acoustics
Two tasks:
- Encoding
- Generation
Encoding | Generation | ||||
---|---|---|---|---|---|
Level | Task | Automatic metric | Task | Automatic metric | Human |
Language | Spoken Language Modelling | Spot-the-word, Syntax-Acc | Speech Generation | AUC-of-VERT/PPX, cont-BLEU, PPX@o-VERT | MMOS |
Acoustic | Acoustic Unit Discovery | ABX-across, ABX-within | Speech Resynthesis | PER-from-ASR, CER-from-ASR | CER, MOS |
Tasks
- Spoken language modelling: Encoding of the patterns in spoken language (language modelling under speech units)
- Acoustic Unit Discovery: Learning (linguistically meaningful) speech representations which are invariant to non-linguistic factors, in particular speaker identity and noise
- Speech Generation: Generating novel speech conditioned on a (textual or other?) prompt
- Speech Resynthesis: Synthesising speech given acoustic units i.e. units-to-speech (analogously to text-to-speech)