Generative Spoken Language Modeling from Raw Audio

🪴 Anil's Garden

Title: Generative Spoken Language Modeling from Raw Audio
Authors: Kushal Lakhotia, Evgeny Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Adelrahman Mohamed, Emmanuel Dupoux
Published: 1st February 2021 (Monday) @ 21:41:40
Link: http://arxiv.org/abs/2102.01192v2

Abstract

We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo-text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems.

Notes (Quick)

Generative Speech Evaluation (§3.1)
- Speech resynthesis intelligibility: ASR-PER: “Our main metric is Phone Error Rate (PER), which only uses an acoustic-model ASR, without fusing with an additional language model (Chorowski and Jaitly, 2016). In preliminary experiments we also experimented with a full ASR with an LM and computed Word Error Rate (WER) and Character Error Rate (CER) to give partial credit. The latter is probably closer to humans intelligibility metrics, as humans cannot turn off their lexicon or language model.”
- Speech generation quality and diversity: AUC on Perplexity and VERT:
  - VERT (“diversity”): geometric mean of self-BLEU and auto-BLEU, each computed over generated utterances
    - “We used a bigram version of self- and auto-BLEU”
    - auto-BLEU: geometric mean over k and over all generated utterances ratio of k-grams (substrings of length k for k in [1,n]; I think) repeated at least once
  - “To calculate perplexity of the generated utterances, we use a pre-trained ASR to convert speech to text, and an off-the-shelf Transformer model trained on the English NewsCrawl dataset.”
    - ASR model: “We use a LARGE wav2vec 2.0 model, trained on LibriSpeech-960h with CTC loss from scratch. Its decoder uses the standard KenLM 4-gram language model”
    - Language Model: https://github.com/facebookresearch/fairseq/tree/main/examples/language_model
Encoded representation Quality (§3.2)
Human Evaluations (§3.3)

Notes (Structured)

Aim

This paper is seminal, which guides its scope:

Compared to standard text generation, a critical and novel component of the audio variant is clearly the discovery of units since it conditions all the other components. This is why we devote our analyses of model architectures to the unit-to-speech component specifically, and leave it for further work to evaluate how the downstream components can also be optimized for spoken language generation.

Contributions

Two novel evaluation metrics for the generation mode of spoken language modeling at the acoustic and language levels respectively
- 💡 use a pretrained ASR model to establish model-independent assessments of the intelligibility (acoustic level) and meaningfulness (language level) of the produced outputs
- the ASR system converts the generated waveform back to text $⟹$ can adapt standard text-based metrics for these two levels
validate metrics by comparison with human evaluation $\to$ show high degree of concordance between human and machine evaluations of intelligibility and meaningfulness of generated audio
show these metrics can be predicted by simpler ones geared to evaluate the encoding mode of the spoken LM
Zero-shot metrics borrowed from previous studies in the Zero Resource Speech Challenges (Versteegh et al., 2016; Nguyen et al., 2020) correlate well with their generative counterpart, offering an easier proxy to rapidly iterate on model selection. (4) we systematically study the effect of the type of encoding units by factorially crossing three recent speech-to-unit encoders, CPC, Wave2vec 2.0 and HuBERT, with three codebook sizes for the discrete units, 50, 100, 200. We keep constant the rest of the system built from out-ofthe-box components (standard Transformer for the uLM, Tacotron 2 for u2S). We show that both the encoder type and the number of units matter, and that they matter differently depending on the evaluation task. (5) we open source our evaluation tools and models to help reproducibility and comparability with future work. In Section 3, we introduce the ASR, zero-shot and human evaluation metrics, in Section 4 we present the models, in Section 5, we analyze the results and discuss them in Section 6. 2 Related work Unsupervised speech representation learning aims to distill features useful for downstream tasks, such as phone discrimination (Kharitonov et al., 2021; Schneider et al., 2019) and semantic prediction (Lai et al., 2021; Wu et al., 2020), by constructing pretext tasks that can exploit large quantities of unlabeled speech. Pretext tasks in the literature can be roughly divided into two categories: reconstruction and prediction. Recons

The core motive of “textless NLP” is stated in the intro of this paper, which is the first in thetextless-nlp initiative fromMeta

Being able to achieve ’textless NLP’ would be beneficial for the majority of the world’s languages which do not have large textual resources or even a widely used standardized orthography (Swiss German, dialectal Arabic, Igbo, etc.), and which, despite being used by millions of users, have little chance of being served by current text-based technology. It would also be useful for ’high-resource’ languages, where the oral and written forms often mismatch in terms of lexicon and syntax, and where some linguistically relevant signals carried by prosody and intonation are basically absent from text.

Difficulties of Evaluation

Evaluation for speech generation is difficult because speech is:
- continuous
- variable
- speech waveform is multi-level - fine grained acoustic details for intelligible audio and higher level language concepts (morphological / “semantic”)
Text-based models do not have this problem:
- the input is already expressed in terms of mid-level discrete units (characters or words)
- they are typically evaluated with unsupervised metrics close to the learning objectives - e.g. perplexity, log-likelihood
approach is not directly applicable even if we rely on discrete pseudo-text units because these depend on their granularity and different models use different number, duration and distribution of DSUs

Tasks Implicit in Spoken Language Modelling - Framework

At two levels:

High level: Language (“semantics”)
Granular level: Acoustics

Two tasks:

Encoding
Generation

	Encoding			Generation
Level	Task	Automatic metric	Task	Automatic metric	Human
Language	Spoken Language Modelling	Spot-the-word, Syntax-Acc	Speech Generation	AUC-of-VERT/PPX, cont-BLEU, PPX@o-VERT	MMOS
Acoustic	Acoustic Unit Discovery	ABX-across, ABX-within	Speech Resynthesis	PER-from-ASR, CER-from-ASR	CER, MOS

Tasks

Spoken language modelling: Encoding of the patterns in spoken language (language modelling under speech units)
Acoustic Unit Discovery: Learning (linguistically meaningful) speech representations which are invariant to non-linguistic factors, in particular speaker identity and noise
Speech Generation: Generating novel speech conditioned on a (textual or other?) prompt
Speech Resynthesis: Synthesising speech given acoustic units i.e. units-to-speech (analogously to text-to-speech)

🪴 Anil's Garden

Explorer

Generative Spoken Language Modeling from Raw Audio

Notes (Quick)

Notes (Structured)

Aim

Contributions

Difficulties of Evaluation

Tasks Implicit in Spoken Language Modelling - Framework

Tasks

Graph View

Table of Contents

Backlinks