Title: wav2vec: Unsupervised Pre-training for Speech Recognition
Authors: Steffen Schneider, Alexei Baevski, Ronan Collobert, Michael Auli
Published: 11th April 2019 (Thursday) @ 17:29:30
Link: http://arxiv.org/abs/1904.05862v4

Abstract

We explore unsupervised pre-training for speech recognition by learning representations of raw audio. wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training. We pre-train a simple multi-layer convolutional neural network optimized via a noise contrastive binary classification task. Our experiments on WSJ reduce WER of a strong character-based log-mel filterbank baseline by up to 36% when only a few hours of transcribed data is available. Our approach achieves 2.43% WER on the nov92 test set. This outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data.


Have a look at this talk on the InfoNCE loss by Oriol Vinyals: Oriol Vinyals · The InfoNCE loss in self-supervised learning (from NeurIPS 2020 - Self-Supervised Learning — Theory and Practice).


wav2vec - Notes

Overall Idea

  • convolutional net - raw audio input → general representation output
    • application: asr
  • objective: contrastive loss - distinguish future audio sample from negatives
    • Collobert 2011; Mikolov 2013; van den Oord 2018
  • extend van den Oord 2018 - who just did frame-wise phoneme classification - and apply downstream to ASR (unsupervised pre-training)

Benchmarking // Evaluation // Results

  • WSJ benchmark: pretrained reprs trained on ~1k hours unlabelled speech substantially improves character-based ASR
    • beats Deep Speech 2: knocks WER down from 3.1% to 2.43%
  • TIMIT pre-training: matches sota
  • “simulate” low-resource setting ⇐> use _only 8 hours of labelled (transcribed) audio data: wav2vec reduces WER by up to 36% compared to a baseline model

Pre-training approach

  • reduce temporal frequency
  • similar to van den Oord 2018: model density ratio:

Model

Two networks - logically distinct but functionally quite similar

Encoder network - network #1

  • five-layer conv network
  • could have used trainable front-end of Zeghidour 2018a
  • output: low-frequency reprs
  • encodes 30md of 16kHz audio
  • striding results in representations every 10ms (overlap)

Context network:

  • convolutional net
  • mix multiple latent representations into contextualised tensor: - receptive field size:
  • receptive field of ~210ms
  • 9 layers, kernel 3, stride 1

Both encoder and context networks use causal convolutions with 512 channels (like van den Oord 2018 IIRC)

Use group norm because “important to find a normalization scheme that is invariant to the scaling and offset of the input” - suppse this means the average and variability of the audio samples.

Have a larger “wav2vec large” model → add two additional linear transforms in the encoder network and much larger context network with 12 (vs 9) conv layers - receptive field of “wav2vec large” context network increased to ~810ms (vs ~210ms) - skip connections used to allow convergence for this larger model.

Objective: Maximise probability of predicting a sample steps into the future given a step-specific affine tranformation for each step size whilst minimising the probability of predicting distractor samples

  • take distractor samples from the same audio - they observe empirically better performance like van den Oord 2018 did

Data

  • Phoneme recognition on TIMIT:
    • use standard split
    • train split contains ~3 hours audio
  • WSJ (~81 hours transcribed data in total)
    • train on si284; validation on nov93dev; test on nov92
  • Librispeech - total of 960 hours clean and noisy speech audio

Pre-training: use one of:

  • full 81 hours of WSJ