Title: WaveNet: A Generative Model for Raw Audio
Authors: Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu
Published: 12th September 2016 (Monday) @ 17:29:40
Link: http://arxiv.org/abs/1609.03499v2

Abstract

This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.


WaveNet - Notes

WaveNet was designed to generate audio waveforms. It is trained directly on audio waveform input using blocks of dilated causal convolutions (without pooling) to output a predictive distribution over the next time point.

WaveNet is one of the contributing components along with Tacotron in systems like Google Duplex, an “AI” system from 2018 for accomplishing real-world tasks over the phone. Duplex exhibits convincing TTS and the non-WaveNet backend is additionally able to make impressive adjustments in realtime when the conversation takes an unexpected turn whilst booking a restaurant (link to the Google Developers keynote from May 2018).

Dilated convolutions take non-adjacent pixels or audio as input to the operation and are the core component of the architecture. They were introduced by Yu and Koltun in 2016 in Multi-Scale Context Aggregation by Dilated Convolutions.

Dilated Causal Convolutions

Wavenet Architecture Schematic

[Image Source: van den Oord et al. (2016) WaveNet: A Generative Model for Raw Audio]

Raw audio is typically sampled at 16kHz (i.e. samples per second) or more with each sample being a 16-bit integer value (so ).

  • In WaveNet they apply a -law companding transformation and then quantize it to 256 values, which produces a better reconstruction of audio than a linear quantization scheme.
  • WaveNet uses a sigmoid activation in place of a ReLU (with the “gated activation unit” from gated PixelCNN)
  •