Title: Tacotron: Towards End-to-End Speech Synthesis
Authors: Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, Rif A. Saurous
Published: 29th March 2017 (Wednesday) @ 16:55:13
Link: http://arxiv.org/abs/1703.10135v2

Abstract

A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. Given <text, audio> pairs, the model can be trained completely from scratch with random initialization. We present several key techniques to make the sequence-to-sequence framework perform well for this challenging task. Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. In addition, since Tacotron generates speech at the frame level, it’s substantially faster than sample-level autoregressive methods.


Notes

Tacotron predicts raw spectrogram from character string input in an end-to-end fashion using the sequence-to-sequence1 (seq2seq) with attention2 paradigm.

  • The model seeks to be more robust than the “layered” text-to-speech (TTS) approaches that preceded it, which typically concatenated a text analysis frontend, an acoustic model and an audio synthesis module to predict vocoder parameters.
  • The <text, audio> training data pairs do not require phoneme-level alignment so any audio with an accompanying transcript can be used.
  • Tacotron also generates speech at the frame level making it faster than autoregressive approaches, which necessarily produce output sequentially.
  • It achieves a 3.82 mean opinion score (MOS), a human-judged subjective 5-scale score for the naturalness of the produced speech and this outcompeted previous approaches.

TTS is a large-scale inverse problem: highly compressed text is decompressed into a continuous audio output that is both longer than the input and more variable than it, given the variation in speakers’ intonations and timbres.

Previous Approaches

  • WaveNet works well but is slow because it is autoregressive. Also, it conditions on features from an existing TTS frontend meaning it replaces only the vocoder and acoustic model components, and not the text analysis frontend.
  • DeepVoice replaces each layer of the layered structure with independently-trained neural networks and consequently is difficult to tweak to train end-to-end.
  • Wang et al. (2016) does TTS using seq2seq with attention but relies on a pre-trained hidden Markov model (HMM) to help the seq2seq model learn the text-audio alignment, uses tricks for training that hurt prosody and needs a vocoder since it predicts vocoder parameters.
  • Char2Wav3 is end-to-end and trained on characters but predicts vocoder parameters and uses a separately trained SampleRNN neural vocoder.

To reiterate, in contrast, Tacotron is trained end-to-end on audio-text pairs without need for alignment.

Model Architecture

  • Tacotron includes an encoder, an attention-based decoder and a post-processing net
  • The model takes characters as input and produces spectrogram frames, which are converted to waveforms as a post-processing step

CBHG module: Convolution Bank (1-D) + Highway Network + GRU (bidirectional)

A building block consisting of:

  • a bank of sets of 1-D convolutional filters where the th set contains filters of width
    • these model local and contextual information analogous to modelling n-grams up to K-grams
    • max pooling along the time dimension with and is used to increase local invariances whilst preserving the time resolution from the input
    • the processed sequence is further fed to fixed-width 1-D convolutions
    • this is added to the original input via residual connections as in ResNet
    • Batch normalisation is used for the conv layers
  • highway networks (see )
    • a multi-layer highway network is used to extract high-level features given the output of the convolution bank sub-block
  • bidirectional gated recurrent unit (GRU)
    • the bidirectional GRU is stacked on top of the highway networks

The non-causal convolutions, batch norm, residual connections and max pooling4 were found to improve generalisation.

Encoder

The characters are embedded into 256-D continuous vectors and fed into a “pre-net” (FC-256-ReLU, Dropout(0.5), FC-128-ReLU, Dropout(0.5)) with a bottleneck layer (the FC-128) to improves generalisation. The CBHG module is the final encoder component before feeding the representation into the attention decoder module.

Decoder

  • content-based tanh attention decoder as per Vinyals et al. 2015 - stateful recurrent layer produces attention query at each time step
  • concatenate context vector + attention query input to decoder RNNs
  • decoder: stack of GRUs with vertical residual connections (Wu et al. 2016?; residuals improve convergence)
  • target: could be spectrogram but redundant for learning speech-text alignment
  • 80-band mel-spectrogram to waveform conversion done in post-processing step
  • FC output layer to predict 80-band mel-spectrogram
  • 🗝️ Key Trick: Predict multiple non-overlapping frames per decoder step
    • reduce complexity by factor
    • reduce model size, training time and inference time
    • improve convergence speed - faster and more stable learned alignment from attention
    • neighbouring frames are informationally redundant (multiple frames per character)
    • allows attention to move forward (e.g. Zen et al. 2016)
  • first decoder step conditioned on all-zero <GO> frame
  • Inference time decoding: last frame of predictions fed as input to the decoder for step (could use all predictions)
  • Teacher forcing / ground truthing used during training
  • Scheduled sampling (?; Bengio et al. 2015) not used so dropout critical
    • provides robustness against multiple modalities in output distribution

Post-processing and Waveform Synthesis

  • post-processing synthesises a waveform from the mel-spectrogram
  • learns to predict spectral magnitude on a linear scale
  • post-processing step can see full decoded sequence
  • CBHG module used as post-processing net (flexible; can be swapped out; can predict different targets e.g. vocoder parameters)
  • Griffin-Lim (Griffin and Lim 1984)
    • raise predicted magnitudes by power of 1.2 before Griffin-Lim to reduce artefacts (harmonic enhancement)
    • 30 iterations for convergence
    • differentiable (implemented in TF) but no loss imposed
    • simple (but can likely be improved on)

Model Details

  • log magnitude spectrogram
  • Hann windowing
  • 50 ms frame length
  • 12.5 ms frame shift
  • 2048-point Fourier transform
  • pre-emphasis (0.97)
  • 24 kHz sampling rate
  • output layer reduction factor; larger e.g. works well too
  • Adam, LR decay (from 0.001)
  • loss for seq2seq decoder (mel-spectrogram) and post-processing net (linear-scale spectrogram) with equal weights for losses
  • batch size = 32
  • all sequences padded to max length (loss mask to mask loss on zero-padded frames but models don’t know when to stop emitting outputs; reconstruct zero-padded frames instead)

Experiments

  • trained on N. American English: 24.6 hrs speech by professional female speaker
  • text normalization e.g. 16 sixteen
  • authors say generative models are hard to compare to objective metrics (poor correlation with perception)
  • see samples
  • Ablation: Comparison with seq2seq model:
    • encoder and decoder use 2-layer residual 256-D GRUs (also tried LSTMs)
    • No pre-net or post-processing
    • decoder predicts linear-scale log magnitude spectrogram
    • scheduled sampling (sampling rate = 0.5) required for alignment learning and generalization
    • learned attention alignment is poor, see Fig. 3(a) - gets stuck bad speech intelligibility (c.f. main model learns clean and smooth alignment)
    • naturalness and duration are destroyed
  • Ablation: Comparison CBHG encoder replaced by 2-layer residual GRU (ceteris paribus):
    • alignment from GRU encoder noisier, which causes mispronunciations
    • CBHG encoder reduces overfitting, generalises well to long and complex phrases
  • Ablation: Comparison no post-processing net (decider RNN directly predicts linear-scale spectrogram):
    • with more context, post-processing net predictions better resolved harmonics (e.g. higher harmonics in bins 100-400) and high frequency formant5 structure (reduces artefacts)

Mean Opinion Score (MOS) Tests

  • 100 unseen phrases received 8 5-point Likert scale ratings each from native speakers
  • Comparison points: parametric system (based on LSTM from Zen et al. 2016) and concatenative system (Gonzalvo et al. 2016) production systems
  • Tacotron beats parametric and loses to concatenative, but latter is a strong baseline; authors mention artefacts from Griffin-Lim synthesis

Discussions

  • Tacotron is frame-based so inference is faster than sample-level autoregressive methods
  • no engineered linguistic features, HMM aligner or other - end-to-end
  • learned text normalization may render hard-coded text normalization redundant
  • Improvements: output layer, attention module, loss function, Griffin-Lim waveform synthesiser

BELOW IS PREVIOUSLY DRAFTED TEXT

The decoder is a 2-layer 256-cell GRU with (tanh) attention. Briefly, the decoder context vector is concatenated with the 1-layer attention GRU output to form the input to the next decoding step. This type of decoder is analogous to the LSTM-based one used in Vinyals et al. (2015) Grammar as a Foreign Language6. I discussed that paper in a [previous post]({{ site.baseurl }}{% link _posts/2021-09-06-LSTM-grammar-foreign-language.md %}), which can be consulted for more detail.

Residual connections are used in the GRU stack to aid convergence as per Srivastava and co’s highway networks.

Tacotron uses 80-band mel-scale spectrograms as the decoder target since raw spectrogram is highly redundant whilst mel-spectrograms provide sufficient intelligibility and prosody information. A waveform is synthesised from the mel-spectrogram as a post-processing step.


Footnotes

  1. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112, 2014. https://arxiv.org/pdf/1409.3215.pdf.

  2. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014. https://arxiv.org/pdf/1409.0473.pdf.

  3. Jose Sotelo, Soroush Mehri, Kundan Kumar, João Felipe Santos, Kyle Kastner, Aaron Courville, and Yoshua Bengio. Char2Wav: End-to-end speech synthesis. In ICLR2017 workshop submission, 2017. https://mila.quebec/wp-content/uploads/2017/02/end-end-speech.pdf.

  4. Recall that max pooling is not equivalent to global pooling (for example global average pooling) and. It computes the maximum over a given window over the input and is used to reduce its dimensionality and make the overall model robust to local variation.

  5. each of several prominent bands of frequency that determine the phonetic quality of a vowel.

  6. Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. Grammar as a foreign language. In Advances in Neural Information Processing Systems, pp. 2773–2781, 2015. https://arxiv.org/pdf/1412.7449.pdf.