Title: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
Authors: Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu
Published: 16th December 2017 (Saturday) @ 00:51:40
Link: http://arxiv.org/abs/1712.05884v2
Abstract
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of comparable to a MOS of for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the input to WaveNet instead of linguistic, duration, and features. We further demonstrate that using a compact acoustic intermediate representation enables significant simplification of the WaveNet architecture.
We use the location-sensitive attention from [21], which extends the additive attention mechanism [22] to use cumulative attention weights from previous decoder time steps as an additional feature. This encourages the model to move forward consistently through the input, mitigating potential failure modes where some subsequences are repeated or ignored by the decoder. Attention probabilities are computed after projecting inputs and location features to 128-dimensional hidden representations. Location features are computed using 32 1-D convolution filters of length 31.
- location-sensitive attention (ref. [21]): Attention-Based Models for Speech Recognition
- extends the “additive” attention mechanism which is just the standard one
Notes (migrated from previous blog / notes)
Tacotron 2 - Architecture
See Architecture of Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions - Figure 1 from Tacotron 2 paper
Encoder
- Character embeddings: 512-dimensional learned character embeddings
- Convolutional module: 3x convolutional layers each layer with 512 filters of shape followed by batch normalization and ReLU activations
- Final layer output of convolutional block is passed to bi-directional LSTM with 512 units (256 in forward + 256 in backward directions) to generate the encoded features
Attention Network with Location-Sensitive Attention
- Note: Location-sensitive attention mechanism from Attention-Based Models for Speech Recognition uses “cumulative” attention weights from previous time steps as an additional feature
- Encourages the model to move forward (monotonically) through time → avoids subsequent repetition or omission by the decoder
- Consumes encoder output to summarise the full encoded sequence as a fixed-length context vector for each decoder output step (spectrogram frame)
- Location features computed using 32 1D convolution filters with kernel size 31
- Inputs and location features projected to 128-dimensional hidden representations (vectors)
- Attention probabilities computed on 128-dimensional encoder outputs and location features
Decoder
Auto-regressively predicts mel-spectrogram one frame at a time.
- Pre-net: 2 dense (linear) layers containing 256 hidden units with ReLU activation functions
- necessary information bottleneck to learn attention
- Prenet’s [1] output and [2] attention context vector are concatenated and passed to a stack of 2 uni-directional LSTM layers with 1024 units
- Concatenation of [1] LSTM output and [2] attention context vector are projected through linear transform to predict the (current) spectrogram frame
- Post-net:
- ****The predicted mel-spectrogram is passed through a 5-layer convolutional post-net which predicts a residual to improve the reconstruction of the GT mel-spectrogram
- ******************************************Each post-net convolutional layer comprises 512 filters with kernel shape followed by batch normalization and tanh activation functions except for the final layer
- Stop token predictor: In parallel to the spectrogram frame prediction, the concatenated decoder LSTM output and attention context vector are concatenated and projected (dense layer) to a scalar then passed through a sigmoid activation function to predict the “stop token” probability, the probability that the sequence has completed
- **************************Generation stops when
Dropout and Zoneout Regularization
- Convolutional layers are regularized with dropout with
- LSTM layers are regularized with zoneout with
- Inference time variability: To introduce output variation at inference time, dropout () is applied to the (dense) layers of the (autoregressive) decoder pre-net
Differences against Tacotron from Tacotron: Towards End-to-End Speech Synthesis
- Simpler building blocks:
- vanilla LSTM and convolution blocks in encoder instead of CBHG stacks and GRU recurrent networks as used in Tacotron
- No “reduction factor” → each decoder step corresponds to a single spectrogram frame
Tacotron 2 - Training
Loss
- Minimise mean squared error (MSE) before and after post-net to aid convergence
- Experimented with log-likelihood loss modelling the output distribution with a mixture density network → allows avoiding assumption of constant variance over time
- did not produce better sounding samples
- was harder to train
Tacotron 2 > Implementation
Reference Implementation by Rafael Valle and co. at NVIDIA
https://github.com/NVIDIA/tacotron2
Tacotron 2 > Questions
The Tacotron 2 Encoder uses nn.utils.rnn.pack_padded_sequence with a subsequent call to nn.utils.rnn.pad_packed_sequence(outputs, batch_first=True) in the forward method but does not do this for inference, despite the fact that the only additional required input is a vector of input_lengths which can be computed from the inference inputs.
Question: Why is the input sequence to the RNN packed and padded in the forward method but not in the inference method?
class Encoder(nn.Module):
"""Encoder module:
- Three 1-d convolution banks
- Bidirectional LSTM
"""
def __init__(self, hparams):
super(Encoder, self).__init__()
convolutions = []
for _ in range(hparams.encoder_n_convolutions):
conv_layer = nn.Sequential(
ConvNorm(
hparams.encoder_embedding_dim,
hparams.encoder_embedding_dim,
kernel_size=hparams.encoder_kernel_size,
stride=1,
padding=int((hparams.encoder_kernel_size - 1) / 2),
dilation=1,
w_init_gain="relu",
),
nn.BatchNorm1d(hparams.encoder_embedding_dim),
)
convolutions.append(conv_layer)
self.convolutions = nn.ModuleList(convolutions)
self.lstm = nn.LSTM(
hparams.encoder_embedding_dim,
int(hparams.encoder_embedding_dim / 2),
1,
batch_first=True,
bidirectional=True,
)
def forward(self, x, input_lengths):
for conv in self.convolutions:
x = F.dropout(F.relu(conv(x)), 0.5, self.training)
x = x.transpose(1, 2)
# pytorch tensor are not reversible, hence the conversion
input_lengths = input_lengths.cpu().numpy()
x = nn.utils.rnn.pack_padded_sequence(x, input_lengths, batch_first=True)
self.lstm.flatten_parameters()
outputs, _ = self.lstm(x)
outputs, _ = nn.utils.rnn.pad_packed_sequence(outputs, batch_first=True)
return outputs
def inference(self, x):
for conv in self.convolutions:
x = F.dropout(F.relu(conv(x)), 0.5, self.training)
x = x.transpose(1, 2)
self.lstm.flatten_parameters()
outputs, _ = self.lstm(x)
return outputsTacotron 2 > Issues
Decoder Prenet Dropout is obligatory at inference time → non-deterministic inference
From Francesco Cariaggi
Ok, it looks like Tacotron and Dropout don’t go along very well. There’s an issue discussing our exact same problem as well as some Tacotron documentation (from CoquiTTS) that reads:
Tacotron also uses a Prenet module with Dropout that projects the model’s previous output before feeding it to the decoder again. The paper and most of the implementations use the Dropout layer even in inference and they report the attention fails or the voice quality degrades otherwise. But the issue with that, you get a slightly different output speech every time you run the model.
Apparently, if you train Tacotron using Dropout, you can’t avoid using it at inference time, too.