Title: Speech Resynthesis from Discrete Disentangled Self-Supervised Representations
Authors: Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux
Published: 1st April 2021 (Thursday) @ 09:20:33
Link: http://arxiv.org/abs/2104.00355v3

Abstract

We propose using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, we separately extract low-bitrate representations for speech content, prosodic information, and speaker identity. This allows to synthesize speech in a controllable manner. We analyze various state-of-the-art, self-supervised representation learning methods and shed light on the advantages of each method while considering reconstruction quality and disentanglement properties. Specifically, we evaluate the F0 reconstruction, speaker identification performance (for both resynthesis and voice conversion), recordings’ intelligibility, and overall quality using subjective human evaluation. Lastly, we demonstrate how these representations can be used for an ultra-lightweight speech codec. Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods. Audio samples can be found under the following link: speechbot.github.io/resynthesis.


Overview

Suggest using the learned speech units as an input to a vocoder module with no spectrogram estimation. We additionally augment the learned units with quantized F0 representation and a global speaker embedding

Their method allows the evaluation of the learned units with respect to:

  • speech content
  • speaker identity; and
  • F0 information

plus “better control of [TTS] audio synthesis”

Contribution:

  1. Demonstrate the usage of discrete speech units, learned in a self-supervised manner, for high-quality synthesis purposes
    • no Mel-spectrogram estimation
  2. Provide an extensive evaluation of the SSL speech units from a speech synthesis point of view, i.e.,
    1. signal reconstruction
    2. voice conversion
    3. F0 manipulation
  3. Build an ultralightweight speech codec from the obtained speech units

On SSL methods: Autoencoders: Various constraints can be imposed on the encoded space, such as

Speech Resynthesis (§2)

  • Closest approach published in prior work according to the authors: Improved Prosody from Learned F0 Codebook Representations for VQ-VAE Speech Waveform Reconstruction
  • is ref 23 in the references
  • is closest to the Discrete Disentangled Self-Supervised Representations approach because it separates out encoding of the fundamental frequency (F0) from the waveform
  • In contrast, we study SSL-based speech encoders and empirically show these representations are better disentangled, and apply them as an ultra-low bitrate speech codec.
  • Another line of work suggests using intermediate representations obtained from an ASR acoustic model. These representations are being used together with the identity and prosodic information for voice conversion [24, 25, 26].
  • Unlike all of the above, we suggest synthesizing speech directly from the discrete units. Moreover, the resynthesis process sheds light on the encoded information in each of the evaluated representations

Method (§3)

Three pre-trained and fixed encoders:

  1. Content - temporal
  2. - temporal
  3. Speaker ID - global (across whole audio)

One decoder

Encoders (§3.1)

  • Content encoder (): Tried CPC (k-means quantized), HuBERT (k-means quantized) and VQ-VAE
  • encoder ():
    • they use VQ-VAE - convolutional encoder + bottleneck to a learned codebook, mapped vectors go to decoder, which reconstructs
    • Yet Another Algorithm for Pitch Tracking [38] algorithm is used to extract the F0 from the input signal, x, generating
    • for generation, they use indices of mapped latent vectors, not the vectors themselves (i.e. scalars or in other words scalar IDs)
  • Speaker Encoder ():
    • pretrained speaker verification (SV not SID) model similar to the one from End-to-End Text-Dependent Speaker Verification (ref 40)
    • get input speech utterance, extract mel-spectrogram, outputs a -dimensional d-vector (speaker representation)
    • learning the speaker embedding via a lookup table worked better but is limited to speakers seen at training time

Decoder (§3.2)

Uses a modified HiFi-GAN with lookup tables (LUT) to embed the time-varying encoded speech representations into latent vectors.

Specifically uses and for the content and respectively, upsamples them and concatenates them for the time-varying representations. Speaker embedding is concatenated to each frame in the upsampled sequence.

They use the usual multi-period and multi-scale discriminators of HiFi-GAN

See §3.2 for full details.

Loss: Three Components

They compose three losses over:

  • time:
  • the subdiscriminators - indexed by

Discriminator Loss with Penalization

For each multi-period and multi-scale discriminator of HiFi-GAN, indexed they use:

where , is the resynthesized signal from the encoded representation.

Reconstruction Loss

Reconstruction term computed between the mel-spectrogram of the input signal and the generated signal

where is a spectral operator computing Mel-spectrogram.

Discriminator Feature-Matching Loss

The second term is a feature-matching loss from (ref [41]).

Measures the distance between discriminator activations of the real signal and those of the resynthesized signal

Results

Table 1

Table 2

Figure 2 MUSHRA vs bitrate

Notes

todo There is a lot of juicy detail in the Implementation, Evaluation Metrics and Reconstruction & Conversion subsections of the Results (§4) section. There are a lot of signposts to standard best practices from 2021 for this kind of speech representation interrogation paper.

I’ve included this below for convenience.

Implementation Details  We follow the same setup as in [5]. For CPC, we used the model from [44], which was trained on a “clean” 6k hour sub-sample of the LibriLight dataset [4544]. We extract a downsampled representation from an intermediate layer with a 256-dimensional embedding and a hop size of 160 audio samples. For HuBERT we used a Base 12 transformer-layer model trained for two iterations [4] on 960 hours of LibriSpeech corpus [46]. This model downsamples the raw audio ×320 into a sequence of 768-dimensional vectors. Similarly to [5], activations were extracted from the sixth layer.

For CPC and HuBERT, the k-means algorithm is trained on LibriSpeech clean-100h [46] dataset to convert continuous frames to discrete codes. We quantize both learned representations with K=100 centroids. Leading to a bitrate of 700bps for CPC and 350bps for HuBERT.

Similarly to CPC models, we trained the VQ-VAE content encoder model on the “clean” 6K hours subset from the LibriLight dataset. We use an encoder operating on the raw signal to extract discrete units, similar to [39]. In addition, “random restarts” were performed when the mean usage of a codebook vector fell below a predetermined threshold. Finally, we used HiFiGAN (architecture and objective) as the decoder instead of a simple convolutional decoder, as it improved the overall audio quality. This model encodes the raw audio into a sequence of discrete tokens from 256 possible tokens [34] with a hop size of 160 raw audio samples. The VQ-VAE discrete code operates at a bitrate of 800bps. We additionally experimented with 100 discrete units for VQ-VAE, however results were the best for 256. This finding is consistent with [34].

The speaker verification network uses the architecture proposed in [40]. It was trained on the VoxCeleb2 [47] dataset, achieving a 7.4% Equal Error Rate (EER) for speaker verification on the test split of the VoxCeleb1 [48] dataset.

Only a single F0 representation is considered across all evaluated models, trained on the VCTK dataset. The F0 is extracted from the raw audio using a window size of 20ms and a 5ms hop. As a result, the F0 sequence is sampled at 200Hz. The quantization described at Sec. 3, is applied using an F0 codebook of K′=20 tokens and an encoder that downsamples the signal by ×16. Hence, the discrete F0 representation is sampled at 12.5Hz, leading to a bitrate of 65bps. The final bitrate of the evaluated codecs is the sum of the pitch code bitrate with the content code bitrate.

Evaluation Metrics  We consider both subjective and objective evaluation metrics. For subjective tests, we report the Mean Opinion Scores (MOS). In which human evaluators rate the naturalness of audio samples on a scale of 1–5. Each experiment, included 50 randomly selected samples rated by 30 raters. For objective evaluation, we consider: (i) Equal Error Rate (EER) as an automatic speaker verification metric obtained using a pre-trained speaker verification network. We report EER between test utterances and enrolled speakers; (ii) Voicing Decision Error (VDE) [49], which measures the portion of frames with voicing decision error; (iii) F0 Frame Error (FFE) [50], measures the percentage of frames that contain a deviation of more than 20% in pitch value or have a voicing decision error; (iv) Word Error Rate (WER) and Phoneme Error Rate (PER), proxy metrics to the intelligibility of the generated audio. We used a pre-trained ASR network [3] on both reconstructed and converted samples to calculate both metrics.

Reconstruction & Conversion We start by reporting the reconstruction performance. Results are summarized in Table 1. When considering the intelligibility of the reconstructed signal HuBERT reaches the lowest PER and WER scores across all models, where both CPC and HuBERT are superior to VQ-VAE. However, when considering F0 reconstruction VQ-VAE outperforms both HuBERT and CPC by a significant margin. This results are somewhat intuitive, bearing in mind VQ-VAE objective is to fully reconstruct the input signal. In terms of subjective evaluation, all models reach similar MOS scores, with one exception of CPC on LJ.

To better evaluate the disentanglement properties of each method with respect to speaker identity and F0, we conducted an additional set of experiments aiming at speaker conversion and F0 manipulation. For voice conversion, we converted each test utterance into five random target speakers. Next, we employed a speaker verification network, which extracts d-vector representation to evaluate speaker-converted utterances’ similarity to real speaker utterances (low error-rate indicates good conversion), providing measurement to the speaker identity’s disentanglement from the evaluated coding method. The error-rate is reported between converted test utterances and enrolled speakers. For the LJ speech single speaker dataset, we converted samples from the VCTK dataset to the single speaker and enrolled all VCTK speakers together with the single speaker. Results are summarized in Table 2 (left). Unlike resynthesis results, on voice conversion CPC and HuBERT outperform VQ-VAE on both LJ and VCTK datasets, indicating VQ-VAE contains more information about the speaker in the encoded units, hence producing more artifacts. Notice, this also affects WER, PER, and the overall subjective quality (MOS).

Next, to evaluate the presence of F0 in the discrete units, we flattened the F0 units before synthesizing the signal and calculated VDE and FFE with respect to the original F0 values. F0 flattening was done by setting the speakers’ mean F0 value across all voiced frames. In this experiment, we expected units that contain F0 information to be better at F0 reconstruction over disentangled units. Results are summarized in Table 2 (right). Notice VQ-VAE can still reconstruct the F0 almost at the same level as when using the original F0 as conditioning (5.2 vs 7.03, and 5.59 vs 7.8), in contrast to CPC and HuBERT.