Title: Controllable Speech Representation Learning Via Voice Conversion and AIC Loss Authors: Yunyun Wang, Jiaqi Su, Zeyu Jin, Adam Finkelstein
Published: 23rd May 2022 Link: https://ieeexplore.ieee.org/document/9747590

Abstract

Speech representation learning transforms speech into features that are suitable for downstream tasks, e.g. speech recognition, phoneme classification, or speaker identification. For such recognition tasks, a representation can be lossy (non-invertible), which is typical of BERT-like self-supervised models. However, when used for synthesis tasks, we find these lossy representations prove to be insufficient to plausibly reconstruct the input signal. This paper introduces a method for invertible and controllable speech representation learning based on disentanglement. The representation can be decoded into a signal perceptually identical to the original. Moreover, its disentangled components (content, pitch, speaker identity, and energy) can be controlled independently to alter the synthesis result. Our model builds upon a zero-shot voice conversion model AutoVC-F0, in which we introduce alteration invariant content loss (AIC loss) and adversarial training (GAN). Through objective measures and subjective tests, we show that our formulation offers significant improvement in voice conversion sound quality as well as more precise control over the disentangled features.


Representation learning transforms complex signals into features more suitable for downstream tasks. Self-supervised learning such as BERT [1] and its successor GPT-3 [2] play a key role in state-of-the-art methods in language processing and computer vision. For audio, BERT-like self-supervised models like Wav2Vec [3] and HuBERT [4] have demonstrated success in speech speech recognition, phoneme classification and speaker identification [5], [6], [7], [8]. The learned representation provides a more robust solution to these downstream tasks while dramatically reducing the need for labeled data. However, the encoded representations suffer from several drawbacks. First, they are typically lossy (non-invertible) as the training process does not optimize for reconstruction. While not problematic for the aforementioned recognition tasks, we show that the resulting representation is not suitable for synthesis tasks due to information loss. Second, the fine-tuning steps are essential to improve the learned representation for a downstream task, but also make the model domain-specific. Fine-tuning for English phoneme classification, for example, improves the performance on that task, but makes the model less generalizable to other languages. Finally, models such as Wav2Vec focus on encoding one aspect of speech signal, especially after fine-tuning. While the representation fine-tuned on phonemes performs well for speech content-related tasks, it is less suitable for other aspects of speech such as the recording environment, quality, prosody, and speaking style. This paper introduces a strategy for speech representation learning that is both invertible (making it suitable for synthesis tasks) and controllable – disentangled components can be manipulated independently, including content, pitch, speaker identity, and energy.

Our approach is inspired by recent work in voice conversion. Many-to-many voice conversion has been a challenging task due to the lack of parallel speech data. Existing approaches are generally based on either speech recognition or auto-encoders. Speech recognition-based methods combine automatic speech recognition (ASR) and text-to-speech (TTS) directly [9], [10] or synthesizing from intermediate features from speech recognition [11], [12]. Assuming ASR is near perfect, the converted speech often sounds natural and clean. However, this method is often language-specific, and detailed information about the speaker and prosody is lost during recognition. The autoencoder-based separates and manipulates speaker and content information by bottleneck constraints [13], [14], [15] or cross domain feature disentanglement [16]. They tend to preserve prosody and accent, and can be language-independent; but the synthesis quality tends to reduce due to information loss at bottleneck and disentanglement.

Our approach specifically builds on advances in zero-shot voice conversion – especially AutoVC [13] and AutoVC-F0 [14], [15] – that disentangle speaker identity and pitch from speech content, while achieving perceptually plausible reconstructions (meaning they are invertible). They are also naturally language-agnostic. AutoVC uses an encoder to extract a code (content) from the mel-spectrogram of the input speaker, and reconstructs the mel-spectrogram by combing the code, speaker identity embedding and pitch in the decoder. The AutoVC paper shows that invertibility coupled with a properly tuned bottleneck guarantees perfect disentanglement, meaning the output can be controlled by altering different representations independently. However, the bottleneck for the code space is sensitive to the architecture and must be carefully crafted – architectures that deviate from their original design tend to result in a significant reduction in reconstruction quality.

Our proposed model ameliorates these concerns. We introduce a new alteration invariant content (AIC) loss that maps together the code-spaces of two utterances with the same speech content spoken by different people. We also add adversarial training to further improve the quality of the output mel-spectrogram. The AIC loss avoids speaker identity from leaking through, thus allowing us to use a larger bottleneck size for better synthesis quality. The model is able to synthesize speech reliably, and provides independent control of content, speaker identity, pitch and energy. Finally the model itself is language independent.

We describe objective experiments showing that our method can provide effective control over pitch and energy. In addition we describe subjective experiments showing that our method is able to reconstruct plausible voice – while providing independent control of speaker identity, comparing favorably with baseline methods. Our trained models, as well as listening examples, are available here: https://pixl.cs.princeton.edu/pubs/Wang_2022_CSR/

The basis of our approach originates from a source-filter view of speech signal. The source is either periodic signal or noise, characterized by VUV (voiced vs unvoiced) and F0 (aka pitch, in Hz or semitones). The filter contains content which relates to muscle control that produces the filter of words and speaking style, referring to the acoustic properties of the speaker’s vocal tract. Conceptually these aspects can be manipulated independently and are sufficient in determining the original speech. This conceptual model has been supported in AutoVC-F0 as their architecture disentangles F0 and speaker identity from the rest of the information (content) allowing manipulation of each representation without altering the others.

We denote i-th speaker’s voice identity as a vector si and uj as the j-th utterance content code which contains the content and rhythm of the speech. Let be the real audio of speaker si speaking the content uj with F0 sequence . Note that uj is unique to every speaker si as every speaker speaks differently, even for the same text. Let be the mel-spectrogram of corresponding audio .

An ideal content encoder E should be able to extract uj from and an ideal decoder D should be able to reconstruct given uj, si and

View SourceRight-click on figure for MathML and additional features.

We aim to design E and D such that (1) reconstructed speech and are perceptually identical, and (2) altering s and f makes decoded speech sounds like Speaker s’ speaking the same content with a new F0 f’ To achieve this, we propose several objectives (Sec.2.2): mel-spec loss, self content loss, AIC loss, and adversarial loss.

2.1. Model

Our model (Fig. 1) builds upon the AutoVC-F0 [14] decoder. Unlike the AutoVC-F0, we design a convolution-based encoder instead of LSTM-based encoder, to extract code space decoupled from long-term context dependency. This obtains a more interpretable code space with clearer correspondence to local content (i.e. phonemes).

Fig. 1. -
The framework of our autoencoder model, shown for voice conversion (above) with training architecture (below). Two types of AIC loss are shown in green.

Fig. 1.

The framework of our autoencoder model, shown for voice conversion (above) with training architecture (below). Two types of AIC loss are shown in green.

Show All

Encoder. The encoder consists of 6 stacks of convolution layers with kernel size 5 and stride 1, each following groupnorm of group size 32 and ReLU activation. It takes in 80-coefficient mel-spectrogram, and outputs a sequence of codes at the same temporal resolution as the input. The convolution channel sizes are [80, 512, 512, 512, 512, 512_, neckDim_]. AutoVC-F0 uses a resolution of for the bottleneck. In experiments, we choose a neckDim of 8 to include more information with a larger code space.

Decoder. The decoder first stacks all input features (content code, speaker embedding, pitch and possibly energy) along the feature dimension and feeds the resulting sequence to a LSTM with hidden size 512. Then it goes through 3 stacks of convolution layers similar to the encoder with channel sizes [512], [512], [512], [512] following by 2-layer LSTM with hidden size 1024. Finally, the linear projection layer projects the features from dimension 1024 to 80. To add more details to the output mel-spectrogram, we add one postnet identical to the implementation of AutoVC-F0. The output of the decoder and the output of the postnet are added together as the final output of the decoder. The postnet has 6 convolution layers similar to encoder with channel sizes [80, 512, 512, 512, 512, 512, 80].

Generator and Discriminator. We notice that the over-smoothing effect caused by mel-spec loss is limiting the audio quality. Thus we use an adversarial network to refine the mel-spectrogram and reduce artifacts. The whole encoder-decoder architecture acts as the generator and the discriminator uses the same architecture from SpecGAN [17].

2.2. Objectives

During each iteration of training, we randomly select a speaker s1 and a segment of real utterance . We extract the content code using the encoder and generate three mel-spectrograms ,,, of the same code with three different speaker identities using the decoder. We apply the following objectives among the synthesized mel-spectrograms to force the encoder to disentangle speaker identity information and the decoder to reconstruct realistic speech.

Mel-spec Loss. Mel-spec loss is a L2 loss between the real mel-spectrogram and the self-to-self reconstructed mel-spectrogram. By restricting the synthesized self-to-self mel-spectrogram close to the real one, it ensures the reconstruction quality of the decoder.

View SourceRight-click on figure for MathML and additional features.

Self Content Loss. Self content loss is a L1 loss between the extracted code of the real input and that of the self-to-self reconstructed output. It is used in AutoVC [13] and shares a similar idea to cycle consistency loss [18]. It enhances the robustness of the encoder as it is invariant to self-to-self reconstruction.

View SourceRight-click on figure for MathML and additional features.

Alteration Invariant Content (AIC) Loss. AIC loss is our proposed content loss to improve synthesis quality and model robustness. In the AutoVC [13], [14] setting, the bottleneck between the encoder and decoder should be carefully crafted such that the code space fully contains content information necessary for a perfect reconstruction of the utterance, while no information in the condition leaks through. As such, AutoVC reduces the time resolution of the extracted codes, which leads to phoneme inaccuracy and low synthesis quality. Our AIC loss prevents speaker identity from leaking through by forcing the code spaces of any two different speakers to be close, and therefore permits a large bottleneck at the original time resolution that better preserves details of the input content. Specifically, during training, we convert a content code to two randomly selected speakers and calculate their L1 content code distance. We focus on invariance to alternation in speaker identity for AIC loss because our experiment shows pitch and energy are already well disentangled from the code space even without the AIC loss.

View SourceRight-click on figure for MathML and additional features.

Adversarial Loss. We adopt the hinge loss and feature matching loss from MelGAN [19] as our adversarial loss. The generator hinge loss and feature matching loss are combined with weight 1, 10 to form the adversarial generator loss. We sum up the four losses with weight 1, 100, 100, 10 respectively as our full objective for the generator, and use the hinge loss for the discriminator.

The model is trained from scratch using the mel-spec loss, self content loss and AIC loss for 400k iterations with a learning rate 10−4. Next, we add GAN loss into training and train an additional 400k iterations with learning rate 10−5 for the generator and 10−6 for the discriminator. We update the discriminator and generator in turn each iteration. We use a batch size of 4 and Adam optimizer [20] on a GeForce RTX 3090 GPU. We use the VCTK dataset [21] for training and evaluation. The last 10 speakers and first 10 utterances for each speaker are held for testing. Each clip is preprocessed through a Butterworth highpass filter at 30Hz to remove low frequency noise.

To produce accurate fundamental frequencies (F0), we use CREPE [22] to estimate the pitch range for the utterance and then guide SWIPE [23] to calculate the final log-F0. The log-F0s are then normalized into 257 bins with 256 effective pitches and 1 unvoiced bin. To compare how normalization affects the generated audio quality, we try two different normalization strategies: absolute normalization, and relative normalization. For frequency f Hz, absolute normalization normalizes f to with fmin = 40_, f_max = 400 chosen. Similar to how AutoVC-F0 [14] does speaker normalization, relative normalization normalizes f to where and are the mean and standard deviation of the log frequency of speaker si.

For the identity extractor, we use the pretrained Resemblyzer [24], a version of speaker encoder from GE2E [25]. Resemblyzer produces a speakers embedding of dimension 256. The speaker identity embeddings of the utterances from the same speaker are averaged as the identity embedding for the speaker. The energy is calculated from raw waveform directly. We use the pretrained HiFi-GAN vocoder [26] to synthesize audio at 22050 sample rate from the predicted mel-spectrogram at the last step.

3.1. Ablation Study

We conduct ablation study on three design choices to justify our final approach: (a) Absolute pitch vs. relative pitch (b) GAN training vs. no GAN training (c) AIC loss vs. AIC loss 2 vs. no AIC loss. Variations of our models are trained and compared:

Our first main model is denoted as Ours-AC where A and C indicate that we’re using Absolute pitch and AIC loss. To explore the potential of Ours-AC, we also extend its output audio to 48k (Ours-AC-48k) using a bandwidth extension model [27]. Ours-AC-noGAN is Ours-AC without GAN training. Ours-ACE is Ours-AC with additional condition on Energy. Ours-RC is with Relative pitch and AIC loss. Ours-A and R are two models with Absolute pitch and Relative pitch respectively but without AIC loss. Ours-AC2 is our second main model with Absolute pitch and a different formulation of AIC loss which we call AIC loss 2. Instead of choosing two different speakers s2 and s3 as in the original AIC loss, it uses one additional speaker

We compare different settings of our models through a MOS (Mean Opinion Score) test on quality and similarity using Amazon Mechanical Turk (Fig. 2). The subjects are required to pass a preliminary test before entering, making sure they have the suitable equipment and are able to distinguish audios of different quality. We add validation questions randomly during the test and filter out the subjects who answer randomly. In quality test, we collected 401 valid HITs across 176 unique workers, totaling 10426 answers for all the method conditions (including input and target). In similarity test, we collected 554 valid hits across 292 unique workers, totaling 14404 answers. Note that in similarity test, we set the high anchor as “two speakers sound like the same persons” and low anchor as “two speakers sound like two different persons” making it more difficult to get high ratings across board.

Fig. 2. -
Ablation. MOS scores (above) and similarity scores (below) model variants. Among these, we find that AIC loss, GAN training, and absolute pitch all play important roles in the complete model.

Fig. 2.

Ablation. MOS scores (above) and similarity scores (below) model variants. Among these, we find that AIC loss, GAN training, and absolute pitch all play important roles in the complete model.

Show All

Absolute Pitch vs. Relative Pitch. Absolute pitch helps to generate a more natural speech (Ours-AC vs**.Ours-RC, Ours-A** vs**.Ours-R**). The relative pitch normalizes pitch by speaker pitch distribution and leaves the decoder to decide the actual pitch range based on the speaker identity. Sometimes the decoder generates an unstable pitch, in other words, gender flip, for cross-gender scenario.

GAN Training vs. No GAN Training. Ours-AC has higher quality MOS score than Ours-AC-noGAN in all comparison scenarios, and Ours-AC is on par with Ours-AC-noGAN for similarity. Adding GAN resolves the over-smoothing issue caused by _L_2 loss and introduces details in the mel-spectrogram which encourages the vocoder to generate a cleaner audio. Noticeable artifacts can be easily picked up by human ear when not using GAN.

AIC Loss vs. AIC Loss 2 vs. No AIC Loss. Comparison between (Ours-AC, Ours-A) and (Ours-RC, Ours-R) shows that adding AIC loss can improve the generation quality, especially in cross-gender and unseen2unseen scenarios. AIC loss contributes more significantly in the relative pitch setting by reducing gender flip while absolute pitch is experiencing less of that issue. Using AIC loss (Ours-AC) scores slightly higher than AIC loss 2 (Ours-AC2) overall, but AIC loss 2 is more robust to unseen speakers and has better performance in similarity test. We also notice that AIC loss leads to a more stable training curve than AIC loss 2. In conclusion, Ours-AC2 is a more balanced choice but Ours-AC can be useful in some scenarios too.

3.2. Voice Conversion

Based on the analysis above, we use Ours-AC, Ours-AC2 and Ours-AC-48k for voice conversion task. We conduct the same tests as described in Sec.3.1 for our models and baselines. (Fig. 3)

Fig. 3. -
MOS scores (above) and similarity scores (below) shows that our best models compare favorably with baselines across gender and seen/unseen speaker conversion cases.

Fig. 3.

MOS scores (above) and similarity scores (below) shows that our best models compare favorably with baselines across gender and seen/unseen speaker conversion cases.

Show All

PitchShift. We use the traditional pitch shifter PSOLA [28] to shift the mean absolute pitches of the input speaker to the mean absolute pitches of the target speaker. It is mainly for comparing the timbre difference between the models because it eliminates pitch difference.

AutoVC-F0. To match our experiment’s setting, AutoVC-F0’s [14] speaker identity extractor is changed to Resemblyzer. We notice a slight quality improvement from the original paper (20 speakers) on our training set (99 speakers).

Wav2Vec. The features of the input utterance are extracted using the large finetuned pretrained Wav2Vec 2.0 [3] model wav2vec2-large-960h-lv60, interpolated to match our time resolution and linearly projected to a code space of dimension 32 (much larger than ours). We then train our decoder on top of this fixed code space using mel-spec loss and self content loss. Wav2Vec constructs speaker identity reasonably. As is finetuned on phoneme recognition task, it naturally disentangles speaker information and thus leads to better speaker similarity in the decoder. However, it also loses other information besides phonemes that are crucial for high-quality speech. As a result, Wav2Vec representation leads to the lower audio quality, even though the same decoder architecture is used. AutoVC-F0 scores low in MOS as well. It suffers from the low time resolution and sometimes blurs out a sequence of short phonemes. The generated F0 sometimes drifts off as its range is implicitly modeled using relative scales. PitchShift is a reference in which PSOLA [28] is used to alter the pitch without changing the timbre. Hence it has high quality score but low similarity scores. Ours-AC and Ours-AC-48k achieves a significantly higher MOS score and bandwidth extension improves audio quality noticeably. Ours-AC2 wins by a small margin overall for the similarity test though is slightly behind Ours-AC in the MOS test. We conclude that Ours-AC and Ours-AC2 have strength in quality and similarity respectively, and can be suitable for different tasks depending on the application.

3.3. Analysis

Pitch/Energy Control. We test our model’s pitch control ability by conditioning the decoder on an linearly increasing F0. Unlike Wav2Vec and our models, AutoVC-F0 cannot shift precisely to an absolute pitch scale. Hence, we take the target pitch contour, normalize it using the same strategy AutoVC-F0 normalizes pitch and use it as condition for AutoVC-F0. We calculate the objective scores (Tab. 1) using _L_1 distance in semitone and Hz for the voiced part. We add the VUV (Voice vs Unvoiced) error which denotes the portion that a voiced segment is misclassified as unvoiced or vice versa. Ours-AC2 and PitchShift give the best pitch control performance overall. For Ours-ACE, we are able to control the energy by conditioning on a new energy sequence. We calculate the _L_2 distance between the energy of the generated audio and the target energy. The test audios have an energy mean of 0.0386. When conditioning on of the original energy, we get a distance of 0.0049. When conditioning on a non-zero constant, we get a distance of 0.0160. Note that this larger distance is due to the unvoiced part. When conditioning on constant 0, we get a distance of 0.0061 and the generated audio is mainly noise.

Table 1. Objective scores for pitch control. PitchShift and our method perform best for this task, outperforming other baselines.

Table 1.-
Objective scores for pitch control. PitchShift and our method perform best for this task, outperforming other baselines.

Code Space. Fig. 4 shows the code space of Ours-AC2 (our best model) for two speakers uttering the same sentence (a,d); and that of the same two utterances where the pitch is shifted linearly using PSOLA [28] (b,e); with transcriptions (c,f). Each dimension of the code space appears as a horizontal band, sorted vertically by overall energy (only top 4 shown), values normalized to [0], [1], and visualized with the “jet” colormap (blue=0, red=1). Constant-value bands are dark blue. The content codes are similar for the different speakers (a,d), and are almost identical for the modified pitches (b,e). This observation shows that our code space is invariant to pitch changes and speaker identity changes. Despite having 8 dimensions in the bottleneck, the model uses only 3 of them for content code; dimensions 4-8 are constant. This shows that our AIC loss succeeds at limiting additional information leaking into the code space.

Fig. 4. -
Code space comparisons: the same sentence uttered by two different speakers (a,d); after pitch shift (b,e); and transcription (c,f).

Fig. 4.

Code space comparisons: the same sentence uttered by two different speakers (a,d); after pitch shift (b,e); and transcription (c,f).

Show All

This paper proposes an invertible speech representation learning model based on voice conversion. We show through experiments and analysis that the learned representation disentangles speech components (content, pitch, speaker identity, energy), which can be controlled separately to synthesize high-quality speech. Although the proposed model outperforms baselines, our listening test shows a gap in voice similarity among all tested voice conversion methods. Thus, one potential avenue for future work is to disentangle prosody for more accurate synthesis. There are a few other downstream tasks for future study, such as speech recognition using content code, disentangling prosodic style, and building multi-speaker text-to-speech synthesis by generating content codes and F0.