Title: HiFi-GAN-2: Studio-Quality Speech Enhancement via Generative Adversarial Networks Conditioned on Acoustic Features
Authors: Jiaqi Su, Zeyu Jin, Adam Finkelstein
Published: 17th October 2021
Link: https://ieeexplore.ieee.org/document/9632770
Abstract
Modern speech content creation tasks such as podcasts, video voice-overs, and audio books require studio-quality audio with full bandwidth and balanced equalization (EQ). These goals pose a challenge for conventional speech enhancement methods, which typically focus on removing significant acoustic degradation such as noise and reverb so as to improve speech clarity and intelligibility. We present HiFi-GAN-2, a waveform-to-waveform enhancement method that improves the quality of real-world consumer-grade recordings, with moderate noise, reverb and EQ distortion, to sound like studio recordings. HiFi-GAN-2 has three components. First, given a noisy reverberant recording as input, a recurrent network predicts the acoustic features (MFCCs) of a clean signal. Second, given the same noisy input, and conditioned on the MFCCs output by the first network, a feed-forward WaveNet (modeled via multidomain multi-scale adversarial training) generates a clean 16kHz signal. Third, a pre-trained bandwidth extension network generates the final 48kHz studio-quality signal from the 16kHz output of the second network. The complete pipeline is trained via simulation of noise, reverb and EQ added to studio-quality speech. Objective and subjective evaluations show that the proposed method outperforms state-of-the-art baselines on both conventional denoising as well as joint dereverberation and denoising tasks. Listening tests also show that our method achieves close to studio quality on real-world speech content (TED Talks and the VoxCeleb dataset).
HiFi-GAN-2: Studio-Quality Speech Enhancement via Generative Adversarial Networks Conditioned on Acoustic Features | IEEE Conference Publication | IEEE Xplore
Date of Conference: 17-20 October 2021 Date Added to IEEE Xplore: 13 December 2021 INSPEC Accession Number: 21406573 Publisher: IEEE Conference Location: New Paltz, NY, USA
1. Introduction
Speech enhancement methods typically focus on alleviating severe noise and reverberation from recordings and improving intelligibility for downstream tasks such as speech recognition. Modern content creation scenarios (e.g., podcasts, video voice-overs, and audio books) would benefit from improving consumer-grade recordings (which suffer from moderate noise, reverb, and EQ distortion) to professional studio quality. Therefore, this paper addresses the speech enhancement problem in a different context from that of previous work: to improve single-channel consumer-grade recordings to sound like professional studio recordings. To address this goal requires solving the combined problem of denoising, dereverberation and equalization matching, while targeting a studio-quality dataset.
Recent advances in machine learning have enabled significant progress on the long studied topics of speech enhancement, denoising and dereverberation problems. Typical methods tackle the problem by learning a spectral mapping [1], [2] or masking [3], [4] on the magnitude spectrogram, while inverse STFT process to recover waveform introduces audible artifacts due to missing or mismatching phase. Other methods predict phase alongside the spectrogram [5], [6], or learn complex ratio mask [7], [8]. Another approach focuses on enhancement directly in the waveform, for example, using WaveNet [9], [10] and Wave-U-Net [11], to avoid information loss or phase inversion. State-of-the-art methods like DEMUCS [12] and PoCoNet [13] have shown significant audio quality improvement, especially for hard denoising cases with low SNRs. Yet those methods learn from datasets like VoxCeleb [14], the Valentini dataset [15] and the DNS Challenge Dataset [16] that do not contain studio-quality target audio, thus limiting the capabilities of the learnt models. Moreover, these datasets do not simulate conditions matching typical consumer-grade recording environments, which limits their use in the context of the problem we address. As a result, such audio can be improved by these methods, but the results remain far from studio-quality.
Generative adversarial networks (GANs) have been widely shown effective in achieving high fidelity audio in speech processing and generation. Researchers in speech enhancement have explored GANson spectral features [17], [18] as well as on waveform [19], [20]. HiFi-GAN [21] shows high fidelity results by applying discrimination in both the time domain and the time-frequency domain. Meanwhile, an emerging branch of research performs speech enhancement by re-synthesis [22], [23], given recent success in high-fidelity speech synthesis [24]. The idea is to extract speech features from the input audio and re-synthesize the clean waveform using neural vocoders. This approach aligns with our objective, as the synthesized audio is naturally free of noise and reverberation. The performance is however limited by the quality of existing vocoders, as most do not generalize well across speakers and tend to generate âroboticâ voices. They are also susceptible to inaccurately estimated speech features, leading to speech content distortion and unnatural prosody.
This paper proposes HiFi-GAN-2, which builds on our previous HiFi-GAN method [21] and targets studio-quality output. The previous HiFi-GAN uses a feed-forward WaveNet together with deep feature matching in multi-domain and multi-scale discriminators. HiFi-GAN-2 incorporates a separate recurrent neural network to predict the acoustic features of a clean target from those of noisy input. The WaveNet then conditions on the predicted acoustic features to generate the clean audio. This modification significantly improves output audio quality. We believe that the acoustic features, estimated from the entire input audio sequence, help the WaveNet (which has limited receptive field) to generate audio that more faithfully matches the original speaker and content. We evaluate the proposed method using objective and subjective tests in three application scenarios: (1) joint denoising and dereverberation for realworld recordings, (2) enhancement for real-world speech content at full bandwidth, and (3) conventional denoising. We also show in subjective evaluation that conventional denoising datasets that are of low quality can hinder model performance, and thus encourage use of studio-quality datasets in future research.
Figure 1:
Architecture. A pre-trained network (right) predicts acoustic features (mfccs) of clean speech based on a noisy input spectrogram. A wavenet (left) generates clean speech from the same noisy input, locally conditioned on the predicted mfccs. Adversarial training with deep feature matching involves a spectrogram discriminator and multiple waveform discriminators for the signal at different resolutions.
2. Method
HiFi-GAN-2 builds on top of our previous work HiFi-GAN [21] for speech denoising and dereverberation, to further push towards studio quality. HiFi-GAN uses an end-to-end feed-forward WaveNet together with deep feature matching in multi-scale multi-domain discriminators. Although HiFi-GAN is shown successful for obtaining clean high-fidelity audio recordings from noisy reverberant conditions, we observe inconsistency in speaker identity when noise and reverb are strong. This is likely caused by the ambiguity in disentangling speech content and speaker identity from environment effects (EQ and reverb). Moreover, the feed-forward WaveNet is not able to enforce consistent speaker identity due to limited receptive field and lack of global context. Thus, the network would benefit from extra information that helps it to infer speaker identity and content, i.e. the clean speech. One possible solution is to use speaker embedding as global conditioning, similar to that of multispeaker speech synthesizer [25], but we did not observe quality improvement, possibly due to the utterance-level fuzziness of the embedding space. Instead, we propose conditioning the WaveNet on acoustic features that contain clean speaker identity and speech content information. Hence, we incorporate a separate recurrent neural network to predict clean acoustic features from the input noisy reverberant audio, which is then used as time-aligned local conditioning for HiFi-GAN. Such design combines benefits from waveform-to-waveform conversion, which avoids information loss and artifacts in STFT/ISTFT processes, and the effectiveness of acoustic features in modeling human perception of speech over a long period of context. The overall architecture is shown in Figure 1.
2.1. Acoustic Feature Prediction Network
We propose a network inspired by Tacotron 2 [25] for acoustic feature prediction. It consists of three pre-processing convolution blocks (1D convolution, batch normalization and ReLU), three layers of bi-directional LSTMs, a linear projection layer, and a post-net of five convolution blocks (1D convolution, batch normalization and Tanh activation except for the last block). We use channel size of 512 across all the layers, kernel size of 5 for the convolutions, momentum of 0.9 for the batch normalization layers, and dropout of 0.2 for the recurrent layers. This network is trained using the acoustic feature of simulated noisy reverberant audio as input and that of clean audio as target. It minimizes the MSE losses of acoustic feature as well as the delta (first order difference) of the feature, for the outputs both before and after post-net.
To select a proper acoustic feature, we examined log mel spectrogram (Mel) and Mel-frequency Cepstral Coefficients (MFCCs). While Mel has higher frequency resolution, the MFCCs is more robust to noise. Our experiments found that predicting 18-coefficient MFCCs of the target clean audio from the 80-coefficient Mel of the input audio yields the best result. Since each cepstral coefficient has a different range of values, the target MFCCs is also globally normalized by subtracting each coefficient with the mean and dividing by four times the standard deviation using the clean audio datasetâs statistics, following the practice of Qian et al. [26]. We did not observe statistically significant improvement in changing the number of cepstral coefficients to 24; yet the performance drops with 30 or 12 coefficients. Thus, we stick to 18 coefficients for further experiments. Our ablation study is discussed in details in Section 3.1.
2.2. Conditional WaveNet
The waveform denoising network is a feed-forward WaveNet [9] with local conditioning [27]. It uses non-causal dilated convolutions with dilation rate as a power of two to enable large receptive field. We use three WaveNet stacks (totaling 30 layers) and a channel size of 128 across the network. Our early experiments show vanishing benefit to further increasing the number of WaveNet stacks, as well as degraded performance with other channel sizes (64 or 256). We use weight normalization on all layers to accelerate convergence.
The prediction from the pre-trained acoustic feature prediction network is up-sampled using linear interpolation along time axis to match the length of the input waveform and is applied via additive local conditioning as is described in the original WaveNet design [27]: in each WaveNet layer, it is convolved with a 1Ă1 convolution before being added to the filter activation; same process is done for the gate activation. We hypothesize that the WaveNet can utilize the local conditioning in two ways: (1) if the acoustic features contain sufficient information, the WaveNet may serve like a vocoder where it re-synthesizes speech using the phase of the input waveform; (2) or, the WaveNet utilizes this auxiliary information to gain access to a cleaner representation of speech content as well as larger temporal context. Our experiment shows that (2) is more likely the case as the WaveNet with randomized acoustic features can still generate intelligible speech but it sounds muffled and less recognizable as the original speaker.
2.3. Adversarial Training and Loss Functions
The adversarial training helps to improve perceptual quality and removes artifacts and noises. We follow the same design as HiFi-GAN, using a spectral discriminator and a set of waveform discriminators. The spectral discriminator takes in the 128-coefficient log mel-spectrogram. It consists of four stacks of 2D convolution layer, batch normalization and Gated Linear Unit (GLU), and lastly a convolution layer followed by global average pooling, similar to the one used in StarGAN-VC [28]. It uses kernel sizes of (7,9), (5, 8), (4, 8), (4, 6) and stride sizes of (1, 1), (1, 2), (2, 2), (2, 2) for the stacks, and the last convolution layer uses a kernel size of (32, 5). The channel sizes is 32 across all the layers. Meanwhile, a set of three waveform discriminators respectively operate at the output signal down-sampled by different ratios as a power of two, following the design in MelGAN [24]. Each waveform discriminator is composed of a set of grouped convolutions and global average pooling at the end, with Leaky ReLU between the layers. Specifically, the kernel sizes are 15,41,41,41,41,5,3; stride sizes 1,4,4,4,4,1,1; channel sizes 16,64,256,1024,1024,1024,1; and group sizes 1,4,16, 64, 256, 1, 1. The adversariallosses take hinge loss formulation.
The supervised loss function of the generator is composed of L1 waveform loss, and Ll losses of multiple log spectrograms with different FFT window sizes (i.e 512, 1024, and 2048 for 16kHz audio, each with one-fourth as its hop size). In addition, we apply the adversarial losses, as well as the feature matching losses [24] of the discriminators which are computed as L1 difference of the deep features between the generated audio and the ground-truth clean audio. The feature matching loss helps to stabilize GAN training and prevents the generator from mode collapse.
3. Experiments
We evaluate our method, ablations and various baselines over studio-quality speech enhancement task as well as conventional denoising task. The term âstudio-qualityâ implies that the clean audio used in training are recorded and professionally edited in an anechoic studio, at a sample rate >=44. 1kHz. The âcleanâ category of the Device and Produced Speech (DAPS) Dataset [29] fits into this requirement. Due to limited bandwidth of baseline methods, we first conduct a comparative study at 16kHz, on joint denoising and dereverberation task on the DAPS dataset. Then we expand the experiment to real-world recordings used in content creation, evaluated at full 4SkHz. Finally, we apply our method to conventional denoising task to show its broad applicability.
We used the architecture described in Section 2 for experiments. We compute Mel and MFCCs using FFT length of 512 and hop size 160 at 16kHz. We first train the acoustic feature prediction network (24M params) for 100k steps using Adam optimizer with a batch size of 64 and input length of 256 frames. The learning rate starts with 0.001 and gets halved every 20k steps. Then we train WaveNet (10M params) with the weights of acoustic feature prediction network fixed. The WaveNet first trains for 1000k steps with learning rate 0.001, using the waveform and the spectrogram losses. Next we add randomly initialized discriminators to the output of the WaveNet (generator). We use learning rate 0.00001 for the generator (adversarialloss, feature matching loss and previously used loss), and 0.001 for the discriminators, for 100k steps. A batch size of 6 and a sample length of 22K are used throughout training. On a Tesla V100, each of the three training stages takes seven days, and inference takes 0.5 seconds per second of input audio. Audio samples for our experiments are available at: https://pixl.cs.princeton.edu/pubs/Su_2021_HSS/
3.1. Joint Denoising and Dereverberation
The DAPS Dataset provides pairs of recordings of the same set of studio-quality speech re-recorded under twelve different room environments, and thus aligns with our goal of converting real-world recordings to studio-quality recordings. One male voice (m10) and one female voice (f10) are held out for evaluation purpose. We also hold out 2 minutes of audio per training voice for validation purpose. Our training set is constructed around the rest of the DAPS Datasetâs clean set following the same data simulation and augmentation procedure as described in HiFi-GAN [21]. We convolve these studio-quality speech recordings with the 270 impulse responses from the MIT Impulse Response Survey Dataset [30], and then add noise from the REVERB Challenge database [31] and the ACE Challenge database [32]. Data augmentation of HiFi-GAN is used on all of speech, impulse responses and noise samples.
Table 1: Objective measures on the daps dataset.
Our best full approach HiFi-GAN-2 consists of an acoustic feature prediction network that predicts globally normalized lS-coefficient MFCCs of clean target from SO-coefficient log mel spectrogram of noisy input, the WaveNet conditioning on the predicted IS-coefficient MFCCs, and GAN training. We conducted ablation experiments to address the following four design questions, and accordingly eight variants of our approach: Q1: Should we train the WaveNet with ground truth acoustic features or generated ones? Q2: Should we use MFCCs or other acoustic features (e.g. log mel spectrogram) for conditioning? Q3: Should we predict clean MFCCs from input audioâs MFCCs directly or from its log mel spectrogram? Q4: Should we apply global normalization, local normalization or no normalization for the conditioning?
**Model A:**Same as HiFi-GAN-2, but no GAN training (âno GANâ)
Model A-GT: Same as Model A, but conditioning on ground-truth clean acoustic features for training (âGTâ).
**Model B:**Same as Model A, but the prediction network takes globally normalized IS-coefficient MFCCs as input (âmfcc2mfccâ).
**Model C:**Same as Model A, but the MFCCs are locally normalized using instance statistics (âlocal normâ).
**Model D:**Same as Model A, but the prediction network outputs SO-coefficient log mel spectrogram (âmeI2melâ).
Model D-GT: Same as Model D, but conditioning on ground-truth clean acoustic features for training (âGTâ).
Model A-GT-GAN: Model A-GT with GAN training. Model D-GAN: Model D with GAN training.
We also compare to four state-of-the-art baselines: our previous HiFi-GAN [21] with two WaveNet stacks and HiFi-GAN (3Ă10) with three stacks (as in this work), a spectral-domain method using complex ratio masking [8] (FullSubNet), a time-domain method using encoder-decoder structure [12] (DEMUCS), and a speech enhancement by resynthesis method [23] (Regen). DEMUCS and FullSubNet originally targeted at speech denoising, so we re-train their released models on our training set. Meanwhile, since speech re-synthesis may completely change the appearance of the signal, Regen is compared in the subjective evaluation only, using its released audio samples re-sampled to 16kHz.
Table 1 shows the objective metrics [31] for speech denoising, dereverberation and enhancement. All variants of our proposed methods outperform all the baselines in PESQ and SRMR. Model A-GT, Model A and Model D scores the top three. Though HiFi-âs objective sco_re_ is lower, it has the highest perceptual quality shown in subjective evaluations. Adding one WaveNet stack brings moderate improvement to the perceptual quality but not the objective measures. Adding conditioning to the WaveNet improves the objective scores universally by a large margin. Globally normalized 18-coefficient MFCCs scores the best as conditioning (Q2, Q4), and it can be more accurately predicted from Mel than from MFCCs (Q3). Training with ground-truth conditioning can degrade test performance due to mismatch of training and inference conditionsâ as is the case in Model D and D-G T. However, training with ground truth MFCCs (Model A-GT) outperforms generated ones (Model A) (Ql). This may be due to that the prediction of MFCCs as a compact representation is sufficiently clo_se_ to the ground truth. Although GAN training lowers objective scores, we observe significant perceptual quality improvement in the listening tests. GAN helps the output to match the clean audioâs data distribution (hence sounds realistic) rather than direct approximation to ground truth.
Since the objective scores may not correlate with perceptual quality well [33], we also conduct Mean Opinion Score (MOS) tests using Amazon Mechanical Turk (AMT) on the baselines and our top performing methods. Using a studio-quality recording as high anchor and audio with noise (0dB SNR) as low anchor, a subject is asked to rate the sound quality of an audio recording on a scale of 1 to 5, with I=Bad, 5=Excellent. We collected 449 valid HITs with 208 unique workers, totalling 11674 ratings. the MOS scores are shown in Figure 2(a). Our methods outperform all the baselines, and HiFi-GAN-2 achieves the best average rating of 3.90 (±0.03, p < 0.05 over second best in unpaired t-test). Therefore, adding conditioning and adding GAN training respectively bring steady perceptual quality improvement. While Model A and Model A-G T are rated the same, training on generated MFCCs receive more improvement from adversarial training than on ground-truth ones, as the former exposes artifacts to the discriminators caused by inaccuracy in MFCC prediction. Model A-GT-GAN can be an efficient alternative to HiFi-GAN-2 as training on GT is easier.
3.2. Real-World Speech at Full Bandwidth
We gather real-world customer-grade recordings from TED Talks(www.ted.com) and VoxCelebl [14] to further evaluate if our method can suffice speech content creation needs. We selected 10 male and 10 female speakers from TED 2004â06 and sampled two random sentences (5â6 seconds) per episode. For VoxCelebl, we used Speech Transmission Index (STI) [34] to label each recording, and randomly sampled 50 audio clips to cover an STI range of 0.75-0.99 uniformly. Details are on our experiment result website.
We used the same trained models from Section 3.1, and extended output sample rate from 16kHz to 48kHz using the bandwidth extension model of Su et al. [35] trained also on the DAPS Dataset. We conducted the same MOS test as in Section 3.1, including 348 valid HITs with 128 unique workers and 7656 ratings. The result in Figure 2(b) shows that HiFi-GAN-2 performs the best, and works well together with the bandwidth extension algorithm, achieving close to studio quality for the resulting 48kHz audio (4.27 ±0.03, p < 0.05 over second best, p < 0.0001 over all baselines).
Furthermore, we conducted experiments to show that datasets us_ed_ as clean audio in conventional speech enhancement can be low quality and thus hinders the performance of algorithms trained on them. For example, CREMA-D [36], a crowd-sourced emotional dataset, contains similar amount of reverb and noise to customer-grade recordings. We conducted the same MOS test as above on CREMA-D dataset using first 62 speakers speaking sentence labelled âIWLâ in neutral emotion (84 unique workers, 4496 ratings); the result shows HiFi-GAN-2 at full bandwidth scores 4.254 while the original dataset only scores 2.349; it also worth mentioning that DEMUCS trained on the DNS Challenge dataset (which uses CREMA-D as clean data) scored 2.813 that is far lower than HiFi-GAN-2 trained on the DAPS dataset.
](https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/9632687/9632666/9632770/9632770-fig-2-source-large.gif)
Figure 2:
Mos scores: (a) Joint denoising and dereverberation on the daps dataset; (b) enhancement on real-world speech content.
Table 2: Objective measures on the valentini dataset.
3.3. Conventional Denoising
To show that the proposed methods also works in conventional setting, we experimented with the common benchmark Valentini dataset [15] for speech denoising. We follow the standard split of 28 speakers for training and 2 speakers for test. Table 2 shows our methods outperform all the other state-of-the-art methods on the objective measures, and Model A-GT achieves the highest scores so far to our knowledge. It is consistent with our previous observations that training with ground-truth conditioning without GAN is most favored by the objective measures.
In this paper, we characterize the difference between conventional speech enhancement and studio-quality audio enhancement, and present HiFi-GAN-2, a waveform-to-waveform enhancement method that improves the quality of real-world amateur recordings to studio quality. HiFi-GAN-2 consists of a recurrent neural network that predicts acoustic features (i.e. MFCCs) of the clean target from the input audio, and a feed-forward WaveNet for waveform enhancement that conditions on the predicted acoustic features, together with multi-domain multi-scale adversarial training. A pre-trained bandwidth extension network can be optionally applied to generate the final 48kHz studio quality signal from the output of HiFi-GAN-2. Extensive evaluations show that the proposed method outperforms all the other state-of-the-art baselines in both objective metrics and subjective metrics on joint dereverberation and denoising tasks as well as conventional denoising task.
4. Conclusion
In this paper, we characterize the difference between conventional speech enhancement and studio-quality audio enhancement, and present HiFi-GAN-2, a waveform-to-waveform enhancement method that improves the quality of real-world amateur recordings to studio quality. HiFi-GAN-2 consists of a recurrent neural network that predicts acoustic features (i.e. MFCCs) of the clean target from the input audio, and a feed-forward WaveNet for waveform enhancement that conditions on the predicted acoustic features, together with multi-domain multi-scale adversarial training. A pre-trained bandwidth extension network can be optionally applied to generate the final 48kHz studio quality signal from the output of HiFi-GAN-2. Extensive evaluations show that the proposed method outperforms all the other state-of-the-art baselines in both objective metrics and subjective metrics on joint dereverberation and denoising tasks as well as conventional denoising task.