Title: Real Time Speech Enhancement in the Waveform Domain
Authors: Alexandre Défossez, Gabriel Synnaeve, Yossi Adi
Published: 23rd June 2020 (Tuesday) @ 09:19:13
Link: http://arxiv.org/abs/2006.12847v3
Abstract
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU. The proposed model is based on an encoder-decoder architecture with skip-connections. It is optimized on both time and frequency domains, using multiple loss functions. Empirical evidence shows that it is capable of removing various kinds of background noise including stationary and non-stationary noises, as well as room reverb. Additionally, we suggest a set of data augmentation techniques applied directly on the raw waveform which further improve model performance and its generalization abilities. We perform evaluations on several standard benchmarks, both using objective metrics and human judgements. The proposed model matches state-of-the-art performance of both causal and non causal methods while working directly on the raw waveform.
- Propose a realtime version of DEMUCS
- Model estimating the noise distribution to denoise
- Non-stationary noise
- “babbling” with multiple speakers i.e. cocktail party problem
- neural methods better inc at single channel source separation
- enhanced samples improve (downstream) ASR performance
Methods
- Adapt DEMUCS, which is for music source separation to causal speech enhancement
- causal suitable for the streaming setting
- 💡 (musical) source separation is a similar task to speech enhancement drop the non-speech source, leaving clean speech
- focus on additive noise: -question is this realistic?
- DEMUCS consists in:
- a multi-layer convolutional encoder and decoder with U-net skip connections
- a sequence modeling LSTM applied on the encoder output
- uni-LSTM for causal modelling
- bi-LSTM for non-causal (non-streaming) modelling
- characterized by:
- number of layers L
- initial number of hidden channels H
- layer kernel size K and stride S
- resampling factor U
- Loss objective:
- take loss over various different STFT parameters; loss at different resolution with
- number of FFT bins ∈ {512, 1024, 2048}
- hop sizes ∈ {50, 120, 240}
- window lengths ∈ {240, 600, 1200}
- take loss over various different STFT parameters; loss at different resolution with
Data
- Valentini Noisy Speech Database
- “DNS” dataset i.e. data from The INTERSPEECH 2020 Deep Noise Suppression Challenge Datasets, Subjective Testing Framework, and Challenge Results
Augmentations:
- Remix (shift one second; noise vs speech for new noisy samples)
- Band-Mask - remove 20% of frequencies in mel-scale randomly between and
- Revecho
Evaluation Methods
Note, from the intro, Défossez, Synnaeve and Adi write:
Although, multiple metrics exist to measure speech enhancement systems these have shown to not correlate well with human judgements. Hence, we report results for both objective metrics as well as human evaluation.
Objective metrics:
- PESQ: Perceptual evaluation of speech quality - wide-band version recommended in ITU-T P.862.2 (from 0.5 to 4.5)
- Short-Time Objective Intelligibility (STOI) (from 0 to 100)
- CSIG: Mean opinion score (MOS) prediction of the signal distortion attending only to the speech signal (from 1 to 5)
- CBAK: MOS prediction of the intrusiveness of background noise (from 1 to 5)
- COVL: MOS prediction of the overall effect (from 1 to 5)
Subjective metrics:
- MOS study as recommended in ITU-T P.835
- crowd source evaluation using the CrowdMOS package
- Randomly sample 100 utterances and each one was scored by 15 different raters on 3 axes:
- level of distortion
- intrusiveness of background noise
- overall quality
- …these are the same as in DNSMOS P 835 A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors with this paper substituting “speech quality” in favour of “level of distortion” - it’s a denoising paper
Baseline (previous state of the art) was: DeepMMSE A Deep Learning Approach to MMSE-Based Noise Power Spectral Density Estimation
Real-time Evaluation
We computed the Real-Time Factor (RTF, e.g. time to enhance a frame divided by the stride) under the streaming setting to better match real-world conditions.
We benchmark this implementation on a quad-core Intel i5 CPU (2.0 GHz, up to AVX2 instruction set).
- The RTF is 1.05 for the H=64 version
- RTF is 0.6 for the H=48
- When restricting execution to a single core, H=48 model still achieves a RTF of 0.8 it is realistic to use in real conditions
- e.g. along a video call software
Effect on ASR models
- They do this evaluation by artificially synthetically noising Librispeech samples - is this standard practice?
- bit weird since it’s not real noise
To that end, we synthetically generated noisy data using the LIBRISPEECH dataset [31] together with noises from the test set of the DNS [19] benchmark. We created noisy samples in a controlled setting where we mixed the clean and noise files with SNR levels ∈ {0, 10, 20, 30}.
Viterbi WER on | dev-clean | enhanced | dev-other | enhanced |
---|---|---|---|---|
original (no noise) | 2.1 | 2.2 | 4.6 | 4.7 |
noisy SNR 0 | 12.0 | 6.9 | 21.1 | 14.7 |
noisy SNR 10 | 9.8 | 6.3 | 18.4 | 13.1 |
noisy SNR 20 | 5.2 | 4.0 | 11.7 | 9.4 |
noisy SNR 30 | 3.3 | 2.9 | 7.6 | 7.2 |