Title: Real Time Speech Enhancement in the Waveform Domain
Authors: Alexandre Défossez, Gabriel Synnaeve, Yossi Adi
Published: 23rd June 2020 (Tuesday) @ 09:19:13
Link: http://arxiv.org/abs/2006.12847v3

Abstract

We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU. The proposed model is based on an encoder-decoder architecture with skip-connections. It is optimized on both time and frequency domains, using multiple loss functions. Empirical evidence shows that it is capable of removing various kinds of background noise including stationary and non-stationary noises, as well as room reverb. Additionally, we suggest a set of data augmentation techniques applied directly on the raw waveform which further improve model performance and its generalization abilities. We perform evaluations on several standard benchmarks, both using objective metrics and human judgements. The proposed model matches state-of-the-art performance of both causal and non causal methods while working directly on the raw waveform.


  • Propose a realtime version of DEMUCS
  • Model estimating the noise distribution to denoise
    • Non-stationary noise
    • “babbling” with multiple speakers i.e. cocktail party problem
  • neural methods better inc at single channel source separation
  • enhanced samples improve (downstream) ASR performance

Methods

  • Adapt DEMUCS, which is for music source separation to causal speech enhancement
    • causal suitable for the streaming setting
    • 💡 (musical) source separation is a similar task to speech enhancement drop the non-speech source, leaving clean speech
  • focus on additive noise: -question is this realistic?
  • DEMUCS consists in:
    • a multi-layer convolutional encoder and decoder with U-net skip connections
    • a sequence modeling LSTM applied on the encoder output
      • uni-LSTM for causal modelling
      • bi-LSTM for non-causal (non-streaming) modelling
    • characterized by:
      • number of layers L
      • initial number of hidden channels H
      • layer kernel size K and stride S
      • resampling factor U
  • Loss objective:
    • take loss over various different STFT parameters; loss at different resolution with
      • number of FFT bins ∈ {512, 1024, 2048}
      • hop sizes ∈ {50, 120, 240}
      • window lengths ∈ {240, 600, 1200}

Data

Augmentations:

  • Remix (shift one second; noise vs speech for new noisy samples)
  • Band-Mask - remove 20% of frequencies in mel-scale randomly between and
  • Revecho

Evaluation Methods

Note, from the intro, Défossez, Synnaeve and Adi write:

Although, multiple metrics exist to measure speech enhancement systems these have shown to not correlate well with human judgements. Hence, we report results for both objective metrics as well as human evaluation.

Objective metrics:

  1. PESQ: Perceptual evaluation of speech quality - wide-band version recommended in ITU-T P.862.2 (from 0.5 to 4.5)
  2. Short-Time Objective Intelligibility (STOI) (from 0 to 100)
  3. CSIG: Mean opinion score (MOS) prediction of the signal distortion attending only to the speech signal (from 1 to 5)
  4. CBAK: MOS prediction of the intrusiveness of background noise (from 1 to 5)
  5. COVL: MOS prediction of the overall effect (from 1 to 5)

Subjective metrics:

Baseline (previous state of the art) was: DeepMMSE A Deep Learning Approach to MMSE-Based Noise Power Spectral Density Estimation

Real-time Evaluation

We computed the Real-Time Factor (RTF, e.g. time to enhance a frame divided by the stride) under the streaming setting to better match real-world conditions.

We benchmark this implementation on a quad-core Intel i5 CPU (2.0 GHz, up to AVX2 instruction set).

  • The RTF is 1.05 for the H=64 version
  • RTF is 0.6 for the H=48
    • When restricting execution to a single core, H=48 model still achieves a RTF of 0.8 it is realistic to use in real conditions
    • e.g. along a video call software

Effect on ASR models

  • They do this evaluation by artificially synthetically noising Librispeech samples - is this standard practice?
    • bit weird since it’s not real noise

To that end, we synthetically generated noisy data using the LIBRISPEECH dataset [31] together with noises from the test set of the DNS [19] benchmark. We created noisy samples in a controlled setting where we mixed the clean and noise files with SNR levels ∈ {0, 10, 20, 30}.

Viterbi WER ondev-cleanenhanceddev-otherenhanced
original (no noise)2.12.24.64.7
noisy SNR 012.06.921.114.7
noisy SNR 109.86.318.413.1
noisy SNR 205.24.011.79.4
noisy SNR 303.32.97.67.2