🪴 Anil's Garden

❯

Multi-task self-supervised learning for Robust Speech Recognition

19 Dec 20252 min read

paper
speech
asr
annotated

Title: Multi-task self-supervised learning for Robust Speech Recognition
Authors: Mirco Ravanelli, Jianyuan Zhong, Santiago Pascual, Pawel Swietojanski, Joao Monteiro, Jan Trmal, Yoshua Bengio
Published: 25th January 2020 (Saturday) @ 00:24:45
Link: http://arxiv.org/abs/2001.09239v2

Abstract

Despite the growing interest in unsupervised learning, extracting meaningful knowledge from unlabelled audio remains an open challenge. To take a step in this direction, we recently proposed a problem-agnostic speech encoder (PASE), that combines a convolutional encoder followed by multiple neural networks, called workers, tasked to solve self-supervised problems (i.e., ones that do not require manual annotations as ground truth). PASE was shown to capture relevant speech information, including speaker voice-print and phonemes. This paper proposes PASE+, an improved version of PASE for robust speech recognition in noisy and reverberant environments. To this end, we employ an online speech distortion module, that contaminates the input signals with a variety of random disturbances. We then propose a revised encoder that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks. Finally, we refine the set of workers used in self-supervision to encourage better cooperation. Results on TIMIT, DIRHA and CHiME-5 show that PASE+ significantly outperforms both the previous version of PASE as well as common acoustic features. Interestingly, PASE+ learns transferable representations suitable for highly mismatched acoustic conditions.

Problem-Agnostic Speech Encoder (PASE)

Graph View

Backlinks

Speech and Audio - Rolodex - Papers, Models and Releases
Discrete Audio Tokens for Multimodal LLMs - Mirco Ravanelli

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋

🪴 Anil's Garden

Explorer

Multi-task self-supervised learning for Robust Speech Recognition

Graph View

Backlinks