Title: kNN For Whisper And Its Effect On Bias And Speaker Adaptation
Authors: Maya K. Nachesa, Vlad Niculae
Published: 24th October 2024 (Thursday) @ 15:32:52
Link: http://arxiv.org/abs/2410.18850v2
Abstract
Speech recognition performance varies by language, domain, and speaker characteristics such as accent, but fine-tuning a model on any of these categories may lead to catastrophic forgetting. Token-level nearest neighbor search (NN), first proposed for neural sequence decoders for natural language generation (NLG) and machine translation (MT), is a non-parametric method that instead adapts using inference-time search in an external datastore, without training the underlying model. We show that Whisper, a transformer end-to-end speech model, benefits from NN. We investigate the differences between the speech and text setups. We discuss implications for speaker adaptation, and analyze improvements by gender, accent, and age.