Title: Multilingual Speech Models for Automatic Speech Recognition Exhibit Gender Performance Gaps
Authors: Giuseppe Attanasio, Beatrice Savoldi, Dennis Fucci, Dirk Hovy
Published: 28th February 2024 (Wednesday) @ 00:24:29
Link: http://arxiv.org/abs/2402.17954v1
Abstract
Current voice recognition approaches use multi-task, multilingual models for speech tasks like Automatic Speech Recognition (ASR) to make them applicable to many languages without substantial changes. However, broad language coverage can still mask performance gaps within languages, for example, across genders. We systematically evaluate multilingual ASR systems on gendered performance gaps. Using two popular models on three datasets in 19 languages across seven language families, we find clear gender disparities. However, the advantaged group varies between languages. While there are no significant differences across groups in phonetic variables (pitch, speaking rate, etc.), probing the modelâs internal states reveals a negative correlation between probe performance and the gendered performance gap. I.e., the easier to distinguish speaker gender in a language, the more the models favor female speakers. Our results show that group disparities remain unsolved despite great progress on multi-tasking and multilinguality. We provide first valuable insights for evaluating gender gaps in multilingual ASR systems. We release all code and artifacts at https://github.com/g8a9/multilingual-asr-gender-gap.
Note: This was read for the Sardine Paper Clinic ahead of the 15th June 2024 ARR deadline.
Sardine Paper Clinic Notes
- In what sense is the female group a âminorityâ here - training data distribution? Otherwise, better to use socio-economically disadvantaged group?
- What explanations for the gender bias disparity across datasets can be gleaned from the datasets themselves?
- Since the results differ qualitatively across Common Voice, Fleurs and Vox Populi - what about a human baseline error rate for these subsets by M/F for each of CV, Fleurs, VP? Cross-tabulations
- Error bars on Figure 2 to show significance of the results?
- Justification of using the portion of the audio containing speech - which output layer did you use?