Title: ML-SUPERB: Multilingual Speech Universal PERformance Benchmark
Authors: Jiatong Shi, Dan Berrebbi, William Chen, Ho-Lam Chung, En-Pei Hu, Wei Ping Huang, Xuankai Chang, Shang-Wen Li, Abdelrahman Mohamed, Hung-yi Lee, Shinji Watanabe
Published: 18th May 2023 (Thursday) @ 00:01:27
Link: http://arxiv.org/abs/2305.10615v3
Abstract
Speech processing Universal PERformance Benchmark (SUPERB) is a leaderboard to benchmark the performance of Self-Supervised Learning (SSL) models on various speech processing tasks. However, SUPERB largely considers English speech in its evaluation. This paper presents multilingual SUPERB (ML-SUPERB), covering 143 languages (ranging from high-resource to endangered), and considering both automatic speech recognition and language identification. Following the concept of SUPERB, ML-SUPERB utilizes frozen SSL features and employs a simple framework for multilingual tasks by learning a shallow downstream model. Similar to the SUPERB benchmark, we find speech SSL models can significantly improve performance compared to FBANK features. Furthermore, we find that multilingual models do not always perform better than their monolingual counterparts. We will release ML-SUPERB as a challenge with organized datasets and reproducible training scripts for future multilingual representation research.
Notes
- Data Collection: ML-SUPERB gathers data from a wide range of multilingual speech corpora, including:
- Multilingual Librispeech [16], Commonvoice [17], Voxforge [18], Voxpopuli [19], Googlei18n open-source project [20â22], Nordic Language Technology ASR corpora [23], Fleurs [24], NCHLT Speech [25], Spoken Wikipedia corpus [26], Mexican endangered languages [10, 27â 31], M-AILab multilingual corpora [32], Living Audio dataset [33], ALFFA corpus [34].
- All corpora are with either Creative Commons, MIT, GNU, or Free-BSD licenses, which are available for both industrial and academic research, permissively.
- We used the original split for source datasets, with the exception of SWC, M-AILABS, LAD, and ALFFA. Therefore, all datasets except these four can be used for SSL pre-training.
- Each
(lang, dataset)
language-corpus pair has 10-minute section randomly sampled for {train, dev, test} + 1 hour segment sampled for training, including the 10-minute segment mentioned - 10-minute / 1-hour training size because:
- challenging - small trainset makes harder benchmark â avoids saturation
- reasonable performance - recent SSL speech models show reasonable performance with (downstream tuning on?) trainsets of this size
- efficiency - limiting trainset size necessary for evaluation (I guess evaluation server?question )
- âA full evaluation cycle of ML-SUPERB can take up to 3 days using 4 2080Ti GPUs.â
- benchmark includes few-shot cases with 20 languages and uses only 5 utterances in training for each language additionally
- Two tracks: monolingual & multilingual
- Monolingual track: 9 languages; 14 tracks
- languages: {rus, swa, swe, jpn, cmn, xty} | {eng, fra, deu}
- for each language,
- select 1x train and val set - for {rus, swa, swe, jpn, cmn, xty}
- use all available sets for evaluation (test) - test accent or domain conditions
- select 3x
(lang, dataset)
pairs for eng and 2x for {fra, deu}- in order to evaluate the impact of the training domain on the modelsâ performances
- for eng we have 3 monolingual exp, with (eng , MLS), (eng , NCHLT) and (eng , VoxPopuli)
-
Multilingual track: