Title: Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond
Authors: Beomseok Lee, Ioan Calapodescu, Marco Gaido, Matteo Negri, Laurent Besacier
Published: 7th August 2024 (Wednesday) @ 16:55:28
Link: http://arxiv.org/abs/2408.03900v1
Abstract
We present Speech-MASSIVE, a multilingual Spoken Language Understanding (SLU) dataset comprising the speech counterpart for a portion of the MASSIVE textual corpus. Speech-MASSIVE covers 12 languages from different families and inherits from MASSIVE the annotations for the intent prediction and slot-filling tasks. Our extension is prompted by the scarcity of massively multilingual SLU datasets and the growing need for versatile speech datasets to assess foundation models (LLMs, speech encoders) across languages and tasks. We provide a multimodal, multitask, multilingual dataset and report SLU baselines using both cascaded and end-to-end architectures in various training scenarios (zero-shot, few-shot, and full fine-tune). Furthermore, we demonstrate the suitability of Speech-MASSIVE for benchmarking other tasks such as speech transcription, language identification, and speech translation. The dataset, models, and code are publicly available at: https://github.com/hlt-mt/Speech-MASSIVE
Summary: This is a Multilingual Spoken Language Understanding (SLU) dataset based on MASSIVE A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages
Available via Hugging Face đ€ at https://huggingface.co/datasets/FBK-MT/Speech-MASSIVE
Speech-MASSIVE is mainly for Intent Classification but can be used for ASR, LangID and Speech Translation (ST).
Languages: Arabic, German, Spanish, French, Hungarian, Korean, Dutch, Polish, European Portuguese, Russian, Turkish, and Vietnamese
They donât do Italian because itâs already available in ITALIC An Italian Intent Classification Dataset, grazie a Giuseppe Attanasio e altri!
ASR baselines assessment: We compared ASR error rates to those obtained on the FLEURS dataset [5].4 FLEURS generally yields lower WERs/CERs compared to Speech-MASSIVE. The same observation was made for Italian in [11], which followed a recording methodology similar to ours.