Title: Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond
Authors: Beomseok Lee, Ioan Calapodescu, Marco Gaido, Matteo Negri, Laurent Besacier
Published: 7th August 2024 (Wednesday) @ 16:55:28
Link: http://arxiv.org/abs/2408.03900v1

Abstract

We present Speech-MASSIVE, a multilingual Spoken Language Understanding (SLU) dataset comprising the speech counterpart for a portion of the MASSIVE textual corpus. Speech-MASSIVE covers 12 languages from different families and inherits from MASSIVE the annotations for the intent prediction and slot-filling tasks. Our extension is prompted by the scarcity of massively multilingual SLU datasets and the growing need for versatile speech datasets to assess foundation models (LLMs, speech encoders) across languages and tasks. We provide a multimodal, multitask, multilingual dataset and report SLU baselines using both cascaded and end-to-end architectures in various training scenarios (zero-shot, few-shot, and full fine-tune). Furthermore, we demonstrate the suitability of Speech-MASSIVE for benchmarking other tasks such as speech transcription, language identification, and speech translation. The dataset, models, and code are publicly available at: https://github.com/hlt-mt/Speech-MASSIVE


Summary: This is a Multilingual Spoken Language Understanding (SLU) dataset based on MASSIVE A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages

Available via Hugging Face đŸ€— at https://huggingface.co/datasets/FBK-MT/Speech-MASSIVE


Speech-MASSIVE is mainly for Intent Classification but can be used for ASR, LangID and Speech Translation (ST).

Languages: Arabic, German, Spanish, French, Hungarian, Korean, Dutch, Polish, European Portuguese, Russian, Turkish, and Vietnamese

They don’t do Italian because it’s already available in ITALIC An Italian Intent Classification Dataset, grazie a Giuseppe Attanasio e altri!

ASR baselines assessment: We compared ASR error rates to those obtained on the FLEURS dataset [5].4 FLEURS generally yields lower WERs/CERs compared to Speech-MASSIVE. The same observation was made for Italian in [11], which followed a recording methodology similar to ours.

Speech-MASSIVE Stats: hours, samples, speakers, languages, splits and baselines

Table 1: Speech-MASSIVE’s overall statistics. ‘# hrs’ displays the recording duration for all samples (including invalid), while ‘# spk (Male/Female/Unknown)’ indicates the number of speakers for all the samples (including invalid). The last 2 columns (‘WER’, and ‘CER’) measures Whisper ASR performance.

SLU (Intent Classification) Baselines: Cascased & E2E

Table 2

LID Accuracy and ST BLEU Baselines with Whisper Large v3

Table 3