Title: MLS: A Large-Scale Multilingual Dataset for Speech Research
Authors: Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, Ronan Collobert
Published: 7th December 2020 (Monday) @ 01:53:45
Link: http://arxiv.org/abs/2012.03411v2

Abstract

This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages, including about 44.5K hours of English and a total of about 6K hours for other languages. Additionally, we provide Language Models (LM) and baseline Automatic Speech Recognition (ASR) models and for all the languages in our dataset. We believe such a large transcribed dataset will open new avenues in ASR and Text-To-Speech (TTS) research. The dataset will be made freely available for anyone at http://www.openslr.org.


DatasetTotal Duration / hrsLanguages# SpeakersNotes
MLS A Large-Scale Multilingual Dataset for Speech Research44,000 (English) + 6,000 (all others)en, de, nl, fr, es, it, pt, plSee Table 2 of the paper for detailed stats; actually more than advertised e.g. more than 44k hrs for English; 44k is train set only.

  • Dataset is available both as flac (original) and opus compressed via OpenSLR: https://www.openslr.org/94/
  • pre-trained 3- and 4-gram LMs also available
  • Hugging Face Datasets has a streamable version of the non-English subsets of Multilingual LibriSpeech (MLS) dataset at facebook/multilingual_librispeech

This is a streamable version of the Multilingual LibriSpeech (MLS) dataset. The data archives were restructured from the original ones from OpenSLR to make it easier to stream.

For some reason, the English split is not available; keep in mind it is ~44k hours cf. all the other splits which are 6k hours total. Confirmed via the following Python snippet:

#!/usr/bin/env python
from datasets import load_dataset
mls_en_train = load_dataset("facebook/multilingual_librispeech", "english", split="train")
ValueError: BuilderConfig 'english' not found.
Available: ['german', 'dutch', 'french', 'spanish', 'italian', 'portuguese', 'polish']