Title: Librispeech: An ASR corpus based on public domain audio books
Authors: Vassil Panayotov, Guoguo Chen, Daniel Povey, Sanjeev Khudanpur
Published: 19th April 2015
Link: https://ieeexplore.ieee.org/document/7178964

Abstract

This paper introduces a new corpus of read English speech, suitable for training and evaluating speech recognition systems. The LibriSpeech corpus is derived from audiobooks that are part of the LibriVox project, and contains 1000 hours of speech sampled at 16 kHz. We have made the corpus freely available for download, along with separately prepared language-model training data and pre-built language models. We show that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models trained on WSJ itself. We are also releasing Kaldi scripts that make it easy to build these systems.


PDF: https://www.danielpovey.com/files/2015_icassp_librispeech.pdf


LibriSpeech Data Subsets

subsethoursper-spk minutesfemale spkrsmale spkrstotal spkrs
dev-clean5.48202040
test-clean5.48202040
dev-other5.310161733
test-other5.110171633
train-clean-100100.625125126251
train-clean-360363.625439482921
train-other-500496.7305646021166

3.1. Data selection

To select the audio recordings for inclusion into the corpus we use LibriVox’s API’ to collect information about the readers, the audio book projects in which they participated, and the chapters of books that they read. The URLs for audio files and reference texts were obtained by matching the information from LibriVox’s API with the metadata records from the Internet Archive” and Project Gutenberg’s RDF/XML files. For a small fraction of audiobooks no exact match for the title was found in Project Gutenberg, so to improve coverage we allowed a tuzzy matching of titles.

In order to guarantee that there was no speaker overlap between the training, development and test sets, we wanted to ensure that each recording is unambiguously attributable to a single speaker. To that end we exclude such LibriVox genres as, for example, “Dra-matic Reading”, which include predominantly multi-reader audio chapters. As an extra precaution, in the final post-processing step of the alignment processing the recordings are processed with the LIUM speaker diarization toolkit [20] to automatically detect multi-speaker chapters. A custom GUI application was written, that makes use of the text-audio alignment information and the speaker diariza-tion information, to allow for quick inspection and filtering out of the remaining multi-speaker recordings. This application also made it possible to quickly produce gender information for the speakers and to discard a small number of recordings that had excessive audio quality problems.

We ensured a gender balance at the speaker level and in terms of the amount of data available for each gender.

3.2. Corpus partitions

The size of the corpus makes it impractical, or at least inconvenient for some users, to distribute it as a single large archive. Thus the training portion of the corpus is split into three subsets, with approximate size 100, 360 and 500 hours respectively.

A simple automatic procedure was used to select the audio in the first two sets to be, on average, of higher recording quality and with accents closer to US English.

An acoustic model was trained on WSJ’s si-84 data subset and was used to recognize the audio in the corpus, using a bigram LM estimated on the text of the respective books. We computed the Word Error Rate (WER) of this automatic transcript relative to our reference transcripts obtained from the book texts.

The speakers in the corpus were ranked according to the WER of the WSJ model’s transcripts, and were divided roughly in the middle, with the lower-WER speakers designated as “clean” and the higher-WER speakers designated as “other”.

From the “clean” pool, 20 male and 20 female speakers were drawn at random and assigned to a development set. The same was repeated to form a test set. For each dev or test set speaker, approximately eight minutes of speech are used, for total of approximately 5 hours and 20 minutes each. Note that, as mentioned in Section 2.4, we use a different segmentation procedure for development and test data, than for training data.

The rest of the audio in the “clean” pool was randomly split into two training sets with approximate size 100 and 360 hours respec-tively. For each speaker in these training sets the amount of speech was limited to 25 minutes, in order to avoid major imbalances in per-speaker audio duration.

The “other” pool was similarly split into test and development sets, and a single training set of approximately 500 hours. For this pool, however we did not choose the development and test sets at random; instead we deliberately chose more challenging data. The WER we computed using the WSJ models was used to rank the speakers in order of increasing difficulty, and the speakers for the test and development set were randomly chosen from the third quartile of this sorted list. Table 1 provides a summary of all subsets in the corpus.