🪴 Anil's Garden

❯

Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context

19 Dec 20252 min read

paper
dataset
tts
asr
speech
annotated

Title: Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context
Authors: Wei Kang, Xiaoyu Yang, Zengwei Yao, Fangjun Kuang, Yifan Yang, Liyong Guo, Long Lin, Daniel Povey
Published: 15th September 2023 (Friday) @ 01:59:21
Link: http://arxiv.org/abs/2309.08105v2

Abstract

In this paper, we introduce Libriheavy, a large-scale ASR corpus consisting of 50,000 hours of read English speech derived from LibriVox. To the best of our knowledge, Libriheavy is the largest freely-available corpus of speech with supervisions. Different from other open-sourced datasets that only provide normalized transcriptions, Libriheavy contains richer information such as punctuation, casing and text context, which brings more flexibility for system building. Specifically, we propose a general and efficient pipeline to locate, align and segment the audios in previously published Librilight to its corresponding texts. The same as Librilight, Libriheavy also has three training subsets small, medium, large of the sizes 500h, 5000h, 50000h respectively. We also extract the dev and test evaluation sets from the aligned audios and guarantee there is no overlapping speakers and books in training sets. Baseline systems are built on the popular CTC-Attention and transducer models. Additionally, we open-source our dataset creatation pipeline which can also be used to other audio alignment tasks.

pkufool/libriheavy_long on Hugging Face

Libriheavy is a labeled version of Librilight. We align the audio files in Librilight to their corresponding text in the original book and segment them into smaller pieces with durations ranging from 2 to 30 seconds. We maintain the original dataset splits of Librilight and have three training subsets (small, medium, large). In addition, we further extract evaluation subsets (dev, test-clean, test-other) for validation and testing. Table 1 shows the statitics of these subsets.

Graph View

Backlinks

Datasets

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋

🪴 Anil's Garden

Explorer

Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context

Graph View

Backlinks