Title: YODAS: Youtube-Oriented Dataset for Audio and Speech
Authors: Xinjian Li, Shinnosuke Takamichi, Takaaki Saeki, William Chen, Sayaka Shiota, Shinji Watanabe
Published: 2nd June 2024 (Sunday) @ 23:43:27
Link: http://arxiv.org/abs/2406.00899v1
Abstract
In this study, we introduce YODAS (YouTube-Oriented Dataset for Audio and Speech), a large-scale, multilingual dataset comprising currently over 500k hours of speech data in more than 100 languages, sourced from both labeled and unlabeled YouTube speech datasets. The labeled subsets, including manual or automatic subtitles, facilitate supervised model training. Conversely, the unlabeled subsets are apt for self-supervised learning applications. YODAS is distinctive as the first publicly available dataset of its scale, and it is distributed under a Creative Commons license. We introduce the collection methodology utilized for YODAS, which contributes to the large-scale speech dataset construction. Subsequently, we provide a comprehensive analysis of speech, text contained within the dataset. Finally, we describe the speech recognition baselines over the top-15 languages.
YODAS [4], a large-scale multilingual dataset with over 500k hours in 100+ languages, derives annotations from YouTube subtitles, which are often unreliable. Even manually created captions lack guaranteed human verification. Language ID inaccuracies lead to significant data loss (e.g., only 20% retention for Bulgarian, Ukrainian), necessitating robust filtering. Additionally, the dataset contains noise, requiring extensive preprocessing.
â Granary Speech Recognition and Translation Dataset in 25 European Languages