Title: On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models
Authors: Jinchuan Tian, Yifan Peng, William Chen, Kwanghee Choi, Karen Livescu, Shinji Watanabe
Published: 13th June 2024 (Thursday) @ 16:22:37
Link: http://arxiv.org/abs/2406.09282v1
Abstract
The Open Whisper-style Speech Model (OWSM) series was introduced to achieve full transparency in building advanced speech-to-text (S2T) foundation models. To this end, OWSM models are trained on 25 public speech datasets, which are heterogeneous in multiple ways. In this study, we advance the OWSM series by introducing OWSM v3.2, which improves on prior models by investigating and addressing the impacts of this data heterogeneity. Our study begins with a detailed analysis of each dataset, from which we derive two key strategies: data filtering with proxy task to enhance data quality, and the incorporation of punctuation and true-casing using an open large language model (LLM). With all other configurations staying the same, OWSM v3.2 improves performance over the OWSM v3.1 baseline while using 15% less training data.
Table 1: Statistics on the OWSM data mixture
Volume includes only the training subset. The data volume can differ from the officially claimed for each dataset due to our data preparation policy. For the language column, the 3-character language IDs follow ISO-639-3 standards; digital numbers represent the number of languages for ASR data and the number of translation directions for ST data. License information is based on our collection. Punctuation and Case-Sensitivity specify whether the original text label contains punctuation and is case-sensitive. Dash - means the language is not case-sensitive. Long-Form specifies whether the segmentation information is provided to splice short clips into long-form examples.
Corpus | Type | Volume (h) | Language | # Examples | License | Punctuation | Case-Sensitivity | Long-Form |
---|---|---|---|---|---|---|---|---|
aidatatang [24] | ASR | 140 | zho | 164K | CC-BY-NC-ND-4.0 | \usym2718 | - | \usym2718 |
AISHELLÂ [25] | ASR | 150 | zho | 120K | Apache 2.0 | \usym2718 | - | \usym2718 |
ami [26] | ASR | 141 | eng | 24K | CC-BY-4.0 | \usym2718 | \usym2718 | \usym2714 |
babel [27] | ASR | 2115 | 25 | 318K | - | \usym2718 | \usym2718 | \usym2714 |
CommonVoice (CV)Â [28] | ASR | 16360 | 104 | 11.8M | CC0-1.0 | \usym2714 | \usym2714 | \usym2718 |
CoVoST2Â [29] | ST | 8550 | 22 | 5.9M | CC-BY-NC 4.0 | \usym2714 | \usym2714 | \usym2718 |
Fisher Callhome Spanish [30] | ASR | 241 | spa | 36K | - | \usym2718 | \usym2718 | \usym2714 |
FLEURSÂ [31] | ASR | 950 | 102 | 268K | CC-BY-4.0 | \usym2714 | \usym2714 | \usym2718 |
GigaSpeech [11] | ASR | 12520 | eng | 2.0M | Apache 2.0 | \usym2714 | \usym2718 | \usym2714 |
GigaSTÂ [32] | ST | 24453 | 2 | 4.0M | CC-BY-NC 4.0 | \usym2714 | \usym2714 | \usym2714 |
KsponSpeech [33] | ASR | 960 | kor | 619K | MIT License | \usym2718 | - | \usym2718 |
LibriSpeech (LS)Â [14] | ASR | 897 | eng | 145K | CC-BY-4.0 | \usym2718 | \usym2718 | \usym2714 |
MagicData (Magic.)Â [34] | ASR | 711 | zho | 573K | CC-BY-NC-ND-4.0 | \usym2718 | - | \usym2718 |
Multilingual LibriSpeech (MLS)[35] | ASR | 50670 | 8 | 8.6M | CC-BY-4.0 | \usym2718 | \usym2718 | \usym2714 |
MuST-C - ASR part [36] | ASR | 2657 | eng | 400K | CC-BY-NC-ND-4.0 | \usym2714 | \usym2714 | \usym2714 |
MuST-C - ST part [36] | ST | 8163 | 15 | 1.2M | CC-BY-NC-ND-4.0 | \usym2714 | \usym2714 | \usym2714 |
Googlei18n1 | ASR | 1326 | 21 | 1.0M | CC BY-SA 4.0 | \usym2718 | \usym2714 | \usym2718 |
ReazonSpeech [37] | ASR | 18864 | jpn | 11.1M | Apache 2.0 | \usym2714 | - | \usym2718 |
Russian Open STTÂ [38] | ASR | 4791 | rus | 4.7M | CC-BY-NC | \usym2718 | \usym2718 | \usym2718 |
SPGISpeech [39] | ASR | 4999 | eng | 2.0M | - | \usym2714 | \usym2714 | \usym2718 |
Fisher SwitchBoard (SWBD)Â [40] | ASR | 3214 | eng | 498K | - | \usym2718 | \usym2718 | \usym2714 |
TEDLIUM3Â [41] | ASR | 472 | eng | 67K | CC-BY-NC-ND 3.0 | \usym2718 | \usym2718 | \usym2714 |
VCTKÂ [42] | ASR | 25 | eng | 43K | CC-BY-4.0 | \usym2714 | \usym2714 | \usym2718 |
VoxForge [43] | ASR | 235 | 8 | 148K | GPL | \usym2718 | \usym2718 | \usym2718 |
VoxPopuli - ASR part [44] | ASR | 1702 | 16 | 310K | CC0-1.0 | \usym2714 | \usym2714 | \usym2714 |
VoxPopuli - ST part [44] | ST | 111 | 40 | 21K | CC0-1.0 | \usym2714 | \usym2714 | \usym2714 |
WenetSpeech [12] | ASR | 14963 | zho | 2.2M | CC-BY-4.0 | \usym2718 | - | \usym2714 |
Total | 180396 | 150 | 58.5M |