On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models

🪴 Anil's Garden

Title: On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models
Authors: Jinchuan Tian, Yifan Peng, William Chen, Kwanghee Choi, Karen Livescu, Shinji Watanabe
Published: 13th June 2024 (Thursday) @ 16:22:37
Link: http://arxiv.org/abs/2406.09282v1

Abstract

The Open Whisper-style Speech Model (OWSM) series was introduced to achieve full transparency in building advanced speech-to-text (S2T) foundation models. To this end, OWSM models are trained on 25 public speech datasets, which are heterogeneous in multiple ways. In this study, we advance the OWSM series by introducing OWSM v3.2, which improves on prior models by investigating and addressing the impacts of this data heterogeneity. Our study begins with a detailed analysis of each dataset, from which we derive two key strategies: data filtering with proxy task to enhance data quality, and the incorporation of punctuation and true-casing using an open large language model (LLM). With all other configurations staying the same, OWSM v3.2 improves performance over the OWSM v3.1 baseline while using 15% less training data.

Table 1: Statistics on the OWSM data mixture

Volume includes only the training subset. The data volume can differ from the officially claimed for each dataset due to our data preparation policy. For the language column, the 3-character language IDs follow ISO-639-3 standards; digital numbers represent the number of languages for ASR data and the number of translation directions for ST data. License information is based on our collection. Punctuation and Case-Sensitivity specify whether the original text label contains punctuation and is case-sensitive. Dash - means the language is not case-sensitive. Long-Form specifies whether the segmentation information is provided to splice short clips into long-form examples.

Corpus	Type	Volume (h)	Language	# Examples	License	Punctuation	Case-Sensitivity	Long-Form
aidatatang [24]	ASR	140	zho	164K	CC-BY-NC-ND-4.0	\usym2718	-	\usym2718
AISHELL [25]	ASR	150	zho	120K	Apache 2.0	\usym2718	-	\usym2718
ami [26]	ASR	141	eng	24K	CC-BY-4.0	\usym2718	\usym2718	\usym2714
babel [27]	ASR	2115	25	318K	-	\usym2718	\usym2718	\usym2714
CommonVoice (CV) [28]	ASR	16360	104	11.8M	CC0-1.0	\usym2714	\usym2714	\usym2718
CoVoST2 [29]	ST	8550	22	5.9M	CC-BY-NC 4.0	\usym2714	\usym2714	\usym2718
Fisher Callhome Spanish [30]	ASR	241	spa	36K	-	\usym2718	\usym2718	\usym2714
FLEURS [31]	ASR	950	102	268K	CC-BY-4.0	\usym2714	\usym2714	\usym2718
GigaSpeech [11]	ASR	12520	eng	2.0M	Apache 2.0	\usym2714	\usym2718	\usym2714
GigaST [32]	ST	24453	2	4.0M	CC-BY-NC 4.0	\usym2714	\usym2714	\usym2714
KsponSpeech [33]	ASR	960	kor	619K	MIT License	\usym2718	-	\usym2718
LibriSpeech (LS) [14]	ASR	897	eng	145K	CC-BY-4.0	\usym2718	\usym2718	\usym2714
MagicData (Magic.) [34]	ASR	711	zho	573K	CC-BY-NC-ND-4.0	\usym2718	-	\usym2718
Multilingual LibriSpeech (MLS)[35]	ASR	50670	8	8.6M	CC-BY-4.0	\usym2718	\usym2718	\usym2714
MuST-C - ASR part [36]	ASR	2657	eng	400K	CC-BY-NC-ND-4.0	\usym2714	\usym2714	\usym2714
MuST-C - ST part [36]	ST	8163	15	1.2M	CC-BY-NC-ND-4.0	\usym2714	\usym2714	\usym2714
Googlei18n1	ASR	1326	21	1.0M	CC BY-SA 4.0	\usym2718	\usym2714	\usym2718
ReazonSpeech [37]	ASR	18864	jpn	11.1M	Apache 2.0	\usym2714	-	\usym2718
Russian Open STT [38]	ASR	4791	rus	4.7M	CC-BY-NC	\usym2718	\usym2718	\usym2718
SPGISpeech [39]	ASR	4999	eng	2.0M	-	\usym2714	\usym2714	\usym2718
Fisher SwitchBoard (SWBD) [40]	ASR	3214	eng	498K	-	\usym2718	\usym2718	\usym2714
TEDLIUM3 [41]	ASR	472	eng	67K	CC-BY-NC-ND 3.0	\usym2718	\usym2718	\usym2714
VCTK [42]	ASR	25	eng	43K	CC-BY-4.0	\usym2714	\usym2714	\usym2718
VoxForge [43]	ASR	235	8	148K	GPL	\usym2718	\usym2718	\usym2718
VoxPopuli - ASR part [44]	ASR	1702	16	310K	CC0-1.0	\usym2714	\usym2714	\usym2714
VoxPopuli - ST part [44]	ST	111	40	21K	CC0-1.0	\usym2714	\usym2714	\usym2714
WenetSpeech [12]	ASR	14963	zho	2.2M	CC-BY-4.0	\usym2718	-	\usym2714
Total		180396	150	58.5M

🪴 Anil's Garden

Explorer

On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models

Table 1: Statistics on the OWSM data mixture

Graph View

Backlinks