Title: Scaling Speech Technology to 1,000+ Languages
Authors: Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli
Published: 22nd May 2023 (Monday) @ 22:09:41
Link: http://arxiv.org/abs/2305.13516v1
Abstract
Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task. The main ingredients are a new dataset based on readings of publicly available religious texts and effectively leveraging self-supervised learning. We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, as well as a language identification model for 4,017 languages. Experiments show that our multilingual speech recognition model more than halves the word error rate of Whisper on 54 languages of the FLEURS benchmark while being trained on a small fraction of the labeled data.
Notes
Motivation:
Generating Posterior Probabilities. Forced alignment requires posterior probabilities from an acoustic model which we use for alignment (§3.1.4). This acoustic model is a Transformer which requires substantial amounts of memory to store activations which makes it infeasible to use for long audio files. As a workaround, we chunk the audio files into 15 second segments, generate posterior probabilities for each audio frame using the alignment model, and then concatenate these posterior probabilities into a single matrix again. The acoustic model is trained with Connectionist Temporal Classification (CTC; Graves et al. 2006).
They use CTC to obtain the most probably text-audio alignment:
Forced Alignment using CTC. Next, we perform forced alignment which finds the most likely path in the posterior probabilities for a given input audio sequence of length T and a text transcription of length L. These posterior probabilities require O(T Ă L) memory and a path will be of length T. This path is computed using the Viterbi algorithm. There are open source libraries implementing the algorithm on CPU [KĂŒrzinger et al., 2020, Kahn et al., 2022], however, the CPU versions are slow to run, particularly on long recordings, as we will show below.
They implement a GPU version of the CTC forced alignment algorithm which uses GPU memory instead of where is the length of the âpathâ - the number of frames or time samples in the audio and is the length of the text transcript:
Efficient Forced Alignment on GPUs. In order to make force alignment efficient for our purpose, we implemented a GPU version that computes the Viterbi path memory in a memory efficient way. Storing all O(T Ă L) forward values for the Viterbi algorithm is infeasible on GPUs due to memory constraints. We therefore only store forward values for the current and the previous time-step and regularly transfer the computed backtracking matrices to CPU memory. This reduces the required GPU memory to O(L) compared to O(T Ă L) and enables forced alignment for very long audio sequences at high speed. Appendix A illustrates the algorithm and an implementation is available as part of TorchAudio [Yang et al., 2021].
There are alternative alignment algorithms that are available implemented in popular libraries:
Figure 4 shows that the forced alignment implementation scales much better to longer sequences than CPU alternatives such as ctc-segmentation [KĂŒrzinger et al., 2020], a popular segmentation library used in ESPNet [Watanabe et al., 2018], SpeechBrain [Ravanelli et al., 2021] and Flashlight [Kahn et al., 2022].
- ctc-segmentation [KĂŒrzinger et al., 2020] - a popular segmentation library used in ESPNet [Watanabe et al., 2018]
- SpeechBrain [Ravanelli et al., 2021]
- Flashlight [Kahn et al., 2022]