Title: CVSS Corpus and Massively Multilingual Speech-to-Speech Translation
Authors: Ye Jia, Michelle Tadmor Ramanovich, Quan Wang, Heiga Zen
Published: 11th January 2022 (Tuesday) @ 00:27:08
Link: http://arxiv.org/abs/2201.03713v3
Abstract
We introduce CVSS, a massively multilingual-to-English speech-to-speech translation (S2ST) corpus, covering sentence-level parallel S2ST pairs from 21 languages into English. CVSS is derived from the Common Voice speech corpus and the CoVoST 2 speech-to-text translation (ST) corpus, by synthesizing the translation text from CoVoST 2 into speech using state-of-the-art TTS systems. Two versions of translation speeches are provided: 1) CVSS-C: All the translation speeches are in a single high-quality canonical voice; 2) CVSS-T: The translation speeches are in voices transferred from the corresponding source speeches. In addition, CVSS provides normalized translation text which matches the pronunciation in the translation speech. On each version of CVSS, we built baseline multilingual direct S2ST models and cascade S2ST models, verifying the effectiveness of the corpus. To build strong cascade S2ST baselines, we trained an ST model on CoVoST 2, which outperforms the previous state-of-the-art trained on the corpus without extra data by 5.8 BLEU. Nevertheless, the performance of the direct S2ST models approaches the strong cascade baselines when trained from scratch, and with only 0.1 or 0.7 BLEU difference on ASR transcribed translation when initialized from matching ST models.
CVSS creates this multilingual English speech-to-speech translation parallel data synthetically so they use TTS models:
CVSS is constructed by synthesizing the translation text from CoVoST 2 into speech using two state-of-theart TTS models. This section describes the two TTS models being used, both of which were trained on the LibriTTS corpus (Zen et al., 2019)
- PnG NAT (Figure 1) is a combination of PnG BERT (Jia et al., 2021) and Non-Attentive Tacotron (NAT) (Shen et al., 2020). It synthesizes speech as natural as professional human speakers (Jia et al., 2021). - see §4.1 for details
- PnG NAT with voice cloning To transfer the voices from the source speech to the translation speech, we modified PnG NAT to support zero-shot cross-lingual voice cloning (VC) by incorporating a speaker encoder in the same way as in Jia et al. (2018). The augmented TTS model is illustrated in Figure 2.
CVSS-C is synthesized using the PnG NAT model described in Sec. 4.1. A female speaker âlavocedorataâ (ID 3983) from LibriTTS is used as the canonical speaker. Although this speaker has merely 6.7 minutes recordings in the training set, these recordings are highly fluent, clean and natural.
CVSS-T is synthesized using the augmented PnG NAT model described in Sec. 4.2 for cross-lingual voice cloning. The speaker embedding computed on the source non-English speech is used for synthesizing the English translation speech.
Vocoder A neural vocoder based on WaveRNN (Kalchbrenner et al., 2018) is used for converting the mel-spectrograms synthesized by the TTS models into waveforms. This neural vocoder is trained on a proprietary dataset of 420 hours studio recordings from 98 professional speakers in 6 English accents.
Data Format The synthesized speech is stored as monophonic WAV files at 24 kHz sample rate and in 16-bit linear PCM format.