Title: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
Authors: Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu Wei
Published: 5th January 2023 (Thursday) @ 15:37:15
Link: http://arxiv.org/abs/2301.02111v1
Abstract
We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find Vall-E could preserve the speakerâs emotion and acoustic environment of the acoustic prompt in synthesis. See https://aka.ms/valle for demos of our work.
Vall-E - Notes
Paper can be found at Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
Builds on EnCodec: High Fidelity Neural Audio Compression
Codebase for EnCodec: https://github.com/facebookresearch/encodec (URL: https://github.com/facebookresearch/encodec)
Actions:
- On Vector Quantization of NN FP outputs: Van den Oord et al. 2017 and Zeghidour et al. 2021
Code
Unofficial Vall-E PyTorch Implementation by enhuiz (2.2k stars âïž)
vall-e/vall_e/vall_e at main · enhuiz/vall-e
Implementation - Fork of enhuiz repo fixing DeepSpeed
GitHub - kgasenzer/vall-e: An unofficial PyTorch implementation of the audio LM VALL-E
See related: