Title: Parakeet A natural sounding, conversational text-to-speech model Authors: Jordan Darefsky, Ge Zhu, Zhiyao Duan Published: 2024-05-13 Link: https://jordandarefsky.com/blog/2024/parakeet/
Abstract
In this blog post, I describe my work with Ge Zhu and Professor Zhiyao Duan to develop an initial version of a text-to-speech (TTS) model we call Parakeet. The research presented in this post was completed early last fall and was supported by Googleâs TPU Research Cloud (TRC) program. This project would not have been possible without their immense generosity.
Parakeet takes in as input a text prompt, optionally containing multiple speakers or non-verbal events like âlaughter,â and outputs up to 30 seconds of corresponding audio. In designing Parakeet, we had two main goals in mind for our model:
To be architecturally simple â as close to end-to-end as possible.
To be able to produce natural, conversational speech. This includes generating multi-speaker samples, as well as generating common âeventsâ found in dialogue, such as laughter and coughing. A brief overview of our methodology is as follows:
We curate a ~100,000 hour dataset of audio-transcription pairs. ~60,000 hours come from Spotify Podcast Dataset [1]; we fine-tune a Whisper model to provide speaker/event annotated transcriptions and then backtranslate the entirety of the Spotify Podcast audio using this model.
We train an autoregressive transformer to predict audio tokens (specifically, DAC [2] codes), conditioned on raw transcription text. We develop a modified classifier-free guidance (CFG) technique, which we call CFG-filter, to improve quality.
We plan to release our fine-tuned whisper models and possibly the generative model (and/or future improved versions). The generative model would have to be released under a non-commercial license due to our datasets.
This project is a work in progress, but below are samples from our model, followed by a more detailed methodology. These samples are cherry-picked (in general, best-of-generally-between-4-and-16-samples). We hope to eventually improve our model so that our outputs are more consistently high-quality.