🪴 Anil's Garden

❯

Parakeet A natural sounding, conversational text-to-speech model

16 Oct 20252 min read

speech
tts
paper

Title: Parakeet A natural sounding, conversational text-to-speech model Authors: Jordan Darefsky, Ge Zhu, Zhiyao Duan Published: 2024-05-13 Link: https://jordandarefsky.com/blog/2024/parakeet/

Abstract

In this blog post, I describe my work with Ge Zhu and Professor Zhiyao Duan to develop an initial version of a text-to-speech (TTS) model we call Parakeet. The research presented in this post was completed early last fall and was supported by Google’s TPU Research Cloud (TRC) program. This project would not have been possible without their immense generosity.

Parakeet takes in as input a text prompt, optionally containing multiple speakers or non-verbal events like “laughter,” and outputs up to 30 seconds of corresponding audio. In designing Parakeet, we had two main goals in mind for our model:

To be architecturally simple — as close to end-to-end as possible.

To be able to produce natural, conversational speech. This includes generating multi-speaker samples, as well as generating common “events” found in dialogue, such as laughter and coughing. A brief overview of our methodology is as follows:

We curate a ~100,000 hour dataset of audio-transcription pairs. ~60,000 hours come from Spotify Podcast Dataset [1]; we fine-tune a Whisper model to provide speaker/event annotated transcriptions and then backtranslate the entirety of the Spotify Podcast audio using this model.

We train an autoregressive transformer to predict audio tokens (specifically, DAC [2] codes), conditioned on raw transcription text. We develop a modified classifier-free guidance (CFG) technique, which we call CFG-filter, to improve quality.

We plan to release our fine-tuned whisper models and possibly the generative model (and/or future improved versions). The generative model would have to be released under a non-commercial license due to our datasets.

This project is a work in progress, but below are samples from our model, followed by a more detailed methodology. These samples are cherry-picked (in general, best-of-generally-between-4-and-16-samples). We hope to eventually improve our model so that our outputs are more consistently high-quality.

Graph View

Backlinks

No backlinks found

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋

🪴 Anil's Garden

Explorer

Parakeet A natural sounding, conversational text-to-speech model

Graph View

Backlinks