This page lists some speech-related research conducted by the team led by Xu Tan. The research topics cover text to speech, singing voice synthesis, music generation, automatic speech recognition, etc. Some research are open-sourced via NeuralSpeech and Muzic.

We are hiring researchers on audio/video generation and LLMs. Please contact tanxu2012@gmail.com if you are interested.

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

February 8, 2024

PromptTTS 2: Describing and Generating Voices with Text Prompt

September 07, 2023

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

April 19, 2023

PromptTTS: Controllable Text-to-Speech with Text Descriptions

November 22, 2022

VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing

August 30, 2022

BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis

May 29, 2022

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

May 03, 2022

Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech

April 02, 2022

AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios

March 06, 2022

Speech-T: Transducer for Text to Speech and Beyond

October 06, 2021

TeleMelody: Lyric-to-Melody Generation with a Template-Based Two-Stage Method

September 21, 2021

DeepRapper: Neural Rap Generation with Rhyme and Rhythm Modeling

August 16, 2021

PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Driven Adaptive Prior

June 11, 2021

AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style

June 02, 2021

AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data

March 05, 2021

AdaSpeech: Adaptive Text to Speech for Custom Voice

March 01, 2021

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

February 10, 2021

SongMASS: Automatic Song Writing with Pre-training and Alignment Constraint

December 14, 2020

November 03, 2020

DenoiSpeech: Denoising Text to Speech with Frame-Level Noise Modeling

October 14, 2020

HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis

September 02, 2020

PopMAG: Pop Music Accompaniment Generation

August 01, 2020

UWSpeech: Speech to Speech Translation for Unwritten Languages

June 12, 2020

MultiSpeech: Multi-Speaker Text to Speech with Transformer

May 09, 2020

March 01, 2020

DeepSinger: Singing Voice Synthesis with Data Mined From the Web

February 14, 2020

LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition

February 02, 2020

FastSpeech: Fast, Robust and Controllable Text to Speech

May 10, 2019

Almost Unsupervised Text to Speech and Automatic Speech Recognition

April 10, 2019