I am a final-year Ph.D. student in the Department of Electrical and Computer Engineering at Carnegie Mellon University. I am fortunate to be supervised by Prof. Shinji Watanabe (Sep 2021 - now) and Prof. Ian Lane (Aug 2020 - Aug 2021; now at UC, Santa Cruz). I received my bachelor’s degree from the Department of Electronic Engineering at Tsinghua University in 2020.

In Summer 2024, I was an AI Research Intern at NVIDIA NeMo, where I worked on joint speech-text language models. In Summer 2023, I was a research scientist intern at Meta AI FAIR and worked on speech language models for voice-preserved textless speech-to-speech translation. In Summer 2022, I worked as a speech recognition intern at ASAPP about speech model compression.

My research area is speech and language processing. My Ph.D. thesis is to develop effective and efficient open speech foundation models. I have led the project of Open Whisper-style Speech Models (OWSM) at CMU WAVLab, developing the first large-scale, fully open speech foundation model from academia. Recently, I am also interested in integrating speech capabilities into large language models.

I published first-authored papers at top-tier AI/speech conferences, such as ICML, ACL, ICASSP, and INTERSPEECH. Several projects received notable recognition, including the Best Paper Award at SLT 2024, Best Paper Award at EMNLP 2024, Top 3% Paper Recognition at ICASSP 2023 (3 papers), and Best Student Paper Award Finalist at SPIE Medical Imaging 2020. I also contribute to a widely used speech processing toolkit, ESPnet. Specifically, I have been the primary contributor to several major projects: