🪴 Anil's Garden

❯

❯

Yifan Peng

19 Dec 20252 min read

Website: https://pyf98.github.io/

I am a final-year Ph.D. student in the Department of Electrical and Computer Engineering at Carnegie Mellon University. I am fortunate to be supervised by Prof. Shinji Watanabe (Sep 2021 - now) and Prof. Ian Lane (Aug 2020 - Aug 2021; now at UC, Santa Cruz). I received my bachelor’s degree from the Department of Electronic Engineering at Tsinghua University in 2020.

In Summer 2024, I was an AI Research Intern at NVIDIA NeMo, where I worked on joint speech-text language models. In Summer 2023, I was a research scientist intern at Meta AI FAIR and worked on speech language models for voice-preserved textless speech-to-speech translation. In Summer 2022, I worked as a speech recognition intern at ASAPP about speech model compression.

My research area is speech and language processing. My Ph.D. thesis is to develop effective and efficient open speech foundation models. I have led the project of Open Whisper-style Speech Models (OWSM) at CMU WAVLab, developing the first large-scale, fully open speech foundation model from academia. Recently, I am also interested in integrating speech capabilities into large language models.

I published first-authored papers at top-tier AI/speech conferences, such as ICML, ACL, ICASSP, and INTERSPEECH. Several projects received notable recognition, including the Best Paper Award at SLT 2024, Best Paper Award at EMNLP 2024, Top 3% Paper Recognition at ICASSP 2023 (3 papers), and Best Student Paper Award Finalist at SPIE Medical Imaging 2020. I also contribute to a widely used speech processing toolkit, ESPnet. Specifically, I have been the primary contributor to several major projects:

Novel speech encoder architecture: Branchformer (ICML’22), E-Branchformer vs Conformer (INTERSPEECH’23)
Speech model compression: I3D (ICASSP’23 Top 3%), HJ-Pruning (ICASSP’23 Top 3%), DPHuBERT (INTERSPEECH’23)
Open speech foundation models: OWSM (ASRU’23), OWSM v3.1 (INTERSPEECH’24), OWSM-CTC (ACL’24)
Speech language models: SpeechLM analysis, MSLM-S2ST, VoiceTextBlender, and more to follow

Graph View

Backlinks

DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models
Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech
E-Branchformer: Branchformer with Enhanced merging for speech recognition
ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit
ESPnet-SpeechLM: An Open Speech Language Model Toolkit
Granary: Speech Recognition and Translation Dataset in 25 European Languages
I3D: Transformer architectures with input-dependent dynamic depth for speech recognition
OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models
OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer
OWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning
OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification
On The Landscape of Spoken Language Models: A Comprehensive Survey
On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models
OpusLM: A Family of Open Unified Speech Language Models
Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data
Towards Robust Speech Representation Learning for Thousands of Languages
Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋