🪴 Anil's Garden

❯

SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

19 Dec 20252 min read

paper
speech
lm
microsoft

Title: SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data
Authors: Ziqiang Zhang, Sanyuan Chen, Long Zhou, Yu Wu, Shuo Ren, Shujie Liu, Zhuoyuan Yao, Xun Gong, Lirong Dai, Jinyu Li, Furu Wei
Published: 30th September 2022 (Friday) @ 09:12:10
Link: http://arxiv.org/abs/2209.15329v3

Abstract

How to boost speech pre-training with textual data is an unsolved problem due to the fact that speech and text are very different modalities with distinct characteristics. In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation. Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities, including phoneme-unit and hidden-unit tokenizers, which can be trained using a small amount of paired speech-text data. Based on the trained tokenizers, we convert the unlabeled speech and text data into tokens of phoneme units or hidden units. The pre-training objective is designed to unify the speech and the text into the same discrete semantic space with a unified Transformer network. We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB, demonstrating significant improvements on content-related tasks. Code and models are available at https://aka.ms/SpeechLM.

Graph View

Backlinks

Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks
Speech and Audio - Rolodex - Papers, Models and Releases

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋

🪴 Anil's Garden

Explorer

SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

Graph View

Backlinks