Title: From TOWER to SPIRE: Adding the Speech Modality to a Text-Only LLM
Authors: Kshitij Ambilduke, Ben Peters, Sonal Sannigrahi, Anil Keshwani, Tsz Kin Lam, Bruno Martins, Marcely Zanon Boito, André F. T. Martins
Published: 13th March 2025 (Thursday) @ 17:57:32
Link: http://arxiv.org/abs/2503.10620v1
Abstract
Large language models (LLMs) have shown remarkable performance and generalization capabilities across multiple languages and tasks, making them very attractive targets for multi-modality integration (e.g., images or speech). In this work, we extend an existing LLM to the speech modality via speech discretization and continued pre-training. In particular, we are interested in multilingual LLMs, such as TOWER, as their pre-training setting allows us to treat discretized speech input as an additional translation language. The resulting open-source model, SPIRE, is able to transcribe and translate English speech input while maintaining TOWERâs original performance on translation-related tasks, showcasing that discretized speech input integration as an additional language is feasible during LLM adaptation. We make our code and models available to the community.
Summary by Sonal Sannigrahi targeting a wider audience:
We introduce SPIRE, a speech-augmented language model that adapts the capabilities of text-only Tower to the speech domain. Unlike previous approaches that rely on an automatic speech recognition (ASR) pipeline or expensive projector-based trained, SPIRE uses speech discretisation via HuBERT tokens to convert continuous speech signals into discrete speech units (DSUs). This method is not only more computationally efficient but also allows us to treat speech as an additional language instead of a different modality. This is interesting for 2 reasons, i) we can use well tested training methods for building multilingual text models to now adapt our LM to speech and ii) we can also adapt similar data sampling strategies with success.
Similar to Tower, we apply a two-stage training process: Continued Pre-training (CPT) and Instruction Tuning (IT) to first allow the model to model speech successfully and then expand the task coverage of the same. During CPT we use mixed speech and text data covering the 10 original Tower languages and during IT we train on task-specific datasets covering ASR, Machine Translation (MT), and Speech Translation (ST). Our model uses a total of 42.5K hours of speech, significantly lower than other speech models, and achieves competitive results while preserving Towerâs text-only performance on translation and translation-related tasks.
Experiments show that SPIRE can effectively handle ASR, MT, and ST tasks. For ASR, the fully trained SPIREFULL model outperforms Whisper-base and matches or surpasses several multimodal baselines trained on far larger datasets, though it lags behind large-scale systems like Whisper-large-v3 and SeamlessM4T. On MT benchmarks, SPIRE maintains Towerâs strong translation performance across 10 languages, confirming that integrating speech does not compromise text-based capabilities. In ST, SPIREFULL demonstrates robustness, performing competitively in both direct and cascaded settings, especially when cascading ASR and MT. However, its direct ST performance remains dataset-dependent and is weaker than models trained on larger and more diverse speech corpora.