Title: LiSTEN: Learning Soft Token Embeddings for Neural Audio LLMs
Authors: Pooneh Mousavi, Shubham Gupta, Cem Subakan, Mirco Ravanelli
Published: 24th May 2025 (Saturday) @ 05:28:22
Link: http://arxiv.org/abs/2505.18517v1

Abstract

Foundation models based on large language models (LLMs) have shown great success in handling various tasks and modalities. However, adapting these models for general-purpose audio-language tasks is challenging due to differences in acoustic environments and task variations. In this work, we introduce LiSTEN Learning Soft Token Embeddings for Neural Audio LLMs), a framework for adapting LLMs to speech and audio tasks. LiSTEN uses a dynamic prompt selection strategy with learnable key-value pairs, allowing the model to balance general and task-specific knowledge while avoiding overfitting in a multitask setting. Our approach reduces dependence on large-scale ASR or captioning datasets, achieves competitive performance with fewer trainable parameters, and simplifies training by using a single-stage process. Additionally, LiSTEN enhances interpretability by analyzing the diversity and overlap of selected prompts across different tasks.