Title: Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing?
Authors: Marco Gaido, Sara Papi, Matteo Negri, Luisa Bentivogli
Published: 19th February 2024 (Monday) @ 10:34:13
Link: http://arxiv.org/abs/2402.12025v1

Abstract

The field of natural language processing (NLP) has recently witnessed a transformative shift with the emergence of foundation models, particularly Large Language Models (LLMs) that have revolutionized text-based NLP. This paradigm has extended to other modalities, including speech, where researchers are actively exploring the combination of Speech Foundation Models (SFMs) and LLMs into single, unified models capable of addressing multimodal tasks. Among such tasks, this paper focuses on speech-to-text translation (ST). By examining the published papers on the topic, we propose a unified view of the architectural solutions and training strategies presented so far, highlighting similarities and differences among them. Based on this examination, we not only organize the lessons learned but also show how diverse settings and evaluation approaches hinder the identification of the best-performing solution for each architectural building block and training choice. Lastly, we outline recommendations for future works on the topic aimed at better understanding the strengths and weaknesses of the SFM+LLM solutions for ST.


Open Questions

See the section ending for respective subtopics’ individual open questions.

  • Scaling laws not studied - could increase training data or compute for smaller models; or do grid search for larger model
  • Inconsistency in training task selection impedes researchers’ ability to select this in an evidence-based manner/optimally
    • relatedly: extent of knowledge transfer across tasks?
  • Role / Contribution of LLM finetuning to evaluation performance; Importance considering also choice of LLM used e.g. if a translation-specialised LLM (like Tower) was used, would this still be necessary?
  • Benchmarking / training on consistent language directions (NB en→de or de→en is one of the most frequently studied pairs for ST)
    • does this enable transfer learning (esp. inside language families) or hurt learning via interference?

Data

en→many:

  • IWSLT offline constrained scenario - ~4.5k hours; or
  • MuST-C - ~500 hours - seems to be offline as of August ‘24 many→ en:
  • CoVoST2, mTEDX and EuroParl-ST (complemented by ASR data: VoxPopuli, Common Voice)