Title: MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models
Authors: Yingxu He, Zhuohan Liu, Shuo Sun, Bin Wang, Wenyu Zhang, Xunlong Zou, Nancy F. Chen, Ai Ti Aw
Published: 13th December 2024 (Friday) @ 03:15:05
Link: http://arxiv.org/abs/2412.09818v3
Abstract
We introduce MERaLiON-AudioLLM (Multimodal Empathetic Reasoning and Learning in One Network), the first speech-text model tailored for Singaporeâs multilingual and multicultural landscape. Developed under the National Large Language Models Funding Initiative, Singapore, MERaLiON-AudioLLM integrates advanced speech and text processing to address the diverse linguistic nuances of local accents and dialects, enhancing accessibility and usability in complex, multilingual environments. Our results demonstrate improvements in both speech recognition and task-specific understanding, positioning MERaLiON-AudioLLM as a pioneering solution for region specific AI applications. We envision this release to set a precedent for future models designed to address localised linguistic and cultural contexts in a global framework.
Quick Notes
- Audio encoder: Whisper v2
- âWe are also exploring the integration of a localised speech encoder, which has been pre-trained from scratch using a self-supervised learning (SSL) frameworkâ
- Adaptor: We utilize a simple yet effective multi-layer perceptron (MLP) adaptor module with 2 hidden layers to transform the encoder outputs into 100 speech or audio token embeddings with a dimensional size of 3854
- MERaLiON-Whisper produces embeddings of sequence length 1500 and hidden dimension 1280, while SEA-LION V3 has an embedding size of 3854.
- yields slightly better results compared to other alternatives such as window-level Qformer [Tang et al., 2024] and ConvMLP [Li et al., 2023].
- Text Decoder: SEA-LION V3 [AI Singapore, 2024], a state-of-the-art localised large language model developed by 3 our partner, AI Singapore.
- built on the 9B version of Googleâs Gemma 2 [Team et al., 2024]
- continual pre-training on an additional 200 billion tokens sourced from diverse datasets
- encompass the four official languages of Singapore (English, Chinese, Malay, and Tamil)
- âŠand also include several other Southeast Asian languages.
- We use the instruct version of SEA-LION V3,3 which was further fine-tuned on approximately 500,000 English instruction-tuning pairs and approximately 1 million instruction tuning pairs in various ASEAN languages
- Training methodology:
- Audio encoder fine-tuning: guide the Whisper-large-v2 model to better capture these local characteristics by training it end-to-end on a collection of cleaned local ASR datasets derived from IMDAâs National Speech Corpus, in-house ASR datasets and public datasets
- Data: We curated an extensive collection of speech-text instruction-tuning pairs totalling 260,000 hours of data. A significant portion of this dataset is derived from IMDAâs National Speech Corpus (NSC) [Koh et al., 2019], which is licensed under the Singapore Open Data License.4 The National Speech Corpus contains approximately 10,600 hours of recordings of Singaporean English speakers, structured into six parts:
- 3000 hours of prompted readings from phonetically balanced scripts
- 3000 hours of prompted readings featuring sentences on topics such as people, food, locations, and brands
- 900 hours of conversational data, including discussions on daily life and gameplay interactions
- 900 hours of code-switching conversations where speakers alternate between Singlish and their Mother Tongue languages (Chinese, Malay, Tamil).
- 1500 hours of conversations following four themes: debate, finance, positive emotion, and negative emotion.
- 1300 hours of simulated phone calls across three thematic designs: (1) holiday, hotel, restaurant, (2) bank, telephone, insurance, and (3) Housing and Development Board (HDB), Ministry of Education (MOE), Ministry of Social and Family Development (MSF)
Results
Related: MERaLiON-SpeechEncoder Towards a Speech Foundation Model for Singapore and Beyond