🪴 Anil's Garden

❯

mHuBERT-147: A Compact Multilingual HuBERT Model

19 Dec 20251 min read

paper
speech
hubert
naver-labs
annotated

Title: mHuBERT-147: A Compact Multilingual HuBERT Model
Authors: Marcely Zanon Boito, Vivek Iyer, Nikolaos Lagos, Laurent Besacier, Ioan Calapodescu
Published: 10th June 2024 (Monday) @ 15:32:42
Link: http://arxiv.org/abs/2406.06371v4

Abstract

We present mHuBERT-147, the first general-purpose massively multilingual HuBERT speech representation model trained on 90K hours of clean, open-license data. To scale up the multi-iteration HuBERT approach, we use faiss-based clustering, achieving 5.2x faster label assignment than the original method. We also apply a new multilingual batching up-sampling strategy, leveraging both language and dataset diversity. After 3 training iterations, our compact 95M parameter mHuBERT-147 outperforms larger models trained on substantially more data. We rank second and first on the ML-SUPERB 10min and 1h leaderboards, with SOTA scores for 3 tasks. Across ASR/LID tasks, our model consistently surpasses XLS-R (300M params; 436K hours) and demonstrates strong competitiveness against the much larger MMS (1B params; 491K hours). Our findings indicate that mHuBERT-147 is a promising model for multilingual speech tasks, offering an unprecedented balance between high performance and parameter efficiency.

Training code: https://github.com/utter-project/fairseq
Pre-processing code: https://github.com/utter-project/mHuBERT-147-scripts
Model Weights: https://huggingface.co/utter-project/mHuBERT-147

Vaibhav Srivastav’s tweet about mHuBERT: https://x.com/reach_vb/status/1800825506402599315

Trained on on 90K hours of clean, open-licence data

Graph View

Backlinks

Speech and Audio - Tokenizers (Tokenisers)

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋

🪴 Anil's Garden

Explorer

mHuBERT-147: A Compact Multilingual HuBERT Model

Graph View

Backlinks