Title: Granary: Speech Recognition and Translation Dataset in 25 European Languages
Authors: Nithin Rao Koluguri, Monica Sekoyan, George Zelenfroynd, Sasha Meister, Shuoyang Ding, Sofia Kostandian, He Huang, Nikolay Karpov, Jagadeesh Balam, Vitaly Lavrukhin, Yifan Peng, Sara Papi, Marco Gaido, Alessio Brutti, Boris Ginsburg
Published: 19th May 2025 (Monday) @ 17:40:58
Link: http://arxiv.org/abs/2505.13404v2

Abstract

Multi-task and multilingual approaches benefit large models, yet speech processing for low-resource languages remains underexplored due to data scarcity. To address this, we present Granary, a large-scale collection of speech datasets for recognition and translation across 25 European languages. This is the first open-source effort at this scale for both transcription and translation. We enhance data quality using a pseudo-labeling pipeline with segmentation, two-pass inference, hallucination filtering, and punctuation restoration. We further generate translation pairs from pseudo-labeled transcriptions using EuroLLM, followed by a data filtration pipeline. Designed for efficiency, our pipeline processes vast amount of data within hours. We assess models trained on processed data by comparing their performance on previously curated datasets for both high- and low-resource languages. Our findings show that these models achieve similar performance using approx. 50% less data. Dataset will be made available at https://hf.co/datasets/nvidia/Granary


  • “Building on prior research [6], we identified Whisper-large-v3 [11] as a strong candidate for pseudo-labeling due to its robust performance, multilingual capabilities, and open license.”
    • but hallucinates a lot - esp. the turbo variant
    • It struggles with noise and nonspeech segments, necessitating a robust voice activity detection (VAD) system. Additionally, language identification errors, fixed 30-second segment requirements, and lack of case control in output text further complicate its use.

the above points are basically why they had to design the Granary pipeline

  • Everything to FLAC or WAV
  • 16 kHz
  • mono audio
  • 40 second max length