Title: Lightweight and Efficient Spoken Language Identification of Long-form Audio Authors: Winstead Zhu, Md Iftekhar Tanveer, Yang Janet Liu, Seye Ojumu, Rosie Jones Published: 2023-08-01 Link: https://www.isca-archive.org/interspeech_2023/zhu23c_interspeech.html

Abstract

State-of-the-art Spoken Language Identification (SLI) systems usually focus on tackling short audio clips, and thus their performance degrade drastically when applied to long-form audio, such as podcast, which poses peculiar challenges to existing SLI approaches due to its long duration and diverse content that frequently involves multiple speakers as well as various languages, topics, and speech styles. In this paper, we propose the first system to tackle SLI for long-form audio using podcast data by training a lightweight, multi-class feedforward neural classifier using speaker embeddings as input. We demonstrate that our approach can make inference on long audio input efficiently; furthermore, our system can handle long audio files with multiple speakers and can be further extended into utterance-level inference and code-switching detection, which is currently not covered by any existing SLI system.


Written up on Spotify blog: Audio-based Machine Learning Model for Podcast Language Identification - Spotify Research Spotify Research


  • Spoken Language Identification: Id language from audio
  • crucial for ASR and S2ST tasks
    • e.g. Google speech-to-text API requires selecting at most four languages to condition the prior on a knowledge base - requires accurate (human) annotation / metadata
  • started off basing their approach on VGGVox - speaker embedding model - with the aim of co-opting this encoder for (spoken) language identification

Approach

  • Unsupervised speaker diarization of podcast (voices)
  • VGGVox embedding of individual voices
  • [implied from figure] average the individual speaker embeddings (?)
  • Feedforward network with {Dense 200d, ReLU}-{Dense 200d, ReLU}-BatchNorm-Dropout(0.4) - {Dense 200d, ReLU}-{Dense 200d, ReLU}-BatchNorm-Softmax(10) Data:
  • 10 languages (targets): English, Spanish, French, Portuguese, German, Indonesian, Swedish, Italian, Chinese, Welsh
  • 90:10 train-val split
  • 10,572 hours of audio (podcasts)
  • 1000 (1k) episodes sampled per language Optimization:
  • Batch size 10
  • trained for 500 epochs

Results

Evaluated on 1000 human-labelled episodes sampled per language only for English, Spanish, German, Portuguese and Swedish:

Metric \ LanguagePrecision (%)Recall (%)F1 (%)AUC
English (en)97.0788.3392.500.99
Spanish (es)93.9387.6790.690.98
German (de)98.1588.6793.170.99
Portuguese (pt)88.4688.3386.740.95
Swedish (sv)94.5091.6793.060.98
--------------------------------------------------------------------------
Average94.4288.9391.230.98

Table 1. Evaluation results on the long-form audio test set (podcast)

[reminder on how to interpret AUC / ROC curves]

  • This is a balanced test set
    • Higher precision on English seems like a good result but the training data were balanced too (assuming roughly the same lengths of podcasts)
  • their system allows prediction for 1 hour audio in 3 minutes
  • ECAPA-TDNN - state of the art SLI model (they?) trained on the VoxLingua107 dataset with SpeechBrain
    • couldn’t run inference for episodes longer than 15 minutes

Takeaway: They got an F1 of 91.25% on average across the five test languages Is this good??

Extras

They can also handle code-switching podcasts by not averaging the speaker embeddings before putting the VGGVox embeddings through the FFN

  • this is presumably without having to retrain the FFN?

Figures

Model Architecture

T-SNE plot (2d) of VGGVox embeddings of random sample of 100 languages from {English, Spanish, French, Portuguese and Japanese}