Improving Universal Access to Modern Speech Technology
Stanford I NLP
Martijn Bartelds
Stanford NLP Group
bartelds@stanford.edu

Won best thesis award at Groningen


ML Superb 2.0

Background: ML-SUPERB 2 0

  • Recent multilingual speech models
  • hundreds of languages supported but evaluation setups are inconsistent
    • question does he mean between languages, or in general?
  • ML Superb 2.0
  • XTREME-S
  • IndicSUPERB
  • ML-SUPERB
  • ML Supeb is the most comprehensive benchmark
    • some limitations:
      • considered only a fixed benchmark setting - froze modern and trained only top layer for ASR or LangID
    • not a robust / reasonable approach
    • doesn’t take users’ budgets / limitations into account
    • want a better benchmark to accommodate this
    • ML Superb was an aggregate score - does not incentivise robustness
  • ML Superb 2.0 evalutes join multilingual LID/ASR
  • Updates the ML superb dataset by correcting some mistakes
  • Some stats:
    • 141 langs from 15 datasets
    • 300 hours total
    • 1 hour from each language-dataset pair → 300 hours total (some langs have more than one dataset)
    • 20 languages from the 141 are reserved from the few-shot experiments
  • use the ESPnet
  • S3PRL used for loading
  • XLS-R and MMS evaluated at the outset (had best performance)
  • limit 100m tunable parameters - in line with original ML superb

Design of Benchmark (more flexible approach)

  • Invesigate 4 new benchmark configurations
    • larger scale downstream models
      • LID + transcript
    • SSL model fine-tuning
    • efficient model adaptation strategies
    • supervised pre-training models - OWSM (open-source Whisper)

larger scale downstream models

CTC framework considers 3 encoders:

  • E-Branchformer
  • Conformer
  • Transformer

CTC-Attention framework considers encoder and decoder downsream models

  • E-Branchformer + 8-layer transformer decoder
  • Conformer + 8-layer transformer decoder
  • Transformer + 8-layer transformer decoder

SSL model fine-tuning

Effieicnt

PEFT approaches (Adapters, LoRA) + larger scale downstream models approach

Supervised Pre-training Models

Experimental Design

  • Hyperparameters follow prior work
  • Tune learning rate (I think just this) and train the model

See paper for details

Experimental Design: Evaluation

Place greater focus on measuring robustness:

  • ï»żï»żMacro-average over languages/datasets instead of micro-average CER
  • ï»żï»żCompute per-language CER as the macro-average of CERs across all datasets per language
  • ï»żï»żCompute the macro-average of the per-language CERs

→ Allows to better understand variation between languages and datasets

→ Languages with more samples do not disproportionally affect the CER

  • ï»żï»żStandard deviation of language-specific CERs
  • ï»żï»żMeasure CER of the worst-performing language
  • ï»żï»żMeasure CER range between datasets in the same language

Effect of Introducing Four Configurations

ConfigurationsDetailsAccuracyCER (Normal)
Original ML-SUPERBMMS + Transformer CTC90.324.7 +/- 12.3
Larger DownstreamMMS + E-Branchformer ATT-CTC95.216.6 +/- 11.8
SSL Model Fine-tuningMMS + 9-14 layers partial fine-tuning CTC95.615.5 +/- 10.3
Efficient Model AdaptationMMS + LoRA + Transformer ATT-CTC94.218.7 +/- 11.5
Supervised Pre-trained ModelWhisper Encoder + Transformer CTC91.721.0 +/- 12.5

Supervised ASR vs. SSL Pre-trained Models

  • ï»żï»żOriginal ML-SUPERB only focuses on SSL pre-trained models
  • ï»żï»żML-SUPERB 2.0 also allows the use of supervised ASR models
  • ï»żï»żAs long as the test sets from the ML-SUPERB 2.0 dataset are not used in training *
  • ï»żï»żIn our paper, we introduce some preliminary analysis on the comparison between supervised ASR and SSL pre-trained models

In most configurations: CER exceeds 60% Lao or Min Nan Chinese

Large CER differences between datasets in the same language

→ big contribution of differences arising from domain or acoustic settings

Takehome Messages

Contributions of ML Superb 2.0

  • ï»żï»żWe present an updated benchmark for multilingual speech pre-trained models, which builds upon ML-SUPERB
  • ï»żï»żWe investigate four configurations that ML-SUPERB does not consider
  • ï»żï»żWe introduce a broader set of evaluation metrics to measure variation across languages and datasets

Findings of ML Superb 2.0

  • ï»żï»żAll four configurations show improvements over the configuration used in the original ML-SUPERB, which was likely underestimating model performance
  • ï»żï»żModel ranking depends on the configuration of the benchmark
  • ï»żï»żThere is no single way to evaluate an SSL model. It must always be measured in the context of a specific downstream model and task
  • ï»żï»żWe encourage research on methods that improve language/dataset
  • robustness

Questions on ML Superb 2.0

  • CPT/fine-tuning strategy of ML Superb 2.0 caps at 100m params so they target bottom, middle or top of network for full fine-tuning evaluation appraoch.
  • For PEFT, they insert at all transformer layers - I suppose for a really huge model (lots of transformer layers) they would run out of parameters even with LoRA

Standard Approach: ERM

  • Look at Average vs Worst Group gap

Approach inspired by Distributionally Robust Neural Networks for Group Shifts On the Importance of Regularization for Worst-Case Generalization

Group DRO algorithm:

  • sample a batch compute average training loss for each group in a batch - perform exponential multiplicative update of average training loss for each group
  • normalize the group weights to get a distribution
  • loss is weighted by group weights

See the paper for understanding

Challenges

  • Best performing models on ML-Superb 2.0 are fine-tuned with CTC
    • encoder only models need a DS task
  • CTC scales with the length of the audio samples and corresponding transcriptions

Groups are very varied in utterance length.

  • DRO optimisation
    • behaviour leads to undertraining - worsens DS performance of the other groups that would have performed better

They propose CTC-DRO

CTC-DRO

Work in progress

Follow up

Had to leave 40-45 minutes into the talk. Watch the end on Conversational AI Reading Group channel.