Improving Universal Access to Modern Speech Technology
Stanford I NLP
Martijn Bartelds
Stanford NLP Group
bartelds@stanford.edu

Won best thesis award at Groningen


ML Superb 2.0

Background: ML-SUPERB 2 0

  • Recent multilingual speech models
  • hundreds of languages supported but evaluation setups are inconsistent
    • question does he mean between languages, or in general?
  • ML Superb 2.0
  • XTREME-S
  • IndicSUPERB
  • ML-SUPERB
  • ML Supeb is the most comprehensive benchmark
    • some limitations:
      • considered only a fixed benchmark setting - froze modern and trained only top layer for ASR or LangID
    • not a robust / reasonable approach
    • doesn’t take users’ budgets / limitations into account
    • want a better benchmark to accommodate this
    • ML Superb was an aggregate score - does not incentivise robustness
  • ML Superb 2.0 evalutes join multilingual LID/ASR
  • Updates the ML superb dataset by correcting some mistakes
  • Some stats:
    • 141 langs from 15 datasets
    • 300 hours total
    • 1 hour from each language-dataset pair 300 hours total (some langs have more than one dataset)
    • 20 languages from the 141 are reserved from the few-shot experiments
  • use the ESPnet
  • S3PRL used for loading
  • XLS-R and MMS evaluated at the outset (had best performance)
  • limit 100m tunable parameters - in line with original ML superb

Design of Benchmark (more flexible approach)

  • Invesigate 4 new benchmark configurations
    • larger scale downstream models
      • LID + transcript
    • SSL model fine-tuning
    • efficient model adaptation strategies
    • supervised pre-training models - OWSM (open-source Whisper)

larger scale downstream models

CTC framework considers 3 encoders:

  • E-Branchformer
  • Conformer
  • Transformer

CTC-Attention framework considers encoder and decoder downsream models

  • E-Branchformer + 8-layer transformer decoder
  • Conformer + 8-layer transformer decoder
  • Transformer + 8-layer transformer decoder

SSL model fine-tuning

Effieicnt

PEFT approaches (Adapters, LoRA) + larger scale downstream models approach

Supervised Pre-training Models

Experimental Design

  • Hyperparameters follow prior work
  • Tune learning rate (I think just this) and train the model

See paper for details

Experimental Design: Evaluation

Place greater focus on measuring robustness:

  • Macro-average over languages/datasets instead of micro-average CER
  • Compute per-language CER as the macro-average of CERs across all datasets per language
  • Compute the macro-average of the per-language CERs

Allows to better understand variation between languages and datasets

Languages with more samples do not disproportionally affect the CER

  • Standard deviation of language-specific CERs
  • Measure CER of the worst-performing language
  • Measure CER range between datasets in the same language

Effect of Introducing Four Configurations

ConfigurationsDetailsAccuracyCER (Normal)
Original ML-SUPERBMMS + Transformer CTC90.324.7 +/- 12.3
Larger DownstreamMMS + E-Branchformer ATT-CTC95.216.6 +/- 11.8
SSL Model Fine-tuningMMS + 9-14 layers partial fine-tuning CTC95.615.5 +/- 10.3
Efficient Model AdaptationMMS + LoRA + Transformer ATT-CTC94.218.7 +/- 11.5
Supervised Pre-trained ModelWhisper Encoder + Transformer CTC91.721.0 +/- 12.5

Supervised ASR vs. SSL Pre-trained Models

  • Original ML-SUPERB only focuses on SSL pre-trained models
  • ML-SUPERB 2.0 also allows the use of supervised ASR models
  • As long as the test sets from the ML-SUPERB 2.0 dataset are not used in training *
  • In our paper, we introduce some preliminary analysis on the comparison between supervised ASR and SSL pre-trained models

In most configurations: CER exceeds 60% Lao or Min Nan Chinese

Large CER differences between datasets in the same language

big contribution of differences arising from domain or acoustic settings

Takehome Messages

Contributions of ML Superb 2.0

  • We present an updated benchmark for multilingual speech pre-trained models, which builds upon ML-SUPERB
  • We investigate four configurations that ML-SUPERB does not consider
  • We introduce a broader set of evaluation metrics to measure variation across languages and datasets

Findings of ML Superb 2.0

  • All four configurations show improvements over the configuration used in the original ML-SUPERB, which was likely underestimating model performance
  • Model ranking depends on the configuration of the benchmark
  • There is no single way to evaluate an SSL model. It must always be measured in the context of a specific downstream model and task
  • We encourage research on methods that improve language/dataset
  • robustness

Questions on ML Superb 2.0

  • CPT/fine-tuning strategy of ML Superb 2.0 caps at 100m params so they target bottom, middle or top of network for full fine-tuning evaluation appraoch.
  • For PEFT, they insert at all transformer layers - I suppose for a really huge model (lots of transformer layers) they would run out of parameters even with LoRA

Standard Approach: ERM

  • Look at Average vs Worst Group gap

Approach inspired by Distributionally Robust Neural Networks for Group Shifts On the Importance of Regularization for Worst-Case Generalization

Group DRO algorithm:

  • sample a batch compute average training loss for each group in a batch - perform exponential multiplicative update of average training loss for each group
  • normalize the group weights to get a distribution
  • loss is weighted by group weights

See the paper for understanding

Challenges

  • Best performing models on ML-Superb 2.0 are fine-tuned with CTC
    • encoder only models need a DS task
  • CTC scales with the length of the audio samples and corresponding transcriptions

Groups are very varied in utterance length.

  • DRO optimisation
    • behaviour leads to undertraining - worsens DS performance of the other groups that would have performed better

They propose CTC-DRO

CTC-DRO

Work in progress

Follow up

Had to leave 40-45 minutes into the talk. Watch the end on Conversational AI Reading Group channel.