Improving Universal Access to Modern Speech Technology

🪴 Anil's Garden

Improving Universal Access to Modern Speech Technology
Stanford I NLP
Martijn Bartelds
Stanford NLP Group
bartelds@stanford.edu

Won best thesis award at Groningen

Part of the Conversational AI Reading Group series

ML Superb 2.0

Different languages have high variability in ASR performance
- gives two examples of Whisper and MMS
What do we need
- new algorithms for bridging performance gap between languages
ML-SUPERB 2 0 Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets

Background: ML-SUPERB 2 0

Recent multilingual speech models
hundreds of languages supported but evaluation setups are inconsistent
- question does he mean between languages, or in general?
ML Superb 2.0
XTREME-S
IndicSUPERB
ML-SUPERB
ML Supeb is the most comprehensive benchmark
- some limitations:
  - considered only a fixed benchmark setting - froze modern and trained only top layer for ASR or LangID
- not a robust / reasonable approach
- doesn’t take users’ budgets / limitations into account
- want a better benchmark to accommodate this
- ML Superb was an aggregate score - does not incentivise robustness
  - surely it does?question
ML Superb 2.0 evalutes join multilingual LID/ASR
Updates the ML superb dataset by correcting some mistakes
Some stats:
- 141 langs from 15 datasets
- 300 hours total
- 1 hour from each language-dataset pair → 300 hours total (some langs have more than one dataset)
- 20 languages from the 141 are reserved from the few-shot experiments
use the ESPnet
S3PRL used for loading
XLS-R and MMS evaluated at the outset (had best performance)
limit 100m tunable parameters - in line with original ML superb

Design of Benchmark (more flexible approach)

Invesigate 4 new benchmark configurations
- larger scale downstream models
  - LID + transcript
- SSL model fine-tuning
- efficient model adaptation strategies
- supervised pre-training models - OWSM (open-source Whisper)

larger scale downstream models

CTC framework considers 3 encoders:

E-Branchformer
Conformer
Transformer

CTC-Attention framework considers encoder and decoder downsream models

E-Branchformer + 8-layer transformer decoder
Conformer + 8-layer transformer decoder
Transformer + 8-layer transformer decoder

SSL model fine-tuning

Effieicnt

PEFT approaches (Adapters, LoRA) + larger scale downstream models approach

Supervised Pre-training Models

Experimental Design

Hyperparameters follow prior work
Tune learning rate (I think just this) and train the model

See paper for details

Experimental Design: Evaluation

Place greater focus on measuring robustness:

Macro-average over languages/datasets instead of micro-average CER
Compute per-language CER as the macro-average of CERs across all datasets per language
Compute the macro-average of the per-language CERs

→ Allows to better understand variation between languages and datasets

→ Languages with more samples do not disproportionally affect the CER

Standard deviation of language-specific CERs
Measure CER of the worst-performing language
Measure CER range between datasets in the same language

Effect of Introducing Four Configurations

Configurations	Details	Accuracy	CER (Normal)
Original ML-SUPERB	MMS + Transformer CTC	90.3	24.7 +/- 12.3
Larger Downstream	MMS + E-Branchformer ATT-CTC	95.2	16.6 +/- 11.8
SSL Model Fine-tuning	MMS + 9-14 layers partial fine-tuning CTC	95.6	15.5 +/- 10.3
Efficient Model Adaptation	MMS + LoRA + Transformer ATT-CTC	94.2	18.7 +/- 11.5
Supervised Pre-trained Model	Whisper Encoder + Transformer CTC	91.7	21.0 +/- 12.5

Supervised ASR vs. SSL Pre-trained Models

Original ML-SUPERB only focuses on SSL pre-trained models
ML-SUPERB 2.0 also allows the use of supervised ASR models
As long as the test sets from the ML-SUPERB 2.0 dataset are not used in training *
In our paper, we introduce some preliminary analysis on the comparison between supervised ASR and SSL pre-trained models

In most configurations: CER exceeds 60% Lao or Min Nan Chinese

Large CER differences between datasets in the same language

→ big contribution of differences arising from domain or acoustic settings

Takehome Messages

Contributions of ML Superb 2.0

We present an updated benchmark for multilingual speech pre-trained models, which builds upon ML-SUPERB
We investigate four configurations that ML-SUPERB does not consider
We introduce a broader set of evaluation metrics to measure variation across languages and datasets

Findings of ML Superb 2.0

All four configurations show improvements over the configuration used in the original ML-SUPERB, which was likely underestimating model performance
Model ranking depends on the configuration of the benchmark
There is no single way to evaluate an SSL model. It must always be measured in the context of a specific downstream model and task
We encourage research on methods that improve language/dataset
robustness

Questions on ML Superb 2.0

CPT/fine-tuning strategy of ML Superb 2.0 caps at 100m params so they target bottom, middle or top of network for full fine-tuning evaluation appraoch.
For PEFT, they insert at all transformer layers - I suppose for a really huge model (lots of transformer layers) they would run out of parameters even with LoRA

Standard Approach: ERM

Look at Average vs Worst Group gap

Approach inspired by Distributionally Robust Neural Networks for Group Shifts On the Importance of Regularization for Worst-Case Generalization

Group DRO algorithm:

sample a batch compute average training loss for each group in a batch - perform exponential multiplicative update of average training loss for each group
normalize the group weights to get a distribution
loss is weighted by group weights

See the paper for understanding

Challenges

Best performing models on ML-Superb 2.0 are fine-tuned with CTC
- encoder only models need a DS task
CTC scales with the length of the audio samples and corresponding transcriptions

Groups are very varied in utterance length.

DRO optimisation
- behaviour leads to undertraining - worsens DS performance of the other groups that would have performed better

They propose CTC-DRO

CTC-DRO

Work in progress

Follow up

OWSM - open Whisper
CTC scales with the length of the audio samples and corresponding transcriptions
[Distributionally Robust Neural Networks for Group Shifts On the Importance of Regularization for Worst-Case Generalization]]
ML-SUPERB 2 0 Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets

Had to leave 40-45 minutes into the talk. Watch the end on Conversational AI Reading Group channel.

🪴 Anil's Garden

Explorer

Improving Universal Access to Modern Speech Technology - Martijn Bartelds

ML Superb 2.0

Background: ML-SUPERB 2 0

Design of Benchmark (more flexible approach)

larger scale downstream models

SSL model fine-tuning

Effieicnt

Supervised Pre-training Models

Experimental Design

Experimental Design: Evaluation

Effect of Introducing Four Configurations

Supervised ASR vs. SSL Pre-trained Models

Takehome Messages

Questions on ML Superb 2.0

Standard Approach: ERM

Challenges

CTC-DRO

Follow up

Graph View

Table of Contents

Backlinks