Improving Universal Access to Modern Speech Technology
Stanford I NLP
Martijn Bartelds
Stanford NLP Group
bartelds@stanford.edu
Won best thesis award at Groningen
Part of the Conversational AI Reading Group series
ML Superb 2.0
- Different languages have high variability in ASR performance
- gives two examples of Whisper and MMS
- What do we need
- new algorithms for bridging performance gap between languages
- ML-SUPERB 2 0 Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets
Background: ML-SUPERB 2 0
- Recent multilingual speech models
- hundreds of languages supported but evaluation setups are inconsistent
- question does he mean between languages, or in general?
- ML Superb 2.0
- XTREME-S
- IndicSUPERB
- ML-SUPERB
- ML Supeb is the most comprehensive benchmark
- some limitations:
- considered only a fixed benchmark setting - froze modern and trained only top layer for ASR or LangID
- not a robust / reasonable approach
- doesnât take usersâ budgets / limitations into account
- want a better benchmark to accommodate this
- ML Superb was an aggregate score - does not incentivise robustness
- surely it does?question
- some limitations:
- ML Superb 2.0 evalutes join multilingual LID/ASR
- Updates the ML superb dataset by correcting some mistakes
- Some stats:
- 141 langs from 15 datasets
- 300 hours total
- 1 hour from each language-dataset pair â 300 hours total (some langs have more than one dataset)
- 20 languages from the 141 are reserved from the few-shot experiments
- use the ESPnet
- S3PRL used for loading
- XLS-R and MMS evaluated at the outset (had best performance)
- limit 100m tunable parameters - in line with original ML superb
Design of Benchmark (more flexible approach)
- Invesigate 4 new benchmark configurations
- larger scale downstream models
- LID + transcript
- SSL model fine-tuning
- efficient model adaptation strategies
- supervised pre-training models - OWSM (open-source Whisper)
- larger scale downstream models
larger scale downstream models
CTC framework considers 3 encoders:
- E-Branchformer
- Conformer
- Transformer
CTC-Attention framework considers encoder and decoder downsream models
- E-Branchformer + 8-layer transformer decoder
- Conformer + 8-layer transformer decoder
- Transformer + 8-layer transformer decoder
SSL model fine-tuning
Effieicnt
PEFT approaches (Adapters, LoRA) + larger scale downstream models approach
Supervised Pre-training Models
Experimental Design
- Hyperparameters follow prior work
- Tune learning rate (I think just this) and train the model
See paper for details
Experimental Design: Evaluation
Place greater focus on measuring robustness:
- ï»żï»żMacro-average over languages/datasets instead of micro-average CER
- ï»żï»żCompute per-language CER as the macro-average of CERs across all datasets per language
- ï»żï»żCompute the macro-average of the per-language CERs
â Allows to better understand variation between languages and datasets
â Languages with more samples do not disproportionally affect the CER
- ï»żï»żStandard deviation of language-specific CERs
- ï»żï»żMeasure CER of the worst-performing language
- ï»żï»żMeasure CER range between datasets in the same language
Effect of Introducing Four Configurations
Configurations | Details | Accuracy | CER (Normal) |
---|---|---|---|
Original ML-SUPERB | MMS + Transformer CTC | 90.3 | 24.7 +/- 12.3 |
Larger Downstream | MMS + E-Branchformer ATT-CTC | 95.2 | 16.6 +/- 11.8 |
SSL Model Fine-tuning | MMS + 9-14 layers partial fine-tuning CTC | 95.6 | 15.5 +/- 10.3 |
Efficient Model Adaptation | MMS + LoRA + Transformer ATT-CTC | 94.2 | 18.7 +/- 11.5 |
Supervised Pre-trained Model | Whisper Encoder + Transformer CTC | 91.7 | 21.0 +/- 12.5 |
Supervised ASR vs. SSL Pre-trained Models
- ï»żï»żOriginal ML-SUPERB only focuses on SSL pre-trained models
- ï»żï»żML-SUPERB 2.0 also allows the use of supervised ASR models
- ï»żï»żAs long as the test sets from the ML-SUPERB 2.0 dataset are not used in training *
- ï»żï»żIn our paper, we introduce some preliminary analysis on the comparison between supervised ASR and SSL pre-trained models
In most configurations: CER exceeds 60% Lao or Min Nan Chinese
Large CER differences between datasets in the same language
â big contribution of differences arising from domain or acoustic settings
Takehome Messages
Contributions of ML Superb 2.0
- ï»żï»żWe present an updated benchmark for multilingual speech pre-trained models, which builds upon ML-SUPERB
- ï»żï»żWe investigate four configurations that ML-SUPERB does not consider
- ï»żï»żWe introduce a broader set of evaluation metrics to measure variation across languages and datasets
Findings of ML Superb 2.0
- ï»żï»żAll four configurations show improvements over the configuration used in the original ML-SUPERB, which was likely underestimating model performance
- ï»żï»żModel ranking depends on the configuration of the benchmark
- ï»żï»żThere is no single way to evaluate an SSL model. It must always be measured in the context of a specific downstream model and task
- ï»żï»żWe encourage research on methods that improve language/dataset
- robustness
Questions on ML Superb 2.0
- CPT/fine-tuning strategy of ML Superb 2.0 caps at 100m params so they target bottom, middle or top of network for full fine-tuning evaluation appraoch.
- For PEFT, they insert at all transformer layers - I suppose for a really huge model (lots of transformer layers) they would run out of parameters even with LoRA
Standard Approach: ERM
- Look at Average vs Worst Group gap
Approach inspired by Distributionally Robust Neural Networks for Group Shifts On the Importance of Regularization for Worst-Case Generalization
Group DRO algorithm:
- sample a batch compute average training loss for each group in a batch - perform exponential multiplicative update of average training loss for each group
- normalize the group weights to get a distribution
- loss is weighted by group weights
See the paper for understanding
Challenges
- Best performing models on ML-Superb 2.0 are fine-tuned with CTC
- encoder only models need a DS task
- CTC scales with the length of the audio samples and corresponding transcriptions
Groups are very varied in utterance length.
- DRO optimisation
- behaviour leads to undertraining - worsens DS performance of the other groups that would have performed better
They propose CTC-DRO
CTC-DRO
Work in progress
Follow up
- OWSM - open Whisper
- CTC scales with the length of the audio samples and corresponding transcriptions
- [Distributionally Robust Neural Networks for Group Shifts On the Importance of Regularization for Worst-Case Generalization]]
- ML-SUPERB 2 0 Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets
Had to leave 40-45 minutes into the talk. Watch the end on Conversational AI Reading Group channel.