Hurdles at WMT: Keeping up with the MT progress - Tom Kocmi
2024-06-04 @ 10:00 WEST - Sardine Lab, Técnico, Lisbon
- WMT since 2022: Moved from News shared task to mutli-domain
- prevents over specialised, targeted systems
- kept news, added social, added speech; changing year on year
- System breakers (red team) via test suits
- Challenging language pairs
- Allowing pretrained models - LLMs like Llama, Mistral
- Fewer translation directions but more language pairs
New for this year:
- paragraph-level MT - “should have gone to paragraph level a long time ago”
- tried before but broke a lot of things; didn’t know how to do paragraph-level MT evaluation
- Sentence splitting the test set is a hard task - English is easy, but e.g. Chinese…? Asked humans to do it; even they’re not always consistent
- Three tracks:
- constrained - use only listed data and models
- open - use anything publicly available for research community
- closed systems - always included Microsoft, Google, DeepL; like online, GPT-4, etc.
- LLM benchmark - “to this day, I still haven’t seen a nicely evaluated LLM system for MT, so we tried”
- Multimodal - speech-to-text
- we will also provide automatic transcript of the audio - text-to-text systems also can compete but you’re allowed to use audio (and this is the intention)
- Literary domain
Collecting and translating datasets:
-
want data to be fresh - published in recent months - to avoid data contamination in models’ trainsets
- hard to keep it fresh if you go outside news domain; tried Reddit, now they stopped WMT using this data; now using Mastodon, hopefully they don’t stop WMT
- where to get the monolingual data
-
Data prep:
- preparing data collecting data candidates
- manually selecting docs + fixing removing problematic parts
- translating data by humans - Chinese translators told us they will not translate this (claimed it was propaganda); similar issues with sensitive content
-
AM: Were you directly in touch with translators
- TK: WMT does not have a budget; work with companies MTT (Japan)
- at Microsoft, we work with vendors
- with Czech, I know the translators
- TK: WMT does not have a budget; work with companies MTT (Japan)
-
Open problems with test sets
- where to get domain-specific data esp. for non-English languages
- Mastodon doesn’t work outside of English; maybe some German and French, no Czech, no Korean, no Icelandic etc.
- how to automatically check quality of human references?
- get references from partners
- how to check that references are not post-edits?
- translators use some system e.g. GPT-4 for translation generation and post-edit, so BLEU scores for those segments are inflated for those systems
- 2006 - 2016 ish: Used relative ranking
- works pretty well, pairwise comparisons
- issues:
- 16 systems per language pair becomes cumbersome for human eval
- was used when using sentence-level evaluation
- 2016 / mainly 2017: been using reference-based direct assessment
- don’t need language experts since
- Since ~2012: MQM at metrics
- Now: Building new protocol - ESA
- where to get domain-specific data esp. for non-English languages
-
Why don’t we use MQM?
- Need experts who know the error ontology - expensive and problematic for the low-resource languages
- MQM ~10x more expensive than DA + SQM per annotated sentence-level segment
-
Why can’t we continue with DA + SQM
- On paragraph level, cost increases
- need to evaluate same number of segments, but paragraphs take longer than sentences
- Concerns that fluency of LLMs highly bias annotators
- On paragraph level, cost increases
-
Comparison of MQM vs DA + SQM:
- no discordant evaluation results when DA + SQM annotate 4x more segments
- When using same number of segments DA + SQM, see many discordant clusterings of models
-
Ricardo Re: How do you know which one is incorrect on MQM vs DA + SQM?
- TK: Annotator bias from fluency of LLM outputs; we assume that DA + SQM is wrong
- RR: compare e.g. Croatian with and without length normalisation, changes MQM results a lot
New protocol ESA: Error Span Annotations
- Goals of ESA:
- faster by removing span categories
- … [see slides]
- … [see slides]
- speedup with ESA vs MQM of ~31.7% - 33.8s vs 49.4s
- want annotators to evaluate same number of segments across all systems - so if they’re harsh or lenient, it balances out
- 2022;
- evaluate all systems on same subset of documents
- use raw scores - not z-scores
- don’t use participants to evaluate - if you submitted your system to WMT, used to have to do 8 hours of evaluation
- 2023:
- discontinued reference-based DA
- (semi-)professional annotators - no longer MTurk
- prototyped paragraph-level on English-German
Open problems:
- How to distribute systems to annotators
- showing different translations of the same doc reduces annotator attention - bad quality annotations
- how many segments do we need to evaluate?
- … [see slides]
- … [see slides]