🪴 Anil's Garden

Home

❯

Talks

❯

Hurdles at WMT - Keeping up with the MT progress - Tom Kocmi

18 Jul 20254 min read

Hurdles at WMT: Keeping up with the MT progress - Tom Kocmi

2024-06-04 @ 10:00 WEST - Sardine Lab, Técnico, Lisbon

WMT since 2022: Moved from News shared task to mutli-domain
- prevents over specialised, targeted systems
- kept news, added social, added speech; changing year on year
System breakers (red team) via test suits
Challenging language pairs
Allowing pretrained models - LLMs like Llama, Mistral
Fewer translation directions but more language pairs

New for this year:

paragraph-level MT - “should have gone to paragraph level a long time ago”
- tried before but broke a lot of things; didn’t know how to do paragraph-level MT evaluation
Sentence splitting the test set is a hard task - English is easy, but e.g. Chinese…? Asked humans to do it; even they’re not always consistent
Three tracks:
- constrained - use only listed data and models
- open - use anything publicly available for research community
- closed systems - always included Microsoft, Google, DeepL; like online, GPT-4, etc.
LLM benchmark - “to this day, I still haven’t seen a nicely evaluated LLM system for MT, so we tried”
Multimodal - speech-to-text
- we will also provide automatic transcript of the audio - text-to-text systems also can compete but you’re allowed to use audio (and this is the intention)
Literary domain

Collecting and translating datasets:

want data to be fresh - published in recent months - to avoid data contamination in models’ trainsets
- hard to keep it fresh if you go outside news domain; tried Reddit, now they stopped WMT using this data; now using Mastodon, hopefully they don’t stop WMT
- where to get the monolingual data
Data prep:
- preparing data collecting data candidates
- manually selecting docs + fixing removing problematic parts
- translating data by humans - Chinese translators told us they will not translate this (claimed it was propaganda); similar issues with sensitive content
AM: Were you directly in touch with translators
- TK: WMT does not have a budget; work with companies MTT (Japan)
  - at Microsoft, we work with vendors
  - with Czech, I know the translators
Open problems with test sets
- where to get domain-specific data esp. for non-English languages
  - Mastodon doesn’t work outside of English; maybe some German and French, no Czech, no Korean, no Icelandic etc.
- how to automatically check quality of human references?
  - get references from partners
- how to check that references are not post-edits?
  - translators use some system e.g. GPT-4 for translation generation and post-edit, so BLEU scores for those segments are inflated for those systems
- 2006 - 2016 ish: Used relative ranking
  - works pretty well, pairwise comparisons
  - issues:
    - 16 systems per language pair becomes cumbersome for human eval
    - was used when using sentence-level evaluation
- 2016 / mainly 2017: been using reference-based direct assessment
  - don’t need language experts since
- Since ~2012: MQM at metrics
- Now: Building new protocol - ESA
Why don’t we use MQM?
- Need experts who know the error ontology - expensive and problematic for the low-resource languages
- MQM ~10x more expensive than DA + SQM per annotated sentence-level segment
Why can’t we continue with DA + SQM
- On paragraph level, cost increases
  - need to evaluate same number of segments, but paragraphs take longer than sentences
- Concerns that fluency of LLMs highly bias annotators
Comparison of MQM vs DA + SQM:
- no discordant evaluation results when DA + SQM annotate 4x more segments
- When using same number of segments DA + SQM, see many discordant clusterings of models
Ricardo Re: How do you know which one is incorrect on MQM vs DA + SQM?
- TK: Annotator bias from fluency of LLM outputs; we assume that DA + SQM is wrong
- RR: compare e.g. Croatian with and without length normalisation, changes MQM results a lot

New protocol ESA: Error Span Annotations

Goals of ESA:
- faster by removing span categories
- … [see slides]
- … [see slides]
speedup with ESA vs MQM of ~31.7% - 33.8s vs 49.4s
want annotators to evaluate same number of segments across all systems - so if they’re harsh or lenient, it balances out
2022;
- evaluate all systems on same subset of documents
- use raw scores - not z-scores
- don’t use participants to evaluate - if you submitted your system to WMT, used to have to do 8 hours of evaluation
2023:
- discontinued reference-based DA
- (semi-)professional annotators - no longer MTurk
- prototyped paragraph-level on English-German

Open problems:

How to distribute systems to annotators
- showing different translations of the same doc reduces annotator attention - bad quality annotations
how many segments do we need to evaluate?
… [see slides]
… [see slides]

Graph View

Backlinks

No backlinks found

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋

🪴 Anil's Garden

Explorer

Hurdles at WMT - Keeping up with the MT progress - Tom Kocmi

Hurdles at WMT: Keeping up with the MT progress - Tom Kocmi

Graph View

Backlinks