Hurdles at WMT: Keeping up with the MT progress - Tom Kocmi

2024-06-04 @ 10:00 WEST - Sardine Lab, Técnico, Lisbon

  • WMT since 2022: Moved from News shared task to mutli-domain
    • prevents over specialised, targeted systems
    • kept news, added social, added speech; changing year on year
  • System breakers (red team) via test suits
  • Challenging language pairs
  • Allowing pretrained models - LLMs like Llama, Mistral
  • Fewer translation directions but more language pairs

New for this year:

  • paragraph-level MT - “should have gone to paragraph level a long time ago”
    • tried before but broke a lot of things; didn’t know how to do paragraph-level MT evaluation
  • Sentence splitting the test set is a hard task - English is easy, but e.g. Chinese…? Asked humans to do it; even they’re not always consistent
  • Three tracks:
    • constrained - use only listed data and models
    • open - use anything publicly available for research community
    • closed systems - always included Microsoft, Google, DeepL; like online, GPT-4, etc.
  • LLM benchmark - “to this day, I still haven’t seen a nicely evaluated LLM system for MT, so we tried”
  • Multimodal - speech-to-text
    • we will also provide automatic transcript of the audio - text-to-text systems also can compete but you’re allowed to use audio (and this is the intention)
  • Literary domain

Collecting and translating datasets:

  • want data to be fresh - published in recent months - to avoid data contamination in models’ trainsets

    • hard to keep it fresh if you go outside news domain; tried Reddit, now they stopped WMT using this data; now using Mastodon, hopefully they don’t stop WMT
    • where to get the monolingual data
  • Data prep:

    • preparing data collecting data candidates
    • manually selecting docs + fixing removing problematic parts
    • translating data by humans - Chinese translators told us they will not translate this (claimed it was propaganda); similar issues with sensitive content
  • AM: Were you directly in touch with translators

    • TK: WMT does not have a budget; work with companies MTT (Japan)
      • at Microsoft, we work with vendors
      • with Czech, I know the translators
  • Open problems with test sets

    • where to get domain-specific data esp. for non-English languages
      • Mastodon doesn’t work outside of English; maybe some German and French, no Czech, no Korean, no Icelandic etc.
    • how to automatically check quality of human references?
      • get references from partners
    • how to check that references are not post-edits?
      • translators use some system e.g. GPT-4 for translation generation and post-edit, so BLEU scores for those segments are inflated for those systems
    • 2006 - 2016 ish: Used relative ranking
      • works pretty well, pairwise comparisons
      • issues:
        • 16 systems per language pair becomes cumbersome for human eval
        • was used when using sentence-level evaluation
    • 2016 / mainly 2017: been using reference-based direct assessment
      • don’t need language experts since
    • Since ~2012: MQM at metrics
    • Now: Building new protocol - ESA
  • Why don’t we use MQM?

    • Need experts who know the error ontology - expensive and problematic for the low-resource languages
    • MQM ~10x more expensive than DA + SQM per annotated sentence-level segment
  • Why can’t we continue with DA + SQM

    • On paragraph level, cost increases
      • need to evaluate same number of segments, but paragraphs take longer than sentences
    • Concerns that fluency of LLMs highly bias annotators
  • Comparison of MQM vs DA + SQM:

    • no discordant evaluation results when DA + SQM annotate 4x more segments
    • When using same number of segments DA + SQM, see many discordant clusterings of models
  • Ricardo Re: How do you know which one is incorrect on MQM vs DA + SQM?

    • TK: Annotator bias from fluency of LLM outputs; we assume that DA + SQM is wrong
    • RR: compare e.g. Croatian with and without length normalisation, changes MQM results a lot

New protocol ESA: Error Span Annotations

  • Goals of ESA:
    • faster by removing span categories
    • … [see slides]
    • … [see slides]
  • speedup with ESA vs MQM of ~31.7% - 33.8s vs 49.4s
  • want annotators to evaluate same number of segments across all systems - so if they’re harsh or lenient, it balances out
  • 2022;
    • evaluate all systems on same subset of documents
    • use raw scores - not z-scores
    • don’t use participants to evaluate - if you submitted your system to WMT, used to have to do 8 hours of evaluation
  • 2023:
    • discontinued reference-based DA
    • (semi-)professional annotators - no longer MTurk
    • prototyped paragraph-level on English-German

Open problems:

  • How to distribute systems to annotators
    • showing different translations of the same doc reduces annotator attention - bad quality annotations
  • how many segments do we need to evaluate?
  • … [see slides]
  • … [see slides]