Title: MetricX-24: The Google Submission to the WMT 2024 Metrics Shared Task
Authors: Juraj Juraska, Daniel Deutsch, Mara Finkelstein, Markus Freitag
Published: 4th October 2024 (Friday) @ 23:52:28
Link: http://arxiv.org/abs/2410.03983v1

Abstract

In this paper, we present the MetricX-24 submissions to the WMT24 Metrics Shared Task and provide details on the improvements we made over the previous version of MetricX. Our primary submission is a hybrid reference-based/-free metric, which can score a translation irrespective of whether it is given the source segment, the reference, or both. The metric is trained on previous WMT data in a two-stage fashion, first on the DA ratings only, then on a mixture of MQM and DA ratings. The training set in both stages is augmented with synthetic examples that we created to make the metric more robust to several common failure modes, such as fluent but unrelated translation, or undertranslation. We demonstrate the benefits of the individual modifications via an ablation study, and show a significant performance increase over MetricX-23 on the WMT23 MQM ratings, as well as our new synthetic challenge set.


This year, we made four submissions to the WMT24 Metrics Shared Task, all based on the mT5 language model (Xue et al., 2021), which is further fine-tuned on direct assessment (DA) ratings, MQM ratings (Lommel et al., 2014; Freitag et al., 2021), and newly constructed synthetic data. The primary submission, denoted MetricX-24-Hybrid, is a hybrid reference-based/-free metric, which can score a translation irrespective of whether it is given the source segment, the reference, or both. The same model is thus the primary submission for both the reference-based evaluation and the quality estimation (QE) task, having predicted the scores once with and once without the references provided in the input. Our contrasting submissions, MetricX24(-QE), are standalone reference-based/QE models, trained only for their specific task

key takeaways from our experiments, detailed in this report, include:

  1. Learned metrics cannot reliably detect undertranslation, duplication, missing punctuation, and fluent but unrelated translation;
  2. Adding a relatively small amount of synthetic data to the training set can boost the metric’s performance, especially on lower-quality translations with the above issues;
  3. It is possible to effectively train a metric on a mixture of MQM and DA ratings, thus maintaining high performance on a larger set of language pairs;
  4. Training a metric in the hybrid input mode, i.e., with and without the reference included in the input, allows it to learn to rely less on the reference when it is of poor quality

Developing MetricX-24, we relied solely on publicly available data from the WMT Metrics shared tasks between 2015 and 2023.