Title: On Instruction-Finetuning Neural Machine Translation Models
Authors: Vikas Raunak, Roman Grundkiewicz, Marcin Junczys-Dowmunt
Published: 7th October 2024 (Monday) @ 23:26:13
Link: http://arxiv.org/abs/2410.05553v1

Abstract

In this work, we introduce instruction finetuning for Neural Machine Translation (NMT) models, which distills instruction following capabilities from Large Language Models (LLMs) into orders-of-magnitude smaller NMT models. Our instruction-finetuning recipe for NMT models enables customization of translations for a limited but disparate set of translation-specific tasks. We show that NMT models are capable of following multiple instructions simultaneously and demonstrate capabilities of zero-shot composition of instructions. We also show that through instruction finetuning, traditionally disparate tasks such as formality-controlled machine translation, multi-domain adaptation as well as multi-modal translations can be tackled jointly by a single instruction finetuned NMT model, at a performance level comparable to LLMs such as GPT-3.5-Turbo. To the best of our knowledge, our work is among the first to demonstrate the instruction-following capabilities of traditional NMT models, which allows for faster, cheaper and more efficient serving of customized translations.


  1. Recipe for instruction finetuning NMT models (trained with supervision only on parallel datasets), which allows for joint modeling of disparate translation customization tasks in a single NMT model
    • 
and analyze the criticality of each of the recipe components through ablation experiments.
  2. Demonstrate that NMT models are capable of following multiple (30+) instructions simultaneously. We also find that NMT models show abilities of zero-shot composition of instructions, as an effect of finetuning.
  3. ï»żï»żï»żWe show that, with a single instruction-finetuned NMT model, traditional customization tasks such as formality-controlled machine translation can be tackled with high performance, in conjunction with several disparate tasks.

The finetuned NMT model outperforms GPT-3.5-Turbo on average on the IWSLT-22 Formality Control Shared Task (Antonios et al., 2022)

Instruction Finetuning:

  • take pre-trained NMT model
  • finetune with instruction annotated source-translation pairs
  • Instruction is prepended to the source text inside tags that demarcate the instruction, e.g., < instruction> informal
  • Expand the vocabulary of a given NMT model with the instruction tokens in order to delineate the instructions cleanly from the actual source text.
    • Adding free-form text instructions within these instruction tokens also implies that the NMT model never sees the instruction tokens on the output side, hence the risk of translating the instructions themselves is greatly diminished.
    • Initialize embeddings of newly added tokens to random embeddings centered around the mean of the embedding matrix (in particular, mean plus a unitary projection of randomly sampled embedding principal components)
  • Curate both task-specific and parallel datasets used for finetuning
    • For curating parallel dataset (non-instruction data), we apply standard heuristics on the model’s parallel dataset to sample a higher-quality parallel dataset (compared to the model’s full training corpus). The details of the heuristics are presented in Appendix D.
    • Task-specific data curation: manually curate translations from the parallel dataset or we generate the translations synthetically from LLMs (GPT-4 and GPT-3.5-Turbo)
  • Finally, the NMT model is finetuned on a mix (2:1) of parallel and task data
    • mixing ratio is tuned to see no degradation in general translation performance as measured on the WMT’20 validation set
  • At the end of the finetuning, the finetuned and the base models are optionally interpolated to achieve a better trade-off between general and task performance.
    • Details of the interpolation step in Appendix A
    • Found the interpolation to be optional none of the experiments in the main paper use interpolation