Title: Tower: An Open Multilingual Large Language Model for Translation-Related Tasks
Authors: Duarte M. Alves, José Pombal, Nuno M. Guerreiro, Pedro H. Martins, João Alves, Amin Farajian, Ben Peters, Ricardo Rei, Patrick Fernandes, Sweta Agrawal, Pierre Colombo, José G. C. de Souza, André F. T. Martins
Published: 27th February 2024 (Tuesday) @ 18:09:36
Link: http://arxiv.org/abs/2402.17733v1

Abstract

While general-purpose large language models (LLMs) demonstrate proficiency on multiple tasks within the domain of translation, approaches based on open LLMs are competitive only when specializing on a single task. In this paper, we propose a recipe for tailoring LLMs to multiple tasks present in translation workflows. We perform continued pretraining on a multilingual mixture of monolingual and parallel data, creating TowerBase, followed by finetuning on instructions relevant for translation processes, creating TowerInstruct. Our final model surpasses open alternatives on several tasks relevant to translation workflows and is competitive with general-purpose closed LLMs. To facilitate future research, we release the Tower models, our specialization dataset, an evaluation framework for LLMs focusing on the translation ecosystem, and a collection of model generations, including ours, on our benchmark.


  • continued pre-training on 20B tokens TowerBase
  • Dataset: TowerBlocks
  • supervised fine-tuning (IFT) specialising (adapting) model for translation: TowerInstruct
  • Releases:
    • TowerBase & TowerInstruct @ 7B and 13B
    • TowerBlocks: Specialisation dataset
    • TowerEval: Evaluation framework for LLMs for translation-related tasks
  • Justification for using parallel corpora during continued pre-training comes from evidence that Palm 2’s machine translation capabilities stem from incidentially seen bitexts described by Searching for Needles in a Haystack On the Role of Incidental Bilingualism in PaLM’s Translation Capability - they show that using data-driven prompts based on the incidental parallel texts found in the training data improve MT performance (14 chrF points - across languages). See also Language Contamination Helps Explain the Cross-lingual Capabilities of English Pretrained Models
    • training data 1/3 parallel data; 2/3 monolingual data - show §4 this benefits translation quality
  • Data:
    • Monolingual: mC4 - improved with (1) deduplication (2) lang ID (3) perplexity filtering with KenLM
    • Parallel: xx en and en xx after removing sentences below quality thresholds using Bicleaner and CometKiwi-22 (details in Appendix C)
    • IFT details:
  • Training took 10 days (7B; or 20 days for 13B parameter model) on 8x A100-80GB GPUs
    • IFT - training for TowerInstruct:
      • training:
        • cross-entropy (so standard loss)
        • bfloat16 mixed precision and packing (?)
        • trained for 4 epochs with low learning rate (7e-6) and large batch size (global batch size: 256)
          • weight decay 0.01, Adam, maximum sequence length 2048, 500 warmup steps, cosine annealing
      • dialogue template:
        • follows chatml template from OpenAI API - defined for use with OpenAI APIs
        • single tokenizable string
        • additional <|im_start|> and <|im_end|> tokens added to TowerInstruct tokenizer
        • dialogue template as sshown in Appendix E.2
  • Evaluation:
    • Tasks:
      • Automatic Post-edition (APE) - final translation quality after post-editing NLLB-3.3B translations for WMT23
      • Named Entity Recognition (NER) - test split from MutliCoNER
      • Grammatical error correction (GEC) - using held-out data (not in training data) from CoNLL-2014 (English), COWSL2H (Spanish) and mlconvgec2018 (German)
    • Baselines:
      • Llama 70B, Mixtral 8x7B-Instruct, GPT-3.5-turbo, GPT-4
      • Dedicated MT systems: NLLB-54B, ALMA-R
      • Open alternatives (“best match-up”): Gemma 7B, Mistral-7B-Instruct-v0.2, Qwen1.5 72B - Tower outperforms all these “matched up” alternatives
    • Metrics
      • Comet-22 (MT, APE)
      • xComet, CometKiwi-22, Bleurt, chrF (MT)
      • ER, ERRANT (GEC)
      • Sequence F1 score (NER)
  • Results:
    • TowerInstruct-13B wins or

Questions:

  • what about the tokenizer? Do they leave it the same as the Llama 2 tokenizer, that was trained on English
    • what is the Llama tokenizer? BPE? Trained on which corpus?
  • what is UltraChat? Glaive-Code-Assistant?
  • what are bfloat16 mixed precision and packing (?)