🪴 Anil's Garden

❯

Tower: An Open Multilingual Large Language Model for Translation-Related Tasks

18 Jul 20254 min read

paper
annotated

Title: Tower: An Open Multilingual Large Language Model for Translation-Related Tasks
Authors: Duarte M. Alves, José Pombal, Nuno M. Guerreiro, Pedro H. Martins, João Alves, Amin Farajian, Ben Peters, Ricardo Rei, Patrick Fernandes, Sweta Agrawal, Pierre Colombo, José G. C. de Souza, André F. T. Martins
Published: 27th February 2024 (Tuesday) @ 18:09:36
Link: http://arxiv.org/abs/2402.17733v1

Abstract

While general-purpose large language models (LLMs) demonstrate proficiency on multiple tasks within the domain of translation, approaches based on open LLMs are competitive only when specializing on a single task. In this paper, we propose a recipe for tailoring LLMs to multiple tasks present in translation workflows. We perform continued pretraining on a multilingual mixture of monolingual and parallel data, creating TowerBase, followed by finetuning on instructions relevant for translation processes, creating TowerInstruct. Our final model surpasses open alternatives on several tasks relevant to translation workflows and is competitive with general-purpose closed LLMs. To facilitate future research, we release the Tower models, our specialization dataset, an evaluation framework for LLMs focusing on the translation ecosystem, and a collection of model generations, including ours, on our benchmark.

continued pre-training on 20B tokens $\to$ TowerBase
- c.f. A Paradigm Shift in Machine Translation Boosting Translation Performance of Large Language Models uses only monolingual data - Tower uses parallel data - cross-lingual signal
Dataset: TowerBlocks
supervised fine-tuning (IFT) specialising (adapting) model for translation: TowerInstruct
Releases:
- TowerBase & TowerInstruct @ 7B and 13B
- TowerBlocks: Specialisation dataset
- TowerEval: Evaluation framework for LLMs for translation-related tasks
Justification for using parallel corpora during continued pre-training comes from evidence that Palm 2’s machine translation capabilities stem from incidentially seen bitexts described by Searching for Needles in a Haystack On the Role of Incidental Bilingualism in PaLM’s Translation Capability - they show that using data-driven prompts based on the incidental parallel texts found in the training data improve MT performance (14 chrF points - across languages). See also Language Contamination Helps Explain the Cross-lingual Capabilities of English Pretrained Models
- training data 1/3 parallel data; 2/3 monolingual data - show §4 this benefits translation quality
Data:
- Monolingual: mC4 - improved with (1) deduplication (2) lang ID (3) perplexity filtering with KenLM
- Parallel: xx $\to$ en and en $\to$ xx after removing sentences below quality thresholds using Bicleaner and CometKiwi-22 (details in Appendix C)
- IFT details:
  - formulate 75% of records as zero-shot instructions - following The Flan Collection Designing Data and Methods for Effective Instruction Tuning (Longpre et al. 2023). Remaining 25% are 1, 3 or 5 in-context examples
  - task diversity: paraphrasing task, dialog data from UltraChat (?), coding instructions from Glaive-Code-Assistant (?)
Training took 10 days (7B; or 20 days for 13B parameter model) on 8x A100-80GB GPUs
- IFT - training for TowerInstruct:
  - training:
    - cross-entropy (so standard loss)
    - bfloat16 mixed precision and packing (?)
    - trained for 4 epochs with low learning rate (7e-6) and large batch size (global batch size: 256)
      - weight decay 0.01, Adam, maximum sequence length 2048, 500 warmup steps, cosine annealing
  - dialogue template:
    - follows chatml template from OpenAI API - defined for use with OpenAI APIs
    - single tokenizable string
    - additional <|im_start|> and <|im_end|> tokens added to TowerInstruct tokenizer
    - dialogue template as sshown in Appendix E.2
Evaluation:
- Tasks:
  - Automatic Post-edition (APE) - final translation quality after post-editing NLLB-3.3B translations for WMT23
  - Named Entity Recognition (NER) - test split from MutliCoNER
  - Grammatical error correction (GEC) - using held-out data (not in training data) from CoNLL-2014 (English), COWSL2H (Spanish) and mlconvgec2018 (German)
- Baselines:
  - Llama 70B, Mixtral 8x7B-Instruct, GPT-3.5-turbo, GPT-4
  - Dedicated MT systems: NLLB-54B, ALMA-R
  - Open alternatives (“best match-up”): Gemma 7B, Mistral-7B-Instruct-v0.2, Qwen1.5 72B - Tower outperforms all these “matched up” alternatives
- Metrics
  - Comet-22 (MT, APE)
  - xComet, CometKiwi-22, Bleurt, chrF (MT)
  - ER, ERRANT (GEC)
  - Sequence F1 score (NER)
Results:
- TowerInstruct-13B wins or

Questions:

what about the tokenizer? Do they leave it the same as the Llama 2 tokenizer, that was trained on English
- what is the Llama tokenizer? BPE? Trained on which corpus?
what is UltraChat? Glaive-Code-Assistant?
what are bfloat16 mixed precision and packing (?)

Graph View

Backlinks

Language Models

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋

🪴 Anil's Garden

Explorer

Tower: An Open Multilingual Large Language Model for Translation-Related Tasks

Graph View

Backlinks