🪴 Anil's Garden

❯

EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models

18 Jul 20252 min read

paper
annotated

Title: EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models
Authors: Shaoxiong Ji, Zihao Li, Indraneil Paul, Jaakko Paavola, Peiqin Lin, Pinzhen Chen, Dayyán O’Brien, Hengyu Luo, Hinrich Schütze, Jörg Tiedemann, Barry Haddow
Published: 26th September 2024 (Thursday) @ 14:40:45
Link: http://arxiv.org/abs/2409.17892v1

Abstract

In this work, we introduce EMMA-500, a large-scale multilingual language model continue-trained on texts across 546 languages designed for enhanced multilingual performance, focusing on improving language coverage for low-resource languages. To facilitate continual pre-training, we compile the MaLA corpus, a comprehensive multilingual dataset enriched with curated datasets across diverse domains. Leveraging this corpus, we conduct extensive continual pre-training of the Llama 2 7B model, resulting in EMMA-500, which demonstrates robust performance across a wide collection of benchmarks, including a comprehensive set of multilingual tasks and PolyWrite, an open-ended generation benchmark developed in this study. Our results highlight the effectiveness of continual pre-training in expanding large language models’ language capacity, particularly for underrepresented languages, demonstrating significant gains in cross-lingual transfer, task generalization, and language adaptability.

MaLa dataset is important contribution
Data mix is important for non-regression on existing tasks - See table 2 for the data mix they used (instruction, monolingual and more)
Where did they get their data to assemble the MaLa corpus - See table 19 in the appendix
MT evals outperform all other models:
- “Table 7: 3-shot results on FLORES-200 (X-Eng, BLEU/chrF++). EMMA-500 Llama 2 7B has better average performance than all baselines.”
- Table 8: 3-shot results on FLORES-200 (Eng-X, BLEU/chrF++). EMMA-500 Llama 2 7B has better average performance than all baselines.
Trained on Leonardo; global bs 4096, total of 200B tokens (Llama 2 tokenizer):
- “Our EMMA-500 model is trained on the Leonardo supercomputer10, occupying 256 Nvidia A100 GPUs, using the GPT-NeoX framework (Andonian et al., 2023). During training, we set a global batch size of 4096 and worked with sequences of 4096 tokens. The training process ran for 12,000 steps, resulting in a total of 200 billion Llama 2 tokens.”

Graph View

Backlinks

Datasets

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋

🪴 Anil's Garden

Explorer

EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models

Graph View

Backlinks