Title: EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models
Authors: Shaoxiong Ji, Zihao Li, Indraneil Paul, Jaakko Paavola, Peiqin Lin, Pinzhen Chen, DayyĂĄn OâBrien, Hengyu Luo, Hinrich SchĂŒtze, Jörg Tiedemann, Barry Haddow
Published: 26th September 2024 (Thursday) @ 14:40:45
Link: http://arxiv.org/abs/2409.17892v1
Abstract
In this work, we introduce EMMA-500, a large-scale multilingual language model continue-trained on texts across 546 languages designed for enhanced multilingual performance, focusing on improving language coverage for low-resource languages. To facilitate continual pre-training, we compile the MaLA corpus, a comprehensive multilingual dataset enriched with curated datasets across diverse domains. Leveraging this corpus, we conduct extensive continual pre-training of the Llama 2 7B model, resulting in EMMA-500, which demonstrates robust performance across a wide collection of benchmarks, including a comprehensive set of multilingual tasks and PolyWrite, an open-ended generation benchmark developed in this study. Our results highlight the effectiveness of continual pre-training in expanding large language modelsâ language capacity, particularly for underrepresented languages, demonstrating significant gains in cross-lingual transfer, task generalization, and language adaptability.
- MaLa dataset is important contribution
- Data mix is important for non-regression on existing tasks - See table 2 for the data mix they used (instruction, monolingual and more)
- Where did they get their data to assemble the MaLa corpus - See table 19 in the appendix
- MT evals outperform all other models:
- âTable 7: 3-shot results on FLORES-200 (X-Eng, BLEU/chrF++). EMMA-500 Llama 2 7B has better average performance than all baselines.â
- Table 8: 3-shot results on FLORES-200 (Eng-X, BLEU/chrF++). EMMA-500 Llama 2 7B has better average performance than all baselines.
- Trained on Leonardo; global bs 4096, total of 200B tokens (Llama 2 tokenizer):
- âOur EMMA-500 model is trained on the Leonardo supercomputer10, occupying 256 Nvidia A100 GPUs, using the GPT-NeoX framework (Andonian et al., 2023). During training, we set a global batch size of 4096 and worked with sequences of 4096 tokens. The training process ran for 12,000 steps, resulting in a total of 200 billion Llama 2 tokens.â