Title: Unsupervised Cross-lingual Representation Learning at Scale
Authors: Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco GuzmĂĄn, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov
Published: 5th November 2019 (Tuesday) @ 22:42:00
Link: http://arxiv.org/abs/1911.02116v2

Abstract

This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6% average accuracy on XNLI, +13% average F1 score on MLQA, and +2.4% F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 15.7% in XNLI accuracy for Swahili and 11.4% for Urdu over previous XLM models. We also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make our code, data and models publicly available.


Release: XLM-R State-of-the-art cross-lingual understanding through self-supervision

Code added to existing repo at: https://github.com/facebookresearch/XLM


While earlier work in this area has demonstrated the effectiveness of multilingual masked language models on cross-lingual understanding, models such as XLM and multilingual BERT were limited in their ability to learn useful representations for low-resource languages. XLM-R improves on previous approaches in several ways:

  • Building on the cross-lingual approach that we used with XLM and RoBERTa, we increased the number of languages and training examples for our new model, training self-supervised cross-lingual representations from more than two terabytes of publicly available CommonCrawl data that had been cleaned and filtered. This included generating new unlabeled corpora for low-resource languages, scaling the amount of training data available for those languages by two orders of magnitude.

  • During fine-tuning, we leveraged the ability of multilingual models to use labeled data in multiple languages in order to improve downstream task performance. This enabled our model to achieve state-of-the-art results on cross-lingual benchmarks while exceeding the per-language performance of monolingual BERT models.

  • We tuned our model’s parameters to offset the fact that using cross-lingual transfer to scale models to more languages also limits the model’s capacity to understand each of those languages. Our parameter changes included upsampling low-resource languages during training and vocabulary construction, generating a larger shared vocabulary, and increasing the overall model capacity up to 550 million parameters.

We found that XLM-R performed particularly well for low-resource languages, improving XNLI performance on Swahili and Urdu by 2.3 percent and 5 percent compared with the previous state of the art, which was trained on 15 languages.

Builds on: