Meta AI Research Topic - No Language Left Behind

Excerpt

Our work aims to break down language barriers across the world for everyone to understand and communicate with anyone—no matter what language they speak.


Experience the Tech

Stories Told Through Translation:

books from around the world translated into hundreds of languages

Experience the power of AI translation with Stories Told Through Translation, our demo that uses the latest AI advancements from the No Language Left Behind project. This demo translates books from their languages of origin such as Indonesian, Somali and Burmese, into more languages for readers—with hundreds available in the coming months. Through this initiative, the NLLB-200 will be the first-ever AI model able to translate literature at this scale.

What Could I Become?

By Nabila Adani

A girl is inspired by a school assignment to think about what she wants to be when she grows up. What will her dreams inspire her to become?

[

Read Story

](https://nllb.metademolab.com/story/1)

LASER (Language-agnostic sentence representations)

2018

The first successful exploration of massively multilingual sentence representations shared publicly with the NLP community. The encoder creates embeddings to automatically pair up sentences sharing the same meaning in 50 languages.

Data Encoders

WMT-19

2019

FB AI models outperformed all other models at WMT 2019, using large-scale sampled back-translation, noisy channel modeling and data cleaning techniques to help build a strong system.

Model

Flores V1

2019

A benchmarking dataset for MT between English and low-resource languages introducing a fair and rigorous evaluation process, starting with 2 languages.

Evaluation Dataset

WikiMatrix

2019

The largest extraction of parallel sentences across multiple languages: Bitext extraction of 135 million Wikipedia sentences in 1,620 language pairs for building better translation models.

Data Construction

M2M-100

2020

The first, single multilingual machine translation model to directly translate between any pair of 100 languages without relying on English data. Trained on 2,200 language directions —10x more than previous multilingual models.

Model

CCMatrix

2020

The largest dataset of high-quality, web-based bitexts for building better translation models that work with more languages, especially low-resource languages: 4.5 billion parallel sentences in 576 language pairs.

Data Construction

LASER 2

2020

Creates embeddings to automatically pair up sentences sharing the same meaning in 100 languages.

Data Encoders

WMT-21

2021

For the first time, a single multilingual model outperformed the best specially trained bilingual models across 10 out of 14 language pairs to win WMT 2021, providing the best translations for both low- and high-resource languages.

Model

FLORES-101

2021

FLORES-101 is the first-of-its-kind, many-to-many evaluation data set covering 101 languages, enabling researchers to rapidly test and improve upon multilingual translation models like M2M-100.

Evaluation Dataset

NLLB-200

2022

The NLLB model translates 200 languages.

Model

FLORES 200

2021

Expansion of FLORES evaluation data set now covering 200 languages

Evaluation Dataset

NLLB-Data-200

2022

Constructed and released training data for 200 languages

Evaluation Dataset

LASER 3

2022

Creates embeddings to automatically pair up sentences sharing the same meaning in 200 languages.

Data Encoders