Title: Linguini: A benchmark for language-agnostic linguistic reasoning
Authors: Eduardo Sánchez, Belen Alastruey, Christophe Ropers, Pontus Stenetorp, Mikel Artetxe, Marta R. Costa-jussà
Published: 18th September 2024 (Wednesday) @ 16:51:02
Link: http://arxiv.org/abs/2409.12126v1
Abstract
We propose a new benchmark to measure a language model’s linguistic reasoning skills without relying on pre-existing language-specific knowledge. The test covers 894 questions grouped in 160 problems across 75 (mostly) extremely low-resource languages, extracted from the International Linguistic Olympiad corpus. To attain high accuracy on this benchmark, models don’t need previous knowledge of the tested language, as all the information needed to solve the linguistic puzzle is presented in the context. We find that, while all analyzed models rank below 25% accuracy, there is a significant gap between open and closed models, with the best-performing proprietary model at 24.05% and the best-performing open model at 8.84%.
Dataset available: https://github.com/facebookresearch/linguini
- Used data from the International Linguistics Olympiad (IOL)
- like the math or CS olympiads for high schoolers
- Built a taxonomy exploring the IOL from 2003 to 2023.
- Excluded all instances for which their category only appears once; those where the question includes an image or those where the response is only an explanation.
- The remaining problems require solving different linguistic reasoning tasks, such as morphosyntactic segmentation (eg., verb conjugation), morphosemantic alignment (e.g., noun negation), derivation (e.g., finding cognates in related languages), morphophonological segmentation (e.g., pluralization) or graphophonemic transcription (e.g., transcription from one script to another).
- In total, Linguini is composed by 894 questions grouped in 160 problems across 75 (mostly) extremely low-resource language.
What is the x-lingual, reasoning benchmark Linguini?
We classify the problems included in Linguini into the three categories according to their content: sequence transduction, fill-in-blanks and number transliteration. Figure 1 shows one example of each.
Sequence transduction
This category includes sequence production (identified in the benchmark as ‘translation’
) and sequence matching (identified as ‘match_letter’
). The problems require the model to transform a sequence into a different space (e.g., language, phonetic representation, script) based on few examples.
In some cases, basic phonetic/phonological knowledge is needed. For example, the model should be able to reason over principles of voicing and their implementation in situations of coarticulation. Some problems require to know that consonants come in voiced-voiceless pairs, and that one element of the pair may in some cases be a substitute for the other element in the pair under certain circumstances.
Fill-in blanks
Fill-in blanks are mainly morphophonological derivation tasks, and they are identified in the benchmark as ‘fill_blanks’
.
- Models need to understand what are the morphophonological rules that make it possible to go from the first form of a word to its second form.
- This can usually be applied to verbal (e.g., verb tense conjugation), nominal or adjectival (e.g., case declension) derivation.
- involves understanding affixation rules and morpheme swapping rules, which often come with phonological rules if there are different coarticulation phenomena with different affixes or phonotactic phenomena such as consonantal mutations.
Digit/text number transliteration
These problems are identified by the labels ‘text_to_num’
and ‘num_to_text’
. In them, models have to produce a digit or text equivalent, respectively.
They require a model’s understanding of morphological analysis and morpheme order.
Example Problem (Figure 2)
A subset of the context of a problem in Terenâ language and the reasoning steps needed to solve it. To correctly answer the question, the model must notice that (a) voiced d mutates to voiceless paired sound t (fortition), (b) n is dropped because there are no voiceless nasal alveolar sounds and (c) an epenthetic vowel has to be added between the mutation consonant and the rest of the word (a root), and that the vowel that gets added matches the aperture of the vowel in the root. If the aperture is closed, the epenthetic vowel is the closed front vowel i; if the aperture is mid, the epenthetic vowel is the mid front vowel e.
Current SOTA Models’ Performance
Main benchmarks are (obviously) with in-context examples to allow the models to understand what the output should be.
- gap between best performing open model and the best performing proprietary model
- with several tiers of proprietary models above the best open model (llama-3-70b)
- best open model (llama-3-70b)
- Mixed impact of in-context examples (ICEs) in the performance of the models
- some models benefit from it (such as llama-3-70b-it)
- other models’ performance degrades as the number of examples increases (such as claude-3-opus)
- disparity might be due to the two factors introduced by the ICEs:
- (positive) they set an answer format that could be useful for models that can’t infer it directly from a single natural language instruction and
- (negative) they introduce tokens of languages potentially unrelated to the evaluated problem.
- hypothesis for differential effect of ICEs: for models more capable of instruction following, only the second factor plays a role in the model’s performance
Table 1: Exact match results with Linguini for 0-5 ICEs.
Model | 0 | 1 | 2 | 3 | 4 | 5 | Best(↑) |
---|---|---|---|---|---|---|---|
claude-3-opus | 24.05 | 20.58 | 21.36 | 19.91 | 17.00 | 15.1 | 24.05 |
gpt-4o | 14.65 | 12.98 | 13.87 | 12.98 | 13.98 | 13.76 | 14.65 |
gpt-4 | 6.38 | 9.96 | 11.52 | 12.98 | 11.74 | 13.31 | 12.98 |
claude-3-sonnet | 12.30 | 8.95 | 10.29 | 10.40 | 9.28 | 8.72 | 12.30 |
gpt-4-turbo | 8.72 | 9.40 | 9.96 | 7.49 | 8.61 | 9.96 | 9.96 |
llama-3-70b | 8.17 | 5.93 | 7.72 | 8.84 | 8.72 | 6.60 | 8.84 |
llama-3-70b-it | 4.81 | 5.93 | 7.16 | 7.38 | 6.82 | 8.39 | 8.39 |
claude-3-haiku | 6.04 | 7.61 | 4.36 | 6.04 | 6.94 | 7.05 | 7.61 |
llama-2-70b | 4.70 | 2.24 | 2.57 | 3.24 | 3.36 | 3.58 | 3.58 |
mistral-0.1-8x7b | 2.46 | 3.47 | 3.91 | 3.02 | 3.24 | 3.47 | 3.91 |
llama-2-70b-it | 0.89 | 1.45 | 2.80 | 3.02 | 3.13 | 2.80 | 3.13 |
gemma-2b | 0.34 | 2.01 | 1.90 | 1.34 | 1.45 | 1.90 | 2.01 |
qwen-1.5-110b-it | 1.45 | 1.23 | 1.34 | 1.45 | 1.45 | 1.68 | 1.68 |
Ablations
No-Context Prompting: Checking Data Contamination
- No context allows study of degree to which models rely on the given context to provide correct answers
- We analyze the impact of ignoring the context provided in the benchmark as a proxy of possible data contamination.
- Models that have not been trained on any data of the task language should have a null-adjacent performance when not given the context necessary to solve the task.
Steep drops in delta of zero-shot and no context show tested LMs are unlikely to have training data contamination from the IOL
Gemma-2b, Llama 70B it and Mixtral (8x7B Sparse MOE) are the smallest deltas.
Model | Zero-shot | No context | Δ |
---|---|---|---|
llama-3-70b-it | 4.81 | 1.12 | -3.69 |
gpt-4-turbo | 8.72 | 1.45 | -7.27 |
gpt-4 | 6.38 | 1.34 | -5.04 |
claude-3-sonnet | 12.30 | 2.01 | -10.29 |
mistral-0.1-8x7b | 2.46 | 1.98 | -0.48 |
claude-3-haiku | 6.04 | 1.12 | -4.92 |
qwen-1.5-110b-it | 1.45 | 0.43 | -1.02 |
gemma-2b | 0.34 | 0.09 | -0.25 |
llama-2-70b | 4.70 | 1.07 | -3.63 |
llama-2-70b-it | 0.89 | 0.56 | -0.33 |
llama-3-70b | 8.17 | 1.67 | -6.50 |
claude-3-opus | 24.05 | 1.23 | -22.82 |
gpt-4o | 14.65 | 1.45 | -13.20 |
Transliteration: Check effect of script choice on performance
Used transliteration at https://github.com/barseghyanartur/transliterate/ to transliterate questions from Latin to other scripts to ensure model performance was not hurt by alternative scripts.
Ideally this wouldn’t be the case as the model should reason abstracting away from the script
(As a human, I know I would find the script distracting - more difficult in e.g. Cyrillic; working memory issues)
We selected the best performing model (claude-3-opus) and transcribed the best performing problems (those where the accuracy >= 75) into 4 non-Latin alphabetical scripts (Cyrilic, Greek, Georgian and Armenian)
Table 3 shows that the model retains the capacity to perform linguistic reasoning even after changing scripts, which backs the hypothesis of the model relying mainly on the presented context and not on spurious previous knowledge.
questioncritique But does it though? These numbers are zero in cases of Armenian and lower than the 75+% baseline you started out with
- The fact that for 13 our of 16 of the given problems there’s at least one non-Latin script in which the model can solve the problem with greater or equal performance than with Latin script further supports this claim.
Performance disparity among scripts could be related to either
- the difference in tokenization of different scripts or
- inherent limitations of our transliteration strategy
- Armenian script might lack a specific consonant cluster that needs to be developed to provide the right answer
- (character/bi-character-wise substitution doesn’t take this nuance into account)
Problem code & language | Latn | Cyrl | Grek | Geor | Armn |
---|---|---|---|---|---|
012023010100 (qda-gua) | 75.00 | 100.00 | 75.00 | 100.00 | 0.00 |
012021020500 (zun) | 100.00 | 0.00 | 100.00 | 0.00 | 0.00 |
012012030100 (eus) | 78.57 | 7.14 | 92.86 | 0.00 | 0.00 |
012018020100 (nst-hkn) | 83.33 | 83.33 | 66.67 | 83.33 | 100.00 |
012007050100 (tur) | 75.00 | 75.00 | 50.00 | 37.50 | 50.00 |
012006020100 (cat) | 75.00 | 50.00 | 50.00 | 58.33 | 33.33 |
012003030200 (eus) | 100.00 | 100.00 | 75.00 | 100.00 | 100.00 |
012004010100 (txu) | 100.00 | 100.00 | 66.67 | 66.67 | 33.33 |
012007030100 (kat) | 80.00 | 13.33 | 6.67 | 100.00 | 0.00 |
012009050100 (nci) | 83.33 | 83.33 | 83.33 | 83.33 | 50.00 |
012015020100 (kbd-bes) | 100.00 | 66.67 | 100.00 | 66.67 | 83.33 |
012012050100 (rtm) | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
012011040200 (nci) | 100.00 | 50.00 | 75.00 | 75.00 | 0.00 |
012013010200 (yii) | 100.00 | 100.00 | 100.00 | 75.00 | 100.00 |
012012030200 (eus) | 100.00 | 50.00 | 0.00 | 0.00 | 0.00 |
012012030300 (eus) | 100.00 | 50.00 | 100.00 | 0.00 | 0.00 |
Average | 85.71 | 56.12 | 65.31 | 63.27 | 38.78 |
Figure 3:Example of transliteration of a problem into Cyrillic, Greek, Georgian and Armenian scripts.
Are lower resource languages harder to solve the IOL tasks for?
- used number of speakers and Google searches by language as a proxy of resource quantity
- their conclusion: accuracy isn’t largely correlated to to its likelihood of being included in the training set
Notable exceptions to this trend are a number of very high-resource languages (e.g., cat, eus, kat, tur), which are very likely to be included in the model’s training set, given their institutional status.
”One-book” prompting: Testing if providing textbooks via OCR boosts performance
- OCR’d textbooks in akz, apu, mnk
- Added these to the context
- helps apu go from 0.0 to 16.67 for example
- OCR introduces errors so maybe not great
Language code | No-context | Context | Textbook | Context + Textbook |
---|---|---|---|---|
akz | 0.00 | 5.13 | 0.00 | 3.85 |
apu | 0.00 | 0.00 | 0.00 | 16.67 |
mnk | 0.00 | 0.00 | 0.00 | 0.00 |
Average | 0.00 | 1.71 | 0.00 | 6.84 |
Table 7:Overview of Grammar Books [tba] (§A D)
Language | Book Title | Citation |
---|---|---|
akz | The Language of the Alabama Indians | Lupardus (1982) |
apu | A Grammar and a Vocabulary of the Ipuriná Language | Polak (1894) |
mnk | The Structure of Faranah-Maninka | Spears (1965) |
Languages of Linguini (§A B)
I sorted by Language Fam. and ISO lang code.
- Some number of speakers are zero: Ubykh, Proto-Chamic, Guazacapán Xinka
- others are really small e.g. in the tens
Lang. Code | Language | No. Speakers9 | No. Search Results10 | Language Family | Script |
---|---|---|---|---|---|
gya | Northwest Gbaya | 267000 | 8 | - | Latin |
ady | Adyghe | 425,000 | 2,370 | Abkhaz-Adyghe | Latin |
kbd-bes11 | Besleney Kabardian | 516000 | 0 | Abkhaz-Adyghe | Latin |
uby | Ubykh | 0 | 1180 | Abkhaz-Adyghe | Latin |
crk | Plains Cree | 34000 | 5290 | Algic | Latin |
mez | Menominee | 2000 | 2240 | Algic | Latin |
mic | Micmac | 11000 | 774 | Algic | Latin |
yur | Yurok | 35 | 2830 | Algic | Latin |
dbl | Dyirbal | 21 | 2900 | Australian | Latin |
wmb | Wambaya | 43 | 112 | Australian | Latin |
yii | Yidiny | 52 | 280 | Australian | Latin |
cam | Cemuhî | 3300 | 6 | Austronesian | Latin |
cjm | Phan Rang Cham | 491448 | 2 | Austronesian | Latin |
cmc-pro11 | Proto-Chamic | 0 | 267 | Austronesian | Latin |
dhv | Drehu | 13,000 | 216 | Austronesian | Latin |
huq | Tsat | 4500 | 128 | Austronesian | Latin |
kij | Kilivila | 25000 | 271 | Austronesian | Latin |
mmx | Madak | 2600 | 57 | Austronesian | Latin |
mnb | Muna | 270000 | 1020 | Austronesian | Latin |
rtm | Rotuman | 7500 | 4560 | Austronesian | Latin |
tio | Teop | 8000 | 81 | Austronesian | Latin |
txn | West Tarangan | 14,000 | 4 | Austronesian | Latin |
jqr | Jaqaru | 725 | 101 | Aymaran | Latin |
iku | Inuktitut | 39,000 | 12500 | Eskimo-Aleut | Latin |
cat | Catalan | 9200000 | 87100 | Indo-European | Latin |
eng | English Braille | 6000000 | 728 | Indo-European | Latin |
fao | Faroese | 69000 | 23800 | Indo-European | Latin |
roh-eng10 | Engadine | 60000 | 7 | Indo-European | Latin |
roh-sur11 | Sursilvan | 60000 | 3 | Indo-European | Latin |
eus | Basque | 936,812 | 71100 | Isolate | Latin |
mzp | Movima | 1000 | 72 | Isolate | Latin |
rkb | Rikbaktsa | 40 | 54 | Isolate | Latin |
sua | Sulka | 3500 | 107 | Isolate | Latin |
zun | Zuni | 9500 | 1610 | Isolate | Latin |
txu | Kayapo | 8600 | 116 | Jean | Latin |
kat | Georgian | 4000000 | 73700 | Kartvelian | Latin |
apu | Apurinã | 2800 | 264 | Maipurean | Latin |
ter | Terêna | 15,000 | 115 | Maipurean | Latin |
tzo | Tzotzil | 550000 | 1160 | Mayan | Latin |
zoc | Copainalá Zoque | 10000 | 10 | Mixe-Zoquean | Latin |
akz | Alabama | 370 | 1,350 | Muskogean | Latin |
bdk | Budukh | 200 | 126 | Nakh-Daghestanian | Latin |
bam | Bambara | 14000000 | 7150 | Niger-Congo | N’Ko |
bom | Birom | 1000000 | 115 | Niger-Congo | Latin |
enn | Engenni | 20000 | 185 | Niger-Congo | Latin |
ikw-agb11 | Agbirigba | 30 | 1 | Niger-Congo | Latin |
kmb | Kimbundu | 1600000 | 1130 | Niger-Congo | Latin |
mnk | Maninka | 4600000 | 478 | Niger-Congo | N’Ko |
nhu | Nooni | 64000 | 82 | Niger-Congo | Latin |
spp | Supyire | 460000 | 45 | Niger-Congo | Latin |
vai | Vai | 120000 | 1380 | Niger-Congo | Latin |
yor | Yoruba | 47000000 | 1360000 | Niger-Congo | Latin |
laj | Lango | 2100000 | 1490 | Nilo-Saharan | Latin |
xnz | Kunuz Nubian | 35000 | 2 | Nilo-Saharan | Latin |
ian | Iatmül | 46000 | 9 | Papua New Guinea | Latin |
nst-hkn11 | Hakhun | 10000 | 5 | Sino-Tibetan | Latin |
lkt | Lakhota | 2000 | 25300 | Siouan-Catawban | Latin |
stk | Arammba | 1000 | 36 | South-Central Papuan | Latin |
abz | Mountain Arapesh | 16,000 | 98 | Torricelli | Latin |
abz | Abui | 16,000 | 263 | Trans-New Guinea | Latin |
bef | Bena Bena | 45000 | 107 | Trans-New Guinea | Latin |
ekg | Ekari | 100000 | 141 | Trans-New Guinea | Latin |
mrz | Coastal Marind | 9000 | 100 | Trans-New Guinea | Latin |
nqm | Ndom | 1200 | 154 | Trans-New Guinea | Latin |
ubu | Umbu-Ungu | 32,000 | 90 | Trans-New Guinea | Latin |
yon | Yonggom | 6,000 | 48 | Trans-New Guinea | Latin |
ude | Udihe | 50 | 108 | Tungusic | Latin |
chv | Chuvash | 700000 | 6260 | Turkic | Latin |
tat | Tatar | 7000000 | 79700 | Turkic | Latin |
tur | Turkish | 100000000 | 4130000 | Turkic | Latin |
ngh | N|uuki | 1 | 0 | Tuu | Latin |
mns | Mansi | 2229 | 1490 | Uralic | Latin |
nci | Classical Nahuatl | 1500000 | 1690 | Uto-Aztecan | Latin |
qda-gua11 | Guazacapán Xinka | 0 | 1 | Xincan | Latin |
ykg | Tundra Yukaghir | 320 | 206 | Yukaghir | Latin |