Linguini: A benchmark for language-agnostic linguistic reasoning

🪴 Anil's Garden

Title: Linguini: A benchmark for language-agnostic linguistic reasoning
Authors: Eduardo Sánchez, Belen Alastruey, Christophe Ropers, Pontus Stenetorp, Mikel Artetxe, Marta R. Costa-jussà
Published: 18th September 2024 (Wednesday) @ 16:51:02
Link: http://arxiv.org/abs/2409.12126v1

Abstract

We propose a new benchmark to measure a language model’s linguistic reasoning skills without relying on pre-existing language-specific knowledge. The test covers 894 questions grouped in 160 problems across 75 (mostly) extremely low-resource languages, extracted from the International Linguistic Olympiad corpus. To attain high accuracy on this benchmark, models don’t need previous knowledge of the tested language, as all the information needed to solve the linguistic puzzle is presented in the context. We find that, while all analyzed models rank below 25% accuracy, there is a significant gap between open and closed models, with the best-performing proprietary model at 24.05% and the best-performing open model at 8.84%.

Dataset available: https://github.com/facebookresearch/linguini

Used data from the International Linguistics Olympiad (IOL)
- like the math or CS olympiads for high schoolers
Built a taxonomy exploring the IOL from 2003 to 2023.
Excluded all instances for which their category only appears once; those where the question includes an image or those where the response is only an explanation.
The remaining problems require solving different linguistic reasoning tasks, such as morphosyntactic segmentation (eg., verb conjugation), morphosemantic alignment (e.g., noun negation), derivation (e.g., finding cognates in related languages), morphophonological segmentation (e.g., pluralization) or graphophonemic transcription (e.g., transcription from one script to another).
In total, Linguini is composed by 894 questions grouped in 160 problems across 75 (mostly) extremely low-resource language.

What is the x-lingual, reasoning benchmark Linguini?

We classify the problems included in Linguini into the three categories according to their content: sequence transduction, fill-in-blanks and number transliteration. Figure 1 shows one example of each.

Sequence transduction

This category includes sequence production (identified in the benchmark as ‘translation’) and sequence matching (identified as ‘match_letter’). The problems require the model to transform a sequence into a different space (e.g., language, phonetic representation, script) based on few examples.

In some cases, basic phonetic/phonological knowledge is needed. For example, the model should be able to reason over principles of voicing and their implementation in situations of coarticulation. Some problems require to know that consonants come in voiced-voiceless pairs, and that one element of the pair may in some cases be a substitute for the other element in the pair under certain circumstances.

Fill-in blanks

Fill-in blanks are mainly morphophonological derivation tasks, and they are identified in the benchmark as ‘fill_blanks’.

Models need to understand what are the morphophonological rules that make it possible to go from the first form of a word to its second form.
This can usually be applied to verbal (e.g., verb tense conjugation), nominal or adjectival (e.g., case declension) derivation.
involves understanding affixation rules and morpheme swapping rules, which often come with phonological rules if there are different coarticulation phenomena with different affixes or phonotactic phenomena such as consonantal mutations.

Digit/text number transliteration

These problems are identified by the labels ‘text_to_num’ and ‘num_to_text’. In them, models have to produce a digit or text equivalent, respectively.

They require a model’s understanding of morphological analysis and morpheme order.

Example Problem (Figure 2)

A subset of the context of a problem in Terenâ language and the reasoning steps needed to solve it. To correctly answer the question, the model must notice that (a) voiced d mutates to voiceless paired sound t (fortition), (b) n is dropped because there are no voiceless nasal alveolar sounds and (c) an epenthetic vowel has to be added between the mutation consonant and the rest of the word (a root), and that the vowel that gets added matches the aperture of the vowel in the root. If the aperture is closed, the epenthetic vowel is the closed front vowel i; if the aperture is mid, the epenthetic vowel is the mid front vowel e.

Current SOTA Models’ Performance

Main benchmarks are (obviously) with in-context examples to allow the models to understand what the output should be.

gap between best performing open model and the best performing proprietary model
- with several tiers of proprietary models above the best open model (llama-3-70b)
best open model (llama-3-70b)
Mixed impact of in-context examples (ICEs) in the performance of the models
- some models benefit from it (such as llama-3-70b-it)
- other models’ performance degrades as the number of examples increases (such as claude-3-opus)
disparity might be due to the two factors introduced by the ICEs:
1. (positive) they set an answer format that could be useful for models that can’t infer it directly from a single natural language instruction and
2. (negative) they introduce tokens of languages potentially unrelated to the evaluated problem.
hypothesis for differential effect of ICEs: for models more capable of instruction following, only the second factor plays a role in the model’s performance

Table 1: Exact match results with Linguini for 0-5 ICEs.

Model	0	1	2	3	4	5	Best(↑)
claude-3-opus	24.05	20.58	21.36	19.91	17.00	15.1	24.05
gpt-4o	14.65	12.98	13.87	12.98	13.98	13.76	14.65
gpt-4	6.38	9.96	11.52	12.98	11.74	13.31	12.98
claude-3-sonnet	12.30	8.95	10.29	10.40	9.28	8.72	12.30
gpt-4-turbo	8.72	9.40	9.96	7.49	8.61	9.96	9.96
llama-3-70b	8.17	5.93	7.72	8.84	8.72	6.60	8.84
llama-3-70b-it	4.81	5.93	7.16	7.38	6.82	8.39	8.39
claude-3-haiku	6.04	7.61	4.36	6.04	6.94	7.05	7.61
llama-2-70b	4.70	2.24	2.57	3.24	3.36	3.58	3.58
mistral-0.1-8x7b	2.46	3.47	3.91	3.02	3.24	3.47	3.91
llama-2-70b-it	0.89	1.45	2.80	3.02	3.13	2.80	3.13
gemma-2b	0.34	2.01	1.90	1.34	1.45	1.90	2.01
qwen-1.5-110b-it	1.45	1.23	1.34	1.45	1.45	1.68	1.68

Ablations

No-Context Prompting: Checking Data Contamination

No context allows study of degree to which models rely on the given context to provide correct answers
- We analyze the impact of ignoring the context provided in the benchmark as a proxy of possible data contamination.
Models that have not been trained on any data of the task language should have a null-adjacent performance when not given the context necessary to solve the task.

Steep drops in delta of zero-shot and no context show tested LMs are unlikely to have training data contamination from the IOL

Gemma-2b, Llama 70B it and Mixtral (8x7B Sparse MOE) are the smallest deltas.

Model	Zero-shot	No context	Δ
llama-3-70b-it	4.81	1.12	-3.69
gpt-4-turbo	8.72	1.45	-7.27
gpt-4	6.38	1.34	-5.04
claude-3-sonnet	12.30	2.01	-10.29
mistral-0.1-8x7b	2.46	1.98	-0.48
claude-3-haiku	6.04	1.12	-4.92
qwen-1.5-110b-it	1.45	0.43	-1.02
gemma-2b	0.34	0.09	-0.25
llama-2-70b	4.70	1.07	-3.63
llama-2-70b-it	0.89	0.56	-0.33
llama-3-70b	8.17	1.67	-6.50
claude-3-opus	24.05	1.23	-22.82
gpt-4o	14.65	1.45	-13.20

Transliteration: Check effect of script choice on performance

Used transliteration at https://github.com/barseghyanartur/transliterate/ to transliterate questions from Latin to other scripts to ensure model performance was not hurt by alternative scripts.

$\to$ Ideally this wouldn’t be the case as the model should reason abstracting away from the script

(As a human, I know I would find the script distracting - more difficult in e.g. Cyrillic; working memory issues)

We selected the best performing model (claude-3-opus) and transcribed the best performing problems (those where the accuracy >= 75) into 4 non-Latin alphabetical scripts (Cyrilic, Greek, Georgian and Armenian)

Table 3 shows that the model retains the capacity to perform linguistic reasoning even after changing scripts, which backs the hypothesis of the model relying mainly on the presented context and not on spurious previous knowledge.

question critique But does it though? These numbers are zero in cases of Armenian and lower than the 75+% baseline you started out with

The fact that for 13 our of 16 of the given problems there’s at least one non-Latin script in which the model can solve the problem with greater or equal performance than with Latin script further supports this claim.

Performance disparity among scripts could be related to either

the difference in tokenization of different scripts or
inherent limitations of our transliteration strategy
- Armenian script might lack a specific consonant cluster that needs to be developed to provide the right answer
- (character/bi-character-wise substitution doesn’t take this nuance into account)

Problem code & language	Latn	Cyrl	Grek	Geor	Armn
012023010100 (qda-gua)	75.00	100.00	75.00	100.00	0.00
012021020500 (zun)	100.00	0.00	100.00	0.00	0.00
012012030100 (eus)	78.57	7.14	92.86	0.00	0.00
012018020100 (nst-hkn)	83.33	83.33	66.67	83.33	100.00
012007050100 (tur)	75.00	75.00	50.00	37.50	50.00
012006020100 (cat)	75.00	50.00	50.00	58.33	33.33
012003030200 (eus)	100.00	100.00	75.00	100.00	100.00
012004010100 (txu)	100.00	100.00	66.67	66.67	33.33
012007030100 (kat)	80.00	13.33	6.67	100.00	0.00
012009050100 (nci)	83.33	83.33	83.33	83.33	50.00
012015020100 (kbd-bes)	100.00	66.67	100.00	66.67	83.33
012012050100 (rtm)	100.00	100.00	100.00	100.00	100.00
012011040200 (nci)	100.00	50.00	75.00	75.00	0.00
012013010200 (yii)	100.00	100.00	100.00	75.00	100.00
012012030200 (eus)	100.00	50.00	0.00	0.00	0.00
012012030300 (eus)	100.00	50.00	100.00	0.00	0.00
Average	85.71	56.12	65.31	63.27	38.78

Figure 3:Example of transliteration of a problem into Cyrillic, Greek, Georgian and Armenian scripts.

Are lower resource languages harder to solve the IOL tasks for?

used number of speakers and Google searches by language as a proxy of resource quantity
their conclusion: accuracy isn’t largely correlated to to its likelihood of being included in the training set

Notable exceptions to this trend are a number of very high-resource languages (e.g., cat, eus, kat, tur), which are very likely to be included in the model’s training set, given their institutional status.

”One-book” prompting: Testing if providing textbooks via OCR boosts performance

OCR’d textbooks in akz, apu, mnk
Added these to the context
helps apu go from 0.0 to 16.67 for example
OCR introduces errors so maybe not great

Language code	Context	Context + Textbook
akz	5.13	3.85
apu	0.00	16.67
mnk	0.00	0.00
Average	1.71	6.84

Table 7:Overview of Grammar Books [tba] (§A D)

Language	Book Title	Citation
akz	The Language of the Alabama Indians	Lupardus (1982)
apu	A Grammar and a Vocabulary of the Ipuriná Language	Polak (1894)
mnk	The Structure of Faranah-Maninka	Spears (1965)

Languages of Linguini (§A B)

I sorted by Language Fam. and ISO lang code.

Some number of speakers are zero: Ubykh, Proto-Chamic, Guazacapán Xinka
- others are really small e.g. in the tens

Lang. Code	Language	No. Speakers9	No. Search Results10	Language Family	Script
gya	Northwest Gbaya	267000	8	-	Latin
ady	Adyghe	425,000	2,370	Abkhaz-Adyghe	Latin
kbd-bes11	Besleney Kabardian	516000	0	Abkhaz-Adyghe	Latin
uby	Ubykh	0	1180	Abkhaz-Adyghe	Latin
crk	Plains Cree	34000	5290	Algic	Latin
mez	Menominee	2000	2240	Algic	Latin
mic	Micmac	11000	774	Algic	Latin
yur	Yurok	35	2830	Algic	Latin
dbl	Dyirbal	21	2900	Australian	Latin
wmb	Wambaya	43	112	Australian	Latin
yii	Yidiny	52	280	Australian	Latin
cam	Cemuhî	3300	6	Austronesian	Latin
cjm	Phan Rang Cham	491448	2	Austronesian	Latin
cmc-pro11	Proto-Chamic	0	267	Austronesian	Latin
dhv	Drehu	13,000	216	Austronesian	Latin
huq	Tsat	4500	128	Austronesian	Latin
kij	Kilivila	25000	271	Austronesian	Latin
mmx	Madak	2600	57	Austronesian	Latin
mnb	Muna	270000	1020	Austronesian	Latin
rtm	Rotuman	7500	4560	Austronesian	Latin
tio	Teop	8000	81	Austronesian	Latin
txn	West Tarangan	14,000	4	Austronesian	Latin
jqr	Jaqaru	725	101	Aymaran	Latin
iku	Inuktitut	39,000	12500	Eskimo-Aleut	Latin
cat	Catalan	9200000	87100	Indo-European	Latin
eng	English Braille	6000000	728	Indo-European	Latin
fao	Faroese	69000	23800	Indo-European	Latin
roh-eng10	Engadine	60000	7	Indo-European	Latin
roh-sur11	Sursilvan	60000	3	Indo-European	Latin
eus	Basque	936,812	71100	Isolate	Latin
mzp	Movima	1000	72	Isolate	Latin
rkb	Rikbaktsa	40	54	Isolate	Latin
sua	Sulka	3500	107	Isolate	Latin
zun	Zuni	9500	1610	Isolate	Latin
txu	Kayapo	8600	116	Jean	Latin
kat	Georgian	4000000	73700	Kartvelian	Latin
apu	Apurinã	2800	264	Maipurean	Latin
ter	Terêna	15,000	115	Maipurean	Latin
tzo	Tzotzil	550000	1160	Mayan	Latin
zoc	Copainalá Zoque	10000	10	Mixe-Zoquean	Latin
akz	Alabama	370	1,350	Muskogean	Latin
bdk	Budukh	200	126	Nakh-Daghestanian	Latin
bam	Bambara	14000000	7150	Niger-Congo	N’Ko
bom	Birom	1000000	115	Niger-Congo	Latin
enn	Engenni	20000	185	Niger-Congo	Latin
ikw-agb11	Agbirigba	30	1	Niger-Congo	Latin
kmb	Kimbundu	1600000	1130	Niger-Congo	Latin
mnk	Maninka	4600000	478	Niger-Congo	N’Ko
nhu	Nooni	64000	82	Niger-Congo	Latin
spp	Supyire	460000	45	Niger-Congo	Latin
vai	Vai	120000	1380	Niger-Congo	Latin
yor	Yoruba	47000000	1360000	Niger-Congo	Latin
laj	Lango	2100000	1490	Nilo-Saharan	Latin
xnz	Kunuz Nubian	35000	2	Nilo-Saharan	Latin
ian	Iatmül	46000	9	Papua New Guinea	Latin
nst-hkn11	Hakhun	10000	5	Sino-Tibetan	Latin
lkt	Lakhota	2000	25300	Siouan-Catawban	Latin
stk	Arammba	1000	36	South-Central Papuan	Latin
abz	Mountain Arapesh	16,000	98	Torricelli	Latin
abz	Abui	16,000	263	Trans-New Guinea	Latin
bef	Bena Bena	45000	107	Trans-New Guinea	Latin
ekg	Ekari	100000	141	Trans-New Guinea	Latin
mrz	Coastal Marind	9000	100	Trans-New Guinea	Latin
nqm	Ndom	1200	154	Trans-New Guinea	Latin
ubu	Umbu-Ungu	32,000	90	Trans-New Guinea	Latin
yon	Yonggom	6,000	48	Trans-New Guinea	Latin
ude	Udihe	50	108	Tungusic	Latin
chv	Chuvash	700000	6260	Turkic	Latin
tat	Tatar	7000000	79700	Turkic	Latin
tur	Turkish	100000000	4130000	Turkic	Latin
ngh	N\|uuki	1	0	Tuu	Latin
mns	Mansi	2229	1490	Uralic	Latin
nci	Classical Nahuatl	1500000	1690	Uto-Aztecan	Latin
qda-gua11	Guazacapán Xinka	0	1	Xincan	Latin
ykg	Tundra Yukaghir	320	206	Yukaghir	Latin