Title: Linguini: A benchmark for language-agnostic linguistic reasoning
Authors: Eduardo Sánchez, Belen Alastruey, Christophe Ropers, Pontus Stenetorp, Mikel Artetxe, Marta R. Costa-jussà
Published: 18th September 2024 (Wednesday) @ 16:51:02
Link: http://arxiv.org/abs/2409.12126v1

Abstract

We propose a new benchmark to measure a language model’s linguistic reasoning skills without relying on pre-existing language-specific knowledge. The test covers 894 questions grouped in 160 problems across 75 (mostly) extremely low-resource languages, extracted from the International Linguistic Olympiad corpus. To attain high accuracy on this benchmark, models don’t need previous knowledge of the tested language, as all the information needed to solve the linguistic puzzle is presented in the context. We find that, while all analyzed models rank below 25% accuracy, there is a significant gap between open and closed models, with the best-performing proprietary model at 24.05% and the best-performing open model at 8.84%.


Dataset available: https://github.com/facebookresearch/linguini


  • Used data from the International Linguistics Olympiad (IOL)
    • like the math or CS olympiads for high schoolers
  • Built a taxonomy exploring the IOL from 2003 to 2023.
  • Excluded all instances for which their category only appears once; those where the question includes an image or those where the response is only an explanation.
  • The remaining problems require solving different linguistic reasoning tasks, such as morphosyntactic segmentation (eg., verb conjugation), morphosemantic alignment (e.g., noun negation), derivation (e.g., finding cognates in related languages), morphophonological segmentation (e.g., pluralization) or graphophonemic transcription (e.g., transcription from one script to another).
  • In total, Linguini is composed by 894 questions grouped in 160 problems across 75 (mostly) extremely low-resource language.

What is the x-lingual, reasoning benchmark Linguini?

We classify the problems included in Linguini into the three categories according to their content: sequence transduction, fill-in-blanks and number transliteration. Figure 1 shows one example of each.

Sequence transduction

This category includes sequence production (identified in the benchmark as ‘translation’) and sequence matching (identified as ‘match_letter’). The problems require the model to transform a sequence into a different space (e.g., language, phonetic representation, script) based on few examples.

In some cases, basic phonetic/phonological knowledge is needed. For example, the model should be able to reason over principles of voicing and their implementation in situations of coarticulation. Some problems require to know that consonants come in voiced-voiceless pairs, and that one element of the pair may in some cases be a substitute for the other element in the pair under certain circumstances.

Fill-in blanks

Fill-in blanks are mainly morphophonological derivation tasks, and they are identified in the benchmark as ‘fill_blanks’.

  • Models need to understand what are the morphophonological rules that make it possible to go from the first form of a word to its second form.
  • This can usually be applied to verbal (e.g., verb tense conjugation), nominal or adjectival (e.g., case declension) derivation.
  • involves understanding affixation rules and morpheme swapping rules, which often come with phonological rules if there are different coarticulation phenomena with different affixes or phonotactic phenomena such as consonantal mutations.

Digit/text number transliteration

These problems are identified by the labels ‘text_to_num’ and ‘num_to_text’. In them, models have to produce a digit or text equivalent, respectively.

They require a model’s understanding of morphological analysis and morpheme order.

Example Problem (Figure 2)

A subset of the context of a problem in Terenâ language and the reasoning steps needed to solve it. To correctly answer the question, the model must notice that (a) voiced d mutates to voiceless paired sound t (fortition), (b) n is dropped because there are no voiceless nasal alveolar sounds and (c) an epenthetic vowel has to be added between the mutation consonant and the rest of the word (a root), and that the vowel that gets added matches the aperture of the vowel in the root. If the aperture is closed, the epenthetic vowel is the closed front vowel i; if the aperture is mid, the epenthetic vowel is the mid front vowel e.

Current SOTA Models’ Performance

Main benchmarks are (obviously) with in-context examples to allow the models to understand what the output should be.

  • gap between best performing open model and the best performing proprietary model
    • with several tiers of proprietary models above the best open model (llama-3-70b)
  • best open model (llama-3-70b)
  • Mixed impact of in-context examples (ICEs) in the performance of the models
    • some models benefit from it (such as llama-3-70b-it)
    • other models’ performance degrades as the number of examples increases (such as claude-3-opus)
  • disparity might be due to the two factors introduced by the ICEs:
    1. (positive) they set an answer format that could be useful for models that can’t infer it directly from a single natural language instruction and
    2. (negative) they introduce tokens of languages potentially unrelated to the evaluated problem.
  • hypothesis for differential effect of ICEs: for models more capable of instruction following, only the second factor plays a role in the model’s performance

Table 1: Exact match results with Linguini for 0-5 ICEs.

Model012345Best(↑)
claude-3-opus24.0520.5821.3619.9117.0015.124.05
gpt-4o14.6512.9813.8712.9813.9813.7614.65
gpt-46.389.9611.5212.9811.7413.3112.98
claude-3-sonnet12.308.9510.2910.409.288.7212.30
gpt-4-turbo8.729.409.967.498.619.969.96
llama-3-70b8.175.937.728.848.726.608.84
llama-3-70b-it4.815.937.167.386.828.398.39
claude-3-haiku6.047.614.366.046.947.057.61
llama-2-70b4.702.242.573.243.363.583.58
mistral-0.1-8x7b2.463.473.913.023.243.473.91
llama-2-70b-it0.891.452.803.023.132.803.13
gemma-2b0.342.011.901.341.451.902.01
qwen-1.5-110b-it1.451.231.341.451.451.681.68

Ablations

No-Context Prompting: Checking Data Contamination

  • No context allows study of degree to which models rely on the given context to provide correct answers
    • We analyze the impact of ignoring the context provided in the benchmark as a proxy of possible data contamination.
  • Models that have not been trained on any data of the task language should have a null-adjacent performance when not given the context necessary to solve the task.

Steep drops in delta of zero-shot and no context show tested LMs are unlikely to have training data contamination from the IOL

Gemma-2b, Llama 70B it and Mixtral (8x7B Sparse MOE) are the smallest deltas.

ModelZero-shotNo contextΔ
llama-3-70b-it4.811.12-3.69
gpt-4-turbo8.721.45-7.27
gpt-46.381.34-5.04
claude-3-sonnet12.302.01-10.29
mistral-0.1-8x7b2.461.98-0.48
claude-3-haiku6.041.12-4.92
qwen-1.5-110b-it1.450.43-1.02
gemma-2b0.340.09-0.25
llama-2-70b4.701.07-3.63
llama-2-70b-it0.890.56-0.33
llama-3-70b8.171.67-6.50
claude-3-opus24.051.23-22.82
gpt-4o14.651.45-13.20

Transliteration: Check effect of script choice on performance

Used transliteration at https://github.com/barseghyanartur/transliterate/ to transliterate questions from Latin to other scripts to ensure model performance was not hurt by alternative scripts.

Ideally this wouldn’t be the case as the model should reason abstracting away from the script

(As a human, I know I would find the script distracting - more difficult in e.g. Cyrillic; working memory issues)

We selected the best performing model (claude-3-opus) and transcribed the best performing problems (those where the accuracy >= 75) into 4 non-Latin alphabetical scripts (Cyrilic, Greek, Georgian and Armenian)

Table 3 shows that the model retains the capacity to perform linguistic reasoning even after changing scripts, which backs the hypothesis of the model relying mainly on the presented context and not on spurious previous knowledge.

questioncritique But does it though? These numbers are zero in cases of Armenian and lower than the 75+% baseline you started out with

  • The fact that for 13 our of 16 of the given problems there’s at least one non-Latin script in which the model can solve the problem with greater or equal performance than with Latin script further supports this claim.

Performance disparity among scripts could be related to either

  • the difference in tokenization of different scripts or
  • inherent limitations of our transliteration strategy
    • Armenian script might lack a specific consonant cluster that needs to be developed to provide the right answer
    • (character/bi-character-wise substitution doesn’t take this nuance into account)
Problem code & languageLatnCyrlGrekGeorArmn
012023010100 (qda-gua)75.00100.0075.00100.000.00
012021020500 (zun)100.000.00100.000.000.00
012012030100 (eus)78.577.1492.860.000.00
012018020100 (nst-hkn)83.3383.3366.6783.33100.00
012007050100 (tur)75.0075.0050.0037.5050.00
012006020100 (cat)75.0050.0050.0058.3333.33
012003030200 (eus)100.00100.0075.00100.00100.00
012004010100 (txu)100.00100.0066.6766.6733.33
012007030100 (kat)80.0013.336.67100.000.00
012009050100 (nci)83.3383.3383.3383.3350.00
012015020100 (kbd-bes)100.0066.67100.0066.6783.33
012012050100 (rtm)100.00100.00100.00100.00100.00
012011040200 (nci)100.0050.0075.0075.000.00
012013010200 (yii)100.00100.00100.0075.00100.00
012012030200 (eus)100.0050.000.000.000.00
012012030300 (eus)100.0050.00100.000.000.00
Average85.7156.1265.3163.2738.78

Figure 3:Example of transliteration of a problem into Cyrillic, Greek, Georgian and Armenian scripts.

Are lower resource languages harder to solve the IOL tasks for?

  • used number of speakers and Google searches by language as a proxy of resource quantity
  • their conclusion: accuracy isn’t largely correlated to to its likelihood of being included in the training set

Notable exceptions to this trend are a number of very high-resource languages (e.g., cat, eus, kat, tur), which are very likely to be included in the model’s training set, given their institutional status.

”One-book” prompting: Testing if providing textbooks via OCR boosts performance

  • OCR’d textbooks in akz, apu, mnk
  • Added these to the context
  • helps apu go from 0.0 to 16.67 for example
  • OCR introduces errors so maybe not great
Language codeNo-contextContextTextbookContext + Textbook
akz0.005.130.003.85
apu0.000.000.0016.67
mnk0.000.000.000.00
Average0.001.710.006.84

Table 7:Overview of Grammar Books [tba] (§A D)

LanguageBook TitleCitation
akzThe Language of the Alabama IndiansLupardus (1982)
apuA Grammar and a Vocabulary of the Ipuriná LanguagePolak (1894)
mnkThe Structure of Faranah-ManinkaSpears (1965)

Languages of Linguini (§A B)

I sorted by Language Fam. and ISO lang code.

  • Some number of speakers are zero: Ubykh, Proto-Chamic, Guazacapán Xinka
    • others are really small e.g. in the tens
Lang. CodeLanguageNo. Speakers9No. Search Results10Language FamilyScript
gyaNorthwest Gbaya2670008-Latin
adyAdyghe425,0002,370Abkhaz-AdygheLatin
kbd-bes11Besleney Kabardian5160000Abkhaz-AdygheLatin
ubyUbykh01180Abkhaz-AdygheLatin
crkPlains Cree340005290AlgicLatin
mezMenominee20002240AlgicLatin
micMicmac11000774AlgicLatin
yurYurok352830AlgicLatin
dblDyirbal212900AustralianLatin
wmbWambaya43112AustralianLatin
yiiYidiny52280AustralianLatin
camCemuhî33006AustronesianLatin
cjmPhan Rang Cham4914482AustronesianLatin
cmc-pro11Proto-Chamic0267AustronesianLatin
dhvDrehu13,000216AustronesianLatin
huqTsat4500128AustronesianLatin
kijKilivila25000271AustronesianLatin
mmxMadak260057AustronesianLatin
mnbMuna2700001020AustronesianLatin
rtmRotuman75004560AustronesianLatin
tioTeop800081AustronesianLatin
txnWest Tarangan14,0004AustronesianLatin
jqrJaqaru725101AymaranLatin
ikuInuktitut39,00012500Eskimo-AleutLatin
catCatalan920000087100Indo-EuropeanLatin
engEnglish Braille6000000728Indo-EuropeanLatin
faoFaroese6900023800Indo-EuropeanLatin
roh-eng10Engadine600007Indo-EuropeanLatin
roh-sur11Sursilvan600003Indo-EuropeanLatin
eusBasque936,81271100IsolateLatin
mzpMovima100072IsolateLatin
rkbRikbaktsa4054IsolateLatin
suaSulka3500107IsolateLatin
zunZuni95001610IsolateLatin
txuKayapo8600116JeanLatin
katGeorgian400000073700KartvelianLatin
apuApurinã2800264MaipureanLatin
terTerêna15,000115MaipureanLatin
tzoTzotzil5500001160MayanLatin
zocCopainalá Zoque1000010Mixe-ZoqueanLatin
akzAlabama3701,350MuskogeanLatin
bdkBudukh200126Nakh-DaghestanianLatin
bamBambara140000007150Niger-CongoN’Ko
bomBirom1000000115Niger-CongoLatin
ennEngenni20000185Niger-CongoLatin
ikw-agb11Agbirigba301Niger-CongoLatin
kmbKimbundu16000001130Niger-CongoLatin
mnkManinka4600000478Niger-CongoN’Ko
nhuNooni6400082Niger-CongoLatin
sppSupyire46000045Niger-CongoLatin
vaiVai1200001380Niger-CongoLatin
yorYoruba470000001360000Niger-CongoLatin
lajLango21000001490Nilo-SaharanLatin
xnzKunuz Nubian350002Nilo-SaharanLatin
ianIatmül460009Papua New GuineaLatin
nst-hkn11Hakhun100005Sino-TibetanLatin
lktLakhota200025300Siouan-CatawbanLatin
stkArammba100036South-Central PapuanLatin
abzMountain Arapesh16,00098TorricelliLatin
abzAbui16,000263Trans-New GuineaLatin
befBena Bena45000107Trans-New GuineaLatin
ekgEkari100000141Trans-New GuineaLatin
mrzCoastal Marind9000100Trans-New GuineaLatin
nqmNdom1200154Trans-New GuineaLatin
ubuUmbu-Ungu32,00090Trans-New GuineaLatin
yonYonggom6,00048Trans-New GuineaLatin
udeUdihe50108TungusicLatin
chvChuvash7000006260TurkicLatin
tatTatar700000079700TurkicLatin
turTurkish1000000004130000TurkicLatin
nghN|uuki10TuuLatin
mnsMansi22291490UralicLatin
nciClassical Nahuatl15000001690Uto-AztecanLatin
qda-gua11Guazacapán Xinka01XincanLatin
ykgTundra Yukaghir320206YukaghirLatin