🪴 Anil's Garden

Home

❯

Research

❯

Language Models

❯

Language Models - Evaluation and Leaderboards

19 Dec 20253 min read

todo

AGIEval
AlpacaEval || Code: tatsu-lab/alpaca_eval
- v1 is just the tatsu-lab/alpaca_eval repository i.e. that’s where the citation from the length-controlled Alpaca v2 points
- v2: Length-Controlled AlpacaEval A Simple Way to Debias Automatic Evaluators - based on reading the abstract, they just update AlpacaEval v1 by controlling for length of LLMs’ generated response via linear regression (“We then obtain length-controlled preferences by predicting preferences while conditioning the GLM with a zero difference in lengths.“)
ARC/C
BBH
DROP
GSM8k
HellaSwag
IFEval
MATH
MMLU
- MMLU - Wikipedia
MMLU-Pro
NQ
Safety
Self-BLEU: Higher self-BLEU scores indicate lower diversity of the produced text
- introduced in Texygen A Benchmarking Platform for Text Generation Models §2.2 Metrics - Texygen is a benchmarking platform to support research on open-domain text generation models
PopQA
TriviaQA
TruthQA
WinoGrande and Winograd
- See full explanation in note: Winograd and WinoGrande

todo the evaluation benchmarks listed in alphabetical order at the top are the ones quoted in the figure from Gemma 3, OLMo 2 32B, and the growing potential of open-source AI, the post from Nathan Lambert - fill them in as and when

Open-LLM performances are plateauing, let’s make the leaderboard steep again - a Hugging Face Space by open-llm-leaderboard
The Leaderboard Illusion
AI-Slop to AI-Polish Aligning Language Models through Edit-Based Writing Rewards and Test-time Computation
NoLiMa: NoLiMa Long-Context Evaluation Beyond Literal Matching
EnigmaEval: EnigmaEval A Benchmark of Long Multimodal Reasoning Challenges - EnigmaEval Leaderboard
Humanity’s Last Exam: Humanity’s Last Exam
GAIA: GAIA a benchmark for General AI Assistants
Open-LLM-Leaderboard From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena
From Crowdsourced Data to High-Quality Benchmarks Arena-Hard and BenchBuilder Pipeline
- Leaderboard (HF)
Measuring Massive Multitask Language Understanding
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
- Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots: https://lmarena.ai/?leaderboard
  - Hugging Face mirror of the live leaderboard created and maintained at https://lmarena.ai/leaderboard. - “Please link to the original URL for citation purposes: https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboard”
- MT-Bench Browser: https://huggingface.co/spaces/lmsys/mt-bench
MT-Bench-101 A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues
INCLUDE Evaluating Multilingual Language Understanding with Regional Knowledge
ROCStories and the Story Cloze Test - commonsense reasoning framework for evaluating story understanding. Requires a system to choose the correct ending to a four-sentence story
- main paper: A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories
Chatbot Arena LLM Leaderboard
Language Model Comparison disaggregates into quality, speed and cost
A Survey on Evaluation of Large Language Models
Connecting the Dots Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game
See also WebDev Arena: web.lmarena.ai - AI Battle to build the best website!
Open LLM Leaderboard Hugging Face - Comparing Large Language Models in an open and reproducible way

Language Model Benchmarks

See also NLP-progress by Sebastian Ruder - Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

GPQA

HellaSwag Can a Machine Really Finish Your Sentence

Natural language inference

MATH-500

Reasoning

LiveCodeBench

Coding - what type? code gen? Code PE?

WikiSQL (Zhong et al., 2017)

NL to SQL queries

GLUE

SuperGLUE

SAMSum

Conversation summarisation

Graph View

Language Model Benchmarks
GPQA
HellaSwag Can a Machine Really Finish Your Sentence
MATH-500
LiveCodeBench
WikiSQL (Zhong et al., 2017)
GLUE
SuperGLUE
SAMSum

Backlinks

Language Models

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋