🪴 Anil's Garden

Home

❯

Research

❯

Language Models - Evaluation and Leaderboards

28 Jul 20252 min read

The Leaderboard Illusion
AI-Slop to AI-Polish Aligning Language Models through Edit-Based Writing Rewards and Test-time Computation
NoLiMa: NoLiMa Long-Context Evaluation Beyond Literal Matching
EnigmaEval: EnigmaEval A Benchmark of Long Multimodal Reasoning Challenges - EnigmaEval Leaderboard
Humanity’s Last Exam: Humanity’s Last Exam
GAIA: GAIA a benchmark for General AI Assistants
Open-LLM-Leaderboard From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena
From Crowdsourced Data to High-Quality Benchmarks Arena-Hard and BenchBuilder Pipeline
- Leaderboard (HF)
Measuring Massive Multitask Language Understanding
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
- Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots: https://lmarena.ai/?leaderboard
  - Hugging Face mirror of the live leaderboard created and maintained at https://lmarena.ai/leaderboard. - “Please link to the original URL for citation purposes: https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboard”
- MT-Bench Browser: https://huggingface.co/spaces/lmsys/mt-bench
MT-Bench-101 A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues
INCLUDE Evaluating Multilingual Language Understanding with Regional Knowledge
ROCStories and the Story Cloze Test - commonsense reasoning framework for evaluating story understanding. Requires a system to choose the correct ending to a four-sentence story
- main paper: A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories
Chatbot Arena LLM Leaderboard
Language Model Comparison disaggregates into quality, speed and cost
A Survey on Evaluation of Large Language Models
Connecting the Dots Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game
See also WebDev Arena: web.lmarena.ai - AI Battle to build the best website!
Open LLM Leaderboard Hugging Face - Comparing Large Language Models in an open and reproducible way

Language Model Benchmarks

See also NLP-progress by Sebastian Ruder - Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

MMLU

MMLU-Pro

GPQA

HellaSwag Can a Machine Really Finish Your Sentence

Natural language inference

MATH-500

Reasoning

LiveCodeBench

Coding - what type? code gen? Code PE?

WikiSQL (Zhong et al., 2017)

NL to SQL queries

GLUE

SuperGLUE

SAMSum

Conversation summarisation

Graph View

Language Model Benchmarks
MMLU
MMLU-Pro
GPQA
HellaSwag Can a Machine Really Finish Your Sentence
MATH-500
LiveCodeBench
WikiSQL (Zhong et al., 2017)
GLUE
SuperGLUE
SAMSum

Backlinks

Language Models

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋