- AGIEval
- AlpacaEval || Code: tatsu-lab/alpaca_eval
- v1 is just the tatsu-lab/alpaca_eval repository i.e. that’s where the citation from the length-controlled Alpaca v2 points
- v2: Length-Controlled AlpacaEval A Simple Way to Debias Automatic Evaluators - based on reading the abstract, they just update AlpacaEval v1 by controlling for length of LLMs’ generated response via linear regression (“We then obtain length-controlled preferences by predicting preferences while conditioning the GLM with a zero difference in lengths.“)
- ARC/C
- BBH
- DROP
- GSM8k
- HellaSwag
- IFEval
- MATH
- MMLU
- MMLU-Pro
- NQ
- Safety
- Self-BLEU: Higher self-BLEU scores indicate lower diversity of the produced text
- introduced in Texygen A Benchmarking Platform for Text Generation Models §2.2 Metrics - Texygen is a benchmarking platform to support research on open-domain text generation models
- PopQA
- TriviaQA
- TruthQA
- WinoGrande and Winograd
- See full explanation in note: Winograd and WinoGrande
todo the evaluation benchmarks listed in alphabetical order at the top are the ones quoted in the figure from Gemma 3, OLMo 2 32B, and the growing potential of open-source AI, the post from Nathan Lambert - fill them in as and when
- The Leaderboard Illusion
- AI-Slop to AI-Polish Aligning Language Models through Edit-Based Writing Rewards and Test-time Computation
- NoLiMa: NoLiMa Long-Context Evaluation Beyond Literal Matching
- EnigmaEval: EnigmaEval A Benchmark of Long Multimodal Reasoning Challenges - EnigmaEval Leaderboard
- Humanity’s Last Exam: Humanity’s Last Exam
- GAIA: GAIA a benchmark for General AI Assistants
- Open-LLM-Leaderboard From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena
- From Crowdsourced Data to High-Quality Benchmarks Arena-Hard and BenchBuilder Pipeline
- Measuring Massive Multitask Language Understanding
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
- Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots: https://lmarena.ai/?leaderboard
- Hugging Face mirror of the live leaderboard created and maintained at https://lmarena.ai/leaderboard. - “Please link to the original URL for citation purposes: https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboard”
- MT-Bench Browser: https://huggingface.co/spaces/lmsys/mt-bench
- Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots: https://lmarena.ai/?leaderboard
- MT-Bench-101 A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues
- INCLUDE Evaluating Multilingual Language Understanding with Regional Knowledge
- ROCStories and the Story Cloze Test - commonsense reasoning framework for evaluating story understanding. Requires a system to choose the correct ending to a four-sentence story
- Chatbot Arena LLM Leaderboard
- Language Model Comparison disaggregates into quality, speed and cost
- A Survey on Evaluation of Large Language Models
- Connecting the Dots Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game
- See also WebDev Arena: web.lmarena.ai - AI Battle to build the best website!
- Open LLM Leaderboard Hugging Face - Comparing Large Language Models in an open and reproducible way
Language Model Benchmarks
See also NLP-progress by Sebastian Ruder - Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.
GPQA
HellaSwag Can a Machine Really Finish Your Sentence
- Natural language inference
MATH-500
- Reasoning
LiveCodeBench
- Coding - what type? code gen? Code PE?
WikiSQL (Zhong et al., 2017)
- NL to SQL queries
GLUE
SuperGLUE
SAMSum
- Conversation summarisation