- The Leaderboard Illusion
- AI-Slop to AI-Polish Aligning Language Models through Edit-Based Writing Rewards and Test-time Computation
- NoLiMa: NoLiMa Long-Context Evaluation Beyond Literal Matching
- EnigmaEval: EnigmaEval A Benchmark of Long Multimodal Reasoning Challenges - EnigmaEval Leaderboard
- Humanityâs Last Exam: Humanityâs Last Exam
- GAIA: GAIA a benchmark for General AI Assistants
- Open-LLM-Leaderboard From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena
- From Crowdsourced Data to High-Quality Benchmarks Arena-Hard and BenchBuilder Pipeline
- Measuring Massive Multitask Language Understanding
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
- Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots: https://lmarena.ai/?leaderboard
- Hugging Face mirror of the live leaderboard created and maintained at https://lmarena.ai/leaderboard. - âPlease link to the original URL for citation purposes: https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboardâ
- MT-Bench Browser: https://huggingface.co/spaces/lmsys/mt-bench
- Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots: https://lmarena.ai/?leaderboard
- MT-Bench-101 A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues
- INCLUDE Evaluating Multilingual Language Understanding with Regional Knowledge
- ROCStories and the Story Cloze Test - commonsense reasoning framework for evaluating story understanding. Requires a system to choose the correct ending to a four-sentence story
- Chatbot Arena LLM Leaderboard
- Language Model Comparison disaggregates into quality, speed and cost
- A Survey on Evaluation of Large Language Models
- Connecting the Dots Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game
- See also WebDev Arena: web.lmarena.ai - AI Battle to build the best website!
- Open LLM Leaderboard Hugging Face - Comparing Large Language Models in an open and reproducible way
Language Model Benchmarks
See also NLP-progress by Sebastian Ruder - Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.
MMLU
MMLU-Pro
GPQA
HellaSwag Can a Machine Really Finish Your Sentence
- Natural language inference
MATH-500
- Reasoning
LiveCodeBench
- Coding - what type? code gen? Code PE?
WikiSQL (Zhong et al., 2017)
- NL to SQL queries
GLUE
SuperGLUE
SAMSum
- Conversation summarisation