todo the evaluation benchmarks listed in alphabetical order at the top are the ones quoted in the figure from Gemma 3, OLMo 2 32B, and the growing potential of open-source AI, the post from Nathan Lambert - fill them in as and when


Language Model Benchmarks

See also NLP-progress by Sebastian Ruder - Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

GPQA

HellaSwag Can a Machine Really Finish Your Sentence

  • Natural language inference

MATH-500

  • Reasoning

LiveCodeBench

  • Coding - what type? code gen? Code PE?

WikiSQL (Zhong et al., 2017)

  • NL to SQL queries

GLUE

SuperGLUE

SAMSum

  • Conversation summarisation