Title: Measuring Massive Multitask Language Understanding
Authors: Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt
Published: 7th September 2020 (Monday) @ 17:59:25
Link: http://arxiv.org/abs/2009.03300v3

Abstract

We propose a new test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model’s academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.


  • “We comprehensively evaluate the breadth and depth of a model’s text understanding by covering numerous topics that humans are incentivized to learn. Since our test consists in 57 tasks, it can be used to analyze aggregate properties of models across tasks and to track important shortcomings. The test and code is available at github.com/hendrycks/test.”
  • “Some researchers have suggested that the future of NLP evaluation should focus on Natural Language Generation (NLG) (Zellers et al., 2020), an idea that reaches back to the Turing Test (Turing, 1950). However, NLG is notoriously difficult to evaluate and lacks a standard metric (Sai et al., 2020). Consequently, we instead create a simple-to-evaluate test that measures classification accuracy on multiple choice questions.”
  • “There are 57 tasks in total, which is also the number of Atari games (Bellemare et al., 2013),” 😅
  • GPT-3’s confidence is a poor estimator of its accuracy
    • the confidence is always around the same value of 30-35% (basically invariant) and the accuracy is between 20% and 60% depending on the domain
  • “For GPT-3, 9 out of the 10 lowest-accuracy tasks are STEM subjects that emphasize mathematics or calculations.”
    • “We speculate that is in part because GPT-3 acquires declarative knowledge more readily than procedural knowledge. For example, many questions in Elementary Mathematics require applying the order of operations for arithmetic
”
  • “We find that some verbal tasks such as Moral Scenarios from Hendrycks et al. (2020) and Professional Law also have especially low accuracy”
  • Model limitations:
    • “notably poor at modeling human (dis)approval, as evident by the low performance on the Professional Law and Moral Scenarios tasks.”
      • pre IT and RLHF
    • “Models also have difficulty performing calculations, so much so that they exhibit poor performance on Elementary Mathematics and many other STEM subjects with “plug and chug” problems.”
    • “Additionally, they do not match expert-level performance (90%) on any subject, so for all subjects it is subhuman.”
    • Fixing it might not be easy - “attempted to create a better Professional Law model by pretraining on specialized data”
      • collected ~2,000 additional Professional Law training examples
      • fine-tuned RoBERTa-base model (Liu et al., 2019)
      • model attained 32.8% test accuracy
      • “To test the impact of additional specialized training data, we also had RoBERTa continue pretraining on approximately 1.6 million legal case summaries using Harvard’s Law Library case law corpus case.law, but after fine-tuning it only attained 36.1% accuracy. This suggests that while additional pretraining on relevant high quality text can help, it may not be enough to substantially increase the performance of current models.”