Title: Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation
Authors: Meriem Boubdir, Edward Kim, Beyza Ermis, Marzieh Fadaee, Sara Hooker
Published: 22nd October 2023 (Sunday) @ 21:48:51
Link: http://arxiv.org/abs/2310.14424v1
Abstract
Human evaluation is increasingly critical for assessing large language models, capturing linguistic nuances, and reflecting user preferences more accurately than traditional automated metrics. However, the resource-intensive nature of this type of annotation process poses significant challenges. The key question driving our work: âis it feasible to minimize human-in-the-loop feedback by prioritizing data instances which most effectively distinguish between models?â We evaluate several metric-based methods and find that these metrics enhance the efficiency of human evaluations by minimizing the number of required annotations, thus saving time and cost, while ensuring a robust performance evaluation. We show that our method is effective across widely used model families, reducing instances of indecisive (or âtieâ) outcomes by up to 54% compared to a random sample when focusing on the top-20 percentile of prioritized instances. This potential reduction in required human effort positions our approach as a valuable strategy in future large language model evaluations.