Title: MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues
Authors: Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, Wanli Ouyang
Published: 22nd February 2024 (Thursday) @ 18:21:59
Link: http://arxiv.org/abs/2402.14762v3
Abstract
The advent of Large Language Models (LLMs) has drastically enhanced dialogue systems. However, comprehensively evaluating the dialogue abilities of LLMs remains a challenge. Previous benchmarks have primarily focused on single-turn dialogues or provided coarse-grained and incomplete assessments of multi-turn dialogues, overlooking the complexity and fine-grained nuances of real-life dialogues. To address this issue, we introduce MT-Bench-101, specifically designed to evaluate the fine-grained abilities of LLMs in multi-turn dialogues. By conducting a detailed analysis of real multi-turn dialogue data, we construct a three-tier hierarchical ability taxonomy comprising 4208 turns across 1388 multi-turn dialogues in 13 distinct tasks. We then evaluate 21 popular LLMs based on MT-Bench-101, conducting comprehensive analyses from both ability and task perspectives and observing differing trends in LLMs performance across dialogue turns within various tasks. Further analysis indicates that neither utilizing common alignment techniques nor chat-specific designs has led to obvious enhancements in the multi-turn abilities of LLMs. Extensive case studies suggest that our designed tasks accurately assess the corresponding multi-turn abilities. The data and code are available at \url{https://github.com/mtbench101/mt-bench-101}.
Code + Data: https://github.com/mtbench101/mt-bench-101
Quick Notes
Several benchmarks have been introduced to assess the capabilities of Large Language Models (LLMs) in single-turn dialogues, e.g., MMLU (Hendrycks et al., 2020), BBH (Srivastava et al., 2022), and AlpacaEval (Li et al., 2023b). However, daily dialogues between users and chatbots usually involve multi-turn conversations
we undertake a systematical analysis combining realworld multi-turn dialogue data (Gudibande et al., 2023; Zheng et al., 2023) with the teaching taxonomy from educational psychology (Alexander, 2018; Marchel, 2007).
Overall framework of ability taxonomy
- Perceptivity is the most fundamental ability, reflecting the modelâs accuracy in understanding context.
- Context Memory - retrieve and use past dialogue (context)
- Context Understanding: (1) anaphora resolution and (2) separating input (instructions versus instruction inputs - important for functioning of good chat LM)
- Context Interference: Topic shift (user changes topic) and Content Confusion (similar textual inputs from the user that require markedly different outputs)
- Adaptability is built upon this foundation, indicating the modelâs ability to respond effectively to user feedback.
- Rephrasing: (1) Content rephrasing and (2) format rephrasing
- Reflection: (1) Self-correction and (2) self-affirmation (latter when user inputs something wrong e.g. factually or logically wrong)
- Reasoning: (1) Mathematical reasoning and (2) General reasoning - induction, puzzles, deduction
- Interactivity captures the capacity of models for proactive engagement with humans, which is crucial for excelling in multi-turn interactions. Chatbots proactively propose questions to guide the dialogue or gather information for better responses in chatbot-triggered dialogue.
- Questioning: (1) Instruction clarification and (2) Proactive Interaction
Tiers of their framework:
- The first layer outlines three progressive overarching abilities which are depicted in Figure 1.
- The second tier specifies seven detailed abilities
- The third tier further decomposes the 7 abilities from Adaptability (Tier 2) into 13 distinct tasks. For each third-tier task, we meticulously design specific prompts and utilize GPT-4 for data generation.
This taxonomy provides evaluation results across three levels from general to detailed, allowing for the identification of deficiencies in models at varying levels of granularity.
In total, MTBench-101 encompasses 4208 turns within 1388 multi-turn dialogues.
We then perform extensive experiments on MTBench-101 to assess the multi-turn chat ability of existing LLMs, including 2 close-sourced LLMs and 19 open-sourced LLMs. Our findings include:
- We identify adaptability and interactivity as the key deficiencies of existing LLMs, and GPT-4 is the most powerful model for multi-turn dialogues.
- The average performance of models within various tasks exhibits differing trends with the progression of turns, reflecting the distinct characteristics of the abilities.
- Model performance improves as the model size increases. However, neither utilizing common alignment techniques (such as RLHF) nor chat-specific designs has resulted in significant enhancements in the multi-turn abilities of LLMs.
- The agreement between GPT-4 and human expert evaluations reached 87%, utilizing our designed evaluation approach.
Performance (Benchmarking) of LLMs with MT-Bench-101
Table 3: The performance of different LLMs on the 13 multi-turn dialogue tasks in our MT-Bench-101.
Due to space constraints, the 13 tasks are represented by their corresponding acronyms.