🪴 Anil's Garden

Chat with Open Large Language Models

Excerpt

LMSYS Chatbot Arena is a crowdsourced open platform for LLM evals. We’ve collected over 1,000,000 human pairwise comparisons to rank LLMs with the Bradley-Terry model and display the model ratings in Elo-scale. You can find more details in our paper. Chatbot arena is dependent on community participation, please contribute by casting your vote!

LMSYS Chatbot Arena is a crowdsourced open platform for LLM evals. We’ve collected over 1,000,000 human pairwise comparisons to rank LLMs with the Bradley-Terry model and display the model ratings in Elo-scale. You can find more details in our paper. Chatbot arena is dependent on community participation, please contribute by casting your vote!

Totalmodels: 145. Totalvotes: 1,898,013. Last updated: 2024-09-17.

Code to recreate leaderboard tables and plots in this notebook. You can contribute your vote at lmarena.ai!

Overall Questions

models: 145 (100%) votes: 1,898,013 (100%)

| Rank* (UB)

|

Model

|

Arena Score

|

95% CI

|

Votes

|

Organization

|

License

|

Knowledge Cutoff

| | --- | --- | --- | --- | --- | --- | --- | --- | |

104

|

Phi-3-Mini-128k-Instruct

|

1355

|

+12/-11

|

165503

|

Cognitive Computations

|

Falcon-180B TII License

|

2023/10

|

*Rank (UB): model’s ranking (upper-bound), defined by one + the number of models that are statistically better than the target model. Model A is statistically better than model B when A’s lower-bound score is greater than B’s upper-bound score (in 95% confidence interval). See Figure 1 below for visualization of the confidence intervals of model scores.

Note: in each category, we exclude models with fewer than 300 votes as their confidence intervals can be large.

Figure 1: Confidence Intervals on Model Strength (via Bootstrapping)

Figure 2: Average Win Rate Against All Other Models (Assuming Uniform Sampling and No Ties)

Figure 3: Fraction of Model A Wins for All Non-tied A vs. B Battles

Figure 4: Battle Count for Each Combination of Models (without Ties)

Citation

Please cite the following paper if you find our leaderboard or dataset helpful.

@misc{chiang2024chatbot,
    title={Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference},
    author={Wei-Lin Chiang and Lianmin Zheng and Ying Sheng and Anastasios Nikolas Angelopoulos and Tianle Li and Dacheng Li and Hao Zhang and Banghua Zhu and Michael Jordan and Joseph E. Gonzalez and Ion Stoica},
    year={2024},
    eprint={2403.04132},
    archivePrefix={arXiv},
    primaryClass={cs.AI}
}

Terms of Service

Users are required to agree to the following terms before using the service:

The service is a research preview. It only provides limited safety measures and may generate offensive content. It must not be used for any illegal, harmful, violent, racist, or sexual purposes. Please do not upload any private information. The service collects user dialogue data, including both text and images, and reserves the right to distribute it under a Creative Commons Attribution (CC-BY) or a similar license.

Please report any bug or issue to our Discord/arena-feedback.

Acknowledgment

We thank UC Berkeley SkyLab, Kaggle, MBZUAI, a16z, Together AI, Hyperbolic, RunPod, Anyscale, HuggingFace for their generous sponsorship.

🪴 Anil's Garden

Explorer

Chat with Open Large Language Models

Chat with Open Large Language Models

Excerpt

Overall Questions

models: 145 (100%) votes: 1,898,013 (100%)

Figure 1: Confidence Intervals on Model Strength (via Bootstrapping)

Figure 2: Average Win Rate Against All Other Models (Assuming Uniform Sampling and No Ties)

Figure 3: Fraction of Model A Wins for All Non-tied A vs. B Battles

Figure 4: Battle Count for Each Combination of Models (without Ties)

Citation

Terms of Service

Please report any bug or issue to our Discord/arena-feedback.

Acknowledgment

Graph View

Table of Contents

Backlinks