🚨 New feature: Style Control is now added to Arena Hard Auto! Check this section to start using style control!
Arena-Hard-Auto-v0.1 (See Paper) is an automatic evaluation tool for instruction-tuned LLMs. It contains 500 challenging user queries sourced from Chatbot Arena. We prompt GPT-4-Turbo as judge to compare the models' responses against a baseline model (default: GPT-4-0314). Notably, Arena-Hard-Auto has the highest correlation and separability to Chatbot Arena among popular open-ended LLM benchmarks (See Paper). If you are curious to see how well your model might perform on Chatbot Arena, we recommend trying Arena-Hard-Auto.
Although both Arena-Hard-Auto and Chatbot Arena Category Hard (See Blog) employ similar pipeline to select hard prompts, Arena-Hard-Auto employs automatic judge as a cheaper and faster approximator to human preference. Checkout BenchBuilder folder for code and resources on how we curate Arena-Hard-Auto.
Following the newly introduced Style Control on Chatbot Arena, we release Style Control on Arena Hard Auto! We employ the same Style Control methods as proposed in the blogpost. Please refer to the blogpost for methodology and technical background.
(Updated: 10/14)
claude-3-5-sonnet-20241022 | score: 86.5 | 95% CI: (-1.4, 1.6) | average #tokens: 691
claude-3-5-sonnet-20240620 | score: 82.0 | 95% CI: (-1.6, 2.2) | average #tokens: 567
o1-preview-2024-09-12 | score: 81.6 | 95% CI: (-2.4, 2.2) | average #tokens: 1193
o1-mini-2024-09-12 | score: 79.2 | 95% CI: (-2.6, 2.4) | average #tokens: 1399
gpt-4-turbo-2024-04-09 | score: 74.4 | 95% CI: (-2.5, 2.1) | average #tokens: 662
gpt-4-0125-preview | score: 73.5 | 95% CI: (-2.4, 1.8) | average #tokens: 619
gpt-4o-2024-08-06 | score: 71.0 | 95% CI: (-2.5, 2.8) | average #tokens: 594
llama-3.1-nemotron-70b-instruct| score: 70.9 | 95% CI: (-3.3, 3.3) | average #tokens: 869
gpt-4o-2024-05-13 | score: 69.9 | 95% CI: (-2.5, 2.3) | average #tokens: 696
athene-70b | score: 67.7 | 95% CI: (-3.2, 2.2) | average #tokens: 685
yi-lightning | score: 67.1 | 95% CI: (-2.3, 2.8) | average #tokens: 875
llama-3.1-405b-instruct | score: 66.8 | 95% CI: (-2.6, 1.9) | average #tokens: 658
claude-3-opus-20240229 | score: 65.5 | 95% CI: (-2.3, 2.5) | average #tokens: 541
yi-large-preview | score: 65.0 | 95% CI: (-2.4, 2.0) | average #tokens: 720
gpt-4o-mini-2024-07-18 | score: 64.2 | 95% CI: (-2.7, 2.9) | average #tokens: 668
qwen2.5-72b-instruct | score: 63.4 | 95% CI: (-2.5, 2.7) | average #tokens: 821
mistral-large-2407 | score: 63.1 | 95% CI: (-2.6, 3.1) | average #tokens: 623
gemini-1.5-pro-api-0514 | score: 62.4 | 95% CI: (-2.7, 2.1) | average #tokens: 676
glm-4-0520 | score: 61.3 | 95% CI: (-3.3, 3.0) | average #tokens: 636
yi-large | score: 59.3 | 95% CI: (-3.1, 2.2) | average #tokens: 626
deepseek-coder-v2 | score: 58.2 | 95% CI: (-2.6, 2.8) | average #tokens: 578
glm-4-0116 | score: 54.1 | 95% CI: (-2.5, 2.5) | average #tokens: 622
llama-3.1-70b-instruct | score: 51.6 | 95% CI: (-2.5, 2.7) | average #tokens: 628
glm-4-air | score: 50.4 | 95% CI: (-1.8, 2.5) | average #tokens: 619
gpt-4-0314 | score: 50.0 | 95% CI: (0.0, 0.0) | average #tokens: 423
claude-3-sonnet-20240229 | score: 49.7 | 95% CI: (-2.0, 2.6) | average #tokens: 552
gpt-4-0613 | score: 49.6 | 95% CI: (-2.5, 2.7) | average #tokens: 354
qwen2-72b-instruct | score: 49.5 | 95% CI: (-2.4, 2.4) | average #tokens: 515
gemma-2-27b-it | score: 47.4 | 95% CI: (-2.8, 2.8) | average #tokens: 577
gemini-1.5-pro-api-0409-preview| score: 46.8 | 95% CI: (-2.8, 2.7) | average #tokens: 478
mistral-large-2402 | score: 45.5 | 95% CI: (-2.5, 2.1) | average #tokens: 400
claude-3-haiku-20240307 | score: 45.3 | 95% CI: (-2.3, 3.1) | average #tokens: 505
llama-3-70b-instruct | score: 44.3 | 95% CI: (-2.2, 3.5) | average #tokens: 591
mixtral-8x22b-instruct-v0.1 | score: 44.0 | 95% CI: (-2.9, 2.9) | average #tokens: 430
qwen1.5-72b-chat | score: 39.7 | 95% CI: (-2.1, 2.2) | average #tokens: 474
gemini-1.5-flash-api-0514 | score: 39.7 | 95% CI: (-2.5, 2.4) | average #tokens: 642
mistral-next | score: 39.6 | 95% CI: (-2.2, 2.5) | average #tokens: 297
mistral-medium | score: 39.0 | 95% CI: (-2.4, 3.3) | average #tokens: 485
phi-3-medium-4k-instruct | score: 38.7 | 95% CI: (-2.1, 2.6) | average #tokens: 517
command-r-plus | score: 37.3 | 95% CI: (-2.3, 1.6) | average #tokens: 541
claude-2.0 | score: 36.7 | 95% CI: (-2.2, 2.6) | average #tokens: 295
claude-2.1 | score: 35.1 | 95% CI: (-2.9, 2.5) | average #tokens: 290
gpt-3.5-turbo-0613 | score: 34.9 | 95% CI: (-2.2, 3.0) | average #tokens: 401
gpt-3.5-turbo-0125 | score: 34.7 | 95% CI: (-2.3, 2.7) | average #tokens: 329
phi-3-small-8k-instruct | score: 33.6 | 95% CI: (-2.6, 2.3) | average #tokens: 568
gemma-2-9b-it | score: 33.3 | 95% CI: (-2.7, 2.8) | average #tokens: 541
gpt-3.5-turbo-1106 | score: 33.0 | 95% CI: (-2.4, 2.9) | average #tokens: 285
dbrx-instruct-preview | score: 32.0 | 95% CI: (-2.5, 2.4) | average #tokens: 415
internlm2-20b-5-chat | score: 30.2 | 95% CI: (-2.2, 2.5) | average #tokens: 576
mixtral-8x7b-instruct-v0.1 | score: 29.8 | 95% CI: (-2.0, 2.1) | average #tokens: 457
gpt-3.5-turbo-0314 | score: 29.4 | 95% CI: (-2.8, 2.1) | average #tokens: 334
starling-lm-7b-beta | score: 26.0 | 95% CI: (-2.4, 2.2) | average #tokens: 530
snowflake-arctic-instruct | score: 25.9 | 95% CI: (-2.6, 1.8) | average #tokens: 365
gemini-1.0-pro | score: 24.9 | 95% CI: (-2.1, 2.4) | average #tokens: 322
command-r | score: 23.4 | 95% CI: (-1.9, 1.8) | average #tokens: 432
snorkel-mistral-pairrm-dpo | score: 21.8 | 95% CI: (-2.2, 1.9) | average #tokens: 564
yi-34b-chat | score: 21.8 | 95% CI: (-2.2, 2.0) | average #tokens: 611
internlm2-20b-chat | score: 21.1 | 95% CI: (-1.9, 1.3) | average #tokens: 667
llama-3-8b-instruct | score: 19.7 | 95% CI: (-1.6, 1.8) | average #tokens: 585
llama-3.1-8b-instruct | score: 18.2 | 95% CI: (-1.8, 2.0) | average #tokens: 861
tulu-2-dpo-70b | score: 18.0 | 95% CI: (-1.7, 1.8) | average #tokens: 550
starling-lm-7b-alpha | score: 16.4 | 95% CI: (-1.5, 1.5) | average #tokens: 483
phi-3-mini-128k-instruct | score: 16.1 | 95% CI: (-1.5, 1.9) | average #tokens: 609
mistral-7b-instruct | score: 15.2 | 95% CI: (-2.0, 1.5) | average #tokens: 541
llama-2-70b-chat | score: 13.4 | 95% CI: (-1.5, 1.7) | average #tokens: 595
vicuna-33b | score: 11.7 | 95% CI: (-1.9, 1.7) | average #tokens: 451
gemma-1.1-7b-it | score: 11.6 | 95% CI: (-1.4, 1.2) | average #tokens: 341
gemma-7b-it | score: 7.0 | 95% CI: (-1.1, 1.0) | average #tokens: 378
gemma-1.1-2b-it | score: 3.5 | 95% CI: (-0.6, 0.7) | average #tokens: 316
gemma-2b-it | score: 2.9 | 95% CI: (-0.5, 0.6) | average #tokens: 369
The following leaderboard has no style control.
(Updated: 10/14)
o1-mini-2024-09-12 | score: 92.0 | 95% CI: (-1.2, 1.0) | average #tokens: 1399
o1-preview-2024-09-12 | score: 90.4 | 95% CI: (-1.1, 1.3) | average #tokens: 1193
claude-3-5-sonnet-20241022 | score: 85.2 | 95% CI: (-1.4, 1.6) | average #tokens: 691
llama-3.1-nemotron-70b-instruct| score: 84.9 | 95% CI: (-1.7, 1.8) | average #tokens: 869
gpt-4-turbo-2024-04-09 | score: 82.6 | 95% CI: (-1.8, 1.5) | average #tokens: 662
yi-lightning | score: 81.5 | 95% CI: (-1.6, 1.6) | average #tokens: 875
claude-3-5-sonnet-20240620 | score: 79.3 | 95% CI: (-2.1, 2.0) | average #tokens: 567
gpt-4o-2024-05-13 | score: 79.2 | 95% CI: (-1.9, 1.7) | average #tokens: 696
gpt-4-0125-preview | score: 78.0 | 95% CI: (-2.1, 2.4) | average #tokens: 619
qwen2.5-72b-instruct | score: 78.0 | 95% CI: (-1.8, 1.8) | average #tokens: 821
gpt-4o-2024-08-06 | score: 77.9 | 95% CI: (-2.0, 2.1) | average #tokens: 594
athene-70b | score: 77.6 | 95% CI: (-2.7, 2.2) | average #tokens: 684
gpt-4o-mini | score: 74.9 | 95% CI: (-2.5, 1.9) | average #tokens: 668
gemini-1.5-pro-api-preview | score: 72.0 | 95% CI: (-2.1, 2.5) | average #tokens: 676
mistral-large-2407 | score: 70.4 | 95% CI: (-1.6, 2.1) | average #tokens: 623
llama-3.1-405b-instruct-fp8 | score: 69.3 | 95% CI: (-2.4, 2.2) | average #tokens: 658
glm-4-0520 | score: 63.8 | 95% CI: (-2.9, 2.8) | average #tokens: 636
yi-large | score: 63.7 | 95% CI: (-2.6, 2.4) | average #tokens: 626
deepseek-coder-v2 | score: 62.3 | 95% CI: (-2.1, 1.8) | average #tokens: 578
claude-3-opus-20240229 | score: 60.4 | 95% CI: (-2.5, 2.5) | average #tokens: 541
gemma-2-27b-it | score: 57.5 | 95% CI: (-2.1, 2.4) | average #tokens: 577
llama-3.1-70b-instruct | score: 55.7 | 95% CI: (-2.9, 2.7) | average #tokens: 628
glm-4-0116 | score: 55.7 | 95% CI: (-2.4, 2.3) | average #tokens: 622
glm-4-air | score: 50.9 | 95% CI: (-2.9, 2.7) | average #tokens: 619
gpt-4-0314 | score: 50.0 | 95% CI: (0.0, 0.0) | average #tokens: 423
gemini-1.5-flash-api-preview | score: 49.6 | 95% CI: (-2.2, 2.8) | average #tokens: 642
qwen2-72b-instruct | score: 46.9 | 95% CI: (-2.5, 2.7) | average #tokens: 515
claude-3-sonnet-20240229 | score: 46.8 | 95% CI: (-2.3, 2.7) | average #tokens: 552
llama-3-70b-instruct | score: 46.6 | 95% CI: (-2.3, 2.6) | average #tokens: 591
claude-3-haiku-20240307 | score: 41.5 | 95% CI: (-2.5, 2.5) | average #tokens: 505
gpt-4-0613 | score: 37.9 | 95% CI: (-2.8, 2.4) | average #tokens: 354
mistral-large-2402 | score: 37.7 | 95% CI: (-2.1, 2.6) | average #tokens: 400
mixtral-8x22b-instruct-v0.1 | score: 36.4 | 95% CI: (-2.4, 2.6) | average #tokens: 430
Qwen1.5-72B-Chat | score: 36.1 | 95% CI: (-2.0, 2.7) | average #tokens: 474
phi-3-medium-4k-instruct | score: 33.4 | 95% CI: (-2.6, 2.1) | average #tokens: 517
command-r-plus | score: 33.1 | 95% CI: (-2.8, 2.4) | average #tokens: 541
mistral-medium | score: 31.9 | 95% CI: (-1.9, 2.2) | average #tokens: 485
internlm2.5-20b-chat | score: 31.2 | 95% CI: (-2.4, 2.8) | average #tokens: 576
phi-3-small-8k-instruct | score: 29.8 | 95% CI: (-1.8, 1.9) | average #tokens: 568
mistral-next | score: 27.4 | 95% CI: (-2.4, 2.4) | average #tokens: 297
gpt-3.5-turbo-0613 | score: 24.8 | 95% CI: (-1.9, 2.3) | average #tokens: 401
dbrx-instruct-preview | score: 24.6 | 95% CI: (-2.0, 2.6) | average #tokens: 415
internlm2-20b-chat | score: 24.4 | 95% CI: (-2.0, 2.2) | average #tokens: 667
claude-2.0 | score: 24.0 | 95% CI: (-1.8, 1.8) | average #tokens: 295
Mixtral-8x7B-Instruct-v0.1 | score: 23.4 | 95% CI: (-2.0, 1.9) | average #tokens: 457
gpt-3.5-turbo-0125 | score: 23.3 | 95% CI: (-2.2, 1.9) | average #tokens: 329
Yi-34B-Chat | score: 23.1 | 95% CI: (-1.6, 1.8) | average #tokens: 611
Starling-LM-7B-beta | score: 23.0 | 95% CI: (-1.8, 1.8) | average #tokens: 530
claude-2.1 | score: 22.8 | 95% CI: (-2.3, 1.8) | average #tokens: 290
llama-3.1-8b-instruct | score: 21.3 | 95% CI: (-1.9, 2.2) | average #tokens: 861
Snorkel-Mistral-PairRM-DPO | score: 20.7 | 95% CI: (-1.8, 2.2) | average #tokens: 564
llama-3-8b-instruct | score: 20.6 | 95% CI: (-2.0, 1.9) | average #tokens: 585
gpt-3.5-turbo-1106 | score: 18.9 | 95% CI: (-1.8, 1.6) | average #tokens: 285
gpt-3.5-turbo-0301 | score: 18.1 | 95% CI: (-1.9, 2.1) | average #tokens: 334
gemini-1.0-pro | score: 17.8 | 95% CI: (-1.2, 2.2) | average #tokens: 322
snowflake-arctic-instruct | score: 17.6 | 95% CI: (-1.8, 1.5) | average #tokens: 365
command-r | score: 17.0 | 95% CI: (-1.7, 1.8) | average #tokens: 432
phi-3-mini-128k-instruct | score: 15.4 | 95% CI: (-1.4, 1.4) | average #tokens: 609
tulu-2-dpo-70b | score: 15.0 | 95% CI: (-1.6, 1.3) | average #tokens: 550
Starling-LM-7B-alpha | score: 12.8 | 95% CI: (-1.6, 1.4) | average #tokens: 483
mistral-7b-instruct | score: 12.6 | 95% CI: (-1.7, 1.4) | average #tokens: 541
gemma-1.1-7b-it | score: 12.1 | 95% CI: (-1.3, 1.3) | average #tokens: 341
Llama-2-70b-chat-hf | score: 11.6 | 95% CI: (-1.5, 1.2) | average #tokens: 595
vicuna-33b-v1.3 | score: 8.6 | 95% CI: (-1.1, 1.1) | average #tokens: 451
gemma-7b-it | score: 7.5 | 95% CI: (-1.2, 1.3) | average #tokens: 378
Llama-2-7b-chat-hf | score: 4.6 | 95% CI: (-0.8, 0.8) | average #tokens: 561
gemma-1.1-2b-it | score: 3.4 | 95% CI: (-0.6, 0.8) | average #tokens: 316
gemma-2b-it | score: 3.0 | 95% CI: (-0.6, 0.6) | average #tokens: 369
git clone https://github.com/lm-sys/arena-hard.git
cd arena-hard
pip install -r requirements.txt
pip install -r requirements-optional.txt # Optional dependencies (e.g., anthropic sdk)
We have pre-generated many popular models answers and judgments. You can browse them with an online demo or download them (with git-lfs
installed) by
> git clone https://huggingface.co/spaces/lmsys/arena-hard-browser
// copy answers/judgments to the data directory
> cp -r arena-hard-browser/data .
Then run
> python show_result.py
gpt-4-0125-preview | score: 78.0 | 95% CI: (-1.8, 2.2) | average #tokens: 619
claude-3-opus-20240229 | score: 60.4 | 95% CI: (-2.6, 2.1) | average #tokens: 541
gpt-4-0314 | score: 50.0 | 95% CI: (0.0, 0.0) | average #tokens: 423
claude-3-sonnet-20240229 | score: 46.8 | 95% CI: (-2.7, 2.3) | average #tokens: 552
claude-3-haiku-20240307 | score: 41.5 | 95% CI: (-2.4, 2.5) | average #tokens: 505
gpt-4-0613 | score: 37.9 | 95% CI: (-2.1, 2.2) | average #tokens: 354
mistral-large-2402 | score: 37.7 | 95% CI: (-2.9, 2.8) | average #tokens: 400
Qwen1.5-72B-Chat | score: 36.1 | 95% CI: (-2.1, 2.4) | average #tokens: 474
command-r-plus | score: 33.1 | 95% CI: (-2.0, 1.9) | average #tokens: 541
Running show_result.py
will save generated battles into data/arena_hard_battles.jsonl
and bootstrapping statistics into data/bootstrapping_results.jsonl
. If you don't want to regenerate battles or bootstrapping statistics, simply toggle argument --load-battles
or --load-bootstrap
, respectively.
Fill in your API endpoint in config/api_config.yaml
. We support OpenAI compatible API server. You can specify parallel
to indicate the number of concurrent API requests (default: 1).
# example
gpt-3.5-turbo-0125:
model_name: gpt-3.5-turbo-0125
endpoints: null
api_type: openai
parallel: 8
[YOUR-MODEL-NAME]:
model_name: [YOUR-MODEL-NAME]
endpoints:
- api_base: [YOUR-ENDPOINT-URL]
api_key: [YOUR-API-KEY]
api_type: openai
parallel: 8
You may use inference engine such as Latest TGI version or vLLM or SGLang to host your model with an OpenAI compatible API server.
TGI Quick start
hf_pat=
model=
volume=/path/to/cache
port=1996
huggingface-cli download $model
sudo docker run --gpus 8 -e HUGGING_FACE_HUB_TOKEN=$hf_pat --shm-size 2000g -p $port:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0.4 --model-id $model --max-input-length 8192 --max-batch-total-tokens 8193 --max-batch-prefill-tokens 8193 --max-total-tokens 8193
In config/gen_answer_config.yaml
, add your model name in model_list
.
bench_name: arena-hard-v0.1
temperature: 0.0
max_tokens: 4096
num_choices: 1
model_list:
- [YOUR-MODEL-NAME]
Run the command to generate answers:
python gen_answer.py
Caching feature is implemented. The code will skip generating an answer when there is already an existing answer/judgment to the same prompt.
In config/judge_config.yaml
, add your model name in model_list
.
...
# Add your model below for evaluation
model_list:
- gpt-3.5-turbo-0125
- [YOUR-MODEL-NAME]
Run the command to generate judgments:
python gen_judgment.py
Judgment caching is also implemented. It will skip generating judgments that has already been generated or lacks one of the model answers.
Output model win rates. Optionally, use --full-stats
for detailed results. To save a csv file of the model rankings, use --output
> python show_result.py
You can review individual judgment results using our UI code.
> python qa_browser.py --share
Following the newly introduced Style Control on Chatbot Arena, we release Style Control on Arena Hard Auto! We employ the same Style Control methods as proposed in the blogpost. Please refer to the blogpost for methodology and technical background.
Before applying style control, make sure your model answers has proper style attribute generated. Either pull the latest data from huggingface repo, or run the following script!
To add style attribute to your model answers, use add_markdown_info.py
. The following command takes model answers from --dir
, append style attributes (token length, number of headers, etc), and save the new answers in --output-dir
.
> python add_markdown_info.py --dir data/arena-hard-v0.1/model_answer --output-dir data/arena-hard-v0.1/model_answer
To control for style (token length and markdown elements), use --style-control
when running show_result.py
.
> python show_result.py --style-control
To control for length and markdown separately, use --length-control-only
and --markdown-control-only
.
Coming soon...
The code in this repository is mostly developed for or derived from the papers below. Please cite it if you find the repository helpful.
@article{li2024crowdsourced,
title={From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline},
author={Li, Tianle and Chiang, Wei-Lin and Frick, Evan and Dunlap, Lisa and Wu, Tianhao and Zhu, Banghua and Gonzalez, Joseph E and Stoica, Ion},
journal={arXiv preprint arXiv:2406.11939},
year={2024}
}
@misc{chiang2024chatbot,
title={Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference},
author={Wei-Lin Chiang and Lianmin Zheng and Ying Sheng and Anastasios Nikolas Angelopoulos and Tianle Li and Dacheng Li and Hao Zhang and Banghua Zhu and Michael Jordan and Joseph E. Gonzalez and Ion Stoica},
year={2024},
eprint={2403.04132},
archivePrefix={arXiv},
primaryClass={cs.AI}
}
@misc{arenahard2024,
title = {From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline},
url = {https://lmsys.org/blog/2024-04-19-arena-hard/},
author = {Tianle Li*, Wei-Lin Chiang*, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez, Ion Stoica},
month = {April},
year = {2024}
}