arena-hard-auto/BenchBuilder at main · lmarena/arena-hard-auto

History

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
category.py		category.py
config.yaml		config.yaml
embed.py		embed.py
filter.py		filter.py
label.py		label.py
requirements.txt		requirements.txt
topic_clustering.py		topic_clustering.py

README.md

BenchBuilder

An automatic pipeline to create high-quality benchmark. BenchBuilder was used on Chatbot Arena data to curate Arena-Hard-v0.1.

Checkout our paper for more details.

BenchBuilder employs a two stage pipeline.

First, install the BenchBuilder dependencies:

cd BenchBuilder
pip install -r requirements.txt

Step 1: annotate the prompt using GPT-3.5-Turbo and filter prompts which either have a score < 5 or belong to a topic cluster with a mean score < 3. This serves as a cheap and first pass through to remove any low quality prompts and clusters before further curation.

Step 2: use GPT-4-Turbo to annotate the remaining prompts, then extract prompts with quality score of >= 6 and belong to a topic cluster with mean quality score >= 6, ensuring only high-quality prompts are selected with minimal false positives.

After BenchBuilder, we stratified sampled multiple prompts per cluster to create a benchmark. However, you may employ whatever sampling scheme on prompts produced by BenchBuilder.

For Chatbot Arena Category Hard Prompts, which also employed BenchBuilder, we used Llama-3-70B-Instruct as LLM annotators. Check out our Category Hard Prompt blogpost for more detail.

To topic cluster your dataset:

python topic_clustering.py --conv-file [your json file] --min-topic-size 8

To annotate your dataset with key criteria:

python label.py --config config.yaml

Make sure to properly configure your config.yaml before begin labeling.

To filter prompts based on scores and cluster thresholds:

python filter.py --conversations_file [your jsonl file] --clusters_file [your json file] --prompt_threshold 6 --cluster_threshold 3

We also employ BenchBuilder on allenai/WildChat-1M and produce 250 high-quality prompts, Wild-Hard-250. We evaluate 10 of the 20 models outlined in the paper on Wild-Hard-250 and a random sample of 250 prompts from Wild-Chat dataset using GPT-4-Turbo as judge.

	Wild-Hard-250	Wild-Chat-Random
Spearman Correlation	93.6	38.2
Kendall Tau	85.5	27.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BenchBuilder

BenchBuilder

README.md

BenchBuilder

Files

BenchBuilder

Directory actions

More options

Directory actions

More options

Latest commit

History

BenchBuilder

Folders and files

parent directory

README.md

BenchBuilder