Add pass@1 for GPQA-D and MATH-500 #698

lewtun · 2025-05-01T14:06:49Z

For GPQA Diamond and MATH-500, it is customary to report pass@1 instead of raw accuracy. In this PR, I've:

Replaced the vanilla accuracy metric with pass@1, where n was chosen to have ~2k total completions to mitigate variance. For a 7B model this takes about 15' on 1 node with DP=8, which seems like a reasonable tradeoff in variance vs compute
Updated AIME24/25 to include pass@1 with n=64. I realise I down-scaled this originally to n=32, but after reading the Phi-4-reasoning tech report I think it's better to increase to mitigate variance as much as possible since AIME24 is often used for model selection during training. (See plot below where they show the pass@1 variance can be rather large across repeated runs if T is medium/high as is common for LRMs)

For GPQA, I've checked that my pass@1 (n=1) metric is equivalent to the old accuracy based one (as expected) for Qwe2.5-7B-Instruct:

Task	Version	Metric	Value		Stderr
all		gpqa_pass@1:1_samples	0.3384	±	0.0337
		extractive_match	0.3384	±	0.0337
lighteval:gpqa:diamond:0	0	gpqa_pass@1:1_samples	0.3384	±	0.0337
		extractive_match	0.3384	±	0.0337

Please let me know if I should updated any other benchmarks to align with these changes!

lewtun · 2025-05-01T14:07:51Z

src/lighteval/models/vllm/vllm_model.py

    max_model_length: PositiveInt | None = None  # maximum length of the model, ussually infered automatically. reduce this if you encouter OOM issues, 4096 is usually enough
    swap_space: PositiveInt = 4  # CPU swap space size (GiB) per GPU.
-    seed: PositiveInt = 1234
+    seed: NonNegativeInt = 1234


Needed to allow seed=0 in the model args

lewtun · 2025-05-01T14:08:30Z

src/lighteval/tasks/default_tasks.py

    few_shots_select=None,
    generation_size=32768,
    metric=[
-        Metrics.expr_gold_metric,


This was causing redundant computation since we can get the same result from pass@1 (n=1)

HuggingFaceDocBuilderDev · 2025-05-01T14:09:01Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lewtun · 2025-05-01T14:50:09Z

src/lighteval/metrics/metrics_sample.py


-    def compute(self, golds: list[str], predictions: list[str], **kwargs) -> dict[str, float]:
+    def compute(
+        self, golds: list[str], predictions: list[str], formatted_doc: Doc = None, **kwargs


I had to pass formatted_doc in order to enable pass@1 with GPQA

* Add pass@1 for GPQA-D and clean up AIME * Add pass@1 for math_500 * Add pass@1 for MATH-500 * Update test * Fix

Add pass@1 for GPQA-D and clean up AIME

6918cd9

lewtun commented May 1, 2025

View reviewed changes

Add pass@1 for math_500

978c143

lewtun changed the title ~~Add pass@1 for GPQA-D and clean up AIME~~ Add pass@1 for GPQA-D and MATH-500 May 1, 2025

lewtun requested review from NathanHB and clefourrier May 1, 2025 14:45

Add pass@1 for MATH-500

6c5e1b0

lewtun commented May 1, 2025

View reviewed changes

lewtun added 2 commits May 1, 2025 14:55

Update test

7cd9c26

Fix

7f295e6

NathanHB approved these changes May 5, 2025

View reviewed changes

lewtun merged commit d50bc30 into main May 5, 2025
5 checks passed

lewtun mentioned this pull request May 5, 2025

Use pass@1 for all evals huggingface/open-r1#633

Merged

1 task

NathanHB added the task-update label May 5, 2025

hynky1999 pushed a commit that referenced this pull request May 22, 2025

Add pass@1 for GPQA-D and MATH-500 (#698)

743179a

* Add pass@1 for GPQA-D and clean up AIME * Add pass@1 for math_500 * Add pass@1 for MATH-500 * Update test * Fix

NathanHB pushed a commit that referenced this pull request Sep 19, 2025

Add pass@1 for GPQA-D and MATH-500 (#698)

4fa1dba

* Add pass@1 for GPQA-D and clean up AIME * Add pass@1 for math_500 * Add pass@1 for MATH-500 * Update test * Fix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add pass@1 for GPQA-D and MATH-500 #698

Add pass@1 for GPQA-D and MATH-500 #698

lewtun commented May 1, 2025 •

edited

Loading

Uh oh!

lewtun May 1, 2025

Uh oh!

lewtun May 1, 2025

Uh oh!

HuggingFaceDocBuilderDev commented May 1, 2025

Uh oh!

lewtun May 1, 2025

Uh oh!

Uh oh!

Uh oh!

Add pass@1 for GPQA-D and MATH-500 #698

Add pass@1 for GPQA-D and MATH-500 #698

Conversation

lewtun commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lewtun May 1, 2025

Choose a reason for hiding this comment

Uh oh!

lewtun May 1, 2025

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented May 1, 2025

Uh oh!

lewtun May 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lewtun commented May 1, 2025 •

edited

Loading