Skip to content

Conversation

lewtun
Copy link
Member

@lewtun lewtun commented May 1, 2025

For GPQA Diamond and MATH-500, it is customary to report pass@1 instead of raw accuracy. In this PR, I've:

  • Replaced the vanilla accuracy metric with pass@1, where n was chosen to have ~2k total completions to mitigate variance. For a 7B model this takes about 15' on 1 node with DP=8, which seems like a reasonable tradeoff in variance vs compute
  • Updated AIME24/25 to include pass@1 with n=64. I realise I down-scaled this originally to n=32, but after reading the Phi-4-reasoning tech report I think it's better to increase to mitigate variance as much as possible since AIME24 is often used for model selection during training. (See plot below where they show the pass@1 variance can be rather large across repeated runs if T is medium/high as is common for LRMs)

Screenshot 2025-05-01 at 16 42 39

For GPQA, I've checked that my pass@1 (n=1) metric is equivalent to the old accuracy based one (as expected) for Qwe2.5-7B-Instruct:

Task Version Metric Value Stderr
all gpqa_pass@1:1_samples 0.3384 ± 0.0337
extractive_match 0.3384 ± 0.0337
lighteval:gpqa:diamond:0 0 gpqa_pass@1:1_samples 0.3384 ± 0.0337
extractive_match 0.3384 ± 0.0337

Please let me know if I should updated any other benchmarks to align with these changes!

max_model_length: PositiveInt | None = None # maximum length of the model, ussually infered automatically. reduce this if you encouter OOM issues, 4096 is usually enough
swap_space: PositiveInt = 4 # CPU swap space size (GiB) per GPU.
seed: PositiveInt = 1234
seed: NonNegativeInt = 1234
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed to allow seed=0 in the model args

few_shots_select=None,
generation_size=32768,
metric=[
Metrics.expr_gold_metric,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was causing redundant computation since we can get the same result from pass@1 (n=1)

@HuggingFaceDocBuilderDev
Copy link
Collaborator

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@lewtun lewtun changed the title Add pass@1 for GPQA-D and clean up AIME Add pass@1 for GPQA-D and MATH-500 May 1, 2025
@lewtun lewtun requested review from NathanHB and clefourrier May 1, 2025 14:45

def compute(self, golds: list[str], predictions: list[str], **kwargs) -> dict[str, float]:
def compute(
self, golds: list[str], predictions: list[str], formatted_doc: Doc = None, **kwargs
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to pass formatted_doc in order to enable pass@1 with GPQA

@lewtun lewtun merged commit d50bc30 into main May 5, 2025
5 checks passed
hynky1999 pushed a commit that referenced this pull request May 22, 2025
* Add pass@1 for GPQA-D and clean up AIME

* Add pass@1 for math_500

* Add pass@1 for MATH-500

* Update test

* Fix
NathanHB pushed a commit that referenced this pull request Sep 19, 2025
* Add pass@1 for GPQA-D and clean up AIME

* Add pass@1 for math_500

* Add pass@1 for MATH-500

* Update test

* Fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants