Use `n=16` samples to estimate `pass@1` for AIME benchmarks #661

lewtun · 2025-04-07T09:09:42Z

Currently we estimate pass@1 from various values of n in the range [4,8,16,32,64]. Since large values of n are expensive to run, I compared the variance across various reasoning models on AIME24 using 10 different seeds. As shown in the figures below, we can take n=16 as a good compromise between compute and variance, with about a 1 percentage point variance across runs.

Questions:

I removed math_pass_at_1_4n and math_pass_at_1_8n because I wasn't sure if these are computed independently of math_pass_at_1_16n or obtained via subsampling? If the former, I propose to remove them to avoid redundant computation.

HuggingFaceDocBuilderDev · 2025-04-07T09:11:55Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

clefourrier

Thanks for the PR :)

clefourrier · 2025-04-07T09:14:03Z

Btw, pass at n-X should be using subsamples of pass at n

lewtun · 2025-04-07T09:21:42Z

Btw, pass at n-X should be using subsamples of pass at n

Ah cool, so I can safely use the n=4 and n=8 metrics without requiring new generations? I can add them back if that's the case

lewtun · 2025-04-07T09:22:29Z

It also seems the tests are failing for reasons unrelated to this PR

clefourrier · 2025-04-07T09:39:59Z

Re running the tests, looks like a network error

CurryxIaoHu · 2025-05-08T03:33:58Z

When I run the following command to evaluate the performance of R1-Distill-Qwen-7B:

MODEL=/dev/shm/yifan2/R1-Distill-Qwen-7B

MODEL_ARGS="model_name=$MODEL,dtype=bfloat16,data_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilization=0.95,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
TASK="lighteval|aime24|0|0" 
OUTPUT_DIR=...

lighteval vllm $MODEL_ARGS $TASK \
    --use-chat-template \
    --save-details \
    --output-dir $OUTPUT_DIR

I get:

|lighteval:aime24:0      |       |math_pass@1:1_samples |0.5333|±  |0.0926| 
|                        |       |math_pass@1:4_samples |0.5333|±  |0.0717|
|                        |       |math_pass@1:8_samples |0.5500|±  |0.0723|
|                        |       |math_pass@1:16_samples|0.2750|±  |0.0362|
|                        |       |math_pass@1:32_samples|0.1375|±  |0.0181|
|                        |       |math_pass@1:64_samples|0.0687|±  |0.0090|

It is weird that pass@1 suddenly dropped sharply when n comes to 16. Is it because there is something wrong with my settings?

lewtun · 2025-05-08T14:08:57Z

Hmm that does look a bit odd @CurryxIaoHu since I get the following which is quite stable for $n \geq 16$:

Task	Version	Metric	Value		Stderr
all		math_pass@1:1_samples	0.4000	±	0.0910
		math_pass@1:4_samples	0.4417	±	0.0706
		math_pass@1:8_samples	0.4625	±	0.0697
		math_pass@1:16_samples	0.4813	±	0.0683
		math_pass@1:32_samples	0.5010	±	0.0684
		math_pass@1:64_samples	0.5083	±	0.0676
lighteval:aime24:0	2	math_pass@1:1_samples	0.4000	±	0.0910
		math_pass@1:4_samples	0.4417	±	0.0706
		math_pass@1:8_samples	0.4625	±	0.0697
		math_pass@1:16_samples	0.4813	±	0.0683
		math_pass@1:32_samples	0.5010	±	0.0684
		math_pass@1:64_samples	0.5083	±	0.0676

Command to repro:

lighteval vllm 'model_name=deepseek-ai/DeepSeek-R1-Distill-Qwen-7B,revision=main,trust_remote_code=False,dtype=bfloat16,data_parallel_size=8,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}' 'lighteval|aime24|0|0' --use-chat-template --output-dir eval_results/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B/main/aime24 --save-details

Here's my env for reference:

- Platform: Linux-5.15.0-1048-aws-x86_64-with-glibc2.31
- Python version: 3.11.11
- TRL version: 0.18.0.dev0
- PyTorch version: 2.6.0
- CUDA device(s): NVIDIA H100 80GB HBM3
- Transformers version: 4.52.0.dev0
- Accelerate version: 1.4.0
- Accelerate config: not found
- Datasets version: 3.5.1
- HF Hub version: 0.30.2
- bitsandbytes version: 0.45.5
- DeepSpeed version: 0.16.7
- Diffusers version: not installed
- Liger-Kernel version: 0.5.8
- LLM-Blender version: not installed
- OpenAI version: 1.76.2
- PEFT version: not installed
- vLLM version: 0.8.4
- lighteval version: lighteval @ git+https://github.com/huggingface/lighteval.git@d50bc3072b8814656633400a1850c500c6aa2e39

Cppowboy · 2025-05-16T06:31:21Z

When I run the following command to evaluate the performance of R1-Distill-Qwen-7B:

MODEL=/dev/shm/yifan2/R1-Distill-Qwen-7B

MODEL_ARGS="model_name=$MODEL,dtype=bfloat16,data_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilization=0.95,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
TASK="lighteval|aime24|0|0" 
OUTPUT_DIR=...

lighteval vllm $MODEL_ARGS $TASK \
    --use-chat-template \
    --save-details \
    --output-dir $OUTPUT_DIR

I get:

|lighteval:aime24:0      |       |math_pass@1:1_samples |0.5333|±  |0.0926| 
|                        |       |math_pass@1:4_samples |0.5333|±  |0.0717|
|                        |       |math_pass@1:8_samples |0.5500|±  |0.0723|
|                        |       |math_pass@1:16_samples|0.2750|±  |0.0362|
|                        |       |math_pass@1:32_samples|0.1375|±  |0.0181|
|                        |       |math_pass@1:64_samples|0.0687|±  |0.0090|

It is weird that pass@1 suddenly dropped sharply when n comes to 16. Is it because there is something wrong with my settings?

same problem.

|       Task       |Version|        Metric        |Value |   |Stderr|
|------------------|------:|----------------------|-----:|---|-----:|
|all               |       |math_pass@1:1_samples |0.4000|±  |0.0910|
|                  |       |math_pass@1:4_samples |0.2833|±  |0.0643|
|                  |       |math_pass@1:8_samples |0.3000|±  |0.0653|
|                  |       |math_pass@1:16_samples|0.3187|±  |0.0604|
|                  |       |math_pass@1:32_samples|0.3198|±  |0.0595|
|                  |       |math_pass@1:64_samples|0.3182|±  |0.0596|
|lighteval:aime24:0|      2|math_pass@1:1_samples |0.4000|±  |0.0910|
|                  |       |math_pass@1:4_samples |0.2833|±  |0.0643|
|                  |       |math_pass@1:8_samples |0.3000|±  |0.0653|
|                  |       |math_pass@1:16_samples|0.3187|±  |0.0604|
|                  |       |math_pass@1:32_samples|0.3198|±  |0.0595|
|                  |       |math_pass@1:64_samples|0.3182|±  |0.0596|

any solutions?

I guess that, you can not gurantee pass@1:1samples <= pass@1:2_samples, but you can gurantee that pass@1:64_samples <= pass@2:64_samples.

lewtun · 2025-05-16T09:21:48Z

Hi @Cppowboy your results actually look pretty stable to me! With pass@1:1_samples the variance is very large due to sampling (easily +/- 10 points), and as n increases we see the results converge to ~0.3 accuracy.

When you say "same problem" are you referring to the variance or to the fact that your model appears to have much lower accuracy than the ~50% I obtained for deepseek-ai/DeepSeek-R1-Distill-Qwen-7B?

If the latter, my suggestion would be to push the details to the Hub as a dataset so that we can inspect the model outputs in case there's something off there from vllm.

* Use n=16 samples to estimate pass@1 for AIME benchmarks * Remove other metrics

lewtun added 2 commits April 7, 2025 09:04

Use n=16 samples to estimate pass@1 for AIME benchmarks

7c7c28a

Remove other metrics

3505876

lewtun requested review from NathanHB and clefourrier April 7, 2025 09:10

clefourrier approved these changes Apr 7, 2025

View reviewed changes

lewtun merged commit f3639c6 into main Apr 7, 2025
4 checks passed

This was referenced Apr 8, 2025

[WIP] Bump lighteval with proper pass@1 huggingface/open-r1#584

Merged

Align AIME pass@1 with literature #666

Merged

NathanHB added the task-update label May 5, 2025

hynky1999 pushed a commit that referenced this pull request May 22, 2025

Use n=16 samples to estimate pass@1 for AIME benchmarks (#661)

8b66e16

* Use n=16 samples to estimate pass@1 for AIME benchmarks * Remove other metrics

NathanHB pushed a commit that referenced this pull request Sep 19, 2025

Use n=16 samples to estimate pass@1 for AIME benchmarks (#661)

cf06442

* Use n=16 samples to estimate pass@1 for AIME benchmarks * Remove other metrics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use `n=16` samples to estimate `pass@1` for AIME benchmarks #661

Use `n=16` samples to estimate `pass@1` for AIME benchmarks #661

Uh oh!

lewtun commented Apr 7, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Apr 7, 2025

Uh oh!

clefourrier left a comment

Uh oh!

clefourrier commented Apr 7, 2025

Uh oh!

lewtun commented Apr 7, 2025

Uh oh!

lewtun commented Apr 7, 2025

Uh oh!

clefourrier commented Apr 7, 2025

Uh oh!

Uh oh!

CurryxIaoHu commented May 8, 2025

Uh oh!

lewtun commented May 8, 2025

Uh oh!

Cppowboy commented May 16, 2025

Uh oh!

lewtun commented May 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Use n=16 samples to estimate pass@1 for AIME benchmarks #661

Use n=16 samples to estimate pass@1 for AIME benchmarks #661

Uh oh!

Conversation

lewtun commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Apr 7, 2025

Uh oh!

clefourrier left a comment

Choose a reason for hiding this comment

Uh oh!

clefourrier commented Apr 7, 2025

Uh oh!

lewtun commented Apr 7, 2025

Uh oh!

lewtun commented Apr 7, 2025

Uh oh!

clefourrier commented Apr 7, 2025

Uh oh!

Uh oh!

CurryxIaoHu commented May 8, 2025

Uh oh!

lewtun commented May 8, 2025

Uh oh!

Cppowboy commented May 16, 2025

Uh oh!

lewtun commented May 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Use `n=16` samples to estimate `pass@1` for AIME benchmarks #661

Use `n=16` samples to estimate `pass@1` for AIME benchmarks #661

lewtun commented Apr 7, 2025 •

edited

Loading