Skip to content

Conversation

lewtun
Copy link
Member

@lewtun lewtun commented Apr 7, 2025

Currently we estimate pass@1 from various values of n in the range [4,8,16,32,64]. Since large values of n are expensive to run, I compared the variance across various reasoning models on AIME24 using 10 different seeds. As shown in the figures below, we can take n=16 as a good compromise between compute and variance, with about a 1 percentage point variance across runs.

Questions:

  • I removed math_pass_at_1_4n and math_pass_at_1_8n because I wasn't sure if these are computed independently of math_pass_at_1_16n or obtained via subsampling? If the former, I propose to remove them to avoid redundant computation.

DeepSeek-R1-Distill-Llama-8B_math_pass_plot
DeepSeek-R1-Distill-Qwen-14B_math_pass_plot
DeepSeek-R1-Distill-Qwen-7B_math_pass_plot
DeepSeek-R1-Distill-Qwen-1 5B_math_pass_plot

@lewtun lewtun requested review from NathanHB and clefourrier April 7, 2025 09:10
@HuggingFaceDocBuilderDev
Copy link
Collaborator

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@clefourrier clefourrier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR :)

@clefourrier
Copy link
Member

Btw, pass at n-X should be using subsamples of pass at n

@lewtun
Copy link
Member Author

lewtun commented Apr 7, 2025

Btw, pass at n-X should be using subsamples of pass at n

Ah cool, so I can safely use the n=4 and n=8 metrics without requiring new generations? I can add them back if that's the case

@lewtun
Copy link
Member Author

lewtun commented Apr 7, 2025

It also seems the tests are failing for reasons unrelated to this PR

@clefourrier
Copy link
Member

Re running the tests, looks like a network error

@lewtun lewtun merged commit f3639c6 into main Apr 7, 2025
4 checks passed
@CurryxIaoHu
Copy link

When I run the following command to evaluate the performance of R1-Distill-Qwen-7B:

MODEL=/dev/shm/yifan2/R1-Distill-Qwen-7B

MODEL_ARGS="model_name=$MODEL,dtype=bfloat16,data_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilization=0.95,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
TASK="lighteval|aime24|0|0" 
OUTPUT_DIR=...

lighteval vllm $MODEL_ARGS $TASK \
    --use-chat-template \
    --save-details \
    --output-dir $OUTPUT_DIR

I get:

|lighteval:aime24:0      |       |math_pass@1:1_samples |0.5333|±  |0.0926| 
|                        |       |math_pass@1:4_samples |0.5333|±  |0.0717|
|                        |       |math_pass@1:8_samples |0.5500|±  |0.0723|
|                        |       |math_pass@1:16_samples|0.2750|±  |0.0362|
|                        |       |math_pass@1:32_samples|0.1375|±  |0.0181|
|                        |       |math_pass@1:64_samples|0.0687|±  |0.0090|

It is weird that pass@1 suddenly dropped sharply when n comes to 16. Is it because there is something wrong with my settings?

@lewtun
Copy link
Member Author

lewtun commented May 8, 2025

Hmm that does look a bit odd @CurryxIaoHu since I get the following which is quite stable for $n \geq 16$:

Task Version Metric Value Stderr
all math_pass@1:1_samples 0.4000 ± 0.0910
math_pass@1:4_samples 0.4417 ± 0.0706
math_pass@1:8_samples 0.4625 ± 0.0697
math_pass@1:16_samples 0.4813 ± 0.0683
math_pass@1:32_samples 0.5010 ± 0.0684
math_pass@1:64_samples 0.5083 ± 0.0676
lighteval:aime24:0 2 math_pass@1:1_samples 0.4000 ± 0.0910
math_pass@1:4_samples 0.4417 ± 0.0706
math_pass@1:8_samples 0.4625 ± 0.0697
math_pass@1:16_samples 0.4813 ± 0.0683
math_pass@1:32_samples 0.5010 ± 0.0684
math_pass@1:64_samples 0.5083 ± 0.0676

Command to repro:

lighteval vllm 'model_name=deepseek-ai/DeepSeek-R1-Distill-Qwen-7B,revision=main,trust_remote_code=False,dtype=bfloat16,data_parallel_size=8,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}' 'lighteval|aime24|0|0' --use-chat-template --output-dir eval_results/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B/main/aime24 --save-details

Here's my env for reference:

- Platform: Linux-5.15.0-1048-aws-x86_64-with-glibc2.31
- Python version: 3.11.11
- TRL version: 0.18.0.dev0
- PyTorch version: 2.6.0
- CUDA device(s): NVIDIA H100 80GB HBM3
- Transformers version: 4.52.0.dev0
- Accelerate version: 1.4.0
- Accelerate config: not found
- Datasets version: 3.5.1
- HF Hub version: 0.30.2
- bitsandbytes version: 0.45.5
- DeepSpeed version: 0.16.7
- Diffusers version: not installed
- Liger-Kernel version: 0.5.8
- LLM-Blender version: not installed
- OpenAI version: 1.76.2
- PEFT version: not installed
- vLLM version: 0.8.4
- lighteval version: lighteval @ git+https://github.com/huggingface/lighteval.git@d50bc3072b8814656633400a1850c500c6aa2e39

@Cppowboy
Copy link
Contributor

When I run the following command to evaluate the performance of R1-Distill-Qwen-7B:

MODEL=/dev/shm/yifan2/R1-Distill-Qwen-7B

MODEL_ARGS="model_name=$MODEL,dtype=bfloat16,data_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilization=0.95,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
TASK="lighteval|aime24|0|0" 
OUTPUT_DIR=...

lighteval vllm $MODEL_ARGS $TASK \
    --use-chat-template \
    --save-details \
    --output-dir $OUTPUT_DIR

I get:

|lighteval:aime24:0      |       |math_pass@1:1_samples |0.5333|±  |0.0926| 
|                        |       |math_pass@1:4_samples |0.5333|±  |0.0717|
|                        |       |math_pass@1:8_samples |0.5500|±  |0.0723|
|                        |       |math_pass@1:16_samples|0.2750|±  |0.0362|
|                        |       |math_pass@1:32_samples|0.1375|±  |0.0181|
|                        |       |math_pass@1:64_samples|0.0687|±  |0.0090|

It is weird that pass@1 suddenly dropped sharply when n comes to 16. Is it because there is something wrong with my settings?

same problem.

|       Task       |Version|        Metric        |Value |   |Stderr|
|------------------|------:|----------------------|-----:|---|-----:|
|all               |       |math_pass@1:1_samples |0.4000|±  |0.0910|
|                  |       |math_pass@1:4_samples |0.2833|±  |0.0643|
|                  |       |math_pass@1:8_samples |0.3000|±  |0.0653|
|                  |       |math_pass@1:16_samples|0.3187|±  |0.0604|
|                  |       |math_pass@1:32_samples|0.3198|±  |0.0595|
|                  |       |math_pass@1:64_samples|0.3182|±  |0.0596|
|lighteval:aime24:0|      2|math_pass@1:1_samples |0.4000|±  |0.0910|
|                  |       |math_pass@1:4_samples |0.2833|±  |0.0643|
|                  |       |math_pass@1:8_samples |0.3000|±  |0.0653|
|                  |       |math_pass@1:16_samples|0.3187|±  |0.0604|
|                  |       |math_pass@1:32_samples|0.3198|±  |0.0595|
|                  |       |math_pass@1:64_samples|0.3182|±  |0.0596|

any solutions?

I guess that, you can not gurantee pass@1:1samples <= pass@1:2_samples, but you can gurantee that pass@1:64_samples <= pass@2:64_samples.

@lewtun
Copy link
Member Author

lewtun commented May 16, 2025

Hi @Cppowboy your results actually look pretty stable to me! With pass@1:1_samples the variance is very large due to sampling (easily +/- 10 points), and as n increases we see the results converge to ~0.3 accuracy.

When you say "same problem" are you referring to the variance or to the fact that your model appears to have much lower accuracy than the ~50% I obtained for deepseek-ai/DeepSeek-R1-Distill-Qwen-7B?

If the latter, my suggestion would be to push the details to the Hub as a dataset so that we can inspect the model outputs in case there's something off there from vllm.

hynky1999 pushed a commit that referenced this pull request May 22, 2025
* Use n=16 samples to estimate pass@1 for AIME benchmarks

* Remove other metrics
NathanHB pushed a commit that referenced this pull request Sep 19, 2025
* Use n=16 samples to estimate pass@1 for AIME benchmarks

* Remove other metrics
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants