-
Notifications
You must be signed in to change notification settings - Fork 358
Use n=16
samples to estimate pass@1
for AIME benchmarks
#661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR :)
Btw, pass at n-X should be using subsamples of pass at n |
Ah cool, so I can safely use the |
It also seems the tests are failing for reasons unrelated to this PR |
Re running the tests, looks like a network error |
When I run the following command to evaluate the performance of R1-Distill-Qwen-7B:
I get:
It is weird that pass@1 suddenly dropped sharply when n comes to 16. Is it because there is something wrong with my settings? |
Hmm that does look a bit odd @CurryxIaoHu since I get the following which is quite stable for
Command to repro:
Here's my env for reference:
|
I guess that, you can not gurantee pass@1:1samples <= pass@1:2_samples, but you can gurantee that pass@1:64_samples <= pass@2:64_samples. |
Hi @Cppowboy your results actually look pretty stable to me! With When you say "same problem" are you referring to the variance or to the fact that your model appears to have much lower accuracy than the ~50% I obtained for deepseek-ai/DeepSeek-R1-Distill-Qwen-7B? If the latter, my suggestion would be to push the details to the Hub as a dataset so that we can inspect the model outputs in case there's something off there from |
* Use n=16 samples to estimate pass@1 for AIME benchmarks * Remove other metrics
* Use n=16 samples to estimate pass@1 for AIME benchmarks * Remove other metrics
Currently we estimate
pass@1
from various values ofn
in the range[4,8,16,32,64]
. Since large values ofn
are expensive to run, I compared the variance across various reasoning models on AIME24 using 10 different seeds. As shown in the figures below, we can taken=16
as a good compromise between compute and variance, with about a 1 percentage point variance across runs.Questions:
math_pass_at_1_4n
andmath_pass_at_1_8n
because I wasn't sure if these are computed independently ofmath_pass_at_1_16n
or obtained via subsampling? If the former, I propose to remove them to avoid redundant computation.