-
Notifications
You must be signed in to change notification settings - Fork 358
Add pass@1 for GPQA-D and MATH-500 #698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
max_model_length: PositiveInt | None = None # maximum length of the model, ussually infered automatically. reduce this if you encouter OOM issues, 4096 is usually enough | ||
swap_space: PositiveInt = 4 # CPU swap space size (GiB) per GPU. | ||
seed: PositiveInt = 1234 | ||
seed: NonNegativeInt = 1234 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed to allow seed=0
in the model args
few_shots_select=None, | ||
generation_size=32768, | ||
metric=[ | ||
Metrics.expr_gold_metric, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was causing redundant computation since we can get the same result from pass@1 (n=1)
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
||
def compute(self, golds: list[str], predictions: list[str], **kwargs) -> dict[str, float]: | ||
def compute( | ||
self, golds: list[str], predictions: list[str], formatted_doc: Doc = None, **kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to pass formatted_doc
in order to enable pass@1 with GPQA
* Add pass@1 for GPQA-D and clean up AIME * Add pass@1 for math_500 * Add pass@1 for MATH-500 * Update test * Fix
* Add pass@1 for GPQA-D and clean up AIME * Add pass@1 for math_500 * Add pass@1 for MATH-500 * Update test * Fix
For GPQA Diamond and MATH-500, it is customary to report
pass@1
instead of raw accuracy. In this PR, I've:n
was chosen to have ~2k total completions to mitigate variance. For a 7B model this takes about 15' on 1 node with DP=8, which seems like a reasonable tradeoff in variance vs computen=64
. I realise I down-scaled this originally ton=32
, but after reading the Phi-4-reasoning tech report I think it's better to increase to mitigate variance as much as possible since AIME24 is often used for model selection during training. (See plot below where they show the pass@1 variance can be rather large across repeated runs if T is medium/high as is common for LRMs)For GPQA, I've checked that my pass@1 (n=1) metric is equivalent to the old accuracy based one (as expected) for
Qwe2.5-7B-Instruct
:Please let me know if I should updated any other benchmarks to align with these changes!