-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optional r1-style thinking reward #551
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved but see two small requests.
result = strict_format_reward_func([valid, invalid]) | ||
assert result == [1.0, 0.0], f"Multiple responses failed, got {result}" | ||
print("✓ Multiple responses passed") | ||
|
||
|
||
# debug code | ||
if __name__ == "__main__": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vwxyzjn lets add this to the tests check we have?
Also, let's make the scale of the reward set by a hyperparam / config? Could get tricky reward shaping issues.
open_instruct/ground_truth_utils.py
Outdated
@@ -139,7 +139,7 @@ def verify_flan_sample(model_output, ground_truth_answer): | |||
|
|||
def soft_format_reward_func(responses: list[str]) -> list[float]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would agree with nathan that it feels like reward weight should be a param (and maybe even what pattern you are looking for?), to help with tuning reward stuff in the future.
https://beaker.allen.ai/orgs/ai2/workspaces/tulu-3-dev/work/01JKA2SEKQ1E2QJN9SVSKGRKFQ?taskId=01JKA2SEKXC7BP2XB5KJC1C9TA&jobId=01JKA2SERKZ0JWH0HGFRETSYEX