Add optional r1-style thinking reward #551

vwxyzjn · 2025-02-05T03:37:02Z

https://beaker.allen.ai/orgs/ai2/workspaces/tulu-3-dev/work/01JKA2SEKQ1E2QJN9SVSKGRKFQ?taskId=01JKA2SEKXC7BP2XB5KJC1C9TA&jobId=01JKA2SERKZ0JWH0HGFRETSYEX

natolambert

Approved but see two small requests.

natolambert · 2025-02-05T18:53:32Z

open_instruct/ground_truth_utils.py

+    result = strict_format_reward_func([valid, invalid])
+    assert result == [1.0, 0.0], f"Multiple responses failed, got {result}"
+    print("✓ Multiple responses passed")
+

 # debug code
 if __name__ == "__main__":


@vwxyzjn lets add this to the tests check we have?
Also, let's make the scale of the reward set by a hyperparam / config? Could get tricky reward shaping issues.

hamishivi · 2025-02-06T07:41:41Z

open_instruct/ground_truth_utils.py

@@ -139,7 +139,7 @@ def verify_flan_sample(model_output, ground_truth_answer):

 def soft_format_reward_func(responses: list[str]) -> list[float]:


Would agree with nathan that it feels like reward weight should be a param (and maybe even what pattern you are looking for?), to help with tuning reward stuff in the future.

Add optional r1-style thinking reward

16da676

vwxyzjn requested review from natolambert and hamishivi February 5, 2025 03:37

vwxyzjn added 3 commits February 4, 2025 20:04

quick change

ee26489

test

818ed86

Merge branch 'main' into add-optional-format-reward

8e1e1d9

natolambert approved these changes Feb 5, 2025

View reviewed changes

quick change

90355d1

hamishivi approved these changes Feb 6, 2025

View reviewed changes

vwxyzjn added 5 commits February 6, 2025 07:27

push latest change

4be929e

fix

79192aa

tested it to work

f45c51d

Push changes

f828fca

push

b02dcb2

vwxyzjn merged commit 1ff4692 into main Feb 6, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optional r1-style thinking reward #551

Add optional r1-style thinking reward #551

vwxyzjn commented Feb 5, 2025

natolambert left a comment

natolambert Feb 5, 2025

hamishivi Feb 6, 2025

		@@ -139,7 +139,7 @@ def verify_flan_sample(model_output, ground_truth_answer):

		def soft_format_reward_func(responses: list[str]) -> list[float]:

Add optional r1-style thinking reward #551

Add optional r1-style thinking reward #551

Conversation

vwxyzjn commented Feb 5, 2025

natolambert left a comment

Choose a reason for hiding this comment

natolambert Feb 5, 2025

Choose a reason for hiding this comment

hamishivi Feb 6, 2025

Choose a reason for hiding this comment