Refactoring reward functions. Adding step by step reasoning reward. Adding test coverage for reward functions #144

zeenolife · 2025-01-31T21:57:20Z

I've placed the reward functions into their own file to accommodate for future expansion (code specific reward functions and etc.). Added tests to make sure the reward functions are working as intended.
Also, I've written an optional step by step thinking encouragement function

zeenolife · 2025-01-31T21:58:24Z

src/open_r1/grpo.py

    """

    reward_funcs: list[str] = field(
        default_factory=lambda: ["accuracy", "format"],


I've left the default same

zeenolife · 2025-01-31T21:59:49Z

src/open_r1/rewards.py

+    "accuracy": accuracy_reward,
+    "format": format_reward,
+    "reasoning_steps": reasoning_steps_reward,
+}


Not exactly fan of this "registry", but list of reward functions is small, so I guess it works for now ¯_(ツ)_/¯

ocramz · 2025-02-01T07:54:56Z

src/open_r1/rewards.py

+            reward = float(verify(answer_parsed, gold_parsed))
+        else:
+            # If the gold solution is not parseable, we reward 1 to skip this example
+            reward = 1.0


I don't follow: why do we give the same reward in the "non parseable" case and the correct case?

The function was taken as is: link

But, I speculate that:

It's rare that the ground truth dataset is not parseable.

The downstream GRPO trainer currently doesn't support ignoring the samples.

shouldn't we give it 0 weight instead then?

It's completely equivalent, the reward is normalized per group.

Yes, I was just thinking of it.

I guess it is guaranteed that group would contain only the completions for that particular unparseable solution, right? Because if the group would contain others, the normalization wouldnt work properly. Does it make sense?

Nevertheless, I can come back to it in the next PR

I'd just made an issue about this (#159) and was reading this thread to see if it was related. A rare exception from Math-Verify stopped my training run, it needs to be caught and skipped as well (#158).

I thought it was odd because the rewards were like 0.19, 0.20, etc., so rewarding 1 for a pair where we couldn't find the gold solution, or we got an error from Math-Verify, felt off. But if we're sure it works out, I'll close those.

shouldn't we give it 0 weight instead then?

We wouldn't want to give it 0 weight either since that could be a penalty or even a net reward depending on the other rewards in the group (0 might be highest if they all have negative reward). If we don't know, we don't know, we can't put information there that isn't there, we are just diluting the signal.

We would actually want to omit them. It will take upstream changes to trl's GRPOTrainer but this is necessary IMO. I've looked at it a few times and I'm confident it's not right. We should be able to return None from a reward_func and we can just reshape the tensors to remove those examples.

We wouldn't want to give it 0 weight either since that could be a penalty or even a net reward depending on the other rewards in the group

That's not right. Gold is shared in the group. If it's not parsable for one sample, then it's not parsable for the whole group.

Oh OK, I see now.

winglian · 2025-02-03T16:28:47Z

tests/test_rewards.py

+class TestRewards(unittest.TestCase):
+    def test_accuracy_reward_correct_answer(self):
+        """Test accuracy_reward with a correct answer."""
+        completion = [[{"content": r"\boxed{\frac{63}{400}}"}]]


shouldn't the test be without the \boxed ? nowhere in the system prompt do we tell the model to use \boxed so even if it gets the answer right with say <answer>\frac{63}{400}</answer>, shouldn't that be counted as correct?

Valid point, it's passed in as system prompt argument

zeenolife · 2025-02-04T19:00:49Z

@qgallouedec kindly tagging for the review

zeenolife · 2025-02-05T12:56:00Z

@kashif @lewtun @edbeeching tagging other maintainers

edbeeching

LGTM, thanks for this refactor.

zeenolife · 2025-02-06T08:51:36Z

Sorry, minor ruff error, fixed it. @edbeeching

…dding test coverage for reward functions

zeenolife · 2025-02-06T17:34:48Z

Rebased because of the merge conflict

…dding test coverage for reward functions (huggingface#144) * Refactoring reward functions. Adding step by step reasoning reward. Adding test coverage for reward functions * [Refactoring reward functions] - Ruff error fix * [Refactoring reward functions] - Linting error fix

zeenolife commented Jan 31, 2025

View reviewed changes

ocramz reviewed Feb 1, 2025

View reviewed changes

winglian reviewed Feb 3, 2025

View reviewed changes

ctjlewis mentioned this pull request Feb 4, 2025

Should skips really get reward 1? #159

Closed

zeenolife force-pushed the almaz/reward-func-refactoring branch from 57db2dc to f81cfab Compare February 5, 2025 12:54

edbeeching approved these changes Feb 6, 2025

View reviewed changes

zeenolife requested a review from edbeeching February 6, 2025 13:15

edbeeching mentioned this pull request Feb 6, 2025

[GRPO] add cosine reward #206

Merged

zeenolife added 3 commits February 6, 2025 17:29

Refactoring reward functions. Adding step by step reasoning reward. A…

4e5ad0a

…dding test coverage for reward functions

[Refactoring reward functions] - Ruff error fix

4c49f7b

[Refactoring reward functions] - Linting error fix

c991fe0

zeenolife force-pushed the almaz/reward-func-refactoring branch from 5225fe4 to c991fe0 Compare February 6, 2025 17:33

kashif approved these changes Feb 6, 2025

View reviewed changes

qgallouedec approved these changes Feb 6, 2025

View reviewed changes

kashif merged commit e8c2673 into huggingface:main Feb 6, 2025

ali-cogni mentioned this pull request Feb 7, 2025

Why is the reward is equal to 1 if the gold solution is not parseable? Shouldn't it be 0? #215

Open

Refactoring reward functions. Adding step by step reasoning reward. Adding test coverage for reward functions #144

Refactoring reward functions. Adding step by step reasoning reward. Adding test coverage for reward functions #144

Uh oh!

Conversation

zeenolife commented Jan 31, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zeenolife Feb 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ctjlewis Feb 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ctjlewis Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qgallouedec Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zeenolife commented Feb 4, 2025

Uh oh!

zeenolife commented Feb 5, 2025

Uh oh!

edbeeching left a comment

Choose a reason for hiding this comment

Uh oh!

zeenolife commented Feb 6, 2025

Uh oh!

zeenolife commented Feb 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

zeenolife Feb 1, 2025 •

edited

Loading

ctjlewis Feb 2, 2025 •

edited

Loading

ctjlewis Feb 4, 2025 •

edited

Loading

qgallouedec Feb 5, 2025 •

edited

Loading