Skip to content

Conversation

@zeenolife
Copy link
Contributor

I've placed the reward functions into their own file to accommodate for future expansion (code specific reward functions and etc.). Added tests to make sure the reward functions are working as intended.
Also, I've written an optional step by step thinking encouragement function

"""

reward_funcs: list[str] = field(
default_factory=lambda: ["accuracy", "format"],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left the default same

"accuracy": accuracy_reward,
"format": format_reward,
"reasoning_steps": reasoning_steps_reward,
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly fan of this "registry", but list of reward functions is small, so I guess it works for now ¯_(ツ)_/¯

reward = float(verify(answer_parsed, gold_parsed))
else:
# If the gold solution is not parseable, we reward 1 to skip this example
reward = 1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow: why do we give the same reward in the "non parseable" case and the correct case?

Copy link
Contributor Author

@zeenolife zeenolife Feb 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function was taken as is: link

But, I speculate that:

  1. It's rare that the ground truth dataset is not parseable.
  2. The downstream GRPO trainer currently doesn't support ignoring the samples.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we give it 0 weight instead then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's completely equivalent, the reward is normalized per group.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was just thinking of it.

I guess it is guaranteed that group would contain only the completions for that particular unparseable solution, right? Because if the group would contain others, the normalization wouldnt work properly. Does it make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevertheless, I can come back to it in the next PR

Copy link
Contributor

@ctjlewis ctjlewis Feb 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just made an issue about this (#159) and was reading this thread to see if it was related. A rare exception from Math-Verify stopped my training run, it needs to be caught and skipped as well (#158).

I thought it was odd because the rewards were like 0.19, 0.20, etc., so rewarding 1 for a pair where we couldn't find the gold solution, or we got an error from Math-Verify, felt off. But if we're sure it works out, I'll close those.

Copy link
Contributor

@ctjlewis ctjlewis Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we give it 0 weight instead then?

We wouldn't want to give it 0 weight either since that could be a penalty or even a net reward depending on the other rewards in the group (0 might be highest if they all have negative reward). If we don't know, we don't know, we can't put information there that isn't there, we are just diluting the signal.

We would actually want to omit them. It will take upstream changes to trl's GRPOTrainer but this is necessary IMO. I've looked at it a few times and I'm confident it's not right. We should be able to return None from a reward_func and we can just reshape the tensors to remove those examples.

Copy link
Member

@qgallouedec qgallouedec Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wouldn't want to give it 0 weight either since that could be a penalty or even a net reward depending on the other rewards in the group

That's not right. Gold is shared in the group. If it's not parsable for one sample, then it's not parsable for the whole group.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh OK, I see now.

class TestRewards(unittest.TestCase):
def test_accuracy_reward_correct_answer(self):
"""Test accuracy_reward with a correct answer."""
completion = [[{"content": r"\boxed{\frac{63}{400}}"}]]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the test be without the \boxed ? nowhere in the system prompt do we tell the model to use \boxed so even if it gets the answer right with say <answer>\frac{63}{400}</answer>, shouldn't that be counted as correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid point, it's passed in as system prompt argument

@zeenolife
Copy link
Contributor Author

@qgallouedec kindly tagging for the review

@zeenolife zeenolife force-pushed the almaz/reward-func-refactoring branch from 57db2dc to f81cfab Compare February 5, 2025 12:54
@zeenolife
Copy link
Contributor Author

@kashif @lewtun @edbeeching tagging other maintainers

Copy link
Collaborator

@edbeeching edbeeching left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for this refactor.

@zeenolife
Copy link
Contributor Author

Sorry, minor ruff error, fixed it. @edbeeching

@zeenolife zeenolife force-pushed the almaz/reward-func-refactoring branch from 5225fe4 to c991fe0 Compare February 6, 2025 17:33
@zeenolife
Copy link
Contributor Author

Rebased because of the merge conflict

@kashif kashif merged commit e8c2673 into huggingface:main Feb 6, 2025
GitMonkey0 pushed a commit to GitMonkey0/open-r1 that referenced this pull request Feb 24, 2025
…dding test coverage for reward functions (huggingface#144)

* Refactoring reward functions. Adding step by step reasoning reward. Adding test coverage for reward functions

* [Refactoring reward functions] - Ruff error fix

* [Refactoring reward functions] - Linting error fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants