-
Notifications
You must be signed in to change notification settings - Fork 637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Chatllama] Use upvotes in Stanford dataset as a measure for reward #224
Comments
Hey, I would like to contribute to this task. Do you already have any ideas on how to convert upvotes into rewards, or are you completely open to suggestions? I looked at the dataset a bit, and so far I have only come up with some simple ideas, but nothing too satisfying. |
Hi @MattiaSangermano, thank you very much for reaching out! Feel free to propose any idea on the issue. I haven't thought on it yet, but I'm happy to contribute to the brainstorming 😄 |
Thank you @diegofiori, the simplest idea that comes to mind is to scale the upvote values between 0 and 5. I would perform the scaling by normalizing the upvotes of a response taking into account the "activity" of the individual post to which the response belongs. In this case, I would take the response with the most upvotes as an indicator of the post activity ( $$reward_p^i = \frac{score_p^i} {min( where Please let me know if I wasn't clear in my explanation. |
I actually like the idea of using upvotes quantiles to compute the reward. I just have a couple of questions about your reward function.
|
Not necessary, but I think using IQR to refer to the higher whisker was misleading. The higher whisker is
Yes you are right, we should threshold also the score:
I am afraid that in this way we will create an inconsistent dataset . That is, we might have different answer pairs, where the reward of one answer is lower than another just because in the original dataset it was paired with an answer with many upvotes. The Standford dataset was constructed in a way that pairs the same response multiple times, leading to multiple rewards for that response. How can we combine these rewards?
I don't know if I understood correctly, you would like to create 5 or more rules where the reward of an answer basically becomes the number of rules it can pass, right? If so, it sounds very interesting. |
@diegofiori any update? |
Hi @MattiaSangermano, I see your point. I'm actually pretty curious to give a look to the implementation of the metric you proposed. Theoretically speaking it makes sense to me. I'm curious to see some examples from the dataset with the related computed score. I think this is the only way to effectively validate the metric. |
Perfect, I will work on it over the next few days, as soon as possible I will do a PR |
Description
Currently we are supporting the following datasets:
But we are not using all the information contained in the dataset:
The number of upvotes, for instance, could be used as a label for the reward model (after some normalisation) to judge the quality of the answer without asking to a model or a human to return a feedback.
Moreover these datasets for training the reward models must be artificially augment with high quality negative examples to make the reward model learn not only what is good but also what is not.
Eventually a more robust research of possible useful datasets to be integrated must be carried out to ensure to support all the open-source datasets that can be relevant for the projects.
TODO
The text was updated successfully, but these errors were encountered: