Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Chatllama] Use upvotes in Stanford dataset as a measure for reward #224

Open
5 tasks
diegofiori opened this issue Mar 8, 2023 · 8 comments
Open
5 tasks
Assignees
Labels
chatllama Issue related to the ChatLLaMA module good first issue Good for newcomers

Comments

@diegofiori
Copy link
Collaborator

diegofiori commented Mar 8, 2023

Description

Currently we are supporting the following datasets:

But we are not using all the information contained in the dataset:

  • rejected answers from Anthropic
  • up-votes for Stanford.

The number of upvotes, for instance, could be used as a label for the reward model (after some normalisation) to judge the quality of the answer without asking to a model or a human to return a feedback.

Moreover these datasets for training the reward models must be artificially augment with high quality negative examples to make the reward model learn not only what is good but also what is not.

Eventually a more robust research of possible useful datasets to be integrated must be carried out to ensure to support all the open-source datasets that can be relevant for the projects.

TODO

  • Implement a conversion between upvotes and reward for Stanford dataset.
  • Understand how the rejected answer can be used to augment the dataset quality of the reward model.
  • Test the validity of the conversion with a simple use case.
  • Introduce negative examples in the reward dataset that are meaningful to the model to assign the proper score.
  • Other datasets can be used?
@diegofiori diegofiori added the chatllama Issue related to the ChatLLaMA module label Mar 8, 2023
@nebuly-ai nebuly-ai moved this to Requested Features in ChatLLaMA Roadmap Mar 8, 2023
@diegofiori diegofiori added the good first issue Good for newcomers label Mar 9, 2023
@MattiaSangermano
Copy link

Hey, I would like to contribute to this task. Do you already have any ideas on how to convert upvotes into rewards, or are you completely open to suggestions? I looked at the dataset a bit, and so far I have only come up with some simple ideas, but nothing too satisfying.

@diegofiori
Copy link
Collaborator Author

Hi @MattiaSangermano, thank you very much for reaching out! Feel free to propose any idea on the issue. I haven't thought on it yet, but I'm happy to contribute to the brainstorming 😄

@MattiaSangermano
Copy link

MattiaSangermano commented Mar 29, 2023

Thank you @diegofiori, the simplest idea that comes to mind is to scale the upvote values between 0 and 5. I would perform the scaling by normalizing the upvotes of a response taking into account the "activity" of the individual post to which the response belongs. In this case, I would take the response with the most upvotes as an indicator of the post activity (max_upvote). Moreover, to ensure that other responses do not receive an unfairly low reward due to an excess of upvotes received by the winning response I would clip the max_upvote. One way to perform the clipping would be the IQR technique, even in this case I would compute the quantiles with respect to the post_id. A scatch of the reward function would be:

$$reward_p^i = \frac{score_p^i} {min(max\_upvote_{p}, IQR_p)} * 5$$

where p is the index of the post, i is the response index of a post and $IQR_{p}$ is the upper whisker computed using the answer upscores of the post p.

Please let me know if I wasn't clear in my explanation.

@diegofiori
Copy link
Collaborator Author

I actually like the idea of using upvotes quantiles to compute the reward. I just have a couple of questions about your reward function.

  1. Upvote scores are all positive, then IQR_p < max_upvote_p should always be true, shoudn't it? But in this way we would also have many rewards > 5 (>25% of the rewards)
  2. I'd also take into account the relative difference between score A and score B when computing the reward
  3. I'd probably propose as reward function a sum of boolean values, e.g. score^i_p > Q3_p or score B > score A (if we are computing the reward for B), capped to 5, WDYT?

@MattiaSangermano
Copy link

Upvote scores are all positive, then IQR_p < max_upvote_p should always be true, shoudn't it?

Not necessary, but I think using IQR to refer to the higher whisker was misleading. The higher whisker is Q3_p + 1.5 * (Q3_p - Q1_p), therefore if the value of max_upvote_p is far away from Q3_p then the inequality is false. It is similar to what happens when you draw a boxplot as you can have points that fall outside the arms of the boxplot.

But in this way we would also have many rewards > 5 (>25% of the rewards)

Yes you are right, we should threshold also the score: min(score^i_p,IQR_p)

I'd also take into account the relative difference between score A and score B when computing the reward

I am afraid that in this way we will create an inconsistent dataset . That is, we might have different answer pairs, where the reward of one answer is lower than another just because in the original dataset it was paired with an answer with many upvotes. The Standford dataset was constructed in a way that pairs the same response multiple times, leading to multiple rewards for that response. How can we combine these rewards?

I'd probably propose as reward function a sum of boolean values, e.g. score^i_p > Q3_p or score B > score A (if we are computing the reward for B), capped to 5, WDYT?

I don't know if I understood correctly, you would like to create 5 or more rules where the reward of an answer basically becomes the number of rules it can pass, right? If so, it sounds very interesting.

@PierpaoloSorbellini PierpaoloSorbellini changed the title Use upvotes in Stanford dataset as a measure for reward [Chatllama] Use upvotes in Stanford dataset as a measure for reward Mar 31, 2023
@MattiaSangermano
Copy link

@diegofiori any update?

@diegofiori
Copy link
Collaborator Author

Hi @MattiaSangermano, I see your point. I'm actually pretty curious to give a look to the implementation of the metric you proposed. Theoretically speaking it makes sense to me. I'm curious to see some examples from the dataset with the related computed score. I think this is the only way to effectively validate the metric.

@MattiaSangermano
Copy link

Perfect, I will work on it over the next few days, as soon as possible I will do a PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chatllama Issue related to the ChatLLaMA module good first issue Good for newcomers
Projects
Status: Requested Features
Development

No branches or pull requests

2 participants