[Chatllama] Use upvotes in Stanford dataset as a measure for reward #224

diegofiori · 2023-03-08T13:40:56Z

Description

Currently we are supporting the following datasets:

Stanford Human Preferences Dataset (SHP)
[Anthropic RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf

But we are not using all the information contained in the dataset:

rejected answers from Anthropic
up-votes for Stanford.

The number of upvotes, for instance, could be used as a label for the reward model (after some normalisation) to judge the quality of the answer without asking to a model or a human to return a feedback.

Moreover these datasets for training the reward models must be artificially augment with high quality negative examples to make the reward model learn not only what is good but also what is not.

Eventually a more robust research of possible useful datasets to be integrated must be carried out to ensure to support all the open-source datasets that can be relevant for the projects.

TODO

Implement a conversion between upvotes and reward for Stanford dataset.
Understand how the rejected answer can be used to augment the dataset quality of the reward model.
Test the validity of the conversion with a simple use case.
Introduce negative examples in the reward dataset that are meaningful to the model to assign the proper score.
Other datasets can be used?

MattiaSangermano · 2023-03-28T17:02:55Z

Hey, I would like to contribute to this task. Do you already have any ideas on how to convert upvotes into rewards, or are you completely open to suggestions? I looked at the dataset a bit, and so far I have only come up with some simple ideas, but nothing too satisfying.

diegofiori · 2023-03-28T17:21:11Z

Hi @MattiaSangermano, thank you very much for reaching out! Feel free to propose any idea on the issue. I haven't thought on it yet, but I'm happy to contribute to the brainstorming 😄

MattiaSangermano · 2023-03-29T16:19:54Z

Thank you @diegofiori, the simplest idea that comes to mind is to scale the upvote values between 0 and 5. I would perform the scaling by normalizing the upvotes of a response taking into account the "activity" of the individual post to which the response belongs. In this case, I would take the response with the most upvotes as an indicator of the post activity (max_upvote). Moreover, to ensure that other responses do not receive an unfairly low reward due to an excess of upvotes received by the winning response I would clip the max_upvote. One way to perform the clipping would be the IQR technique, even in this case I would compute the quantiles with respect to the post_id. A scatch of the reward function would be:

$$reward_p^i = \frac{score_p^i} {min(max\_upvote_{p}, IQR_p)} * 5$$

where p is the index of the post, i is the response index of a post and $IQR_{p}$ is the upper whisker computed using the answer upscores of the post p.

Please let me know if I wasn't clear in my explanation.

diegofiori · 2023-03-29T18:37:16Z

I actually like the idea of using upvotes quantiles to compute the reward. I just have a couple of questions about your reward function.

Upvote scores are all positive, then IQR_p < max_upvote_p should always be true, shoudn't it? But in this way we would also have many rewards > 5 (>25% of the rewards)
I'd also take into account the relative difference between score A and score B when computing the reward
I'd probably propose as reward function a sum of boolean values, e.g. score^i_p > Q3_p or score B > score A (if we are computing the reward for B), capped to 5, WDYT?

MattiaSangermano · 2023-03-30T08:14:17Z

Upvote scores are all positive, then IQR_p < max_upvote_p should always be true, shoudn't it?

Not necessary, but I think using IQR to refer to the higher whisker was misleading. The higher whisker is Q3_p + 1.5 * (Q3_p - Q1_p), therefore if the value of max_upvote_p is far away from Q3_p then the inequality is false. It is similar to what happens when you draw a boxplot as you can have points that fall outside the arms of the boxplot.

But in this way we would also have many rewards > 5 (>25% of the rewards)

Yes you are right, we should threshold also the score: min(score^i_p,IQR_p)

I'd also take into account the relative difference between score A and score B when computing the reward

I am afraid that in this way we will create an inconsistent dataset . That is, we might have different answer pairs, where the reward of one answer is lower than another just because in the original dataset it was paired with an answer with many upvotes. The Standford dataset was constructed in a way that pairs the same response multiple times, leading to multiple rewards for that response. How can we combine these rewards?

I'd probably propose as reward function a sum of boolean values, e.g. score^i_p > Q3_p or score B > score A (if we are computing the reward for B), capped to 5, WDYT?

I don't know if I understood correctly, you would like to create 5 or more rules where the reward of an answer basically becomes the number of rules it can pass, right? If so, it sounds very interesting.

MattiaSangermano · 2023-04-13T17:33:50Z

@diegofiori any update?

diegofiori · 2023-04-19T05:57:44Z

Hi @MattiaSangermano, I see your point. I'm actually pretty curious to give a look to the implementation of the metric you proposed. Theoretically speaking it makes sense to me. I'm curious to see some examples from the dataset with the related computed score. I think this is the only way to effectively validate the metric.

MattiaSangermano · 2023-04-27T07:22:33Z

Perfect, I will work on it over the next few days, as soon as possible I will do a PR

diegofiori added the chatllama Issue related to the ChatLLaMA module label Mar 8, 2023

nebuly-ai added this to ChatLLaMA Roadmap Mar 8, 2023

nebuly-ai moved this to Requested Features in ChatLLaMA Roadmap Mar 8, 2023

diegofiori added the good first issue Good for newcomers label Mar 9, 2023

diegofiori assigned MattiaSangermano Mar 28, 2023

PierpaoloSorbellini changed the title ~~Use upvotes in Stanford dataset as a measure for reward~~ [Chatllama] Use upvotes in Stanford dataset as a measure for reward Mar 31, 2023

MattiaSangermano mentioned this issue May 1, 2023

Standford SHP reward dataset #340

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Chatllama] Use upvotes in Stanford dataset as a measure for reward #224

[Chatllama] Use upvotes in Stanford dataset as a measure for reward #224

diegofiori commented Mar 8, 2023 •

edited by PierpaoloSorbellini

Loading

MattiaSangermano commented Mar 28, 2023

diegofiori commented Mar 28, 2023

MattiaSangermano commented Mar 29, 2023 •

edited

Loading

diegofiori commented Mar 29, 2023

MattiaSangermano commented Mar 30, 2023

MattiaSangermano commented Apr 13, 2023

diegofiori commented Apr 19, 2023

MattiaSangermano commented Apr 27, 2023

[Chatllama] Use upvotes in Stanford dataset as a measure for reward #224

[Chatllama] Use upvotes in Stanford dataset as a measure for reward #224

Comments

diegofiori commented Mar 8, 2023 • edited by PierpaoloSorbellini Loading

Description

TODO

MattiaSangermano commented Mar 28, 2023

diegofiori commented Mar 28, 2023

MattiaSangermano commented Mar 29, 2023 • edited Loading

diegofiori commented Mar 29, 2023

MattiaSangermano commented Mar 30, 2023

MattiaSangermano commented Apr 13, 2023

diegofiori commented Apr 19, 2023

MattiaSangermano commented Apr 27, 2023

diegofiori commented Mar 8, 2023 •

edited by PierpaoloSorbellini

Loading

MattiaSangermano commented Mar 29, 2023 •

edited

Loading