Conversation
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
…dation preference datasets Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
odelalleau
left a comment
There was a problem hiding this comment.
A few minor remarks on latest changes
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
36836c1 to
8129c23
Compare
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
|
|
@jveronvialard and i synced up offline. came to agreement on changes, after changes + convergence metric comparison we should be good to go |
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com> Signed-off-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com> Signed-off-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: Terry Kong <terryk@nvidia.com>
What does this PR do ?
This PR adds a more generic preference dataset class that can be used for both RM and DPO training. It also aligns the RM and DPO training implementations more closely and adds support for multiple validation preference datasets.
Usage
You can specify multiple validation preference datasets in your RM or DPO training configuration:
For example, when using local preference datasets based on HelpSteer2 and HelpSteer3, where ties have been filtered



Comparing RM convergence plots before and after this PR, using the default


data.dataset_name: HelpSteer3Comparing DPO convergence plots before and after this PR, using the default




data.dataset_name: HelpSteer3Before your PR is "Ready for review"
Pre checks: