feat: preference datasets by jveronvialard · Pull Request #673 · NVIDIA-NeMo/RL

jveronvialard · 2025-07-15T23:26:53Z

What does this PR do ?

This PR adds a more generic preference dataset class that can be used for both RM and DPO training. It also aligns the RM and DPO training implementations more closely and adds support for multiple validation preference datasets.

Usage

You can specify multiple validation preference datasets in your RM or DPO training configuration:

data:
  dataset_name: PreferenceDataset
  train_data_path: <LocalPathToTrainingDataset>
  val_data_paths:
    <NameOfValidationDataset1>: <LocalPathToValidationDataset1>
    <NameOfValidationDataset2>: <LocalPathToValidationDataset2>
    ...

For example, when using local preference datasets based on HelpSteer2 and HelpSteer3, where ties have been filtered

Comparing RM convergence plots before and after this PR, using the default data.dataset_name: HelpSteer3

Comparing DPO convergence plots before and after this PR, using the default data.dataset_name: HelpSteer3

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

…t-rm-training

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

…dation preference datasets Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

…t-rm-training

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

odelalleau

Thanks! At high level lgtm, just a bunch of mostly minor questions & polish suggestions, except for one point that requires potentially more work (extending this new dataset format to DPO as well).

docs/guides/rm.md

nemo_rl/algorithms/sft.py

nemo_rl/data/hf_datasets/preference_dataset.py

tests/unit/data/hf_datasets/test_preference_dataset.py

nemo_rl/data/hf_datasets/preference_dataset.py

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

odelalleau

A few minor remarks on latest changes

docs/guides/rm.md

examples/run_rm.py

…t-rm-training

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

…t-rm-training

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

github-actions · 2025-08-28T03:44:58Z

⚠️ File Consistency Check

Check based on commit: 52a20bc (PR #673 from jveronvialard/preference-datasets)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker_v2.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker_v2.py
Update nemo_rl/models/policy/dtensor_policy_worker_v2.py if necessary to maintain consistency
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/policy/dtensor_policy_worker.py
Not modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

terrykong · 2025-08-28T22:10:09Z

@jveronvialard and i synced up offline. came to agreement on changes, after changes + convergence metric comparison we should be good to go

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

…reference-datasets

nemo_rl/algorithms/dpo.py

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

Signed-off-by: Terry Kong <terryk@nvidia.com>

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com> Signed-off-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com> Signed-off-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: Terry Kong <terryk@nvidia.com>

jveronvialard added 9 commits July 3, 2025 15:49

adding support for Bradley-Terry reward model training

a38c104

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

Merge branch 'main' of github.com:NVIDIA-NeMo/RL into jveronvialard/b…

ede515b

…t-rm-training

update docs

5b9e976

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

add separate run_rm.py and unit tests

68e96ea

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

fix small typos and nit changes

21d67a0

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

adding generic preference dataset class and support for multiple vali…

0aff450

…dation preference datasets Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

rewards tensor shape

8a28af7

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

adding unit tests

7de3b93

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

updating docs

63dd1f3

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

jveronvialard requested a review from odelalleau July 15, 2025 23:26

jveronvialard added enhancement New feature or request training Training related algorithm labels Jul 15, 2025

github-actions bot added the documentation Improvements or additions to documentation label Jul 15, 2025

jveronvialard marked this pull request as draft July 15, 2025 23:30

jveronvialard added 7 commits July 16, 2025 10:06

Merge branch 'main' of github.com:NVIDIA-NeMo/RL into jveronvialard/b…

e914087

…t-rm-training

update config and skip is_tied_lm_head for RM

8fb280b

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

use tokenizer.pad_token_id if model.config.pad_token_id is not defined

3e3b03a

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

nit

ed24aea

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

update functional test and cicd

af17314

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

nit docs

1034634

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

keep individual metrics then aggregate on the entire dataset

02687ce

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

odelalleau requested changes Jul 18, 2025

View reviewed changes

nit code and doc changes

24c5fd0

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

odelalleau reviewed Jul 18, 2025

View reviewed changes

docs/guides/rm.md Outdated Show resolved Hide resolved

docs/guides/rm.md Outdated Show resolved Hide resolved

examples/run_rm.py Outdated Show resolved Hide resolved

jveronvialard added 5 commits July 21, 2025 10:37

Merge branch 'main' of github.com:NVIDIA-NeMo/RL into jveronvialard/b…

8788ec2

…t-rm-training

split sft.py and rm.py

24807c3

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

nit code and doc changes

5c76465

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

pull from target branch

00363a2

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

Merge branch 'main' of github.com:NVIDIA-NeMo/RL into jveronvialard/b…

d3b6272

…t-rm-training

jveronvialard dismissed odelalleau’s stale review via 8129c23 August 27, 2025 23:36

jveronvialard force-pushed the jveronvialard/preference-datasets branch from 36836c1 to 8129c23 Compare August 27, 2025 23:36

jveronvialard added 3 commits August 27, 2025 20:05

adding overall val time

c4e3bda

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

aggregate metrics at the batch level first

578441f

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

lint

52a20bc

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

jveronvialard added 3 commits August 27, 2025 20:51

nit

f97ee6d

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

fix tulu3

c4b4e6a

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

adding tulu3 unit test

bd16f9b

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

jveronvialard added 5 commits August 28, 2025 15:16

nit

5f6cc52

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

validation metrics

0dc7a6f

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

nit code and docs

1137407

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

Merge branch 'main' of github.com:NVIDIA-NeMo/RL into jveronvialard/p…

ba4b539

…reference-datasets

Merge branch 'main' of github.com:NVIDIA-NeMo/RL into jveronvialard/p…

62fc8f1

…reference-datasets

terrykong reviewed Aug 29, 2025

View reviewed changes

nemo_rl/algorithms/dpo.py Outdated Show resolved Hide resolved

adding DPOValMetrics

30571c2

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

terrykong previously approved these changes Aug 29, 2025

View reviewed changes

terrykong added this pull request to the merge queue Aug 29, 2025

github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Aug 30, 2025

revert jsonc to json since sphinx didn't like

e58d9ee

Signed-off-by: Terry Kong <terryk@nvidia.com>

terrykong dismissed their stale review via e58d9ee August 30, 2025 17:19

terrykong enabled auto-merge August 30, 2025 17:19

terrykong approved these changes Aug 30, 2025

View reviewed changes

terrykong added this pull request to the merge queue Aug 30, 2025

Merged via the queue into main with commit cbd4b93 Aug 30, 2025
21 checks passed

terrykong deleted the jveronvialard/preference-datasets branch August 30, 2025 21:35

PocketDocLabs referenced this pull request in arcee-ai/NeMo-RL Aug 31, 2025

feat: preference datasets #673

d179be4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: preference datasets#673

feat: preference datasets#673
terrykong merged 59 commits intomainfrom
jveronvialard/preference-datasets

jveronvialard commented Jul 15, 2025 •

edited

Loading

Uh oh!

odelalleau left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

odelalleau left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Aug 28, 2025

Uh oh!

terrykong commented Aug 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jveronvialard commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Usage

Before your PR is "Ready for review"

Uh oh!

odelalleau left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

odelalleau left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Aug 28, 2025

⚠️ File Consistency Check

⚠️ DTensor Policy Worker Synchronization Warning

Uh oh!

terrykong commented Aug 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jveronvialard commented Jul 15, 2025 •

edited

Loading

odelalleau left a comment •

edited

Loading