Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
a38c104
adding support for Bradley-Terry reward model training
jveronvialard Jul 3, 2025
ede515b
Merge branch 'main' of github.com:NVIDIA-NeMo/RL into jveronvialard/b…
jveronvialard Jul 15, 2025
5b9e976
update docs
jveronvialard Jul 15, 2025
68e96ea
add separate run_rm.py and unit tests
jveronvialard Jul 15, 2025
21d67a0
fix small typos and nit changes
jveronvialard Jul 15, 2025
0aff450
adding generic preference dataset class and support for multiple vali…
jveronvialard Jul 15, 2025
8a28af7
rewards tensor shape
jveronvialard Jul 15, 2025
7de3b93
adding unit tests
jveronvialard Jul 15, 2025
63dd1f3
updating docs
jveronvialard Jul 15, 2025
e914087
Merge branch 'main' of github.com:NVIDIA-NeMo/RL into jveronvialard/b…
jveronvialard Jul 16, 2025
8fb280b
update config and skip is_tied_lm_head for RM
jveronvialard Jul 16, 2025
3e3b03a
use tokenizer.pad_token_id if model.config.pad_token_id is not defined
jveronvialard Jul 16, 2025
ed24aea
nit
jveronvialard Jul 16, 2025
af17314
update functional test and cicd
jveronvialard Jul 16, 2025
1034634
nit docs
jveronvialard Jul 17, 2025
02687ce
keep individual metrics then aggregate on the entire dataset
jveronvialard Jul 17, 2025
24c5fd0
nit code and doc changes
jveronvialard Jul 18, 2025
8788ec2
Merge branch 'main' of github.com:NVIDIA-NeMo/RL into jveronvialard/b…
jveronvialard Jul 21, 2025
24807c3
split sft.py and rm.py
jveronvialard Jul 21, 2025
5c76465
nit code and doc changes
jveronvialard Jul 21, 2025
00363a2
pull from target branch
jveronvialard Jul 21, 2025
d3b6272
Merge branch 'main' of github.com:NVIDIA-NeMo/RL into jveronvialard/b…
jveronvialard Jul 22, 2025
0aaf296
pull from main
jveronvialard Jul 22, 2025
5b3f1ad
Update docs/guides/rm.md
odelalleau Jul 23, 2025
6534c7c
Remove the `-RAY_DEDUP_LOGS=0` examples in the README
odelalleau Jul 23, 2025
b79d0ee
Refactor RM config to include a dedicated `reward_model_cfg` section
odelalleau Jul 23, 2025
51cc9f8
Provide user-friendly error message regarding unsupported RMs in mcore
odelalleau Jul 23, 2025
597d5eb
Simplify code and guard against enabling sequence packing in RMs
odelalleau Jul 23, 2025
ba2e4b6
Fix likely crash with Reward Models introduced in previous commit
odelalleau Jul 23, 2025
4733717
Fix linting issues
odelalleau Jul 23, 2025
3297cd1
Fix a typing issue
odelalleau Jul 23, 2025
179767e
Quick fix to typing issue (with TODO item for better fix)
odelalleau Jul 25, 2025
a86b6c7
Merge branch 'main' of github.com:NVIDIA-NeMo/RL into jveronvialard/b…
jveronvialard Jul 28, 2025
2e6ef71
Merge branch 'jveronvialard/bt-rm-training' of github.com:NVIDIA-NeMo…
jveronvialard Jul 28, 2025
76f77d8
unify data logic between DPO and RM training
jveronvialard Jul 28, 2025
97d1c46
pull from main
jveronvialard Jul 29, 2025
6ca4287
nit code and docs
jveronvialard Aug 4, 2025
1894caf
put data processing in collate_fn
jveronvialard Aug 5, 2025
74eb553
updates to val metrics and save state
jveronvialard Aug 5, 2025
b3e848e
pull from main
jveronvialard Aug 5, 2025
5aba6d6
pull from main
jveronvialard Aug 27, 2025
2449bba
squash unsigned commits resolving previous feedback
jveronvialard Aug 27, 2025
f602042
pull from main
jveronvialard Aug 27, 2025
8ad7565
nit docs + lint
jveronvialard Aug 27, 2025
9efd72a
nit code and docs
jveronvialard Aug 27, 2025
8129c23
better jsonc
jveronvialard Aug 27, 2025
c4e3bda
adding overall val time
jveronvialard Aug 28, 2025
578441f
aggregate metrics at the batch level first
jveronvialard Aug 28, 2025
52a20bc
lint
jveronvialard Aug 28, 2025
f97ee6d
nit
jveronvialard Aug 28, 2025
c4b4e6a
fix tulu3
jveronvialard Aug 28, 2025
bd16f9b
adding tulu3 unit test
jveronvialard Aug 28, 2025
5f6cc52
nit
jveronvialard Aug 28, 2025
0dc7a6f
validation metrics
jveronvialard Aug 29, 2025
1137407
nit code and docs
jveronvialard Aug 29, 2025
ba4b539
Merge branch 'main' of github.com:NVIDIA-NeMo/RL into jveronvialard/p…
jveronvialard Aug 29, 2025
62fc8f1
Merge branch 'main' of github.com:NVIDIA-NeMo/RL into jveronvialard/p…
jveronvialard Aug 29, 2025
30571c2
adding DPOValMetrics
jveronvialard Aug 29, 2025
e58d9ee
revert jsonc to json since sphinx didn't like
terrykong Aug 30, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
190 changes: 75 additions & 115 deletions docs/guides/dpo.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,129 +32,89 @@ uv run examples/run_dpo.py \

## Datasets

Each class representing a NeMo RL DPO dataset is expected to have the following attributes:
1. `formatted_ds`: The dictionary of formatted datasets. This dictionary should contain `train` and `validation` splits, and each split should conform to the format described below.
2. `task_spec`: The `TaskDataSpec` for this dataset. This should specify the name you choose for this dataset.

DPO datasets are expected to follow a specific format with three key fields:
- `prompt`: The input prompt/context
- `chosen_response`: The preferred/winning response
- `rejected_response`: The non-preferred/losing response

[data/hf_datasets/helpsteer3.py](../../nemo_rl/data/hf_datasets/helpsteer3.py) provides an example of how to format data for DPO:

```python
def format_helpsteer3(data):
response_1 = data["response1"]
response_2 = data["response2"]
overall_preference = data["overall_preference"]

if overall_preference < 0:
chosen = response_1
rejected = response_2
elif overall_preference == 0:
chosen = response_1
rejected = response_1
else:
chosen = response_2
rejected = response_1

return {
"prompt": data["context"],
"chosen_response": chosen,
"rejected_response": rejected,
Each DPO dataset class is expected to have the following attributes:
1. `formatted_ds`: The dictionary of formatted datasets, where each dataset should be formatted like
```json
{
"context": [], // list of dicts - The prompt message (including previous turns, if any)
"completions": [ // list of dicts — The list of completions
{
"rank": 0, // int — The rank of the completion (lower rank is preferred)
"completion": [] // list of dicts — The completion message(s)
},
{
"rank": 1, // int — The rank of the completion (lower rank is preferred)
"completion": [] // list of dicts — The completion message(s)
}
]
}
```
2. `task_spec`: The `TaskDataSpec` for this dataset. This should specify the name you choose for this dataset.

We also provide a [DPODataset](../../nemo_rl/data/hf_datasets/dpo.py) class that is compatible with jsonl-formatted preference datsets. This class assumes train and validation datasets have been split and processed into the expected format offline. The jsonl files should consist of examples with `prompt`, `chosen_response`, and `rejected_response` keys.

## Adding Custom DPO Datasets

Adding a new DPO dataset is straightforward. Your custom dataset class should:
1. Implement the required format conversion in the constructor
2. Set up the appropriate `task_spec`

Here's a minimal example which simply re-keys an existing jsonl dataset:

```{testcode}
from datasets import load_dataset
from nemo_rl.data.interfaces import TaskDataSpec
from docs.helpers import make_dpo_dataset

class CustomDPODataset:
def preprocess_dataset(
self,
data,
prompt_key: str = "context",
chosen_key: str = "chosen",
rejected_key: str = "rejected"
):
return {
"prompt": data[prompt_key],
"chosen_response": data[chosen_key],
"rejected_response": data[rejected_key],
DPO training supports only two completions (where the lowest rank is preferred and the highest one is rejected), with each completion being a single response. For example:
```json
{
"context": [
{
"role": "user",
"content": "What's the capital of France?"
},
{
"role": "assistant",
"content": "The capital of France is Paris."
},
{
"role": "user",
"content": "Thanks! And what's the capital of Germany?"
}

def __init__(
self,
train_data_path: str,
val_data_path: str,
prompt_key: str,
chosen_key: str,
rejected_key: str,
):
# Load and format your dataset
fn_kwargs={
"prompt_key": prompt_key,
"chosen_key": chosen_key,
"rejected_key": rejected_key
}
formatted_ds = {
"train": load_dataset("json", data_files=train_data_path, split="train").map(
self.preprocess_dataset,
fn_kwargs=fn_kwargs,
),
"validation": load_dataset("json", data_files=val_data_path, split="train").map(
self.preprocess_dataset,
fn_kwargs=fn_kwargs,
),
],
"completions": [
{
"rank": 0,
"completion": [
{
"role": "assistant",
"content": "The capital of Germany is Berlin."
}
]
},
{
"rank": 1,
"completion": [
{
"role": "assistant",
"content": "The capital of Germany is Munich."
}
]
}

# Initialize task spec with dataset name
self.task_spec = TaskDataSpec(
task_name="custom_dpo",
)
self.formatted_ds = formatted_ds

# Create temporary files using helper function
train_file, val_file = make_dpo_dataset()

# Initialize dataset
dataset = CustomDPODataset(
train_data_path=train_file.name,
val_data_path=val_file.name,
prompt_key="context",
chosen_key="chosen",
rejected_key="rejected"
)

# Test dataset properties
print(f"Task name: {dataset.task_spec.task_name}")
print(f"Train examples: {len(dataset.formatted_ds['train'])}")
print(f"Validation examples: {len(dataset.formatted_ds['validation'])}")
print(f"First train example prompt: {dataset.formatted_ds['train'][0]['prompt']}")
print(f"First train example chosen response: {dataset.formatted_ds['train'][0]['chosen_response']}")
print(f"First train example rejected response: {dataset.formatted_ds['train'][0]['rejected_response']}")
]
}
```

```{testoutput}
Task name: custom_dpo
Train examples: 2
Validation examples: 2
First train example prompt: What is 2+2?
First train example chosen response: 4
First train example rejected response: 5
NeMo RL provides a DPO-compatible implementation of the [HelpSteer3](https://github.com/NVIDIA-NeMo/RL/blob/main/nemo_rl/data/hf_datasets/helpsteer3.py) dataset as an example. This dataset is downloaded from Hugging Face and preprocessed on-the-fly, so there's no need to provide a path to any datasets on disk.

We also provide a [PreferenceDataset](../../nemo_rl/data/hf_datasets/preference_dataset.py) class that is compatible with JSONL-formatted preference datasets. You can modify your config as follows to use such a custom preference dataset:
```yaml
data:
dataset_name: PreferenceDataset
train_data_path: <LocalPathToTrainingDataset>
val_data_paths:
<NameOfValidationDataset>: <LocalPathToValidationDataset>
```
with support for multiple validation sets achieved with:
```yaml
data:
dataset_name: PreferenceDataset
train_data_path: <LocalPathToTrainingDataset>
val_data_paths:
<NameOfValidationDataset1>: <LocalPathToValidationDataset1>
<NameOfValidationDataset2>: <LocalPathToValidationDataset2>
```
Please note:
- If you are using a logger, the prefix used for each validation set will be `validation-<NameOfValidationDataset>`. The total validation time, summed across all validation sets, is reported under `timing/validation/total_validation_time`.
- If you are doing checkpointing, the `metric_name` value in your `checkpointing` config should reflect the metric and validation set to be tracked. For example, `validation-<NameOfValidationDataset1>_loss`.

The older [DPODataset](../../nemo_rl/data/hf_datasets/dpo.py) class is deprecated. This class is also compatible with JSONL-formatted preference datsets. It assumes train and validation datasets have been split and processed into the expected format offline. The JSONL files should consist of examples with `prompt`, `chosen_response`, and `rejected_response` keys.

## DPO-Specific Parameters

Expand Down
82 changes: 81 additions & 1 deletion docs/guides/rm.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,4 +21,84 @@ The default YAML config shares the same base template as the SFT config but incl

## Datasets

By default, NeMo RL supports the `HelpSteer3` dataset. This dataset is downloaded from Hugging Face and preprocessed on-the-fly, so there's no need to provide a path to any datasets on disk.
Each RM dataset class is expected to have the following attributes:
1. `formatted_ds`: The dictionary of formatted datasets, where each dataset should be formatted like
```json
{
"context": [], // list of dicts - The prompt message (including previous turns, if any)
"completions": [ // list of dicts — The list of completions
{
"rank": 0, // int — The rank of the completion (lower rank is preferred)
"completion": [] // list of dicts — The completion message(s)
},
{
"rank": 1, // int — The rank of the completion (lower rank is preferred)
"completion": [] // list of dicts — The completion message(s)
}
]
}
```
2. `task_spec`: The `TaskDataSpec` for this dataset. This should specify the name you choose for this dataset.

Currently, RM training supports only two completions (where the lowest rank is preferred and the highest one is rejected), with each completion being a single response. For example:
```json
{
"context": [
{
"role": "user",
"content": "What's the capital of France?"
},
{
"role": "assistant",
"content": "The capital of France is Paris."
},
{
"role": "user",
"content": "Thanks! And what's the capital of Germany?"
}
],
"completions": [
{
"rank": 0,
"completion": [
{
"role": "assistant",
"content": "The capital of Germany is Berlin."
}
]
},
{
"rank": 1,
"completion": [
{
"role": "assistant",
"content": "The capital of Germany is Munich."
}
]
}
]
}
```

NeMo RL provides a RM-compatible implementation of the [HelpSteer3](https://github.com/NVIDIA-NeMo/RL/blob/main/nemo_rl/data/hf_datasets/helpsteer3.py) dataset as an example. This dataset is downloaded from Hugging Face and preprocessed on-the-fly, so there's no need to provide a path to any datasets on disk.

We also provide a [PreferenceDataset](../../nemo_rl/data/hf_datasets/preference_dataset.py) class that is compatible with JSONL-formatted preference datasets. You can modify your config as follows to use such a custom preference dataset:
```yaml
data:
dataset_name: PreferenceDataset
train_data_path: <LocalPathToTrainingDataset>
val_data_paths:
<NameOfValidationDataset>: <LocalPathToValidationDataset>
```
with support for multiple validation sets achieved with:
```yaml
data:
dataset_name: PreferenceDataset
train_data_path: <LocalPathToTrainingDataset>
val_data_paths:
<NameOfValidationDataset1>: <LocalPathToValidationDataset1>
<NameOfValidationDataset2>: <LocalPathToValidationDataset2>
```
Please note:
- If you are using a logger, the prefix used for each validation set will be `validation-<NameOfValidationDataset>`. The total validation time, summed across all validation sets, is reported under `timing/validation/total_validation_time`.
- If you are doing checkpointing, the `metric_name` value in your `checkpointing` config should reflect the metric and validation set to be tracked. For example, `validation-<NameOfValidationDataset1>_loss`.
15 changes: 14 additions & 1 deletion examples/configs/dpo.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -151,9 +151,22 @@ policy:
data_parallel_sharding_strategy: "optim_grads_params"

data:
dataset_name: "HelpSteer3"
max_input_seq_length: ${policy.max_total_sequence_length}
shuffle: true

dataset_name: HelpSteer3
# You can use custom preference datasets for training and validation. For example:
# data:
# dataset_name: PreferenceDataset
# train_data_path: <LocalPathToTrainingDataset>
# val_data_paths:
# <NameOfValidationDataset1>: <LocalPathToValidationDataset1>
# ...
# If you are doing checkpointing, `metric_name` should reflect the metric and validation set to be tracked. For example:
# checkpointing:
# metric_name: "validation-<NameOfValidationDataset1>_loss"
# ...

logger:
log_dir: "logs" # Base directory for all logs
wandb_enabled: false # Make sure you do a ``wandb login [Your API key]'' before running
Expand Down
14 changes: 13 additions & 1 deletion examples/configs/rm.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -123,9 +123,21 @@ policy:

data:
max_input_seq_length: ${policy.max_total_sequence_length}
dataset_name: "HelpSteer3"
shuffle: true

dataset_name: HelpSteer3
# You can use custom preference datasets for training and validation. For example:
# data:
# dataset_name: PreferenceDataset
# train_data_path: <LocalPathToTrainingDataset>
# val_data_paths:
# <NameOfValidationDataset1>: <LocalPathToValidationDataset1>
# ...
# If you are doing checkpointing, `metric_name` should reflect the metric and validation set to be tracked. For example:
# checkpointing:
# metric_name: "validation-<NameOfValidationDataset1>_loss"
# ...

logger:
log_dir: "logs" # Base directory for all logs
wandb_enabled: true # Make sure you do a ``wandb login [Your API key]'' before running
Expand Down
Loading