NVIDIA-NeMo · terrykong · Apr 22, 2025 · Apr 4, 2025 · Apr 4, 2025 · Apr 7, 2025
@@ -170,6 +170,7 @@ jobs:
         if [[ "${{ needs.pre-flight.outputs.test_level }}" =~ ^(L1|L2)$ ]]; then
           uv run --no-sync bash ./tests/functional/sft.sh
           uv run --no-sync bash ./tests/functional/grpo.sh
+          uv run --no-sync bash ./tests/functional/dpo.sh
         else
           echo Skipping functional tests for level ${{ needs.pre-flight.outputs.test_level }}
         fi

@@ -5,12 +5,15 @@
   - [Features](#features)
   - [Prerequisuites](#prerequisuites)
   - [Quick start](#quick-start)
-    - [SFT](#sft)
+    - [GRPO](#grpo)
       - [Single Node](#single-node)
       - [Multi-node](#multi-node)
-    - [GRPO](#grpo)
+    - [SFT](#sft)
       - [Single Node](#single-node-1)
       - [Multi-node](#multi-node-1)
+    - [DPO](#dpo)
+      - [Single Node](#single-node-2)
+      - [Multi-node](#multi-node-2)
   - [Cluster Start](#cluster-start)
 
 **Nemo-Reinforcer** is a scalable and efficient post-training library designed for models ranging from 1 GPU to thousands, and from tiny to over 100 billion parameters.
@@ -33,10 +36,10 @@ What you can expect:
 - ✅ **Environment Support** - Support for multi-environment training.
 - ✅ **Learning Algorithms** - GRPO (Group Relative Policy Optimization) and SFT (Supervised Fine-Tuning)
 - ✅ **Worker Isolation** - Process isolation between RL Actors (no worries about global state)
+- ✅ **DPO Algorithm** - Direct Preference Optimization for alignment
 - 🔜 **Larger Model Support** - Native PyTorch support for models up to 70B parameters
 - 🔜 **Advanced Parallelism** - FSDP2, TP, SP, and sequence packing for efficient training
 - 🔜 **Environment Isolation** - Dependency isolation between components
-- 🔜 **DPO Algorithm** - Direct Preference Optimization for alignment
 
 ## Prerequisuites
 
@@ -59,6 +62,61 @@ pip install uv
 
 **Reminder**: Don't forget to set your `HF_HOME`, `WANDB_API_KEY`, and `HF_DATASETS_CACHE` (if needed). You'll need to do a `huggingface-cli login` as well for Llama models.
 
+### GRPO
+
+We have a reference GRPO experiment config set up trained for math benchmarks using the [OpenInstructMath2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) dataset.
+
+#### Single Node
+
+To run GRPO on a single GPU for `Llama-3.2-1B-Instruct`:
+
+```sh
+# Run the GRPO math example using a 1B parameter model
+uv run python examples/run_grpo_math.py
+```
+
+By default, this uses the configuration in `examples/configs/grpo_math_1B.yaml`. You can customize parameters with command-line overrides. For example, to run on 8 gpus,
+
+```sh
+# Run the GRPO math example using a 1B parameter model using 8 GPUs
+uv run python examples/run_grpo_math.py \
+  cluster.gpus_per_node=8
+```
+
+You can override any of the parameters listed in the yaml configuration file. For example,
+
+```sh
+uv run python examples/run_grpo_math.py \
+  policy.model_name="Qwen/Qwen2-1.5B" \
+  checkpointing.checkpoint_dir="results/qwen1_5b_math" \
+  logger.wandb_enabled=True \
+  logger.wandb.name="grpo-qwen1_5b_math" \
+  logger.num_val_samples_to_print=10 \
+```
+
+#### Multi-node
+
+```sh
+# Run from the root of NeMo-Reinforcer repo
+NUM_ACTOR_NODES=2
+# Add a timestamp to make each job name unique
+TIMESTAMP=$(date +%Y%m%d_%H%M%S)
+
+# grpo_math_8b uses Llama-3.1-8B-Instruct model
+COMMAND="uv run ./examples/run_grpo_math.py --config examples/configs/grpo_math_8B.yaml cluster.num_nodes=2 checkpointing.checkpoint_dir='results/llama8b_2nodes' logger.wandb_enabled=True logger.wandb.name='grpo-llama8b_math'" \
+UV_CACHE_DIR=YOUR_UV_CACHE_DIR \
+CONTAINER=YOUR_CONTAINER \
+MOUNTS="$PWD:$PWD" \
+sbatch \
+    --nodes=${NUM_ACTOR_NODES} \
+    --account=YOUR_ACCOUNT \
+    --job-name=YOUR_JOBNAME \
+    --partition=YOUR_PARTITION \
+    --time=4:0:0 \
+    --gres=gpu:8 \
+    ray.sub
+```
+
 ### SFT
 
 We provide a sample SFT experiment that uses the [SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/).
@@ -87,15 +145,12 @@ Refer to `examples/configs/sft.yaml` for a full list of parameters that can be o
 
 #### Multi-node
 
-For distributed training across multiple nodes:
-
 ```sh
 # Run from the root of NeMo-Reinforcer repo
 NUM_ACTOR_NODES=2
 # Add a timestamp to make each job name unique
 TIMESTAMP=$(date +%Y%m%d_%H%M%S)
 
-# SFT experiment uses Llama-3.1-8B model
 COMMAND="uv run ./examples/run_sft.py --config examples/configs/sft.yaml cluster.num_nodes=2 cluster.gpus_per_node=8 checkpointing.checkpoint_dir='results/sft_llama8b_2nodes' logger.wandb_enabled=True logger.wandb.name='sft-llama8b'" \
 CONTAINER=YOUR_CONTAINER \
 MOUNTS="$PWD:$PWD" \
@@ -109,48 +164,55 @@ sbatch \
     ray.sub
 ```
 
-### GRPO
+### DPO
 
-We have a reference GRPO experiment config set up trained for math benchmarks using the [OpenInstructMath2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) dataset.
+We provide a sample DPO experiment that uses the [HelpSteer3 dataset](https://huggingface.co/datasets/nvidia/HelpSteer3) for preference-based training.
 
 #### Single Node
 
-To run GRPO on a single GPU for `Llama-3.2-1B-Instruct`:
+The default DPO experiment is configured to run on a single GPU. To launch the experiment:
 
 ```sh
-# Run the GRPO math example using a 1B parameter model
-uv run python examples/run_grpo_math.py
+uv run python examples/run_dpo.py
 ```
 
-By default, this uses the configuration in `examples/configs/grpo_math_1B.yaml`. You can customize parameters with command-line overrides. For example, to run on 8 gpus,
+This trains `Llama3.2-1B-Instruct` on one GPU.
+
+If you have access to more GPUs, you can update the experiment accordingly. To run on 8 GPUs, we update the cluster configuration and switch to an 8B Llama3.1 Instruct model:
 
 ```sh
-# Run the GRPO math example using a 1B parameter model using 8 GPUs
-uv run python examples/run_grpo_math.py \
+uv run python examples/run_dpo.py \
+  policy.model_name="meta-llama/Llama-3.1-8B-Instruct" \
+  policy.train_global_batch_size=256 \
   cluster.gpus_per_node=8
 ```
 
-You can override any of the parameters listed in the yaml configuration file. For example,
+Any of the DPO parameters can be customized from the command line. For example:
 
 ```sh
-uv run python examples/run_grpo_math.py \
-  policy.model_name="Qwen/Qwen2-1.5B" \
-  checkpointing.checkpoint_dir="results/qwen1_5b_math" \
+uv run python examples/run_dpo.py \
+  dpo.sft_loss_weight=0.1 \
+  dpo.preference_average_log_probs=True \
+  checkpointing.checkpoint_dir="results/llama_dpo_sft" \
   logger.wandb_enabled=True \
-  logger.wandb.name="grpo-qwen1_5b_math" \
-  logger.num_val_samples_to_print=10 \
+  logger.wandb.name="llama-dpo-sft"
 ```
 
+Refer to [dpo.yaml](examples/configs/dpo.yaml) for a full list of parameters that can be overridden. For an in-depth explanation of how to add your own DPO dataset, refer to the [DPO documentation](docs/guides/dpo.md).
+
 #### Multi-node
 
+For distributed DPO training across multiple nodes, modify the following script for your use case:
+
 ```sh
 # Run from the root of NeMo-Reinforcer repo
+## number of nodes to use for your job
 NUM_ACTOR_NODES=2
 # Add a timestamp to make each job name unique
 TIMESTAMP=$(date +%Y%m%d_%H%M%S)
 
-# grpo_math_8b uses Llama-3.1-8B-Instruct model
-COMMAND="uv run ./examples/run_grpo_math.py --config examples/configs/grpo_math_8B.yaml cluster.num_nodes=2 checkpointing.checkpoint_dir='results/llama8b_2nodes' logger.wandb_enabled=True logger.wandb.name='grpo-llama8b_math'" \
+COMMAND="uv run ./examples/run_dpo.py --config examples/configs/dpo.yaml cluster.num_nodes=2 cluster.gpus_per_node=8 dpo.val_global_batch_size=32 checkpointing.checkpoint_dir='results/dpo_llama81_2nodes' logger.wandb_enabled=True logger.wandb.name='dpo-llama1b'" \
+RAY_DEDUP_LOGS=0 \
 CONTAINER=YOUR_CONTAINER \
 MOUNTS="$PWD:$PWD" \
 sbatch \

@@ -0,0 +1,169 @@
+# Direct Preference Optimization in Reinforcer
+
+[Direct Preference Optimization (DPO)](https://arxiv.org/pdf/2305.18290) is an RL-free alignment algorithm that operates on preference data. Given a prompt and a pair of chosen and rejected responses, DPO aims
+to increase the probability of the chosen response and decrease the probability of the rejected response relative to a frozen reference model. The actor is initialized using the reference model. For more details, refer to the
+[DPO paper](https://arxiv.org/pdf/2305.18290).
+
+## Launch a DPO Run
+
+The script [examples/run_dpo.py](../../examples/run_dpo.py) can be used to launch a DPO experiment. This script can either be launched locally or via Slurm. For details on how to set up Ray and launch a job using Slurm, refer to the [cluster documentation](../cluster.md).
+
+Be sure to launch the job using `uv`. The command to launch a DPO job is as follows:
+```bash
+uv run examples/run_dpo.py --config <PATH TO YAML CONFIG> <OVERRIDES>
+```
+If not specified, `config` will default to [examples/configs/dpo.yaml](../../examples/configs/dpo.yaml).
+
+## Configuration
+
+Reinforcer allows users to configure DPO experiments using `yaml` config files. An example DPO configuration file can be found [here](../../examples/configs/dpo.yaml).
+
+To override a value in the config, either update the value in the `yaml` file directly, or pass the override via the command line. For example:
+
+```bash
+uv run examples/run_dpo.py \
+    cluster.gpus_per_node=8 \
+    dpo.sft_loss_weight=0.1 \
+    dpo.preference_average_log_probs=True \
+    logger.wandb.name="dpo-dev-8-gpu"
+```
+
+**Reminder**: Don't forget to set your `HF_HOME`, `WANDB_API_KEY`, and `HF_DATASETS_CACHE` (if needed). You'll need to do a `huggingface-cli login` as well for Llama models.
+
+## Datasets
+
+Each class representing a Reinforcer DPO dataset is expected to have the following attributes:
+1. `formatted_ds`: The dictionary of formatted datasets. This dictionary should contain `train` and `validation` splits, and each split should conform to the format described below.
+2. `task_spec`: The `TaskDataSpec` for this dataset. This should specify the name you choose for this dataset.
+
+DPO datasets are expected to follow a specific format with three key fields:
+- `prompt`: The input prompt/context
+- `chosen_response`: The preferred/winning response
+- `rejected_response`: The non-preferred/losing response
+
+[data/hf_datasets/helpsteer3.py](../../nemo_reinforcer/data/hf_datasets/helpsteer3.py) provides an example of how to format data for DPO:
+
+```python
+def format_helpsteer3(data):
+    response_1 = data["response1"]
+    response_2 = data["response2"]
+    overall_preference = data["overall_preference"]
+
+    if overall_preference < 0:
+        chosen = response_1
+        rejected = response_2
+    elif overall_preference == 0:
+        chosen = response_1
+        rejected = response_1
+    else:
+        chosen = response_2
+        rejected = response_1
+
+    return {
+        "prompt": data["context"],
+        "chosen_response": chosen,
+        "rejected_response": rejected,
+    }
+```
+
+We also provide a [DPODataset](../../nemo_reinforcer/data/hf_datasets/dpo.py) class that is compatible with jsonl-formatted preference datsets. This class assumes train and validation datasets have been split and processed into the expected format offline. The jsonl files should consist of examples with `prompt`, `chosen_response`, and `rejected_response` keys.
+
+## Adding Custom DPO Datasets
+
+Adding a new DPO dataset is straightforward. Your custom dataset class should:
+1. Implement the required format conversion in the constructor
+2. Set up the appropriate `task_spec`
+
+Here's a minimal example which simply re-keys an existing jsonl dataset:
+
+```{testcode}
+from datasets import load_dataset
+from nemo_reinforcer.data.interfaces import TaskDataSpec
+from docs.helpers import make_dpo_dataset
+
+class CustomDPODataset:
+    def preprocess_dataset(
+        self,
+        data,
+        prompt_key: str = "context",
+        chosen_key: str = "chosen",
+        rejected_key: str = "rejected"
+    ):
+        return {
+            "prompt": data[prompt_key],
+            "chosen_response": data[chosen_key],
+            "rejected_response": data[rejected_key],
+        }
+
+    def __init__(
+        self,
+        train_data_path: str,
+        val_data_path: str,
+        prompt_key: str,
+        chosen_key: str,
+        rejected_key: str,
+    ):
+        # Load and format your dataset
+        fn_kwargs={
+                "prompt_key": prompt_key, 
+                "chosen_key": chosen_key, 
+                "rejected_key": rejected_key
+            }
+        formatted_ds = {
+            "train": load_dataset("json", data_files=train_data_path, split="train").map(
+                self.preprocess_dataset, 
+                fn_kwargs=fn_kwargs,
+            ),
+            "validation": load_dataset("json", data_files=val_data_path, split="train").map(
+                self.preprocess_dataset, 
+                fn_kwargs=fn_kwargs,
+            ),
+        }
+
+        # Initialize task spec with dataset name
+        self.task_spec = TaskDataSpec(
+            task_name="custom_dpo",
+        )
+        self.formatted_ds = formatted_ds
+
+# Create temporary files using helper function
+train_file, val_file = make_dpo_dataset()
+
+# Initialize dataset
+dataset = CustomDPODataset(
+    train_data_path=train_file.name,
+    val_data_path=val_file.name,
+    prompt_key="context",
+    chosen_key="chosen",
+    rejected_key="rejected"
+)
+
+# Test dataset properties
+print(f"Task name: {dataset.task_spec.task_name}")
+print(f"Train examples: {len(dataset.formatted_ds['train'])}")
+print(f"Validation examples: {len(dataset.formatted_ds['validation'])}")
+print(f"First train example prompt: {dataset.formatted_ds['train'][0]['prompt']}")
+print(f"First train example chosen response: {dataset.formatted_ds['train'][0]['chosen_response']}")
+print(f"First train example rejected response: {dataset.formatted_ds['train'][0]['rejected_response']}")
+```
+
+```{testoutput}
+Task name: custom_dpo
+Train examples: 2
+Validation examples: 2
+First train example prompt: What is 2+2?
+First train example chosen response: 4
+First train example rejected response: 5
+```
+
+## DPO-Specific Parameters
+
+The DPO implementation in Reinforcer supports several key parameters that can be adjusted:
+
+- `dpo.reference_policy_kl_penalty`: Controls the strength of the KL penalty term
+- `dpo.preference_loss_weight`: Weight for the preference loss
+- `dpo.sft_loss_weight`: Weight for the auxiliary SFT loss
+- `dpo.preference_average_log_probs`: Whether to average log probabilities over tokens in the preference loss term
+- `dpo.sft_average_log_probs`: Whether to average log probabilities over tokens in the SFT loss term
+
+These parameters can be adjusted in the config file or via command-line overrides to optimize training for your specific use case.
@@ -0,0 +1,41 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import tempfile
+import json
+
+
+def make_dpo_dataset():
+    train_file = tempfile.NamedTemporaryFile(mode="w", suffix=".jsonl", delete=False)
+    val_file = tempfile.NamedTemporaryFile(mode="w", suffix=".jsonl", delete=False)
+
+    # Write train data
+    train_data = [
+        {"context": "What is 2+2?", "chosen": "4", "rejected": "5"},
+        {"context": "What is 3*3?", "chosen": "9", "rejected": "6"},
+    ]
+    for item in train_data:
+        lines = train_file.write(json.dumps(item) + "\n")
+    train_file.flush()
+
+    # Write validation data
+    val_data = [
+        {"context": "What is 4+4?", "chosen": "8", "rejected": "7"},
+        {"context": "What is 5*5?", "chosen": "25", "rejected": "20"},
+    ]
+    for item in val_data:
+        lines = val_file.write(json.dumps(item) + "\n")
+    val_file.flush()
+
+    return train_file, val_file