NVIDIA-NeMo · terrykong · Apr 28, 2025 · Apr 25, 2025 · Apr 25, 2025
@@ -128,7 +128,7 @@ jobs:
         run: |
           pip install uv
           cd docs/
-          uv run --group docs sphinx-build . _build/html
+          uv run --group docs sphinx-build --fail-on-warning --builder html . _build/html
 
   build-container:
     if: ${{ needs.pre-flight.outputs.test_level != 'none' }}

@@ -3,18 +3,18 @@
 <!-- markdown all in one -->
 - [Nemo-Reinforcer: A Scalable and Efficient Post-Training Library for Models Ranging from tiny to \>100B Parameters, scaling from 1 GPU to 100s](#nemo-reinforcer-a-scalable-and-efficient-post-training-library-for-models-ranging-from-tiny-to-100b-parameters-scaling-from-1-gpu-to-100s)
   - [Features](#features)
-  - [Prerequisuites](#prerequisuites)
+  - [Prerequisites](#prerequisites)
   - [Quick start](#quick-start)
     - [GRPO](#grpo)
-      - [Single Node](#single-node)
-      - [Multi-node](#multi-node)
+      - [Single Node](#grpo-single-node)
+      - [Multi-node](#grpo-multi-node)
         - [GRPO Qwen2.5-32B](#grpo-qwen25-32b)
     - [SFT](#sft)
-      - [Single Node](#single-node-1)
-      - [Multi-node](#multi-node-1)
+      - [Single Node](#sft-single-node)
+      - [Multi-node](#sft-multi-node)
     - [DPO](#dpo)
-      - [Single Node](#single-node-2)
-      - [Multi-node](#multi-node-2)
+      - [Single Node](#dpo-single-node)
+      - [Multi-node](#dpo-multi-node)
   - [Cluster Start](#cluster-start)
 
 **Nemo-Reinforcer** is a scalable and efficient post-training library designed for models ranging from 1 GPU to thousands, and from tiny to over 100 billion parameters.
@@ -48,7 +48,7 @@ What you can expect:
 - 🔜 **Megatron Inference** - Support Megatron Inference for day-0 support for new megatron models
 - 🔜 **MoE Models** - Support DeepseekV3 and Llama4
 
-## Prerequisuites
+## Prerequisites
 
 ```sh
 # For faster setup and environment isolation, we use `uv`
@@ -73,7 +73,7 @@ pip install uv
 
 We have a reference GRPO experiment config set up trained for math benchmarks using the [OpenInstructMath2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) dataset.
 
-#### Single Node
+#### GRPO Single Node
 
 To run GRPO on a single GPU for `Qwen/Qwen2.5-1.5B`:
 
@@ -101,7 +101,7 @@ uv run python examples/run_grpo_math.py \
   logger.num_val_samples_to_print=10 \
 ```
 
-#### Multi-node
+#### GRPO Multi-node
 
 ```sh
 # Run from the root of NeMo-Reinforcer repo
@@ -149,7 +149,7 @@ sbatch \
 
 We provide a sample SFT experiment that uses the [SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/).
 
-#### Single Node
+#### SFT Single Node
 
 The default SFT experiment is configured to run on a single GPU. To launch the experiment,
 
@@ -171,7 +171,7 @@ uv run python examples/run_sft.py \
 
 Refer to `examples/configs/sft.yaml` for a full list of parameters that can be overridden.
 
-#### Multi-node
+#### SFT Multi-node
 
 ```sh
 # Run from the root of NeMo-Reinforcer repo
@@ -194,7 +194,7 @@ sbatch \
 
 We provide a sample DPO experiment that uses the [HelpSteer3 dataset](https://huggingface.co/datasets/nvidia/HelpSteer3) for preference-based training.
 
-#### Single Node
+#### DPO Single Node
 
 The default DPO experiment is configured to run on a single GPU. To launch the experiment:
 
@@ -224,9 +224,9 @@ uv run python examples/run_dpo.py \
   logger.wandb.name="llama-dpo-sft"
 ```
 
-Refer to [dpo.yaml](examples/configs/dpo.yaml) for a full list of parameters that can be overridden. For an in-depth explanation of how to add your own DPO dataset, refer to the [DPO documentation](docs/guides/dpo.md).
+Refer to [dpo.yaml](../examples/configs/dpo.yaml) for a full list of parameters that can be overridden. For an in-depth explanation of how to add your own DPO dataset, refer to the [DPO documentation](docs/guides/dpo.md).
 
-#### Multi-node
+#### DPO Multi-node
 
 For distributed DPO training across multiple nodes, modify the following script for your use case:
 

@@ -53,7 +53,7 @@
     "fieldlist",  # Enables field lists for metadata like :author: Name
     "tasklist",  # Adds support for GitHub-style task lists with [ ] and [x]
 ]
-myst_heading_anchors = 4  # Generates anchor links for headings up to level 4
+myst_heading_anchors = 5  # Generates anchor links for headings up to level 5
 
 # -- Options for Autodoc2 ---------------------------------------------------
 sys.path.insert(0, os.path.abspath(".."))

@@ -151,7 +151,7 @@ To enable the on-policy KL approximation, set the config `use_on_policy_kl_appro
 
 
 #### Importance Sampling Correction
-The policy we use to draw samples, $\pi_{\theta_{\text{old}}}$, is used in both the inference framework and the training framework. To account for this distinction, we refer to the inference framework policy as $\pi_{\text{inference}}$ and the training framework policy as $\pi_{\text{training}}$. As noted in [Adding New Models](../adding_new_models.md#understanding-discrepancies-between-backends), it is possible for the token probabilities from $\pi_{\text{training}}$ and $\pi_{\text{inference}}$ to have discrepancies (from numerics, precision differences, bugs, etc.), leading to off-policy samples. We can correct for this by introducing importance weights between $\pi_{\text{training}}$ and $\pi_{\text{inference}}$ to the first term of the loss function. 
+The policy we use to draw samples, $\pi_{\theta_{\text{old}}}$, is used in both the inference framework and the training framework. To account for this distinction, we refer to the inference framework policy as $\pi_{\text{inference}}$ and the training framework policy as $\pi_{\text{training}}$. As noted in [Adding New Models](../adding-new-models.md#understanding-discrepancies-between-backends), it is possible for the token probabilities from $\pi_{\text{training}}$ and $\pi_{\text{inference}}$ to have discrepancies (from numerics, precision differences, bugs, etc.), leading to off-policy samples. We can correct for this by introducing importance weights between $\pi_{\text{training}}$ and $\pi_{\text{inference}}$ to the first term of the loss function. 
 
 Let $f_\theta(x) = \min \Big(\frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}A_t, \text{clip} \big( \frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}, 1 - \varepsilon, 1 + \varepsilon \big) A_t \Big)$ represent the first term of loss function. Then,