Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/cicd-main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,7 @@ jobs:
run: |
pip install uv
cd docs/
uv run --group docs sphinx-build . _build/html
uv run --group docs sphinx-build --fail-on-warning --builder html . _build/html

build-container:
if: ${{ needs.pre-flight.outputs.test_level != 'none' }}
Expand Down
30 changes: 15 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,18 +3,18 @@
<!-- markdown all in one -->
- [Nemo-Reinforcer: A Scalable and Efficient Post-Training Library for Models Ranging from tiny to \>100B Parameters, scaling from 1 GPU to 100s](#nemo-reinforcer-a-scalable-and-efficient-post-training-library-for-models-ranging-from-tiny-to-100b-parameters-scaling-from-1-gpu-to-100s)
- [Features](#features)
- [Prerequisuites](#prerequisuites)
- [Prerequisites](#prerequisites)
- [Quick start](#quick-start)
- [GRPO](#grpo)
- [Single Node](#single-node)
- [Multi-node](#multi-node)
- [Single Node](#grpo-single-node)
- [Multi-node](#grpo-multi-node)
- [GRPO Qwen2.5-32B](#grpo-qwen25-32b)
- [SFT](#sft)
- [Single Node](#single-node-1)
- [Multi-node](#multi-node-1)
- [Single Node](#sft-single-node)
- [Multi-node](#sft-multi-node)
- [DPO](#dpo)
- [Single Node](#single-node-2)
- [Multi-node](#multi-node-2)
- [Single Node](#dpo-single-node)
- [Multi-node](#dpo-multi-node)
- [Cluster Start](#cluster-start)

**Nemo-Reinforcer** is a scalable and efficient post-training library designed for models ranging from 1 GPU to thousands, and from tiny to over 100 billion parameters.
Expand Down Expand Up @@ -48,7 +48,7 @@ What you can expect:
- 🔜 **Megatron Inference** - Support Megatron Inference for day-0 support for new megatron models
- 🔜 **MoE Models** - Support DeepseekV3 and Llama4

## Prerequisuites
## Prerequisites

```sh
# For faster setup and environment isolation, we use `uv`
Expand All @@ -73,7 +73,7 @@ pip install uv

We have a reference GRPO experiment config set up trained for math benchmarks using the [OpenInstructMath2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) dataset.

#### Single Node
#### GRPO Single Node

To run GRPO on a single GPU for `Qwen/Qwen2.5-1.5B`:

Expand Down Expand Up @@ -101,7 +101,7 @@ uv run python examples/run_grpo_math.py \
logger.num_val_samples_to_print=10 \
```

#### Multi-node
#### GRPO Multi-node

```sh
# Run from the root of NeMo-Reinforcer repo
Expand Down Expand Up @@ -149,7 +149,7 @@ sbatch \

We provide a sample SFT experiment that uses the [SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/).

#### Single Node
#### SFT Single Node

The default SFT experiment is configured to run on a single GPU. To launch the experiment,

Expand All @@ -171,7 +171,7 @@ uv run python examples/run_sft.py \

Refer to `examples/configs/sft.yaml` for a full list of parameters that can be overridden.

#### Multi-node
#### SFT Multi-node

```sh
# Run from the root of NeMo-Reinforcer repo
Expand All @@ -194,7 +194,7 @@ sbatch \

We provide a sample DPO experiment that uses the [HelpSteer3 dataset](https://huggingface.co/datasets/nvidia/HelpSteer3) for preference-based training.

#### Single Node
#### DPO Single Node

The default DPO experiment is configured to run on a single GPU. To launch the experiment:

Expand Down Expand Up @@ -224,9 +224,9 @@ uv run python examples/run_dpo.py \
logger.wandb.name="llama-dpo-sft"
```

Refer to [dpo.yaml](examples/configs/dpo.yaml) for a full list of parameters that can be overridden. For an in-depth explanation of how to add your own DPO dataset, refer to the [DPO documentation](docs/guides/dpo.md).
Refer to [dpo.yaml](../examples/configs/dpo.yaml) for a full list of parameters that can be overridden. For an in-depth explanation of how to add your own DPO dataset, refer to the [DPO documentation](docs/guides/dpo.md).

#### Multi-node
#### DPO Multi-node

For distributed DPO training across multiple nodes, modify the following script for your use case:

Expand Down
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@
"fieldlist", # Enables field lists for metadata like :author: Name
"tasklist", # Adds support for GitHub-style task lists with [ ] and [x]
]
myst_heading_anchors = 4 # Generates anchor links for headings up to level 4
myst_heading_anchors = 5 # Generates anchor links for headings up to level 5

# -- Options for Autodoc2 ---------------------------------------------------
sys.path.insert(0, os.path.abspath(".."))
Expand Down
Empty file removed docs/design-docs/gpu-logger.md
Empty file.
12 changes: 0 additions & 12 deletions docs/design-docs/index.md

This file was deleted.

2 changes: 1 addition & 1 deletion docs/guides/grpo.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,7 @@ To enable the on-policy KL approximation, set the config `use_on_policy_kl_appro


#### Importance Sampling Correction
The policy we use to draw samples, $\pi_{\theta_{\text{old}}}$, is used in both the inference framework and the training framework. To account for this distinction, we refer to the inference framework policy as $\pi_{\text{inference}}$ and the training framework policy as $\pi_{\text{training}}$. As noted in [Adding New Models](../adding_new_models.md#understanding-discrepancies-between-backends), it is possible for the token probabilities from $\pi_{\text{training}}$ and $\pi_{\text{inference}}$ to have discrepancies (from numerics, precision differences, bugs, etc.), leading to off-policy samples. We can correct for this by introducing importance weights between $\pi_{\text{training}}$ and $\pi_{\text{inference}}$ to the first term of the loss function.
The policy we use to draw samples, $\pi_{\theta_{\text{old}}}$, is used in both the inference framework and the training framework. To account for this distinction, we refer to the inference framework policy as $\pi_{\text{inference}}$ and the training framework policy as $\pi_{\text{training}}$. As noted in [Adding New Models](../adding-new-models.md#understanding-discrepancies-between-backends), it is possible for the token probabilities from $\pi_{\text{training}}$ and $\pi_{\text{inference}}$ to have discrepancies (from numerics, precision differences, bugs, etc.), leading to off-policy samples. We can correct for this by introducing importance weights between $\pi_{\text{training}}$ and $\pi_{\text{inference}}$ to the first term of the loss function.

Let $f_\theta(x) = \min \Big(\frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}A_t, \text{clip} \big( \frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}, 1 - \varepsilon, 1 + \varepsilon \big) A_t \Big)$ represent the first term of loss function. Then,

Expand Down
9 changes: 0 additions & 9 deletions docs/guides/index.md

This file was deleted.