Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
16640e8
nemo gym integration
cmunley1 Dec 17, 2025
6261758
couple updates
cmunley1 Dec 18, 2025
4105340
baseline without on policy correction
cmunley1 Dec 20, 2025
be5c156
readme
cmunley1 Dec 20, 2025
64b9ed4
wip
cmunley1 Dec 22, 2025
948869f
fixes
cmunley1 Jan 7, 2026
52a3140
readme
cmunley1 Jan 7, 2026
0e71cbb
cfg
cmunley1 Jan 7, 2026
3548099
small fix
cmunley1 Jan 7, 2026
8373899
docs
cmunley1 Jan 9, 2026
fe4bce6
fixes
cmunley1 Jan 15, 2026
facfb5a
remove flag
cmunley1 Jan 15, 2026
ac94e1b
multi env
cmunley1 Jan 16, 2026
32c5a6b
small fix
cmunley1 Jan 16, 2026
5619096
dataset index
cmunley1 Jan 16, 2026
04821b5
multinode example
cmunley1 Jan 17, 2026
52b2f5c
client and tests
cmunley1 Jan 17, 2026
0793c05
remove native tool parsing, use fastapi state
cmunley1 Jan 17, 2026
5f8ccc9
remove old code
cmunley1 Jan 17, 2026
743d5ea
enable IS
cmunley1 Jan 17, 2026
d98dd8a
remove logp diff tracking without is
cmunley1 Jan 17, 2026
a5f9166
restore
cmunley1 Jan 17, 2026
17b72c8
readme
cmunley1 Jan 17, 2026
18ffaa8
restore pyproject
cmunley1 Jan 17, 2026
cc503cb
readme
cmunley1 Jan 17, 2026
843938f
move submit
cmunley1 Jan 17, 2026
209b12e
config
cmunley1 Jan 17, 2026
2ec1a0f
Merge branch 'main' into cmunley1/nemo_gym_on_policy
sergiopaniego Jan 20, 2026
a8f7b36
draft docs
cmunley1 Jan 21, 2026
6893625
Merge branch 'cmunley1/nemo_gym_on_policy' of github.com:cmunley1/trl…
cmunley1 Jan 21, 2026
e883dcd
draft docs
cmunley1 Jan 21, 2026
2c7de07
docs update
cmunley1 Jan 21, 2026
aad21ee
ds cfg, submit update
cmunley1 Jan 22, 2026
06ab2a2
readme
cmunley1 Jan 22, 2026
cf9f177
rename train, update docs
cmunley1 Jan 22, 2026
7669c00
comment
cmunley1 Jan 22, 2026
df2f350
Merge branch 'main' into cmunley1/nemo_gym_on_policy
kashif Jan 23, 2026
3a455a9
Update trl/trainer/grpo_trainer.py
sergiopaniego Jan 23, 2026
f69a70a
Update trl/scripts/vllm_serve.py
cmunley1 Jan 26, 2026
56535f2
rename docs file
cmunley1 Jan 26, 2026
f1c6614
Merge branch 'main' into cmunley1/nemo_gym_on_policy
cmunley1 Jan 26, 2026
537f82e
Merge branch 'main' into cmunley1/nemo_gym_on_policy
sergiopaniego Jan 27, 2026
7b1fe8a
nemo gym trl edits
lbliii Jan 28, 2026
92227f0
Merge pull request #1 from lbliii/llane/nemo-gym-trl-edits
cmunley1 Jan 28, 2026
9f7f45f
Merge remote-tracking branch 'upstream/main' into cmunley1/nemo_gym_o…
cmunley1 Jan 29, 2026
4d6012e
lint
cmunley1 Jan 30, 2026
d5443eb
docs
cmunley1 Jan 30, 2026
2837bda
improve docs, rename train script
cmunley1 Jan 30, 2026
93d97a7
fixes based on review
cmunley1 Jan 30, 2026
b15ab63
subclass
cmunley1 Jan 30, 2026
6d7e8d0
config update
cmunley1 Jan 31, 2026
13c378c
docs
cmunley1 Jan 31, 2026
c5dcb5d
typo in submit
cmunley1 Jan 31, 2026
03ffa0b
Merge pull request #2 from cmunley1/cmunley1/ng-fix
cmunley1 Jan 31, 2026
b4678fb
improve nemo gym docs
cmunley1 Jan 31, 2026
a476ac5
update docs
cmunley1 Jan 31, 2026
5e70a33
rename project to server
cmunley1 Feb 2, 2026
a3f241e
vllm finish reason
cmunley1 Feb 2, 2026
82cd8d5
Merge branch 'main' into cmunley1/nemo_gym_on_policy
sergiopaniego Feb 4, 2026
e123a88
Update docs/source/nemo_gym.md
sergiopaniego Feb 4, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,8 @@
title: MiniLLM
- local: nash_md_trainer
title: Nash-MD
- local: nemo_gym
title: NeMo Gym
- local: online_dpo_trainer
title: Online DPO
- local: orpo_trainer
Expand Down
1 change: 1 addition & 0 deletions docs/source/example_overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ Scripts are maintained in the [`trl/scripts`](https://github.com/huggingface/trl
| [`examples/scripts/kto.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/kto.py) | This script shows how to use the [`experimental.kto.KTOTrainer`] to fine-tune a model. |
| [`examples/scripts/mpo_vlm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/mpo_vlm.py) | This script shows how to use MPO via the [`DPOTrainer`] to align a model based on preferences using the [HuggingFaceH4/rlaif-v_formatted](https://huggingface.co/datasets/HuggingFaceH4/rlaif-v_formatted) dataset and a set of loss weights with weights. |
| [`examples/scripts/nash_md.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/nash_md.py) | This script shows how to use the [`experimental.nash_md.NashMDTrainer`] to fine-tune a model. |
| [`examples/scripts/nemo_gym/train_multi_environment.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/nemo_gym/train_multi_environment.py) | This script shows how to use the [`GRPOTrainer`] to train language models in NVIDIA NeMo-Gym environments. Supports multi-turn and tool calling environments, and multi-environment training. See the [NeMo-Gym Integration](nemo_gym) guide for setup and usage. |
| [`examples/scripts/online_dpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/online_dpo.py) | This script shows how to use the [`experimental.online_dpo.OnlineDPOTrainer`] to fine-tune a model. |
| [`examples/scripts/online_dpo_vlm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/online_dpo_vlm.py) | This script shows how to use the [`experimental.online_dpo.OnlineDPOTrainer`] to fine-tune a a Vision Language Model. |
| [`examples/scripts/openenv/browsergym.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/openenv/browsergym.py) | Simple script to run GRPO training via the [`GRPOTrainer`] with OpenEnv's BrowserGym environment and vLLM for VLMs |
Expand Down
293 changes: 293 additions & 0 deletions docs/source/nemo_gym.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,293 @@
# NeMo Gym Integration

NVIDIA NeMo Gym is a library for building RL environments for large language models. This integration enables training models in NeMo Gym environments using TRL's GRPOTrainer with vLLM server mode.

The integration supports multi-step and multi-turn rollouts, multi-environment training, and any NeMo Gym environment (thoroughly tested: workplace assistant, reasoning gym, MCQA, and math with judge).

## Why NeMo Gym

- **Production-Ready Scale**: Tested for frontier model training with diverse environments running in parallel across math, coding, tool use, reasoning, and more.
- **Multi-Verifier Training**: Supports algorithmic verification, LLM-as-a-judge, and custom verification logic in a single training run.
- **Decoupled Architecture**: Build agents and environments independently from the training loop—no RL framework expertise required.
- **OpenAI-Compatible API**: All environments use the standardized OpenAI Responses API for seamless integration with vLLM, OpenAI models, and other endpoints.

## Available Environments

NeMo Gym provides training-ready environments across multiple domains, including but not limited to:

| Environment | Domain | Description |
|-------------|--------|-------------|
| Workplace Assistant | Agent | Multi-step tool calling in common office scenarios (calendar, email, and more) |
| Math with Judge | Math | Math problems with algorithmic or judge-based verification |
| Code Gen | Coding | Competitive programming problems with code execution |
| MCQA | Knowledge | Multiple-choice question answering |
| Instruction Following | Instruction Following | IFEval/IFBench style tasks |
| Reasoning Gym | Multiple | Single-step procedurally generated verifiable tasks across domains |

For a complete list of available training environments, refer to the [NeMo Gym repository](https://github.com/NVIDIA-NeMo/Gym#-available-resource-servers).

## Before You Start

Complete these one-time setup steps before running training.

### Install TRL and NeMo Gym

1. **Install TRL with vLLM extras**

```bash
cd trl/
uv venv
source .venv/bin/activate
uv sync --extra vllm
```

1. **Install NeMo Gym**

```bash
# deactivate trl venv
deactivate
git clone https://github.com/NVIDIA-NeMo/Gym.git
cd Gym
uv venv --python 3.12
source .venv/bin/activate
uv sync
```

### Prepare a Dataset

Many NeMo Gym datasets used to train Nemotron models are available on Hugging Face. Use `ng_prepare_data` to download and prepare datasets. This command:

- Downloads the dataset from Hugging Face
- Validates the data format
- Adds an `agent_ref` field to each example that tells NeMo Gym which agent server should handle that example

> **Note**: `train_multi_environment.py` adds the `agent_ref` field when loading datasets, so this step is optional if datasets are created another way.

1. **Set Hugging Face Token**

Create `env.yaml` in `Gym/` with your HF token:

```yaml
hf_token: <your_hf_token>
```

1. **Prepare Dataset**

```bash
# Enter Gym and activate the venv
cd Gym
source .venv/bin/activate

# Set config paths
config_paths="responses_api_models/vllm_model/configs/vllm_model.yaml,\
resources_servers/workplace_assistant/configs/workplace_assistant.yaml"

# Download data and prep for training
ng_prepare_data "+config_paths=[${config_paths}]" \
+output_dirpath=data/workplace_assistant \
+mode=train_preparation \
+should_download=true \
+data_source=huggingface
```

This creates `train.jsonl` and `validation.jsonl` files in `data/workplace_assistant/`.

To create a new environment, refer to the [environment creation guide](https://docs.nvidia.com/nemo/gym/latest/contribute/environments/new-environment.html). We suggest running an existing one first!

#### Dataset Format

NeMo Gym datasets are stored as JSONL. Each line contains a task with input messages, tool definitions, metadata such as ground truth for verification, and an agent server reference. The following example shows the workplace dataset structure. Metadata fields can differ between datasets, as long as the corresponding resources server uses the fields appropriately.

```json
{
"responses_create_params": {
"input": [
{"role": "system", "content": "..."},
{"role": "user", "content": "Move any of jinsoo's tasks that are in review to completed"}
],
"tools": [...],
"parallel_tool_calls": false,
"temperature": 1
},
"ground_truth": [
{"name": "project_management_update_task", "arguments": "{...}"},
...
],
"category": "workbench_project_management",
"environment_name": "workbench",
"agent_ref": {
"type": "responses_api_agents",
"name": "workplace_assistant_simple_agent"
}
}
```

## Interactive Training

For development and testing on a single node.

### Set Up

1. **Update Environment Config**

Update `env.yaml` in `Gym/` to include model information:

```yaml
policy_base_url: http://127.0.0.1:8000/v1
policy_api_key: EMPTY
policy_model_name: Qwen/Qwen2.5-1.5B-Instruct
hf_token: ...
```

2. **Update Training Config**

Update `examples/scripts/nemo_gym/config.yaml` to point to the dataset generated above, and any other optional modifications.

### Run Training

The following steps run in 3 terminals. It can also be ran with processes in the background, or using tmux.

1. **Start NeMo Gym Servers** (Terminal 1)

```bash
cd Gym/
source .venv/bin/activate

config_paths="resources_servers/workplace_assistant/configs/workplace_assistant.yaml,\
responses_api_models/vllm_model/configs/vllm_model_for_training.yaml"

ng_run "+config_paths=[${config_paths}]"
```

This starts:
- **Agent server**: Orchestrates rollouts using resource servers and model servers
- **Resources server**: Supports environment logic such as state-management, tool implementations, and task verification
- **Model server**: Adapts vLLM server requests to support NeMo Gym agents and on-policy RL training while ensuring OpenAI API compatibility
- **Head server**: Manages servers used in training enabling their discovery

1. **Start TRL vLLM Server on GPU 0** (Terminal 2)

```bash
cd trl/
source .venv/bin/activate
CUDA_VISIBLE_DEVICES=0 trl vllm-serve \
--model Qwen/Qwen2.5-1.5B-Instruct \
--max-model-len 16384 \
--host 0.0.0.0 \
--port 8000
```

1. **Run Training on GPU 1** (Terminal 3)

```bash
source trl/.venv/bin/activate
cd trl/examples/scripts/nemo_gym
export WANDB_API_KEY=...
uv add omegaconf

CUDA_VISIBLE_DEVICES=1 python train_multi_environment.py --config config.yaml
```

## Multi-Node Training with Slurm

An example five-node training script is provided in `submit.sh`. Nodes one through four run the training algorithm, while node five runs vLLM inference for NeMo Gym agent rollouts.

1. **Configure the Script**

Update `submit.sh` with your Slurm account, partition, paths to your project directory, and updated training configs.

1. **Submit the Job**

```bash
sbatch submit.sh
```

1. **Monitor Training**

```bash
tail -f logs/<job_id>/*
```

> **Tip**: Set up wandb logging for detailed training metrics. For more details on TRL's vLLM integration, refer to the vLLM integration page.

## Multi-Environment Training

Train on multiple NeMo Gym environments simultaneously. This allows learning diverse capabilities (such as tool calling and math reasoning) in a single training run.

1. **Prepare Individual Datasets**

Prepare datasets for each environment. The workplace assistant dataset was prepared above. Now lets create a dataset for the mini sudoku environment implemented by the reasoning gym resources server in NeMo Gym:

```bash
cd Gym
source .venv/bin/activate
uv add reasoning-gym
cd resources_servers/reasoning_gym
python scripts/create_dataset.py \
--task mini_sudoku \
--size 2000 \
--seed 42 \
--output data/reasoning_gym/train_mini_sudoku.jsonl

python scripts/create_dataset.py \
--task mini_sudoku \
--size 50 \
--seed 24 \
--output data/reasoning_gym/val_mini_sudoku.jsonl
```

1. **Create Combined Dataset**

Combine datasets into a single file with tasks from both environments:

```bash
cat data/workplace_assistant/train_workplace.jsonl data/reasoning_gym/train_mini_sudoku.jsonl | shuf > train_multi_env.jsonl
```

> **Tip**: Ensure datasets are the same size before shuffling for an even blend of tasks. Repeat for the validation dataset.

1. **Update Training Config**

Create `config_multi_env.yaml` pointing to the combined dataset:

```yaml
model_name: "Qwen/Qwen3-4B-Instruct-2507"

dataset_path: "/path/to/data/train_multi_env.jsonl"
eval_dataset_path: "/path/to/data/val_multi_env.jsonl"

task: "workplace-sudoku" # used in wandb run name
output_dir: "outputs/nemo_gym_multi_env"

# ... rest of config same
```

1. **Update ng_run**

Whether training interactively or via Slurm, update the `ng_run` command to include config files from each resources server:

```bash
cd Gym
source .venv/bin/activate

config_paths="responses_api_models/vllm_model/configs/vllm_model.yaml,\
resources_servers/workplace_assistant/configs/workplace_assistant.yaml,\
resources_servers/reasoning_gym/configs/reasoning_gym.yaml"

ng_run "+config_paths=[${config_paths}]" +head_server.host=0.0.0.0
```

This starts servers for both environments. The training script automatically routes each example to the correct agent server based on its `agent_ref` field.

1. **Run Training**

Update the Slurm submission script to use the new training config and both `ng_run` resources server configs, then submit the job as before.

The training script reads `agent_ref` from each example's metadata, routes requests to the correct NeMo Gym agent server, and handles different agents and environments in the same batch.

## Resources

- [NeMo Gym GitHub](https://github.com/NVIDIA-NeMo/Gym)
- [NeMo Gym Documentation](https://docs.nvidia.com/nemo/gym/latest/)
- [Training Script](https://github.com/huggingface/trl/blob/main/examples/scripts/nemo_gym/train_multi_environment.py)
- [TRL GRPO Trainer](grpo_trainer)
5 changes: 5 additions & 0 deletions examples/scripts/nemo_gym/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Post-training with NeMo Gym and TRL

This integration supports training language models in NeMo-Gym environments using TRL GRPO. Both single step and multi step tasks are supported, including multi-environment training. NeMo-Gym orchestrates rollouts, returning token ids and logprobs to TRL through the rollout function for training. Currently this integration is only supported through TRL's vllm server mode.

Check out the docs page `docs/source/nemo_gym.md` for a guide.
37 changes: 37 additions & 0 deletions examples/scripts/nemo_gym/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Model
model_name: "Qwen/Qwen2.5-1.5B-Instruct"

# Data
dataset_path: "/home/ubuntu/Gym/resources_servers/workplace_assistant/data/train.jsonl"
eval_dataset_path: "/home/ubuntu/Gym/resources_servers/workplace_assistant/data/validation.jsonl"

# Logging
output_dir: "outputs/nemo_gym"
task: "workplace" # just used in wandb run name
report_to: "wandb"
project_name: "trl-nemo-gym"
log_completions: true
num_completions_to_print: 2

# Training hyperparameters
learning_rate: 1.0e-5
max_steps: 1000
num_generations: 8
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
max_completion_length: 16384
warmup_steps: 5
lr_scheduler_type: "linear"
optim: "adamw_torch_fused"
weight_decay: 0.0
vllm_importance_sampling_correction: true

# Inference sampling parameters
temperature: 1.0
top_p: 0.999

# Checkpointing and Eval
save_steps: 10
eval_strategy: "steps"
eval_steps: 10

22 changes: 22 additions & 0 deletions examples/scripts/nemo_gym/deepspeed_zero3.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_multinode_launcher: standard
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 4
num_processes: 32
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Loading
Loading