Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
27050d5
fix: Set validation accuracy to mean of rewards to handle non-[0,1] r…
alexandery-nvidia Dec 11, 2025
4bdb142
add datasets dep
bxyu-nvidia Dec 12, 2025
91f5e19
Sort rollout outputs to match inputs order + gym bump
yfw Dec 12, 2025
0259167
copy over config from https://github.com/NVIDIA-NeMo/RL/pull/1625/fil…
bxyu-nvidia Dec 12, 2025
5e834bf
add tool parser plugin placeholder
bxyu-nvidia Dec 12, 2025
ef3449d
tool parser plugin
bxyu-nvidia Dec 13, 2025
bd44408
optional too parser plugin
bxyu-nvidia Dec 13, 2025
72c70b4
make fp32
bxyu-nvidia Dec 13, 2025
fd1d176
add assert
bxyu-nvidia Dec 13, 2025
2ed550f
print repr
bxyu-nvidia Dec 13, 2025
8924556
no thinking
bxyu-nvidia Dec 13, 2025
981871a
try no reasoning parser
bxyu-nvidia Dec 14, 2025
4898199
add print to another assert
bxyu-nvidia Dec 14, 2025
7e5c52f
revert fork
bxyu-nvidia Dec 14, 2025
a31742b
revert
bxyu-nvidia Dec 14, 2025
bca351b
dont print strict ignored
bxyu-nvidia Dec 14, 2025
675a56b
tweak config
bxyu-nvidia Dec 14, 2025
2c9e30a
tweak shapes
bxyu-nvidia Dec 14, 2025
daa44ab
train uses tp2
bxyu-nvidia Dec 14, 2025
d957175
tp4
bxyu-nvidia Dec 14, 2025
e9e9268
print repr of message
bxyu-nvidia Dec 14, 2025
068e2a3
reduce seq len
bxyu-nvidia Dec 14, 2025
9af5063
tp2
bxyu-nvidia Dec 14, 2025
b390520
add submit script
bxyu-nvidia Dec 14, 2025
703a70f
improve
bxyu-nvidia Dec 14, 2025
c61e8d8
add slurm commands
bxyu-nvidia Dec 14, 2025
87a4022
fix partition
bxyu-nvidia Dec 14, 2025
06579e8
fix overrides
bxyu-nvidia Dec 14, 2025
5b24517
try cd first
bxyu-nvidia Dec 14, 2025
dd7339a
try mount
bxyu-nvidia Dec 14, 2025
d0d0ba0
try prefetch venvs
bxyu-nvidia Dec 14, 2025
89349e9
revert
bxyu-nvidia Dec 14, 2025
375a321
try prefetch and no rebuild
bxyu-nvidia Dec 14, 2025
8603815
dont rebuild anything
bxyu-nvidia Dec 14, 2025
40e044b
feat: LoRA SFT support for DTensorV2 path (#1556)
samodi-nv Dec 13, 2025
7681a71
fix: swanlab logger error caused by `define_metric` (#1615)
Zeyi-Lin Dec 13, 2025
8d79a57
refactor: refactor env and data processor & add nemotron super 49b re…
yuki-97 Dec 13, 2025
13cf814
chore: update megatron dev (11/21/2025) / mbridge (11/28/2025) (#1568)
yaoyu-33 Dec 14, 2025
04d0647
bump gym
bxyu-nvidia Dec 15, 2025
0561072
try batch decode
bxyu-nvidia Dec 15, 2025
906eeea
remove hf hub offline
bxyu-nvidia Dec 15, 2025
76f9621
try fix decode fn missing on mock
bxyu-nvidia Dec 15, 2025
bb4d9b5
try with prefetch
bxyu-nvidia Dec 15, 2025
041b033
remove model config
bxyu-nvidia Dec 15, 2025
860b48d
rebuild venvs again
bxyu-nvidia Dec 15, 2025
3e7be40
clean
bxyu-nvidia Dec 15, 2025
df43379
ruff
bxyu-nvidia Dec 15, 2025
8e8505c
add license
bxyu-nvidia Dec 15, 2025
0cf6698
Bxyu/gym grpo tutorial dev (#1641)
bxyu-nvidia Dec 16, 2025
4f5532e
lint
bxyu-nvidia Dec 16, 2025
28b3ae9
bump gym
bxyu-nvidia Dec 16, 2025
957b1df
Merge branch 'main' of github.com:NVIDIA-NeMo/RL into bxyu/gym-grpo-t…
bxyu-nvidia Dec 16, 2025
1bb9572
remove merge artifacts
bxyu-nvidia Dec 16, 2025
4f28102
safe-squash
terrykong Dec 16, 2025
f5e0274
fix uv.lock
terrykong Dec 16, 2025
4615327
Merge branch 'bxyu/gym-grpo-tutorial' of github.com:NVIDIA-NeMo/RL in…
bxyu-nvidia Dec 16, 2025
16cae06
fork and pop reasoning
bxyu-nvidia Dec 16, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion 3rdparty/Gym-workspace/Gym
Submodule Gym updated 72 files
+24 −377 CONTRIBUTING.md
+41 −38 README.md
+22 −122 docs/Makefile
+8 −5 docs/about/concepts/key-terminology.md
+36 −2 docs/conf.py
+2 −2 docs/contribute/development-setup.md
+41 −0 docs/contribute/environments/index.md
+224 −0 docs/contribute/environments/new-environment.md
+2 −0 docs/contribute/index.md
+2 −2 docs/contribute/rl-framework-integration/index.md
+1 −1 docs/get-started/detailed-setup.md
+34 −24 docs/index.md
+4 −1 docs/project.json
+98 −191 docs/reference/faq.md
+24 −24 docs/tutorials/creating-resource-server.md
+17 −8 docs/tutorials/index.md
+233 −0 docs/tutorials/nemo-rl-grpo/about-workplace-assistant.md
+90 −0 docs/tutorials/nemo-rl-grpo/gym-configuration.md
+155 −0 docs/tutorials/nemo-rl-grpo/index.md
+144 −0 docs/tutorials/nemo-rl-grpo/multi-node-training.md
+82 −0 docs/tutorials/nemo-rl-grpo/nemo-rl-configuration.md
+192 −0 docs/tutorials/nemo-rl-grpo/setup.md
+119 −0 docs/tutorials/nemo-rl-grpo/single-node-training.md
+3 −1 docs/tutorials/offline-training-w-rollouts.md
+0 −174 docs/tutorials/rl-training-with-nemo-rl.md
+11 −1 docs/versions1.json
+60 −9 nemo_gym/config_types.py
+8 −2 nemo_gym/dataset_orchestrator.py
+90 −35 nemo_gym/hf_utils.py
+159 −80 nemo_gym/train_data_utils.py
+6 −0 pyproject.toml
+6 −8 resources_servers/calendar/configs/calendar.yaml
+2 −3 resources_servers/code_gen/configs/code_gen.yaml
+2 −1 resources_servers/equivalence_llm_judge/configs/equivalence_llm_judge.yaml
+2 −0 resources_servers/equivalence_llm_judge/data/example_metrics.json
+50 −0 resources_servers/equivalence_llm_judge/data/example_openqa_metrics.json
+2 −2 resources_servers/example_multi_step/data/example_metrics.json
+3 −1 resources_servers/google_search/configs/google_search.yaml
+2 −0 resources_servers/google_search/data/example_metrics.json
+27 −6 resources_servers/google_search/data/train_metrics.json
+3 −1 resources_servers/instruction_following/configs/instruction_following.yaml
+2 −0 resources_servers/instruction_following/data/example_metrics.json
+2 −1 resources_servers/math_advanced_calculations/configs/math_advanced_calculations.yaml
+2 −0 resources_servers/math_advanced_calculations/data/example_metrics.json
+5 −7 resources_servers/math_with_judge/configs/math_stack_overflow.yaml
+5 −8 resources_servers/math_with_judge/configs/math_with_judge.yaml
+1 −0 resources_servers/math_with_judge/data/example_metrics.json
+2 −1 resources_servers/mcqa/configs/mcqa.yaml
+2 −0 resources_servers/mcqa/data/example_metrics.json
+1 −0 resources_servers/mcqa/data/example_with_template_metadata_metrics.json
+4 −1 resources_servers/mini_swe_agent/configs/mini_swe_agent.yaml
+1 −0 resources_servers/mini_swe_agent/data/example_metrics.json
+96 −0 resources_servers/reasoning_gym/README.md
+112 −0 resources_servers/reasoning_gym/app.py
+25 −0 resources_servers/reasoning_gym/configs/reasoning_gym.yaml
+7 −0 resources_servers/reasoning_gym/configs/resources_only.yaml
+5 −0 resources_servers/reasoning_gym/data/example.jsonl
+49 −0 resources_servers/reasoning_gym/data/example_metrics.json
+5 −0 resources_servers/reasoning_gym/data/example_rollouts.jsonl
+2 −0 resources_servers/reasoning_gym/requirements.txt
+314 −0 resources_servers/reasoning_gym/scripts/create_dataset.py
+138 −0 resources_servers/reasoning_gym/tests/test_app.py
+6 −1 resources_servers/structured_outputs/configs/structured_outputs_json.yaml
+1 −0 resources_servers/structured_outputs/data/example_metrics.json
+4 −7 resources_servers/workplace_assistant/configs/workplace_assistant.yaml
+2 −0 resources_servers/workplace_assistant/data/example_metrics.json
+4 −3 resources_servers/workplace_assistant/data/train_metrics.json
+4 −3 resources_servers/workplace_assistant/data/validation_metrics.json
+7 −1 responses_api_models/vllm_model/app.py
+18 −13 scripts/update_resource_servers.py
+1 −0 tests/unit_tests/test_train_data_utils.py
+156 −1 uv.lock
1 change: 1 addition & 0 deletions 3rdparty/Gym-workspace/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@
"yappi",
"ray[default]",
"psutil",
"datasets",
]

if src_dir.exists():
Expand Down
277 changes: 277 additions & 0 deletions examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,277 @@
grpo:
max_num_epochs: 1
num_prompts_per_step: 64
num_generations_per_prompt: 16
max_rollout_turns: 1 # for multi-turn rollouts. Workplace assistant has 1 turn but can have up to 6 tool-calling steps
max_num_steps: 1000000
normalize_rewards: true
use_leave_one_out_baseline: true
val_period: 10
val_at_start: true
overlong_filtering: false
max_val_samples: null # inferred from size of val dataset. for multi evals, repeat val ds via `num_repeats` in `ng_prepare_data`.
val_batch_size: null
seed: 42
use_dynamic_sampling: false
dynamic_sampling_max_gen_batches: 10
batch_multiplier: 1
reward_shaping:
enabled: false
overlong_buffer_length: 128
overlong_buffer_penalty: 1
max_response_length: ${policy.max_total_sequence_length}
reward_scaling:
enabled: false
source_min: 0.0
source_max: 1.0
target_min: 0.0
target_max: 1.0
skip_reference_policy_logprobs_calculation: true

loss_fn:
reference_policy_kl_penalty: 0
reference_policy_kl_type: "k3"
kl_input_clamp_value: 20.0
kl_output_clamp_value: 10.0
ratio_clip_min: 0.2
ratio_clip_max: 0.2
ratio_clip_c: null
# (default off) loss formulation improvements (docs/guides/grpo.md#loss)
use_on_policy_kl_approximation: false
truncated_importance_sampling_ratio: null
use_importance_sampling_correction: false
token_level_loss: true

checkpointing:
enabled: true
metric_name: "val:accuracy"
higher_is_better: true
keep_top_k: 3
save_period: 6 # Save checkpoint every 6 steps (aligned with val_period)
checkpoint_must_save_by: null

policy:
model_name: "nvidia/NVIDIA-Nemotron-Nano-9B-v2"
tokenizer:
name: ${policy.model_name} ## specify if you'd like to use a tokenizer different from the model's default
chat_template_kwargs: null # can be used to pass kwargs to the chat template, e.g., enable_thinking=true
hf_config_overrides: {}
train_global_batch_size: ${mul:${grpo.num_prompts_per_step}, ${grpo.num_generations_per_prompt}} # Match the total rollouts per step
train_micro_batch_size: 1
logprob_batch_size: 1
generation_batch_size: 32 # Only used when generating using HF backend
max_total_sequence_length: 8192
precision: "bfloat16"
logprob_chunk_size: null # Disabled to allow defer_fp32_logits: false

dtensor_cfg:
_v2: false
enabled: false
cpu_offload: False
sequence_parallel: false
activation_checkpointing: true
tensor_parallel_size: 2
context_parallel_size: 1
custom_parallel_plan: null
clear_cache_every_n_steps: null

megatron_cfg:
enabled: true
bias_activation_fusion: false
tensor_model_parallel_size: 2
# We might want to consider setting this value higher (e.g. to 1) and raising the vllm generation max mem utilization
empty_unused_memory_level: 0
activation_checkpointing: true
# train_iters needs to be large enough to cover all training steps
train_iters: 100000
expert_tensor_parallel_size: 1
expert_model_parallel_size: 1
pipeline_model_parallel_size: 1
num_layers_in_first_pipeline_stage: null
num_layers_in_last_pipeline_stage: null
context_parallel_size: 1
pipeline_dtype: ${policy.precision}
sequence_parallel: false
freeze_moe_router: true
moe_router_dtype: "fp64"
moe_router_load_balancing_type: "none" # "seq_aux_loss" causes logprob error divergence for grpo
moe_router_bias_update_rate: 0.0 # by default, disable bias updates for grpo
#gives ~20% training perf speedup with sequence packing
apply_rope_fusion: True
defer_fp32_logits: false
moe_permute_fusion: false

optimizer:
optimizer: "adam"
lr: 5.0e-6
min_lr: 5.0e-7
weight_decay: 0.01
bf16: true
fp16: false
params_dtype: "float32"

#adam
adam_beta1: 0.9
adam_beta2: 0.999
adam_eps: 1e-8

#sgd
sgd_momentum: 0.9

#distributed optimizer
use_distributed_optimizer: true
use_precision_aware_optimizer: true

# optimizer cpu offload
optimizer_cpu_offload: false
optimizer_offload_fraction: 0.0

clip_grad: ${policy.max_grad_norm}

scheduler:
start_weight_decay: ${policy.megatron_cfg.optimizer.weight_decay}
end_weight_decay: ${policy.megatron_cfg.optimizer.weight_decay}
weight_decay_incr_style: "constant"
lr_decay_style: "constant"
lr_decay_iters: 100000 # Must be greater than lr_warmup_iters
lr_warmup_iters: 13
lr_warmup_init: 5.0e-7
override_opt_param_scheduler: true

distributed_data_parallel_config:
grad_reduce_in_fp32: false
overlap_grad_reduce: true
overlap_param_gather: true
use_custom_fsdp: false
data_parallel_sharding_strategy: "optim_grads_params"

env_vars: null

# See docs/design-docs/sequence-packing-and-dynamic-batching.md
# for more details on dynamic batching and sequence packing.
dynamic_batching:
enabled: False
train_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.train_micro_batch_size}}
logprob_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.logprob_batch_size}}
sequence_length_round: 64

sequence_packing:
enabled: false
train_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.train_micro_batch_size}}
logprob_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.logprob_batch_size}}
algorithm: "modified_first_fit_decreasing"
sequence_length_round: 64

# makes the training sequence length divisible by the tensor parallel size
# this is useful for sequence parallel training
make_sequence_length_divisible_by: 1
max_grad_norm: 1.0
offload_optimizer_for_logprob: false # Only useful for non-colocated generation since colocated generation will always offload optimizer to cuda before refit

optimizer: null

scheduler:
- name: "torch.optim.lr_scheduler.ConstantLR"
kwargs:
factor: 1.0
total_iters: 10000000000
- milestones: []

generation:
backend: "vllm"
max_new_tokens: ${policy.max_total_sequence_length}
temperature: 1.0
top_p: 1.0
top_k: null
stop_token_ids: null
stop_strings: null
vllm_cfg:
async_engine: true
precision: ${policy.precision}
tensor_parallel_size: 1
pipeline_parallel_size: 1
enable_expert_parallel: false
expert_parallel_size: 1
gpu_memory_utilization: 0.8
max_model_len: ${policy.max_total_sequence_length}
enforce_eager: false
use_deep_gemm: False
num_last_layers_in_bf16: 0
num_first_layers_in_bf16: 0
kv_cache_dtype: "auto"
expose_http_server: true
skip_tokenizer_init: false
tool_parser_plugin: ???
http_server_serving_chat_kwargs:
# Workplace assistant uses 26 tools, so we enable auto_tools.
# For Nemotron Nano v2, we use the dedicated `nemotron_json` tool parser
enable_auto_tools: true
tool_parser: nemotron_json
vllm_kwargs:
compilation_config:
# when enforce_eager is False, set ++policy.generation.vllm_kwargs.compilation_config.use_inductor=False for better accuracy,
# with the flag, vllm will use the custom CUDA kernels instead of the Triton kernels generated by torch.compile
# for more details, see convergence issue https://github.com/NVIDIA-NeMo/RL/issues/998
use_inductor: False
# We need the Mamba cache to be set to fp32 for Nemotron Nano v2
mamba_ssm_cache_dtype: "float32"
colocated:
# true: generation shares training GPUs
# false: uses dedicated generation resources
enabled: true
# only relevant when enabled is false
resources:
gpus_per_node: null # Decides num gpus to be dedicated to generation when there is one node in the cluster i.e cluster.num_nodes == 1
num_nodes: null # Decides number of nodes to be dedicated to generation

data:
# Using the prepared train and validation datasets (downloaded from HuggingFace and split 90/10)
# Train: 1129 samples, Validation: 126 samples
train_jsonl_fpath: 3rdparty/Gym-workspace/Gym/resources_servers/workplace_assistant/data/train.jsonl
validation_jsonl_fpath: 3rdparty/Gym-workspace/Gym/resources_servers/workplace_assistant/data/validation.jsonl
agent_name: workplace_assistant_simple_agent
shuffle: true
num_workers: 0

env:
should_use_nemo_gym: true
should_log_nemo_gym_responses: true # If you have low logging storage, set this to false
nemo_gym: # This is passed into NeMo-Gym as the initial_global_config_dict
config_paths:
- responses_api_models/vllm_model/configs/vllm_model_for_training.yaml # Required! And it must be *for_training
- resources_servers/workplace_assistant/configs/workplace_assistant.yaml
workplace_assistant_simple_agent:
responses_api_agents:
simple_agent:
max_steps: 6 # Workplace assistant allows up to 6 tool-calling steps per task
policy_model:
responses_api_models:
vllm_model:
# Disable reasoning!
uses_reasoning_parser: false
extra_body:
chat_template_kwargs:
enable_thinking: false

logger:
log_dir: "logs/grpo-workplace-assistant-nemotron-nano-v2-9b" # Base directory for all logs
num_val_samples_to_print: 0 # Number of validation samples to pretty print on terminal
wandb_enabled: true
tensorboard_enabled: false
mlflow_enabled: false # Disable MLflow logging
swanlab_enabled: false
monitor_gpus: true # If true, will monitor GPU usage and log to wandb and/or tensorboard
wandb:
project: "grpo-workplace-assistant"
name: "nemotron-nano-v2-9b-workplace-assistant"
tensorboard: {}
mlflow:
experiment_name: "grpo-workplace-assistant"
run_name: "nemotron-nano-v2-9b-workplace-assistant"
gpu_monitoring:
collection_interval: 10 # How often to collect GPU usage metrics (in seconds)
flush_interval: 10 # How often to flush GPU usage metrics to the loggers (in seconds)

cluster:
gpus_per_node: 8
num_nodes: 1 # Single node by default; set to 2+ for multi-node training
49 changes: 49 additions & 0 deletions examples/nemo_gym/launch_nemo_gym_multinode_training.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# ----- PARAMETERS -----
# WANDB_API_KEY, EXP_NAME, NUM_ACTOR_NODES, REPO_LOCATION, CONTAINER_IMAGE_PATH, SLURM_ACCOUNT, SLURM_PARTITION

# ray.sub needs to be launched from the NeMo-RL root directory
cd $REPO_LOCATION

# Construct the command
read -r -d '' COMMAND <<EOF
cd ${REPO_LOCATION}

HF_HOME=$PWD/.cache/ \
WANDB_API_KEY=$WANDB_API_KEY \
uv run python examples/nemo_gym/run_grpo_nemo_gym.py \
++cluster.num_nodes=$NUM_ACTOR_NODES \
++logger.wandb.name=$EXP_NAME \
++logger.log_dir=results/$EXP_NAME \
++checkpointing.checkpoint_dir=results/$EXP_NAME \
$@
EOF

echo -e "Running command:\n$COMMAND"

mount=$(findmnt -n -o TARGET --target .)

COMMAND=$COMMAND \
CONTAINER=$CONTAINER_IMAGE_PATH \
MOUNTS=$mount:$mount \
sbatch \
--nodes=$NUM_ACTOR_NODES \
--account=$SLURM_ACCOUNT \
--partition=$SLURM_PARTITION \
--time=4:0:0 \
--job-name=$EXP_NAME \
--gres=gpu:8 \
ray.sub
3 changes: 3 additions & 0 deletions nemo_rl/models/generation/vllm/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,9 @@ class VllmSpecificArgs(TypedDict):
expose_http_server: NotRequired[bool]
# These kwargs are passed to the vllm.LLM HTTP server Chat Completions endpoint config. Typically this will include things like tool parser, chat template, etc
http_server_serving_chat_kwargs: NotRequired[dict[str, Any]]
# Miscellaneous top level vLLM HTTP server arguments.
# A filepath that can be imported to register a vLLM tool parser
tool_parser_plugin: NotRequired[str]


class VllmConfig(GenerationConfig):
Expand Down
Loading
Loading