Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -46,3 +46,7 @@ code_snapshots*/
# Runtime env
*runtime_env.yaml
!default_runtime_env.yaml

# Local environment variables
.env
.envrc
2 changes: 1 addition & 1 deletion .gitmodules
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[submodule "3rdparty/Megatron-LM"]
path = 3rdparty/Megatron-LM-workspace/Megatron-LM
url = https://github.com/terrykong/Megatron-LM.git
url = https://github.com/ahmadki/Megatron-LM.git
branch = yuya/nemo-rl-use-dev
shallow = true
[submodule "3rdparty/Megatron-Bridge"]
Expand Down
2 changes: 1 addition & 1 deletion 3rdparty/Automodel-workspace/Automodel
Submodule Automodel updated 293 files
2 changes: 1 addition & 1 deletion 3rdparty/Megatron-Bridge-workspace/Megatron-Bridge
Submodule Megatron-Bridge updated 136 files
3 changes: 2 additions & 1 deletion 3rdparty/Megatron-Bridge-workspace/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,11 +37,12 @@
"pyyaml>=6.0.2",
"tqdm>=4.67.1",
"hydra-core>1.3,<=1.3.2",
"megatron-core[dev,mlm]>=0.15.0a0,<0.16.0",
"megatron-core[dev,mlm]>=0.15.0a0,<0.17.0",
"qwen-vl-utils",
"transformer-engine[pytorch]>=2.9.0a0,<2.10.0",
"mamba-ssm",
"nvidia-resiliency-ext",
"transformers>=4.57.1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify transformers version 4.57.1 availability and qwen3-next requirements.

The dependency on transformers>=4.57.1 has been added. Please confirm:

  • This version exists and is available on PyPI
  • This specific version is required for qwen3-next model support
  • There are no known compatibility issues with other dependencies

🌐 Web query:

What is the latest version of transformers library on PyPI?

💡 Result:

The latest version on PyPI is transformers 4.57.1 — uploaded Oct 14, 2025. [1]

Sources
[1] PyPI — transformers release files (transformers‑4.57.1).



---

🏁 Script executed:

```shell
#!/bin/bash
# Search for qwen3-next references and version rationale in codebase
rg -i 'qwen3-next' -A 3 -B 3

Length of output: 5094


🏁 Script executed:

#!/bin/bash
# Check full setup.py to see all dependencies and constraints
cat 3rdparty/Megatron-Bridge-workspace/setup.py

Length of output: 3891



---

🏁 Script executed:

```shell
#!/bin/bash
# Check if Megatron-Bridge submodule has pyproject.toml and its transformers spec
cat 3rdparty/Megatron-Bridge/pyproject.toml 2>/dev/null | grep -A 50 dependencies

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Search for any comments or documentation about transformers version in codebase
rg 'transformers' -i --type yaml --type py -B 2 -A 2 | grep -i 'version\|4\.57\|qwen' | head -30

Length of output: 2294



---

🏁 Script executed:

```shell
#!/bin/bash
# View full test_transformer_memory_regression.py to understand memory regression issue
cat tests/unit/test_transformer_memory_regression.py

Length of output: 3092


🏁 Script executed:

#!/bin/bash
# Search for skip decorators or conditionals around this test
rg 'test_transformers_version_memory_regression' -B 10 -A 15

Length of output: 2446


🏁 Script executed:

#!/bin/bash
# Check for any recent changes or comments about transformers version bump
git log --oneline --all -S 'transformers>=4.57.1' 2>/dev/null | head -5 || echo "Git log not available"

Length of output: 58


Remove test_transformer_memory_regression.py and reinstate disabled nightly test per the documented cleanup instructions.

The transformers>=4.57.1 constraint in setup.py (added for qwen3-next support) will immediately fail tests/unit/test_transformer_memory_regression.py, which enforces version [4.54, 4.56). This is actually good news—the memory regression is fixed in 4.57.1—but the cleanup steps documented in the test must be completed:

  1. Delete tests/unit/test_transformer_memory_regression.py
  2. Reinstate the disabled nightly test: tests/test_suites/llm/dpo-mistral-nemo-instruct-2407-1n8g-fsdp2tp8-actckpt-long.sh
  3. Update GitHub issue: Dtensor memory regression with TP and long sequence length present in transformers>=4.54,<4.56 #1343
🤖 Prompt for AI Agents
In 3rdparty/Megatron-Bridge-workspace/setup.py around line 45, the bump to
"transformers>=4.57.1" makes the existing unit test
test_transformer_memory_regression.py invalid; follow the documented cleanup:
delete tests/unit/test_transformer_memory_regression.py, re-enable the
previously disabled nightly test
tests/test_suites/llm/dpo-mistral-nemo-instruct-2407-1n8g-fsdp2tp8-actckpt-long.sh
(restore it to the test matrix), and update the referenced GitHub issue
https://github.com/NVIDIA-NeMo/RL/issues/1343 to reflect that the regression is
fixed and cleanup completed.

"causal-conv1d",
]

Expand Down
1 change: 1 addition & 0 deletions examples/configs/grpo_math_1B.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,7 @@ policy:
overlap_param_gather: true
use_custom_fsdp: false
data_parallel_sharding_strategy: "optim_grads_params"
average_in_collective: true

fp8_cfg: null

Expand Down
4 changes: 2 additions & 2 deletions examples/configs/recipes/llm/grpo-deepscaler-1.5b-16K.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
defaults:
- ../../grpo_math_1B.yaml
- grpo-deepscaler-1.5b-8K.yaml
- ../../grpo_math_1B.yaml
- grpo-deepscaler-1.5b-8K.yaml
loss_fn:
reference_policy_kl_penalty: 0.001
ratio_clip_max: 0.28
Expand Down
4 changes: 2 additions & 2 deletions examples/configs/recipes/llm/grpo-deepscaler-1.5b-24K.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
defaults:
- ../../grpo_math_1B.yaml
- grpo-deepscaler-1.5b-8K.yaml
- ../../grpo_math_1B.yaml
- grpo-deepscaler-1.5b-8K.yaml
loss_fn:
reference_policy_kl_penalty: 0.0001
ratio_clip_max: 0.28
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
defaults: ../../grpo_math_1B.yaml
grpo:
num_prompts_per_step: 64
num_generations_per_prompt: 32
checkpointing:
checkpoint_dir: results/grpo-qwen3-next-80ba3b-8n8g-megatron
policy:
model_name: Qwen/Qwen3-Next-80B-A3B-Instruct
train_micro_batch_size: 1
max_total_sequence_length: 4096
dtensor_cfg:
enabled: false
optimizer: null
scheduler: null
sequence_packing:
enabled: false
algorithm: modified_ffd
make_sequence_length_divisible_by: ${policy.megatron_cfg.tensor_model_parallel_size}
megatron_cfg:
enabled: true
converter_type: Qwen3NextForCausalLM
tensor_model_parallel_size: 2
pipeline_model_parallel_size: 4
expert_model_parallel_size: 4
sequence_parallel: true
optimizer:
lr: 3.0e-07
min_lr: 3.0e-08
scheduler:
lr_warmup_iters: 50
lr_warmup_init: 3.0e-08
env_vars:
PYTORCH_CUDA_ALLOC_CONF: expandable_segments:False
generation:
vllm_cfg:
tensor_parallel_size: 4
gpu_memory_utilization: 0.7
logger:
log_dir: logs/grpo-qwen3-next-80ba3b-8n8g-megatron
wandb_enabled: true
tensorboard_enabled: true
wandb:
project: nemo-rl
name: grpo-qwen3-next-80ba3b-8n8g-megatron
cluster:
gpus_per_node: 8
num_nodes: 8
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
defaults: ../../sft.yaml
sft:
max_num_steps: 1000000
val_period: 50
checkpointing:
checkpoint_dir: results/sft-qwen3-next-80ba3b-instruct-8n8g-megatron
save_period: 50
policy:
model_name: Qwen/Qwen3-Next-80B-A3B-Instruct
tokenizer:
name: Qwen/Qwen3-Next-80B-A3B-Instruct
chat_template: default
train_global_batch_size: 512
max_total_sequence_length: 4096
dtensor_cfg:
enabled: false
make_sequence_length_divisible_by: ${policy.megatron_cfg.tensor_model_parallel_size}
optimizer: null
megatron_cfg:
enabled: true
converter_type: Qwen3NextForCausalLM
pipeline_model_parallel_size: 2
expert_model_parallel_size: 8
optimizer:
lr: 2.0e-05
min_lr: 1.99999e-05
weight_decay: 0.01
bf16: true
scheduler:
lr_warmup_init: 1.9999e-65
Comment on lines +19 to +30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Double-check lr_warmup_init magnitude

lr: 2.0e-05, min_lr: 1.99999e-05, but scheduler.lr_warmup_init: 1.9999e-65 is extremely close to zero and looks like it might be a -05 vs -65 typo. That won’t break anything but will effectively start from ~0 lr during warmup.

If this wasn’t intentional (e.g., you meant 1.9999e-05 or similar), it’s worth correcting now to avoid confusing future readers.

🤖 Prompt for AI Agents
In examples/configs/recipes/llm/sft-qwen3-next-80ba3b-8n8g-megatron.yaml around
lines 19 to 30, the scheduler.lr_warmup_init value is set to 1.9999e-65 which is
extremely close to zero and likely a typo (should probably be 1.9999e-05);
update lr_warmup_init to the intended magnitude (e.g., 1.9999e-05 or another
value consistent with lr/min_lr) so warmup starts at the correct learning rate
and avoid confusing future readers.

data:
dataset_name: openmathinstruct2
prompt_file: examples/prompts/math.txt
split: train_1M
add_generation_prompt: true
output_key: generated_solution
seed: 42
logger:
log_dir: logs/sft-qwen3-next-80ba3b-instruct-8n8g-megatron
wandb:
project: nemo-rl
name: sft-qwen3-next-80ba3b-instruct-8n8g-megatron
tensorboard:
log_dir: tb_logs-sft-dev-openmathinstruct2
cluster:
num_nodes: 8
gpus_per_node: 8
7 changes: 3 additions & 4 deletions examples/configs/sft.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ sft:
checkpointing:
enabled: true
checkpoint_dir: "results/sft"
metric_name: "val:val_loss" # one of "val:" or "train:" followed by the metric name
metric_name: "val:val_loss" ## set to null to save most recent k checkpoints
higher_is_better: false
keep_top_k: 3
save_period: 10
Expand All @@ -37,7 +37,6 @@ policy:

dtensor_cfg:
enabled: true
env_vars: {}
cpu_offload: False
sequence_parallel: false
activation_checkpointing: false
Expand Down Expand Up @@ -76,7 +75,6 @@ policy:
## ignored since enabled=false, but needed for testing purposes
megatron_cfg:
enabled: false
env_vars: {}
empty_unused_memory_level: 1
activation_checkpointing: false
tensor_model_parallel_size: 1
Expand All @@ -97,7 +95,7 @@ policy:
apply_rope_fusion: True
# gives ~25% training perf speedup with sequence packing and apply_rope_fusion
bias_activation_fusion: True
defer_fp32_logits: False
defer_fp32_logits: null

optimizer:
optimizer: "adam"
Expand Down Expand Up @@ -139,6 +137,7 @@ policy:
grad_reduce_in_fp32: false
overlap_grad_reduce: true
overlap_param_gather: true
average_in_collective: true
data_parallel_sharding_strategy: "optim_grads_params"
use_custom_fsdp: false

Expand Down
10 changes: 3 additions & 7 deletions examples/configs/vlm_grpo_3B.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -33,11 +33,6 @@ grpo:

loss_fn:
reference_policy_kl_penalty: 0.01
# Can be set to k1, k2, k3
# For more details, see http://joschu.net/blog/kl-approx.html
reference_policy_kl_type: "k3"
kl_input_clamp_value: 20.0
kl_output_clamp_value: 10.0
ratio_clip_min: 0.2
ratio_clip_max: 0.2
ratio_clip_c: null
Expand All @@ -50,7 +45,7 @@ loss_fn:
checkpointing:
enabled: true
checkpoint_dir: "results/clevr_grpo_${policy.model_name}"
metric_name: "val:accuracy" # one of "val:" or "train:" followed by the metric name
metric_name: "val_reward"
higher_is_better: true
keep_top_k: 3
save_period: 10
Expand Down Expand Up @@ -101,7 +96,7 @@ policy:
apply_rope_fusion: True
# gives ~25% training perf speedup with sequence packing and apply_rope_fusion
bias_activation_fusion: True
defer_fp32_logits: False
defer_fp32_logits: null

optimizer:
optimizer: "adam"
Expand Down Expand Up @@ -143,6 +138,7 @@ policy:
grad_reduce_in_fp32: false
overlap_grad_reduce: true
overlap_param_gather: true
average_in_collective: true
use_custom_fsdp: false
data_parallel_sharding_strategy: "optim_grads_params"

Expand Down
11 changes: 4 additions & 7 deletions examples/configs/vlm_grpo_3B_megatron.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -30,11 +30,6 @@ grpo:
max_trajectory_age_steps: 1
loss_fn:
reference_policy_kl_penalty: 0.01
# Can be set to k1, k2, k3
# For more details, see http://joschu.net/blog/kl-approx.html
reference_policy_kl_type: "k3"
kl_input_clamp_value: 20.0
kl_output_clamp_value: 10.0
ratio_clip_min: 0.2
ratio_clip_max: 0.2
ratio_clip_c: null
Expand All @@ -45,7 +40,7 @@ loss_fn:
checkpointing:
enabled: true
checkpoint_dir: results/clevr_grpo_${policy.model_name}
metric_name: val:accuracy # one of "val:" or "train:" followed by the metric name
metric_name: val_reward
higher_is_better: true
keep_top_k: 3
save_period: 10
Expand Down Expand Up @@ -83,6 +78,7 @@ policy:
logprob_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.logprob_batch_size}}
algorithm: modified_first_fit_decreasing
sequence_length_round: 64
optimizer: null
scheduler:
- name: torch.optim.lr_scheduler.LinearLR
kwargs:
Expand Down Expand Up @@ -142,7 +138,7 @@ policy:
apply_rope_fusion: true
# gives ~25% training perf speedup with sequence packing and apply_rope_fusion
bias_activation_fusion: True
defer_fp32_logits: False
defer_fp32_logits: null
optimizer:
optimizer: adam
lr: 2.0e-07
Expand Down Expand Up @@ -173,6 +169,7 @@ policy:
grad_reduce_in_fp32: false
overlap_grad_reduce: false
overlap_param_gather: true
average_in_collective: true
use_custom_fsdp: false
data_parallel_sharding_strategy: optim_grads_params
data:
Expand Down
Loading