chore: update all to transformers v5 (+torch 2.10, ray 2.54, vllm/sglang tot) by hemildesai · Pull Request #1962 · NVIDIA-NeMo/RL

hemildesai · 2026-02-15T19:00:34Z

Update transformers to v5 just for automodel extra
Update Automodel to latest main

Closes #1995
Closes #2041
Closes #2042

Closes NMFW-4

Summary by CodeRabbit

Release Notes

New Features
- Added MoE parallelizer configuration options for distributed training
- Introduced distributed context management for enhanced distributed setup
Configuration Updates
- Updated backend configuration paths for automodel components
- Added checkpoint save period configuration option
- Enhanced DTensor configuration with cache clearing and async checkpointing flags
Dependencies
- Relaxed transformers version constraint
- Updated transformer-engine to latest compatible version
- Added CUDA and DeepSeek dependencies for improved model support
Improvements
- Simplified distributed training initialization flow
- Enhanced automodel execution with cache optimization

github-actions · 2026-02-15T19:01:05Z

⚠️ File Consistency Check

Check based on commit: 9dce543 (PR #1962 from hemil/automodel-transformers-v5)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/workers/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

Please review if the changes in nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/workers/dtensor_policy_worker.py
Update nemo_rl/models/policy/workers/dtensor_policy_worker.py if necessary to maintain consistency
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
Not modified: nemo_rl/models/policy/workers/dtensor_policy_worker.py

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2026-02-15T19:01:27Z

✅ Submodule Fast-Forward Check Results

Check based on commit: 9dce543 (PR #1962 from hemil/automodel-transformers-v5)

✅ Submodules that are properly updated:

Automodel: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

coderabbitai · 2026-02-15T19:59:26Z

📝 Walkthrough

Walkthrough

This PR refactors the distributed context management by replacing FSDP2Manager with a new DistributedContext object, removes model_state_dict_keys parameters from checkpoint management, updates backend configuration paths from moe.utils to models.common.utils, and modifies dependency management and workspace configuration in pyproject.toml.

Changes

Cohort / File(s)	Summary
Automodel Submodule & Dependencies `3rdparty/Automodel-workspace/Automodel`, `pyproject.toml`	Updated Automodel submodule pointer; changed workspace setup to path-based editable mode; relaxed transformers version constraint; added nemo-automodel[moe] extra; updated transformer-engine to 2.10.0; added nvidia-cudnn-cu12 and deep_ep dependencies; introduced automodel conflicts with fsdp, mcore, vllm in lint configuration.
Backend Config Path Migration `examples/configs/recipes/llm/*`, `nemo_rl/models/policy/__init__.py`, `tests/unit/models/policy/test_automodel_types.py`	Updated BackendConfig import paths from `nemo_automodel.components.moe.utils.BackendConfig` to `nemo_automodel.components.models.common.utils.BackendConfig` across YAML configs and type definitions; added checkpointing.save_period: 30 in sft config.
Distributed Context Refactoring `nemo_rl/models/automodel/config.py`, `nemo_rl/models/automodel/setup.py`	Introduced new `DistributedContext` NamedTuple to encapsulate device meshes and distributed configs; replaced `setup_distributed` to return `DistributedContext` instead of `FSDP2Manager`; refactored `setup_model_and_optimizer` to accept `distributed_context` parameter and use `from_pretrained` initialization with device meshes instead of manager-based approach; removed `model_state_dict_keys` field from `ModelAndOptimizerState`.
Policy Configuration Types `nemo_rl/models/policy/__init__.py`	Added `MoEParallelizerOptions` TypedDict with fields for MoE parallelizer settings; extended `DTensorConfig` with `clear_cache_every_n_steps`, `moe_parallelizer`, `defer_fsdp_grad_sync`, `expert_parallel_size`, and `custom_parallel_plan`; reordered existing fields for clarity.
Checkpoint Manager Cleanup `nemo_rl/models/automodel/checkpoint.py`	Removed `model_state_dict_keys` constructor parameter, attribute, and related docstrings; removed `load_base_model` and `set_model_state_dict_keys` public methods; removed TRANSFORMERS_CACHE import; updated `init_checkpointer` and `update_checkpointer_config` calls.
Integration Updates `nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py`, `nemo_rl/distributed/virtual_cluster.py`	Updated DTensorPolicyWorkerV2 to use `distributed_context` instead of `distributed_manager`; removed `model_state_dict_keys` from AutomodelCheckpointManager initialization; added `is_async` flag to checkpoint config; added `--no-cache` flag to automodel uv command execution.
Test Updates `tests/unit/models/automodel/test_automodel_setup.py`, `tests/unit/models/automodel/test_automodel_checkpoint.py`, `tests/unit/models/policy/test_dtensor_worker_v2.py`, `tests/unit/models/policy/test_automodel_types.py`	Updated tests to use new `DistributedContext` API; removed `model_state_dict_keys` from checkpoint manager tests; refactored `test_automodel_setup.py` to verify `DistributedContext` return and device mesh population; removed `use_hf_tp_plan` parameter from worker config tests.
Configuration `pyrefly.toml`	Added `nemo_rl/models/automodel/checkpoint.py` to project-includes; removed `nemo_rl/utils/automodel_checkpoint.py` from project-includes.

Sequence Diagram

sequenceDiagram
    participant Setup as setup_distributed()
    participant Context as DistributedContext
    participant DeviceMesh as create_device_mesh()
    participant ModelSetup as setup_model_and_optimizer()
    participant FromPretrained as model_class.from_pretrained()
    participant Optimizer as OptimizerSetup
    
    Setup->>DeviceMesh: Create device/moe meshes
    DeviceMesh-->>Setup: Return meshes
    Setup->>Context: Construct DistributedContext<br/>(device_mesh, moe_mesh, fsdp2_config, moe_config, sizes)
    Setup-->>ModelSetup: Return DistributedContext
    
    ModelSetup->>ModelSetup: Validate CP/TP/EP interactions
    ModelSetup->>FromPretrained: Call with device_mesh,<br/>moe_mesh, distributed_config
    FromPretrained-->>ModelSetup: Return initialized model
    
    ModelSetup->>ModelSetup: Apply activation checkpointing<br/>and config overrides
    ModelSetup->>Optimizer: Initialize optimizer
    Optimizer-->>ModelSetup: Return optimizer state
    
    ModelSetup-->>ModelSetup: Return ModelAndOptimizerState

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

feat: refactor init of dtensor policy v2 #1709 — Earlier refactor of automodel setup and dtensor policy v2 initialization that this PR builds upon with DistributedContext introduction
chore: Bump vllm to 0.11.2, torch to 2.9, transformers to 4.57.1 #1563 — Updates the same Automodel submodule pointer with related workspace/dependency changes
cp: feat: DTensorPolicyV2 GPT-OSS SFT support (1470) into r0.5.0 #1690 — Modifies Automodel/DTensor integration including backend config paths and checkpoint manager imports

Suggested labels

Run CICD

Suggested reviewers

terrykong
yuki-97
adil-a

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Merge Conflict Detection	⚠️ Warning	❌ Merge conflicts detected (27 files): ⚔️ `3rdparty/Automodel-workspace/Automodel` (content) ⚔️ `docker/Dockerfile` (content) ⚔️ `docs/guides/use-custom-vllm.md` (content) ⚔️ `examples/configs/grpo_math_1B.yaml` (content) ⚔️ `examples/configs/recipes/llm/grpo-moonlight-16b-automodel-1n8g-ep8.yaml` (content) ⚔️ `examples/configs/recipes/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.yaml` (content) ⚔️ `examples/configs/vlm_grpo_3B.yaml` (content) ⚔️ `examples/configs/vlm_grpo_3B_megatron.yaml` (content) ⚔️ `examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml` (content) ⚔️ `nemo_rl/algorithms/grpo.py` (content) ⚔️ `nemo_rl/distributed/virtual_cluster.py` (content) ⚔️ `nemo_rl/environments/nemo_gym.py` (content) ⚔️ `nemo_rl/models/automodel/config.py` (content) ⚔️ `nemo_rl/models/automodel/setup.py` (content) ⚔️ `nemo_rl/models/generation/vllm/vllm_worker.py` (content) ⚔️ `nemo_rl/models/policy/__init__.py` (content) ⚔️ `nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py` (content) ⚔️ `pyproject.toml` (content) ⚔️ `pyrefly.toml` (content) ⚔️ `tests/functional/grpo_non_colocated.sh` (content) ⚔️ `tests/unit/algorithms/test_grpo.py` (content) ⚔️ `tests/unit/environments/test_nemo_gym.py` (content) ⚔️ `tests/unit/models/automodel/test_automodel_setup.py` (content) ⚔️ `tests/unit/models/policy/test_automodel_types.py` (content) ⚔️ `tests/unit/models/policy/test_dtensor_worker_v2.py` (content) ⚔️ `tools/build-custom-vllm.sh` (content) ⚔️ `uv.lock` (content) These conflicts must be resolved before merging into `main`.	Resolve conflicts locally and push changes to this branch.
Test Results For Major Changes	⚠️ Warning	PR contains major breaking changes and dependency upgrades but lacks test results, regression verification, and convergence validation in the description.	Add comprehensive testing summary documenting test results, regression testing confirmation, editable install fix verification, and resolution status for identified review issues.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 96.77% which is sufficient. The required threshold is 80.00%.
Title check	✅ Passed	The title mentions updating transformers and other dependencies, which aligns with the core changes in pyproject.toml (transformers version bump, transformer-engine update, and dependency additions), but it incompletely represents the significant API refactoring across automodel modules.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch hemil/automodel-transformers-v5

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can customize the high-level summary generated by CodeRabbit.

Configure the reviews.high_level_summary_instructions setting to provide custom instructions for generating the high-level summary.

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/unit/models/policy/test_automodel_types.py (1)
21-21: ⚠️ Potential issue | 🟠 Major

Update import path to match new _target_ reference.

Line 21 imports BackendConfig from the old path nemo_automodel.components.moe.utils, but line 50's _target_ string references the new path nemo_automodel.components.models.common.utils.BackendConfig. Update the import to match:
Proposed fix
-    from nemo_automodel.components.moe.utils import BackendConfig  # noqa: F401
+    from nemo_automodel.components.models.common.utils import BackendConfig  # noqa: F401

🤖 Fix all issues with AI agents

In `@nemo_rl/distributed/virtual_cluster.py`:
- Line 56: AUTOMODEL currently includes the --no-cache flag which forces uv to
bypass its cache on every automodel worker launch; either remove --no-cache from
the AUTOMODEL string to restore normal cached startup behavior or, if it was
intentionally added to workaround dependency/stale-cache issues (e.g.,
transformers v5 transition), add an inline comment next to the AUTOMODEL
definition explaining the rationale, when it can be removed, and any
reproduction steps that justify keeping it; update the AUTOMODEL constant
accordingly to reflect the chosen approach.

In `@nemo_rl/models/automodel/setup.py`:
- Line 465: The unconditional print(model) should be run only on the main
process to avoid repeated logs in distributed runs; wrap the existing
print(model) call with a check using the existing rank variable (e.g., if rank
== 0) so only rank 0 prints the model; locate the print(model) call and guard it
with the rank check (using the same rank identifier already declared earlier) so
other ranks skip printing.
- Around line 449-463: The call to model_class.from_pretrained passes
torch_dtype as str(model_config.torch_dtype), which yields values like
"torch.float32" but the loader expects the actual torch.dtype or a bare string
like "float32"; change the argument to pass the dtype object directly
(torch_dtype=model_config.torch_dtype) in the from_pretrained call inside
setup.py (where model_class.from_pretrained is invoked) so it aligns with the
STRING_TO_DTYPE mapping and test mocks that expect a torch.dtype rather than a
stringified value.

In `@pyproject.toml`:
- Line 165: The path-based editable dependency "nemo-automodel" points to an
empty directory (3rdparty/Automodel-workspace/Automodel) and lacks a
pyproject.toml, so fix by either placing the Automodel source into that
directory or updating the dependency to the correct path; then add a valid
pyproject.toml in that directory (with project metadata and build-backend) so
the editable install for nemo-automodel succeeds and verify the package layout
(package/module files) matches the pyproject configuration.
- Around line 238-239: The comment about the transformer-engine override is
stale and the global override to "transformer-engine[pytorch]==2.10.0" may
unintentionally force TE 2.10.0 into extras like mcore which pins
"transformer-engine[pytorch]==2.8.0"; update the comment to reflect the current
2.10.0 override and the rationale, or verify and ensure mcore compatibility with
TE 2.10.0 (and adjust mcore's pin or the global override accordingly) so
automodel, mcore, and Megatron-Bridge/pyproject.toml are all consistent; search
for the symbols transformer-engine[pytorch], mcore, automodel and the
Megatron-Bridge/pyproject.toml reference to locate the relevant pins and change
either the comment or the pinning strategy to resolve the version conflict.

🧹 Nitpick comments (6)

tests/unit/models/automodel/test_automodel_checkpoint.py (1)

375-388: Redundant local re-imports of AutomodelCheckpointManager.

AutomodelCheckpointManager is already imported at the module level (Line 35). The local re-imports inside each test method (Lines 375, 395, 417, 444, 465, 495, 526, 557) are unnecessary.

nemo_rl/models/policy/__init__.py (1)

88-109: New DTensorConfig keys could use brief inline documentation.

Several newly added keys (expert_parallel_size, custom_parallel_plan, defer_fsdp_grad_sync, moe_parallelizer, clear_cache_every_n_steps) lack purpose/default documentation. The coding guidelines ask that new TypedDict keys document their purpose, valid values, and recommended default.

The grouping comments (lines 92–93, 97, 103, 105, 108) are a good start — consider adding brief per-field comments similar to the style used in AutomodelBackendConfig above.

As per coding guidelines: "When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml".

pyproject.toml (1)

250-250: deep_ep override duplicates the spec already in vllm and mcore extras.

deep_ep is pinned to the same git+commit in vllm (Line 74), mcore (Line 114), and now the global override-dependencies (Line 250). This is fine for ensuring consistent resolution, but consider adding a brief comment explaining why the override is needed (e.g., ensuring automodel also uses this version).
nemo_rl/models/automodel/setup.py (2)
265-265: Hidden non-None default for defer_fsdp_grad_sync.

.get("defer_fsdp_grad_sync", True) introduces a default of True in code. Per coding guidelines, YAML should be the single source of truth for configuration defaults — non-None defaults should not be set in code.

Consider either:

Making defer_fsdp_grad_sync a required field in DTensorConfig, or

Setting the default in the YAML config files and accessing it directly here.

As per coding guidelines: "YAML is the single source of truth for configuration defaults; do not set non-None defaults in code for configuration values".

436-463: Potential key collision between from_pretrained_kwargs and automodel_kwargs.

Both **from_pretrained_kwargs (from hf_config_overrides) and **automodel_kwargs are unpacked into from_pretrained(). If any key exists in both dicts, automodel_kwargs silently wins. This may be intentional, but if not, it could cause subtle config loss.

Consider adding a guard:
overlap = set(from_pretrained_kwargs) & set(automodel_kwargs)
if overlap:
    print(f"[WARNING] Overlapping keys between hf_config_overrides and automodel_kwargs: {overlap}")
tests/unit/models/automodel/test_automodel_setup.py (1)
610-627: Lambda self parameter shadows outer fixture self.

Ruff flags self as unused in the lambda on line 622. The parameter is actually the mock instance receiving __getitem__, but it shadows the fixture's self. Consider renaming to _self or _ for clarity.
♻️ Minor rename to silence Ruff ARG005
-        mock_mesh.__getitem__ = lambda self, key: {
+        mock_mesh.__getitem__ = lambda _self, key: {

nemo_rl/distributed/virtual_cluster.py

nemo_rl/models/automodel/setup.py

pyproject.toml

github-actions · 2026-02-16T00:57:21Z

⚠️ File Consistency Check

Check based on commit: fdb0374 (PR #1962 from hemil/automodel-transformers-v5)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/workers/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

Please review if the changes in nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/workers/dtensor_policy_worker.py
Update nemo_rl/models/policy/workers/dtensor_policy_worker.py if necessary to maintain consistency
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
Not modified: nemo_rl/models/policy/workers/dtensor_policy_worker.py

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2026-02-16T00:58:01Z

✅ Submodule Fast-Forward Check Results

Check based on commit: fdb0374 (PR #1962 from hemil/automodel-transformers-v5)

✅ Submodules that are properly updated:

Automodel: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

github-actions · 2026-02-17T02:40:26Z

⚠️ File Consistency Check

Check based on commit: 945117f (PR #1962 from hemil/automodel-transformers-v5)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/workers/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

Please review if the changes in nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/workers/dtensor_policy_worker.py
Update nemo_rl/models/policy/workers/dtensor_policy_worker.py if necessary to maintain consistency
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
Not modified: nemo_rl/models/policy/workers/dtensor_policy_worker.py

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2026-02-17T02:41:09Z

✅ Submodule Fast-Forward Check Results

Check based on commit: 945117f (PR #1962 from hemil/automodel-transformers-v5)

✅ Submodules that are properly updated:

Automodel: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

github-actions · 2026-02-17T04:05:11Z

⚠️ File Consistency Check

Check based on commit: d9143bc (PR #1962 from hemil/automodel-transformers-v5)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/workers/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

Please review if the changes in nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/workers/dtensor_policy_worker.py
Update nemo_rl/models/policy/workers/dtensor_policy_worker.py if necessary to maintain consistency
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
Not modified: nemo_rl/models/policy/workers/dtensor_policy_worker.py

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2026-02-17T04:05:53Z

✅ Submodule Fast-Forward Check Results

Check based on commit: d9143bc (PR #1962 from hemil/automodel-transformers-v5)

✅ Submodules that are properly updated:

Automodel: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

Signed-off-by: Terry Kong <terryk@nvidia.com>

- Set router_aux_loss_coef=0 via hf_config_overrides to fix 74x gradient inflation caused by MoEAuxLossAutoScaler.main_loss_backward_scale not being set in nemo-rl (defaults to 1.0, aux_loss grads unscaled vs token-averaged cross-entropy loss) - Add experts=torch_mm and rope_fusion=false backend settings - Remove TORCH_COMPILE_DISABLE=1 from test script (fix is in automodel experts.py @torch.compile removal on _apply_bias) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

Signed-off-by: hemildesai <hemild@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

- increase time to avoid flaky fail: grpo-qwen3-8B-base-1n8g-fsdp2-lora, grpo-qwen3-8b-base-1n8g-megatron-lora - fix make_sequence_length_divisible_by: distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack - set use_distributed_optimizer=True to avoid using layer_wise_optimizer: sft-qwen2.5-math7b-1n8g-megatron_chunked_linear_ce_loss Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

- missing sampling_params after rebase - numerical differences on gb200: test_megatron_worker.py::test_megatron_context_parallel_topk_agreement - missing model.rotary_emb_local: test_parallelize.py::test_parallelize_plan_keys - wildcard_match fix in automodel: test_lora.py::test_apply_lora_respects_wildcard remaining unit test fails: - test_smolvlm_embeddings_bug.py::test_smolvlm_embeddings_differ_from_reference - test_dtensor_worker_v2.py::test_dtensor_worker_v1_v2_model_config_equivalence - test_dtensor_worker.py::test_dtensor_tp_and_tied_model_with_custom_parallel_plan Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

…p two automodel unit tests Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

…se in gb200 vlm Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

github-actions · 2026-03-24T03:32:36Z

⚠️ File Consistency Check

Check based on commit: 1add665 (PR #1962 from hemil/automodel-transformers-v5)

⚠️ Parallel Plans Synchronization Warning

The file nemo_rl/models/dtensor/parallelize.py was modified in this PR, but neither 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py nor 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py was updated.

Why this matters:
These files contain similar parallel plan implementations that should be kept synchronized to ensure consistency across the codebase.

Action required:

Please review if the changes in nemo_rl/models/dtensor/parallelize.py should also be applied to 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py or 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py
Update the appropriate related file(s) if necessary to maintain functional consistency
Request access to the NVIDIA-NeMo/Automodel repository, create a PR against the nemo-rl-submodule branch, and update the Automodel submodule in the nemo-rl index
Add @ffrujeri as a reviewer of this PR if you have any questions about the consistency requirements
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/dtensor/parallelize.py
Not modified: 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py
Not modified: 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/workers/dtensor_policy_worker.py
nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2026-03-24T03:33:43Z

✅ Submodule Fast-Forward Check Results

Check based on commit: 1add665 (PR #1962 from hemil/automodel-transformers-v5)

✅ Submodules that are properly updated:

Automodel: ✅ PR branch is ahead of main branch (fast-forward)
Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)
Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

yuki-97 · 2026-03-24T03:53:25Z

/ok to test 1add665

Signed-off-by: Yuki Huang <yukih@nvidia.com>

github-actions · 2026-03-24T14:44:36Z

⚠️ File Consistency Check

Check based on commit: e165545 (PR #1962 from hemil/automodel-transformers-v5)

⚠️ Parallel Plans Synchronization Warning

The file nemo_rl/models/dtensor/parallelize.py was modified in this PR, but neither 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py nor 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py was updated.

Why this matters:
These files contain similar parallel plan implementations that should be kept synchronized to ensure consistency across the codebase.

Action required:

Please review if the changes in nemo_rl/models/dtensor/parallelize.py should also be applied to 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py or 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py
Update the appropriate related file(s) if necessary to maintain functional consistency
Request access to the NVIDIA-NeMo/Automodel repository, create a PR against the nemo-rl-submodule branch, and update the Automodel submodule in the nemo-rl index
Add @ffrujeri as a reviewer of this PR if you have any questions about the consistency requirements
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/dtensor/parallelize.py
Not modified: 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py
Not modified: 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/workers/dtensor_policy_worker.py
nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2026-03-24T14:45:44Z

✅ Submodule Fast-Forward Check Results

Check based on commit: e165545 (PR #1962 from hemil/automodel-transformers-v5)

✅ Submodules that are properly updated:

Automodel: ✅ PR branch is ahead of main branch (fast-forward)
Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)
Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

yuki-97 · 2026-03-24T14:46:14Z

/ok to test e165545

yuki-97

h100 nightly tests all passed, can be merged!

hemildesai marked this pull request as ready for review February 15, 2026 19:52

hemildesai requested review from a team as code owners February 15, 2026 19:52

hemildesai added the CI:L1 Run doctests, unit tests, and functional tests label Feb 15, 2026

hemildesai temporarily deployed to nemo-ci February 15, 2026 19:52 — with GitHub Actions Inactive

coderabbitai bot reviewed Feb 15, 2026

View reviewed changes

nemo_rl/distributed/virtual_cluster.py Outdated Show resolved Hide resolved

nemo_rl/models/automodel/setup.py Show resolved Hide resolved

nemo_rl/models/automodel/setup.py Show resolved Hide resolved

pyproject.toml Show resolved Hide resolved

pyproject.toml Outdated Show resolved Hide resolved

hemildesai temporarily deployed to nemo-ci February 15, 2026 22:18 — with GitHub Actions Inactive

hemildesai removed the CI:L1 Run doctests, unit tests, and functional tests label Feb 16, 2026

hemildesai added the CI:L1 Run doctests, unit tests, and functional tests label Feb 16, 2026

hemildesai temporarily deployed to nemo-ci February 16, 2026 00:57 — with GitHub Actions Inactive

hemildesai temporarily deployed to nemo-ci February 16, 2026 01:01 — with GitHub Actions Inactive

hemildesai removed the CI:L1 Run doctests, unit tests, and functional tests label Feb 17, 2026

hemildesai added the CI:L1 Run doctests, unit tests, and functional tests label Feb 17, 2026

hemildesai temporarily deployed to nemo-ci February 17, 2026 02:40 — with GitHub Actions Inactive

hemildesai temporarily deployed to nemo-ci February 17, 2026 02:44 — with GitHub Actions Inactive

hemildesai removed the CI:L1 Run doctests, unit tests, and functional tests label Feb 17, 2026

yuki-97 and others added 19 commits March 23, 2026 20:32

add mamba_ssm_cache_dtype (grpo-nano-v2-12b-2n8g-fsdp2tp1)

39342df

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

rayen nano v3 automodel fix + bump automodel

d9692d7

Signed-off-by: Terry Kong <terryk@nvidia.com>

bump automodel to gpt-oss fix

30e4da2

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

fix moonlight automodel

0b22ef9

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

fix nanov3 nightlies

e565f4d

Signed-off-by: hemildesai <hemild@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

uv + lint

0b1cc08

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

add missing NRL_DEEPSEEK_V3_BF16_CKPT

2eb5a55

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

mbridge deps

3f47651

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

revert NemotronHConfig since transformers 5.2.0 not support

8846782

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

fix unit test

62735e0

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

fix test_smolvlm_embeddings_differ_from_reference and temporarily ski…

c925323

…p two automodel unit tests Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

lower threshold of test_megatron_context_parallel_topk_agreement

56f415f

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

upgrade nvidia-cutlass-dsl in vllm: fix missing setmaxregister_decrea…

173f52c

…se in gb200 vlm Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

test: add nanov3 prefill decode test (#2141)

99b5247

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

bump automodel and unskip unit test

706ac9b

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

remove transformers-v5-errors.md

1add665

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

yuki-97 added 2 commits March 24, 2026 07:42

rename to match tp4

a066aa7

Signed-off-by: Yuki Huang <yukih@nvidia.com>

bump automodel to fix tp plan lookup

e165545

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 approved these changes Mar 25, 2026

View reviewed changes

terrykong approved these changes Mar 25, 2026

View reviewed changes

Conversation

hemildesai commented Feb 15, 2026 • edited by yuki-97 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

github-actions bot commented Feb 15, 2026

⚠️ File Consistency Check

⚠️ DTensor Policy Worker Synchronization Warning

Uh oh!

github-actions bot commented Feb 15, 2026

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

coderabbitai bot commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 16, 2026

⚠️ File Consistency Check

⚠️ DTensor Policy Worker Synchronization Warning

Uh oh!

github-actions bot commented Feb 16, 2026

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

github-actions bot commented Feb 17, 2026

⚠️ File Consistency Check

⚠️ DTensor Policy Worker Synchronization Warning

Uh oh!

github-actions bot commented Feb 17, 2026

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

github-actions bot commented Feb 17, 2026

⚠️ File Consistency Check

⚠️ DTensor Policy Worker Synchronization Warning

Uh oh!

github-actions bot commented Feb 17, 2026

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

github-actions bot commented Mar 24, 2026

⚠️ File Consistency Check

⚠️ Parallel Plans Synchronization Warning

✅ DTensor Policy Worker Synchronization Check

Uh oh!

github-actions bot commented Mar 24, 2026

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

yuki-97 commented Mar 24, 2026

Uh oh!

github-actions bot commented Mar 24, 2026

⚠️ File Consistency Check

⚠️ Parallel Plans Synchronization Warning

✅ DTensor Policy Worker Synchronization Check

Uh oh!

github-actions bot commented Mar 24, 2026

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

yuki-97 commented Mar 24, 2026

Uh oh!

yuki-97 left a comment

hemildesai commented Feb 15, 2026 •

edited by yuki-97

Loading

coderabbitai bot commented Feb 15, 2026 •

edited

Loading