Skip to content

chore: update all to transformers v5 (+torch 2.10, ray 2.54, vllm/sglang tot)#1962

Merged
terrykong merged 105 commits intomainfrom
hemil/automodel-transformers-v5
Mar 25, 2026
Merged

chore: update all to transformers v5 (+torch 2.10, ray 2.54, vllm/sglang tot)#1962
terrykong merged 105 commits intomainfrom
hemil/automodel-transformers-v5

Conversation

@hemildesai
Copy link
Contributor

@hemildesai hemildesai commented Feb 15, 2026

  • Update transformers to v5 just for automodel extra
  • Update Automodel to latest main

Closes #1995
Closes #2041
Closes #2042

Closes NMFW-4

Summary by CodeRabbit

Release Notes

  • New Features

    • Added MoE parallelizer configuration options for distributed training
    • Introduced distributed context management for enhanced distributed setup
  • Configuration Updates

    • Updated backend configuration paths for automodel components
    • Added checkpoint save period configuration option
    • Enhanced DTensor configuration with cache clearing and async checkpointing flags
  • Dependencies

    • Relaxed transformers version constraint
    • Updated transformer-engine to latest compatible version
    • Added CUDA and DeepSeek dependencies for improved model support
  • Improvements

    • Simplified distributed training initialization flow
    • Enhanced automodel execution with cache optimization

@github-actions
Copy link

⚠️ File Consistency Check

Check based on commit: 9dce543 (PR #1962 from hemil/automodel-transformers-v5)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/workers/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/workers/dtensor_policy_worker.py
  • Update nemo_rl/models/policy/workers/dtensor_policy_worker.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
  • Not modified: nemo_rl/models/policy/workers/dtensor_policy_worker.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: 9dce543 (PR #1962 from hemil/automodel-transformers-v5)

✅ Submodules that are properly updated:

Automodel: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@hemildesai hemildesai marked this pull request as ready for review February 15, 2026 19:52
@hemildesai hemildesai requested review from a team as code owners February 15, 2026 19:52
@hemildesai hemildesai added the CI:L1 Run doctests, unit tests, and functional tests label Feb 15, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 15, 2026

📝 Walkthrough

Walkthrough

This PR refactors the distributed context management by replacing FSDP2Manager with a new DistributedContext object, removes model_state_dict_keys parameters from checkpoint management, updates backend configuration paths from moe.utils to models.common.utils, and modifies dependency management and workspace configuration in pyproject.toml.

Changes

Cohort / File(s) Summary
Automodel Submodule & Dependencies
3rdparty/Automodel-workspace/Automodel, pyproject.toml
Updated Automodel submodule pointer; changed workspace setup to path-based editable mode; relaxed transformers version constraint; added nemo-automodel[moe] extra; updated transformer-engine to 2.10.0; added nvidia-cudnn-cu12 and deep_ep dependencies; introduced automodel conflicts with fsdp, mcore, vllm in lint configuration.
Backend Config Path Migration
examples/configs/recipes/llm/*, nemo_rl/models/policy/__init__.py, tests/unit/models/policy/test_automodel_types.py
Updated BackendConfig import paths from nemo_automodel.components.moe.utils.BackendConfig to nemo_automodel.components.models.common.utils.BackendConfig across YAML configs and type definitions; added checkpointing.save_period: 30 in sft config.
Distributed Context Refactoring
nemo_rl/models/automodel/config.py, nemo_rl/models/automodel/setup.py
Introduced new DistributedContext NamedTuple to encapsulate device meshes and distributed configs; replaced setup_distributed to return DistributedContext instead of FSDP2Manager; refactored setup_model_and_optimizer to accept distributed_context parameter and use from_pretrained initialization with device meshes instead of manager-based approach; removed model_state_dict_keys field from ModelAndOptimizerState.
Policy Configuration Types
nemo_rl/models/policy/__init__.py
Added MoEParallelizerOptions TypedDict with fields for MoE parallelizer settings; extended DTensorConfig with clear_cache_every_n_steps, moe_parallelizer, defer_fsdp_grad_sync, expert_parallel_size, and custom_parallel_plan; reordered existing fields for clarity.
Checkpoint Manager Cleanup
nemo_rl/models/automodel/checkpoint.py
Removed model_state_dict_keys constructor parameter, attribute, and related docstrings; removed load_base_model and set_model_state_dict_keys public methods; removed TRANSFORMERS_CACHE import; updated init_checkpointer and update_checkpointer_config calls.
Integration Updates
nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py, nemo_rl/distributed/virtual_cluster.py
Updated DTensorPolicyWorkerV2 to use distributed_context instead of distributed_manager; removed model_state_dict_keys from AutomodelCheckpointManager initialization; added is_async flag to checkpoint config; added --no-cache flag to automodel uv command execution.
Test Updates
tests/unit/models/automodel/test_automodel_setup.py, tests/unit/models/automodel/test_automodel_checkpoint.py, tests/unit/models/policy/test_dtensor_worker_v2.py, tests/unit/models/policy/test_automodel_types.py
Updated tests to use new DistributedContext API; removed model_state_dict_keys from checkpoint manager tests; refactored test_automodel_setup.py to verify DistributedContext return and device mesh population; removed use_hf_tp_plan parameter from worker config tests.
Configuration
pyrefly.toml
Added nemo_rl/models/automodel/checkpoint.py to project-includes; removed nemo_rl/utils/automodel_checkpoint.py from project-includes.

Sequence Diagram

sequenceDiagram
    participant Setup as setup_distributed()
    participant Context as DistributedContext
    participant DeviceMesh as create_device_mesh()
    participant ModelSetup as setup_model_and_optimizer()
    participant FromPretrained as model_class.from_pretrained()
    participant Optimizer as OptimizerSetup
    
    Setup->>DeviceMesh: Create device/moe meshes
    DeviceMesh-->>Setup: Return meshes
    Setup->>Context: Construct DistributedContext<br/>(device_mesh, moe_mesh, fsdp2_config, moe_config, sizes)
    Setup-->>ModelSetup: Return DistributedContext
    
    ModelSetup->>ModelSetup: Validate CP/TP/EP interactions
    ModelSetup->>FromPretrained: Call with device_mesh,<br/>moe_mesh, distributed_config
    FromPretrained-->>ModelSetup: Return initialized model
    
    ModelSetup->>ModelSetup: Apply activation checkpointing<br/>and config overrides
    ModelSetup->>Optimizer: Initialize optimizer
    Optimizer-->>ModelSetup: Return optimizer state
    
    ModelSetup-->>ModelSetup: Return ModelAndOptimizerState
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

Run CICD

Suggested reviewers

  • terrykong
  • yuki-97
  • adil-a
🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Merge Conflict Detection ⚠️ Warning ❌ Merge conflicts detected (27 files):

⚔️ 3rdparty/Automodel-workspace/Automodel (content)
⚔️ docker/Dockerfile (content)
⚔️ docs/guides/use-custom-vllm.md (content)
⚔️ examples/configs/grpo_math_1B.yaml (content)
⚔️ examples/configs/recipes/llm/grpo-moonlight-16b-automodel-1n8g-ep8.yaml (content)
⚔️ examples/configs/recipes/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.yaml (content)
⚔️ examples/configs/vlm_grpo_3B.yaml (content)
⚔️ examples/configs/vlm_grpo_3B_megatron.yaml (content)
⚔️ examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml (content)
⚔️ nemo_rl/algorithms/grpo.py (content)
⚔️ nemo_rl/distributed/virtual_cluster.py (content)
⚔️ nemo_rl/environments/nemo_gym.py (content)
⚔️ nemo_rl/models/automodel/config.py (content)
⚔️ nemo_rl/models/automodel/setup.py (content)
⚔️ nemo_rl/models/generation/vllm/vllm_worker.py (content)
⚔️ nemo_rl/models/policy/__init__.py (content)
⚔️ nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py (content)
⚔️ pyproject.toml (content)
⚔️ pyrefly.toml (content)
⚔️ tests/functional/grpo_non_colocated.sh (content)
⚔️ tests/unit/algorithms/test_grpo.py (content)
⚔️ tests/unit/environments/test_nemo_gym.py (content)
⚔️ tests/unit/models/automodel/test_automodel_setup.py (content)
⚔️ tests/unit/models/policy/test_automodel_types.py (content)
⚔️ tests/unit/models/policy/test_dtensor_worker_v2.py (content)
⚔️ tools/build-custom-vllm.sh (content)
⚔️ uv.lock (content)

These conflicts must be resolved before merging into main.
Resolve conflicts locally and push changes to this branch.
Test Results For Major Changes ⚠️ Warning PR contains major breaking changes and dependency upgrades but lacks test results, regression verification, and convergence validation in the description. Add comprehensive testing summary documenting test results, regression testing confirmation, editable install fix verification, and resolution status for identified review issues.
✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 96.77% which is sufficient. The required threshold is 80.00%.
Title check ✅ Passed The title mentions updating transformers and other dependencies, which aligns with the core changes in pyproject.toml (transformers version bump, transformer-engine update, and dependency additions), but it incompletely represents the significant API refactoring across automodel modules.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch hemil/automodel-transformers-v5
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can customize the high-level summary generated by CodeRabbit.

Configure the reviews.high_level_summary_instructions setting to provide custom instructions for generating the high-level summary.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/unit/models/policy/test_automodel_types.py (1)

21-21: ⚠️ Potential issue | 🟠 Major

Update import path to match new _target_ reference.

Line 21 imports BackendConfig from the old path nemo_automodel.components.moe.utils, but line 50's _target_ string references the new path nemo_automodel.components.models.common.utils.BackendConfig. Update the import to match:

Proposed fix
-    from nemo_automodel.components.moe.utils import BackendConfig  # noqa: F401
+    from nemo_automodel.components.models.common.utils import BackendConfig  # noqa: F401
🤖 Fix all issues with AI agents
In `@nemo_rl/distributed/virtual_cluster.py`:
- Line 56: AUTOMODEL currently includes the --no-cache flag which forces uv to
bypass its cache on every automodel worker launch; either remove --no-cache from
the AUTOMODEL string to restore normal cached startup behavior or, if it was
intentionally added to workaround dependency/stale-cache issues (e.g.,
transformers v5 transition), add an inline comment next to the AUTOMODEL
definition explaining the rationale, when it can be removed, and any
reproduction steps that justify keeping it; update the AUTOMODEL constant
accordingly to reflect the chosen approach.

In `@nemo_rl/models/automodel/setup.py`:
- Line 465: The unconditional print(model) should be run only on the main
process to avoid repeated logs in distributed runs; wrap the existing
print(model) call with a check using the existing rank variable (e.g., if rank
== 0) so only rank 0 prints the model; locate the print(model) call and guard it
with the rank check (using the same rank identifier already declared earlier) so
other ranks skip printing.
- Around line 449-463: The call to model_class.from_pretrained passes
torch_dtype as str(model_config.torch_dtype), which yields values like
"torch.float32" but the loader expects the actual torch.dtype or a bare string
like "float32"; change the argument to pass the dtype object directly
(torch_dtype=model_config.torch_dtype) in the from_pretrained call inside
setup.py (where model_class.from_pretrained is invoked) so it aligns with the
STRING_TO_DTYPE mapping and test mocks that expect a torch.dtype rather than a
stringified value.

In `@pyproject.toml`:
- Line 165: The path-based editable dependency "nemo-automodel" points to an
empty directory (3rdparty/Automodel-workspace/Automodel) and lacks a
pyproject.toml, so fix by either placing the Automodel source into that
directory or updating the dependency to the correct path; then add a valid
pyproject.toml in that directory (with project metadata and build-backend) so
the editable install for nemo-automodel succeeds and verify the package layout
(package/module files) matches the pyproject configuration.
- Around line 238-239: The comment about the transformer-engine override is
stale and the global override to "transformer-engine[pytorch]==2.10.0" may
unintentionally force TE 2.10.0 into extras like mcore which pins
"transformer-engine[pytorch]==2.8.0"; update the comment to reflect the current
2.10.0 override and the rationale, or verify and ensure mcore compatibility with
TE 2.10.0 (and adjust mcore's pin or the global override accordingly) so
automodel, mcore, and Megatron-Bridge/pyproject.toml are all consistent; search
for the symbols transformer-engine[pytorch], mcore, automodel and the
Megatron-Bridge/pyproject.toml reference to locate the relevant pins and change
either the comment or the pinning strategy to resolve the version conflict.
🧹 Nitpick comments (6)
tests/unit/models/automodel/test_automodel_checkpoint.py (1)

375-388: Redundant local re-imports of AutomodelCheckpointManager.

AutomodelCheckpointManager is already imported at the module level (Line 35). The local re-imports inside each test method (Lines 375, 395, 417, 444, 465, 495, 526, 557) are unnecessary.

nemo_rl/models/policy/__init__.py (1)

88-109: New DTensorConfig keys could use brief inline documentation.

Several newly added keys (expert_parallel_size, custom_parallel_plan, defer_fsdp_grad_sync, moe_parallelizer, clear_cache_every_n_steps) lack purpose/default documentation. The coding guidelines ask that new TypedDict keys document their purpose, valid values, and recommended default.

The grouping comments (lines 92–93, 97, 103, 105, 108) are a good start — consider adding brief per-field comments similar to the style used in AutomodelBackendConfig above.

As per coding guidelines: "When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml".

pyproject.toml (1)

250-250: deep_ep override duplicates the spec already in vllm and mcore extras.

deep_ep is pinned to the same git+commit in vllm (Line 74), mcore (Line 114), and now the global override-dependencies (Line 250). This is fine for ensuring consistent resolution, but consider adding a brief comment explaining why the override is needed (e.g., ensuring automodel also uses this version).

nemo_rl/models/automodel/setup.py (2)

265-265: Hidden non-None default for defer_fsdp_grad_sync.

.get("defer_fsdp_grad_sync", True) introduces a default of True in code. Per coding guidelines, YAML should be the single source of truth for configuration defaults — non-None defaults should not be set in code.

Consider either:

  1. Making defer_fsdp_grad_sync a required field in DTensorConfig, or
  2. Setting the default in the YAML config files and accessing it directly here.

As per coding guidelines: "YAML is the single source of truth for configuration defaults; do not set non-None defaults in code for configuration values".


436-463: Potential key collision between from_pretrained_kwargs and automodel_kwargs.

Both **from_pretrained_kwargs (from hf_config_overrides) and **automodel_kwargs are unpacked into from_pretrained(). If any key exists in both dicts, automodel_kwargs silently wins. This may be intentional, but if not, it could cause subtle config loss.

Consider adding a guard:

overlap = set(from_pretrained_kwargs) & set(automodel_kwargs)
if overlap:
    print(f"[WARNING] Overlapping keys between hf_config_overrides and automodel_kwargs: {overlap}")
tests/unit/models/automodel/test_automodel_setup.py (1)

610-627: Lambda self parameter shadows outer fixture self.

Ruff flags self as unused in the lambda on line 622. The parameter is actually the mock instance receiving __getitem__, but it shadows the fixture's self. Consider renaming to _self or _ for clarity.

♻️ Minor rename to silence Ruff ARG005
-        mock_mesh.__getitem__ = lambda self, key: {
+        mock_mesh.__getitem__ = lambda _self, key: {

@hemildesai hemildesai removed the CI:L1 Run doctests, unit tests, and functional tests label Feb 16, 2026
@github-actions
Copy link

⚠️ File Consistency Check

Check based on commit: fdb0374 (PR #1962 from hemil/automodel-transformers-v5)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/workers/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/workers/dtensor_policy_worker.py
  • Update nemo_rl/models/policy/workers/dtensor_policy_worker.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
  • Not modified: nemo_rl/models/policy/workers/dtensor_policy_worker.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@hemildesai hemildesai added the CI:L1 Run doctests, unit tests, and functional tests label Feb 16, 2026
@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: fdb0374 (PR #1962 from hemil/automodel-transformers-v5)

✅ Submodules that are properly updated:

Automodel: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@hemildesai hemildesai removed the CI:L1 Run doctests, unit tests, and functional tests label Feb 17, 2026
@github-actions
Copy link

⚠️ File Consistency Check

Check based on commit: 945117f (PR #1962 from hemil/automodel-transformers-v5)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/workers/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/workers/dtensor_policy_worker.py
  • Update nemo_rl/models/policy/workers/dtensor_policy_worker.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
  • Not modified: nemo_rl/models/policy/workers/dtensor_policy_worker.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@hemildesai hemildesai added the CI:L1 Run doctests, unit tests, and functional tests label Feb 17, 2026
@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: 945117f (PR #1962 from hemil/automodel-transformers-v5)

✅ Submodules that are properly updated:

Automodel: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@github-actions
Copy link

⚠️ File Consistency Check

Check based on commit: d9143bc (PR #1962 from hemil/automodel-transformers-v5)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/workers/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/workers/dtensor_policy_worker.py
  • Update nemo_rl/models/policy/workers/dtensor_policy_worker.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
  • Not modified: nemo_rl/models/policy/workers/dtensor_policy_worker.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: d9143bc (PR #1962 from hemil/automodel-transformers-v5)

✅ Submodules that are properly updated:

Automodel: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@hemildesai hemildesai removed the CI:L1 Run doctests, unit tests, and functional tests label Feb 17, 2026
yuki-97 and others added 19 commits March 23, 2026 20:32
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
- Set router_aux_loss_coef=0 via hf_config_overrides to fix 74x gradient
  inflation caused by MoEAuxLossAutoScaler.main_loss_backward_scale not
  being set in nemo-rl (defaults to 1.0, aux_loss grads unscaled vs
  token-averaged cross-entropy loss)
- Add experts=torch_mm and rope_fusion=false backend settings
- Remove TORCH_COMPILE_DISABLE=1 from test script (fix is in automodel
  experts.py @torch.compile removal on _apply_bias)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
- increase time to avoid flaky fail: grpo-qwen3-8B-base-1n8g-fsdp2-lora, grpo-qwen3-8b-base-1n8g-megatron-lora
- fix make_sequence_length_divisible_by: distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack
- set use_distributed_optimizer=True to avoid using layer_wise_optimizer: sft-qwen2.5-math7b-1n8g-megatron_chunked_linear_ce_loss

Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
- missing sampling_params after rebase
- numerical differences on gb200: test_megatron_worker.py::test_megatron_context_parallel_topk_agreement
- missing model.rotary_emb_local: test_parallelize.py::test_parallelize_plan_keys
- wildcard_match fix in automodel: test_lora.py::test_apply_lora_respects_wildcard

remaining unit test fails:
- test_smolvlm_embeddings_bug.py::test_smolvlm_embeddings_differ_from_reference
- test_dtensor_worker_v2.py::test_dtensor_worker_v1_v2_model_config_equivalence
- test_dtensor_worker.py::test_dtensor_tp_and_tied_model_with_custom_parallel_plan

Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
…p two automodel unit tests

Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
…se in gb200 vlm

Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
@github-actions
Copy link

⚠️ File Consistency Check

Check based on commit: 1add665 (PR #1962 from hemil/automodel-transformers-v5)

⚠️ Parallel Plans Synchronization Warning

The file nemo_rl/models/dtensor/parallelize.py was modified in this PR, but neither 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py nor 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py was updated.

Why this matters:
These files contain similar parallel plan implementations that should be kept synchronized to ensure consistency across the codebase.

Action required:

  • Please review if the changes in nemo_rl/models/dtensor/parallelize.py should also be applied to 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py or 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py
  • Update the appropriate related file(s) if necessary to maintain functional consistency
  • Request access to the NVIDIA-NeMo/Automodel repository, create a PR against the nemo-rl-submodule branch, and update the Automodel submodule in the nemo-rl index
  • Add @ffrujeri as a reviewer of this PR if you have any questions about the consistency requirements
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/dtensor/parallelize.py
  • Not modified: 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py
  • Not modified: 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/workers/dtensor_policy_worker.py
  • nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: 1add665 (PR #1962 from hemil/automodel-transformers-v5)

✅ Submodules that are properly updated:

Automodel: ✅ PR branch is ahead of main branch (fast-forward)
Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)
Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@yuki-97
Copy link
Contributor

yuki-97 commented Mar 24, 2026

/ok to test 1add665

yuki-97 added 2 commits March 24, 2026 07:42
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
@github-actions
Copy link

⚠️ File Consistency Check

Check based on commit: e165545 (PR #1962 from hemil/automodel-transformers-v5)

⚠️ Parallel Plans Synchronization Warning

The file nemo_rl/models/dtensor/parallelize.py was modified in this PR, but neither 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py nor 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py was updated.

Why this matters:
These files contain similar parallel plan implementations that should be kept synchronized to ensure consistency across the codebase.

Action required:

  • Please review if the changes in nemo_rl/models/dtensor/parallelize.py should also be applied to 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py or 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py
  • Update the appropriate related file(s) if necessary to maintain functional consistency
  • Request access to the NVIDIA-NeMo/Automodel repository, create a PR against the nemo-rl-submodule branch, and update the Automodel submodule in the nemo-rl index
  • Add @ffrujeri as a reviewer of this PR if you have any questions about the consistency requirements
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/dtensor/parallelize.py
  • Not modified: 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py
  • Not modified: 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/workers/dtensor_policy_worker.py
  • nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: e165545 (PR #1962 from hemil/automodel-transformers-v5)

✅ Submodules that are properly updated:

Automodel: ✅ PR branch is ahead of main branch (fast-forward)
Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)
Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@yuki-97
Copy link
Contributor

yuki-97 commented Mar 24, 2026

/ok to test e165545

Copy link
Contributor

@yuki-97 yuki-97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

h100 nightly tests all passed, can be merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests CI Relating to CI documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

transformers v5 Nightly tests failed at main RL Nightly Test Failing after likely Mcore bump

7 participants