chore: update all to transformers v5 (+torch 2.10, ray 2.54, vllm/sglang tot)#1962
chore: update all to transformers v5 (+torch 2.10, ray 2.54, vllm/sglang tot)#1962
Conversation
|
📝 WalkthroughWalkthroughThis PR refactors the distributed context management by replacing Changes
Sequence DiagramsequenceDiagram
participant Setup as setup_distributed()
participant Context as DistributedContext
participant DeviceMesh as create_device_mesh()
participant ModelSetup as setup_model_and_optimizer()
participant FromPretrained as model_class.from_pretrained()
participant Optimizer as OptimizerSetup
Setup->>DeviceMesh: Create device/moe meshes
DeviceMesh-->>Setup: Return meshes
Setup->>Context: Construct DistributedContext<br/>(device_mesh, moe_mesh, fsdp2_config, moe_config, sizes)
Setup-->>ModelSetup: Return DistributedContext
ModelSetup->>ModelSetup: Validate CP/TP/EP interactions
ModelSetup->>FromPretrained: Call with device_mesh,<br/>moe_mesh, distributed_config
FromPretrained-->>ModelSetup: Return initialized model
ModelSetup->>ModelSetup: Apply activation checkpointing<br/>and config overrides
ModelSetup->>Optimizer: Initialize optimizer
Optimizer-->>ModelSetup: Return optimizer state
ModelSetup-->>ModelSetup: Return ModelAndOptimizerState
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can customize the high-level summary generated by CodeRabbit.Configure the |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/unit/models/policy/test_automodel_types.py (1)
21-21:⚠️ Potential issue | 🟠 MajorUpdate import path to match new
_target_reference.Line 21 imports
BackendConfigfrom the old pathnemo_automodel.components.moe.utils, but line 50's_target_string references the new pathnemo_automodel.components.models.common.utils.BackendConfig. Update the import to match:Proposed fix
- from nemo_automodel.components.moe.utils import BackendConfig # noqa: F401 + from nemo_automodel.components.models.common.utils import BackendConfig # noqa: F401
🤖 Fix all issues with AI agents
In `@nemo_rl/distributed/virtual_cluster.py`:
- Line 56: AUTOMODEL currently includes the --no-cache flag which forces uv to
bypass its cache on every automodel worker launch; either remove --no-cache from
the AUTOMODEL string to restore normal cached startup behavior or, if it was
intentionally added to workaround dependency/stale-cache issues (e.g.,
transformers v5 transition), add an inline comment next to the AUTOMODEL
definition explaining the rationale, when it can be removed, and any
reproduction steps that justify keeping it; update the AUTOMODEL constant
accordingly to reflect the chosen approach.
In `@nemo_rl/models/automodel/setup.py`:
- Line 465: The unconditional print(model) should be run only on the main
process to avoid repeated logs in distributed runs; wrap the existing
print(model) call with a check using the existing rank variable (e.g., if rank
== 0) so only rank 0 prints the model; locate the print(model) call and guard it
with the rank check (using the same rank identifier already declared earlier) so
other ranks skip printing.
- Around line 449-463: The call to model_class.from_pretrained passes
torch_dtype as str(model_config.torch_dtype), which yields values like
"torch.float32" but the loader expects the actual torch.dtype or a bare string
like "float32"; change the argument to pass the dtype object directly
(torch_dtype=model_config.torch_dtype) in the from_pretrained call inside
setup.py (where model_class.from_pretrained is invoked) so it aligns with the
STRING_TO_DTYPE mapping and test mocks that expect a torch.dtype rather than a
stringified value.
In `@pyproject.toml`:
- Line 165: The path-based editable dependency "nemo-automodel" points to an
empty directory (3rdparty/Automodel-workspace/Automodel) and lacks a
pyproject.toml, so fix by either placing the Automodel source into that
directory or updating the dependency to the correct path; then add a valid
pyproject.toml in that directory (with project metadata and build-backend) so
the editable install for nemo-automodel succeeds and verify the package layout
(package/module files) matches the pyproject configuration.
- Around line 238-239: The comment about the transformer-engine override is
stale and the global override to "transformer-engine[pytorch]==2.10.0" may
unintentionally force TE 2.10.0 into extras like mcore which pins
"transformer-engine[pytorch]==2.8.0"; update the comment to reflect the current
2.10.0 override and the rationale, or verify and ensure mcore compatibility with
TE 2.10.0 (and adjust mcore's pin or the global override accordingly) so
automodel, mcore, and Megatron-Bridge/pyproject.toml are all consistent; search
for the symbols transformer-engine[pytorch], mcore, automodel and the
Megatron-Bridge/pyproject.toml reference to locate the relevant pins and change
either the comment or the pinning strategy to resolve the version conflict.
🧹 Nitpick comments (6)
tests/unit/models/automodel/test_automodel_checkpoint.py (1)
375-388: Redundant local re-imports ofAutomodelCheckpointManager.
AutomodelCheckpointManageris already imported at the module level (Line 35). The local re-imports inside each test method (Lines 375, 395, 417, 444, 465, 495, 526, 557) are unnecessary.nemo_rl/models/policy/__init__.py (1)
88-109: NewDTensorConfigkeys could use brief inline documentation.Several newly added keys (
expert_parallel_size,custom_parallel_plan,defer_fsdp_grad_sync,moe_parallelizer,clear_cache_every_n_steps) lack purpose/default documentation. The coding guidelines ask that new TypedDict keys document their purpose, valid values, and recommended default.The grouping comments (lines 92–93, 97, 103, 105, 108) are a good start — consider adding brief per-field comments similar to the style used in
AutomodelBackendConfigabove.As per coding guidelines: "When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, recommended default, and reflect the default in exemplar YAMLs under
examples/configs/*.yaml".pyproject.toml (1)
250-250:deep_epoverride duplicates the spec already invllmandmcoreextras.
deep_epis pinned to the same git+commit invllm(Line 74),mcore(Line 114), and now the globaloverride-dependencies(Line 250). This is fine for ensuring consistent resolution, but consider adding a brief comment explaining why the override is needed (e.g., ensuring automodel also uses this version).nemo_rl/models/automodel/setup.py (2)
265-265: Hidden non-None default fordefer_fsdp_grad_sync.
.get("defer_fsdp_grad_sync", True)introduces a default ofTruein code. Per coding guidelines, YAML should be the single source of truth for configuration defaults — non-None defaults should not be set in code.Consider either:
- Making
defer_fsdp_grad_synca required field inDTensorConfig, or- Setting the default in the YAML config files and accessing it directly here.
As per coding guidelines: "YAML is the single source of truth for configuration defaults; do not set non-None defaults in code for configuration values".
436-463: Potential key collision betweenfrom_pretrained_kwargsandautomodel_kwargs.Both
**from_pretrained_kwargs(fromhf_config_overrides) and**automodel_kwargsare unpacked intofrom_pretrained(). If any key exists in both dicts,automodel_kwargssilently wins. This may be intentional, but if not, it could cause subtle config loss.Consider adding a guard:
overlap = set(from_pretrained_kwargs) & set(automodel_kwargs) if overlap: print(f"[WARNING] Overlapping keys between hf_config_overrides and automodel_kwargs: {overlap}")tests/unit/models/automodel/test_automodel_setup.py (1)
610-627: Lambdaselfparameter shadows outer fixtureself.Ruff flags
selfas unused in the lambda on line 622. The parameter is actually the mock instance receiving__getitem__, but it shadows the fixture'sself. Consider renaming to_selfor_for clarity.♻️ Minor rename to silence Ruff ARG005
- mock_mesh.__getitem__ = lambda self, key: { + mock_mesh.__getitem__ = lambda _self, key: {
|
|
|
Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
- Set router_aux_loss_coef=0 via hf_config_overrides to fix 74x gradient inflation caused by MoEAuxLossAutoScaler.main_loss_backward_scale not being set in nemo-rl (defaults to 1.0, aux_loss grads unscaled vs token-averaged cross-entropy loss) - Add experts=torch_mm and rope_fusion=false backend settings - Remove TORCH_COMPILE_DISABLE=1 from test script (fix is in automodel experts.py @torch.compile removal on _apply_bias) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: hemildesai <hemild@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>
- increase time to avoid flaky fail: grpo-qwen3-8B-base-1n8g-fsdp2-lora, grpo-qwen3-8b-base-1n8g-megatron-lora - fix make_sequence_length_divisible_by: distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack - set use_distributed_optimizer=True to avoid using layer_wise_optimizer: sft-qwen2.5-math7b-1n8g-megatron_chunked_linear_ce_loss Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>
- missing sampling_params after rebase - numerical differences on gb200: test_megatron_worker.py::test_megatron_context_parallel_topk_agreement - missing model.rotary_emb_local: test_parallelize.py::test_parallelize_plan_keys - wildcard_match fix in automodel: test_lora.py::test_apply_lora_respects_wildcard remaining unit test fails: - test_smolvlm_embeddings_bug.py::test_smolvlm_embeddings_differ_from_reference - test_dtensor_worker_v2.py::test_dtensor_worker_v1_v2_model_config_equivalence - test_dtensor_worker.py::test_dtensor_tp_and_tied_model_with_custom_parallel_plan Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>
…p two automodel unit tests Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>
…se in gb200 vlm Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>
|
✅ Submodule Fast-Forward Check ResultsCheck based on commit: 1add665 (PR #1962 from ✅ Submodules that are properly updated:Automodel: ✅ PR branch is ahead of main branch (fast-forward) All submodule changes look good! ✨ |
|
/ok to test 1add665 |
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
|
✅ Submodule Fast-Forward Check ResultsCheck based on commit: e165545 (PR #1962 from ✅ Submodules that are properly updated:Automodel: ✅ PR branch is ahead of main branch (fast-forward) All submodule changes look good! ✨ |
|
/ok to test e165545 |
yuki-97
left a comment
There was a problem hiding this comment.
h100 nightly tests all passed, can be merged!
Closes #1995
Closes #2041
Closes #2042
Closes NMFW-4
Summary by CodeRabbit
Release Notes
New Features
Configuration Updates
Dependencies
Improvements