Skip to content

cp: feat: DTensorPolicyV2 GPT-OSS SFT support (1470) into r0.5.0#1690

Merged
yuki-97 merged 1 commit intor0.5.0from
cherry-pick-1470-r0.5.0
Dec 23, 2025
Merged

cp: feat: DTensorPolicyV2 GPT-OSS SFT support (1470) into r0.5.0#1690
yuki-97 merged 1 commit intor0.5.0from
cherry-pick-1470-r0.5.0

Conversation

@chtruong814
Copy link
Contributor

@chtruong814 chtruong814 commented Dec 23, 2025

beep boop [🤖]: Hi @adil-a 👋,

we've cherry picked #1470 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

Release Notes

  • New Features

    • Added checkpoint management system for distributed model training with support for LORA/PEFT configurations
    • Enhanced DTensor v2 with dynamic backend selection and CPU offload support
    • Added SFT training configuration for GPT-OSS 20B with expert parallelism
    • Added Transformer Engine runtime patching utilities
  • Bug Fixes

    • Fixed LoRA initialization to use standardized methods
    • Corrected NeMo automodel import paths
  • Chores

    • Updated dependencies: transformer-engine, deep_ep, and GPU acceleration libraries
    • Reorganized distributed test suite with improved class-based structure
    • Expanded test coverage for checkpoint management and configuration validation

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-authored-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@chtruong814 chtruong814 requested review from a team as code owners December 23, 2025 05:12
@chtruong814 chtruong814 requested review from adil-a and removed request for a team December 23, 2025 05:12
@github-actions
Copy link

⚠️ File Consistency Check

Check based on commit: 0f41577 (PR #1690 from cherry-pick-1470-r0.5.0)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/workers/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/workers/dtensor_policy_worker.py
  • Update nemo_rl/models/policy/workers/dtensor_policy_worker.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
  • Not modified: nemo_rl/models/policy/workers/dtensor_policy_worker.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: 0f41577 (PR #1690 from cherry-pick-1470-r0.5.0)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (r0.5.0 branch): https://github.com/NVIDIA-NeMo/Automodel/commits/910f4e0402ec3af0c3b8642639f0347732067630/
CURRENT (PR #1690 from cherry-pick-1470-r0.5.0): https://github.com/NVIDIA-NeMo/Automodel/commits/1d42deb98169fd94b54c714c0fe4bf308fe7115a/

Please ensure all submodule commits are fast-forwards of the r0.5.0 branch before merging.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 23, 2025

📝 Walkthrough

Walkthrough

This PR introduces Automodel and DeepEP integration for DTensor-based LLM training, including a new checkpoint management system, refactored Transformer-Engine patching, and dynamic attention implementation selection. Configuration structures for Automodel backends and new test coverage for checkpoint management and policy worker flows are added.

Changes

Cohort / File(s) Summary
Automodel Configuration & Types
nemo_rl/models/policy/__init__.py, examples/configs/recipes/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.yaml
Introduces TypedDict structures for AutomodelBackendConfig and AutomodelKwargs; extends DTensorConfig with optional automodel_kwargs field. New YAML recipe specifies Automodel training policy with FSDP8-EP8 configuration for GPT-OSS 20B.
Policy Worker Initialization & Warnings
nemo_rl/models/policy/lm_policy.py
Adds runtime warning when TORCH_CUDA_ARCH_LIST environment variable is absent, noting requirement for DeepEP in DTensorPolicyWorker V2.
Automodel Import Path Updates
nemo_rl/models/policy/utils.py
Updates import path for NeMo Automodel classes from nemo_automodel.components._transformers.auto_model to nemo_automodel._transformers.auto_model.
Transformer-Engine Runtime Patching
nemo_rl/models/policy/workers/patches.py, nemo_rl/models/policy/workers/megatron_policy_worker.py
Introduces new patches module with _get_transformer_engine_file and apply_transformer_engine_patch utilities. Refactors MegatronPolicyWorker to use externalized patching instead of internal implementation.
DTensor Policy Worker V2 Major Refactor
nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
Substantial changes: (1) Early transformer-engine patching via apply_transformer_engine_patch; (2) Dynamic attention implementation selection based on sequence packing and context-parallel size; (3) Automodel kwargs augmentation with backend configuration and use_liger_kernel flag; (4) Integration with new AutomodelCheckpointManager for checkpoint operations; (5) FSDP2Manager-based device mesh and parallelization flow; (6) Gradient scaling and clipping with scale_grads_and_clip_grad_norm wrapper; (7) Precision handling via STRING_TO_DTYPE mapping.
Checkpoint Management System
nemo_rl/utils/automodel_checkpoint.py
New AutomodelCheckpointManager class wrapping nemo_automodel's Checkpointer. Provides object-oriented checkpoint interface with rank-aware initialization, PEFT/LoRA configuration support, model state dict key tracking, and checkpoint addon management (ConsolidatedHFAddon, PeftAddon). Replaces functional checkpoint API.
Dependencies & Build Configuration
pyproject.toml, pyrefly.toml, nemo_rl/utils/venvs.py
Adds transformer-engine[pytorch]==2.8.0, nv-grouped-gemm, and deep_ep dependencies to automodel group. Updates deep_ep git revision in vllm block. Moves shutil import to top-level in venvs.py. Updates pyrefly includes for patches.py and automodel_checkpoint.py modules.
Test Framework Updates
tests/unit/models/policy/test_dtensor_worker.py, tests/unit/models/policy/test_dtensor_worker_v2.py
Reorganizes test_dtensor_worker.py with class-based test organization (TestSingleGPUCluster, TestTwoGPUCluster) and centralized _base_setup_impl helper. Expands test_dtensor_worker_v2.py with enhanced create_test_config signature (precision, expert_parallel_size, automodel_kwargs, checkpointing) and new create_test_batch helper.
New Test Suites
tests/unit/models/policy/test_automodel_types.py, tests/unit/models/policy/test_patches.py, tests/unit/utils/test_automodel_checkpoint.py, tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh, tests/test_suites/nightly.txt
Introduces unit tests for AutomodelBackendConfig TypedDict validation, Transformer-Engine patching logic (path resolution, patching application, module reload), and comprehensive AutomodelCheckpointManager functionality (distributed checkpoint save/load, format detection, PEFT handling). Adds integration test script for GPT-OSS 20B DeepEP training with metric checks.
LoRA Test Migration
tests/unit/models/dtensor/test_lora.py
Removes internal _patched_init_lora_weights import and replaces with LinearLoRA.init_lora_weights method calls. Deletes test_lora_init_differs_from_upstream_buggy_version test.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA-NeMo/RL#1023 — Introduces nemo-automodel checkpointing support and automodel_checkpoint module with wiring into DTensor/Policy worker workflows.
  • NVIDIA-NeMo/RL#1470 — Modifies DTensorPolicyWorkerV2, automodel checkpoint utilities, automodel import paths, and transformer-engine patching integration.
  • NVIDIA-NeMo/RL#1665 — Implements SDPA/attention backend selection (attn_impl/sdpa_method) computation in dtensor_policy_worker_v2.py for context-parallel handling.

Suggested labels

r0.5.0, CI:L1

Suggested reviewers

  • adil-a
  • terrykong
  • joyang-nv

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR description lacks test results, metrics, and convergence validation despite substantial changes to gradient scaling, loss scaling, attention selection, and checkpoint management logic. Update PR description with test results, convergence validation, performance metrics, and configuration context used for testing the major refactoring.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly indicates this is a cherry-pick of DTensorPolicyV2 GPT-OSS SFT support (PR #1470) into the r0.5.0 branch, which accurately summarizes the main purpose of the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 81.65% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch cherry-pick-1470-r0.5.0

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/unit/models/policy/test_dtensor_worker.py (1)

192-203: Same device inconsistency as in v2 tests.

sample_mask is created on CUDA (line 195) before the entire batch is moved to CPU (line 202). Consider creating it without specifying device.

🔎 Proposed fix
             **(
                 {
                     "labels": torch.randint(0, vocab_size, (batch_size, seq_len)),
-                    "sample_mask": torch.ones(batch_size).cuda(),
+                    "sample_mask": torch.ones(batch_size),
                 }
                 if mode == "train"
                 else {}
             ),
🧹 Nitpick comments (8)
nemo_rl/models/policy/lm_policy.py (1)

115-119: Add stacklevel parameter to warning for better debugging.

The warning correctly checks for TORCH_CUDA_ARCH_LIST and provides helpful guidance. However, adding stacklevel=2 will ensure the warning points to the caller's location rather than this line, making it easier to debug.

🔎 Proposed fix
 if "TORCH_CUDA_ARCH_LIST" not in os.environ:
     warnings.warn(
         "TORCH_CUDA_ARCH_LIST is not set. This is needed if using DeepEP in DTensorPolicyWorker V2. This variable is set in our container, but "
         "if you are running a custom container or baremetal, you may need to set this variable manually. Example: export TORCH_CUDA_ARCH_LIST='9.0 10.0'",
+        stacklevel=2,
     )

Based on static analysis hints.

tests/unit/models/policy/test_automodel_types.py (1)

20-25: Remove unnecessary noqa directive.

The static analysis indicates F401 (unused import) rule is not enabled, making the noqa: F401 directive unnecessary. The BackendConfig import is actually used at line 65, so even if the rule were enabled, this wouldn't be flagged.

🔎 Proposed fix
 try:
-    from nemo_automodel.components.moe.utils import BackendConfig  # noqa: F401
+    from nemo_automodel.components.moe.utils import BackendConfig

     NEMO_AUTOMODEL_AVAILABLE = True
tests/unit/utils/test_automodel_checkpoint.py (1)

92-115: Intentional exception swallowing for cleanup resilience.

The broad exception handling in _cleanup_dcp_planner_cache is appropriate here since this is a test cleanup helper that should not cause test failures. Consider adding a brief comment explaining this is intentional for test isolation.

🔎 Optional: Add explanatory comment
     except Exception:
-        pass
+        pass  # Cleanup should not fail tests; errors are non-critical
tests/unit/models/policy/test_patches.py (1)

185-216: Consider using _ for intentionally unused parameter.

The path parameter in mock_open_func is unused since the mock only needs to differentiate by mode. Using _ or _path would clarify intent.

🔎 Proposed fix
-        def mock_open_func(path, mode="r"):
+        def mock_open_func(_path, mode="r"):
             call_count[0] += 1
             if mode == "r":
                 mock_file_handle.read.return_value = self.UNPATCHED_CONTENT
             return mock_file_handle
nemo_rl/models/policy/workers/patches.py (1)

96-103: Consider moving imports to module level.

The importlib and sys imports inside the function could be moved to the top of the file for consistency with other imports.

🔎 Proposed fix
 import os
+import sys
+import importlib
 from importlib.util import find_spec

Then remove the local imports at lines 98-99.

nemo_rl/utils/automodel_checkpoint.py (1)

171-190: Accessing private _addons attribute is fragile.

This method directly manipulates self.checkpointer._addons, which is a private implementation detail of the Checkpointer class. If the underlying library changes this internal structure, this code will break silently.

Consider adding a comment explaining why this is necessary and/or wrapping in a try-except to handle potential API changes gracefully.

🔎 Proposed documentation
     def _rebuild_checkpointer_addons(self) -> None:
         """Rebuild the checkpointer's _addons list based on current config.

         The Checkpointer's _addons list is populated during __init__ based on config.
         When config changes (e.g., model_save_format or is_peft), we need to rebuild
         the addons list to match the new config.
+
+        Note: This accesses the private _addons attribute of Checkpointer.
+        This coupling is necessary because the Checkpointer doesn't expose
+        a public API to update addons after initialization.
         """
tests/unit/models/policy/test_dtensor_worker.py (2)

841-854: Use next(iter(...)) for cleaner single-element access.

Static analysis correctly suggests using next(iter(...)) instead of list(...)[0] for accessing the first element.

🔎 Proposed fix
-            param_sample = list(info["parameter_sample"].values())[0]
+            param_sample = next(iter(info["parameter_sample"].values()))
...
-        param_names = [list(info["parameter_sample"].keys())[0] for info in gpu_infos]
+        param_names = [next(iter(info["parameter_sample"].keys())) for info in gpu_infos]
...
-            param_device = list(info["parameter_sample"].values())[0]["device"]
+            param_device = next(iter(info["parameter_sample"].values()))["device"]

1093-1103: Rename unused loop variables to indicate intent.

Per static analysis, rename unused loop variables to underscore-prefixed names.

🔎 Proposed fix
-            for warmup_step in range(2):
+            for _warmup_step in range(2):
                 results = policy.train(data, loss_fn)

             # Measure FLOPS on 3 iterations
             print("Measuring FLOPS on 3 iterations...")
             time_begin = time.time()
-            for train_step in range(3):
+            for _train_step in range(3):
                 results = policy.train(data, loss_fn)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b1a1e73 and 0f41577.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (21)
  • 3rdparty/Automodel-workspace/Automodel
  • examples/configs/recipes/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.yaml
  • nemo_rl/models/policy/__init__.py
  • nemo_rl/models/policy/lm_policy.py
  • nemo_rl/models/policy/utils.py
  • nemo_rl/models/policy/workers/__init__.py
  • nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
  • nemo_rl/models/policy/workers/megatron_policy_worker.py
  • nemo_rl/models/policy/workers/patches.py
  • nemo_rl/utils/automodel_checkpoint.py
  • nemo_rl/utils/venvs.py
  • pyproject.toml
  • pyrefly.toml
  • tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh
  • tests/test_suites/nightly.txt
  • tests/unit/models/dtensor/test_lora.py
  • tests/unit/models/policy/test_automodel_types.py
  • tests/unit/models/policy/test_dtensor_worker.py
  • tests/unit/models/policy/test_dtensor_worker_v2.py
  • tests/unit/models/policy/test_patches.py
  • tests/unit/utils/test_automodel_checkpoint.py
🧰 Additional context used
📓 Path-based instructions (9)
examples/configs/recipes/**/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

When adding support for a new model, create a recipe YAML under examples/configs/recipes/ in the appropriate domain subdirectory (llm, vlm, etc.)

Files:

  • examples/configs/recipes/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.yaml
examples/configs/recipes/llm/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Recipe YAML files should follow the naming pattern: --ng-[-modifiers][-long][.vN].yaml for LLM recipes

Files:

  • examples/configs/recipes/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.yaml
!(**/tests/**|**/test_*.py|**/test_*.sh)

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year

Files:

  • examples/configs/recipes/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.yaml
  • tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh
  • nemo_rl/models/policy/workers/patches.py
  • nemo_rl/models/policy/lm_policy.py
  • pyrefly.toml
  • tests/test_suites/nightly.txt
  • tests/unit/models/policy/test_patches.py
  • tests/unit/models/policy/test_automodel_types.py
  • tests/unit/models/policy/test_dtensor_worker_v2.py
  • nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
  • tests/unit/models/policy/test_dtensor_worker.py
  • tests/unit/models/dtensor/test_lora.py
  • nemo_rl/models/policy/utils.py
  • 3rdparty/Automodel-workspace/Automodel
  • nemo_rl/utils/venvs.py
  • nemo_rl/models/policy/__init__.py
  • nemo_rl/models/policy/workers/megatron_policy_worker.py
  • pyproject.toml
  • nemo_rl/utils/automodel_checkpoint.py
  • tests/unit/utils/test_automodel_checkpoint.py
**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.sh: Use uv run instead of python to execute scripts
Follow the Google Shell Style Guide for shell scripts

Files:

  • tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh
tests/test_suites/**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

tests/test_suites/**/*.sh: When adding support for a new model, create a corresponding driver shell script under tests/test_suites/ in the matching domain
Driver shell scripts should match the YAML base name with .sh extension and invoke training entrypoint with uv run

Files:

  • tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh
**/*.{py,sh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)

Files:

  • tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh
  • nemo_rl/models/policy/workers/patches.py
  • nemo_rl/models/policy/lm_policy.py
  • tests/unit/models/policy/test_patches.py
  • tests/unit/models/policy/test_automodel_types.py
  • tests/unit/models/policy/test_dtensor_worker_v2.py
  • nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
  • tests/unit/models/policy/test_dtensor_worker.py
  • tests/unit/models/dtensor/test_lora.py
  • nemo_rl/models/policy/utils.py
  • nemo_rl/utils/venvs.py
  • nemo_rl/models/policy/__init__.py
  • nemo_rl/models/policy/workers/megatron_policy_worker.py
  • nemo_rl/utils/automodel_checkpoint.py
  • tests/unit/utils/test_automodel_checkpoint.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Conform code to Python 3.12+
Indent code with 4 spaces. Do not use tabs
Use snake_case for file names
Use PascalCase for class names
Use snake_case for function and method names
Use snake_case for local variables
Prefix variable names that start with a number with 'k' (e.g., k_99th_percentile)
Use upper snake_case with 'G' prefix for global variables (e.g., G_MY_GLOBAL)
Use upper snake_case for constants
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
Prefer docstrings over comments for interfaces that may be used outside a file
Reserve comments for code within a function or interfaces that are local to a file
If a piece of code is commented out, include a comment describing its usage and why it's commented out. Remove debug comments before merging
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx
Avoid using reflection when functionality can be easily achieved without reflection
When using try-except blocks, limit the except clause to the smallest set of specific errors possible
When using try-except blocks for duck-typing, keep the body of the try as small as possible and use the else block for logic
YAML is the single source of truth for configuration defaults. Do not set non-None defaults in code for configuration values
For required configuration attributes, access config directly and expect presence (e.g., policy_cfg['precision']) without hidden defaults
Use typing.NotRequired to mark optional attributes in TypedDict for configuration
When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml
Follow the Google Python Style Guide for Python code

Files:

  • nemo_rl/models/policy/workers/patches.py
  • nemo_rl/models/policy/lm_policy.py
  • tests/unit/models/policy/test_patches.py
  • tests/unit/models/policy/test_automodel_types.py
  • tests/unit/models/policy/test_dtensor_worker_v2.py
  • nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
  • tests/unit/models/policy/test_dtensor_worker.py
  • tests/unit/models/dtensor/test_lora.py
  • nemo_rl/models/policy/utils.py
  • nemo_rl/utils/venvs.py
  • nemo_rl/models/policy/__init__.py
  • nemo_rl/models/policy/workers/megatron_policy_worker.py
  • nemo_rl/utils/automodel_checkpoint.py
  • tests/unit/utils/test_automodel_checkpoint.py
nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

For any source file under nemo_rl/*.py that defines a class or function decorated with @ray.remote, add a coverage pragma (# pragma: no cover) because these run in separate Ray processes

Files:

  • nemo_rl/models/policy/workers/patches.py
  • nemo_rl/models/policy/lm_policy.py
  • nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
  • nemo_rl/models/policy/utils.py
  • nemo_rl/utils/venvs.py
  • nemo_rl/models/policy/__init__.py
  • nemo_rl/models/policy/workers/megatron_policy_worker.py
  • nemo_rl/utils/automodel_checkpoint.py
tests/test_suites/nightly.txt

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

When adding a nightly test for a new model, append the driver script path (relative to tests/test_suites/) to tests/test_suites/nightly.txt

Files:

  • tests/test_suites/nightly.txt
🧠 Learnings (9)
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/llm/*.yaml : Recipe YAML files should follow the naming pattern: <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].yaml for LLM recipes

Applied to files:

  • examples/configs/recipes/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.yaml
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/**/*.yaml : When adding support for a new model, create a recipe YAML under examples/configs/recipes/ in the appropriate domain subdirectory (llm, vlm, etc.)

Applied to files:

  • examples/configs/recipes/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.yaml
📚 Learning: 2025-10-12T14:46:57.171Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.

Applied to files:

  • tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to tests/test_suites/**/*.sh : When adding support for a new model, create a corresponding driver shell script under tests/test_suites/ in the matching domain

Applied to files:

  • tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to tests/test_suites/nightly.txt : When adding a nightly test for a new model, append the driver script path (relative to tests/test_suites/) to tests/test_suites/nightly.txt

Applied to files:

  • tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh
  • tests/test_suites/nightly.txt
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to tests/test_suites/**/*.sh : Driver shell scripts should match the YAML base name with .sh extension and invoke training entrypoint with uv run

Applied to files:

  • tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to **/*.py : When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml

Applied to files:

  • tests/unit/models/policy/test_automodel_types.py
  • nemo_rl/models/policy/__init__.py
📚 Learning: 2025-09-17T01:52:21.399Z
Learnt from: ffrujeri
Repo: NVIDIA-NeMo/RL PR: 1023
File: nemo_rl/utils/checkpoint.py:58-65
Timestamp: 2025-09-17T01:52:21.399Z
Learning: model_state_dict_keys is not intended to be part of the nemo-rl CheckpointingConfig TypedDict - it's handled at the automodel implementation layer, not as a general checkpointing configuration parameter.

Applied to files:

  • tests/unit/models/policy/test_automodel_types.py
  • nemo_rl/models/policy/__init__.py
  • nemo_rl/utils/automodel_checkpoint.py
  • tests/unit/utils/test_automodel_checkpoint.py
📚 Learning: 2025-10-30T20:50:44.126Z
Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.

Applied to files:

  • tests/unit/models/policy/test_dtensor_worker_v2.py
  • nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
🧬 Code graph analysis (5)
tests/unit/models/policy/test_patches.py (1)
nemo_rl/models/policy/workers/patches.py (2)
  • _get_transformer_engine_file (19-44)
  • apply_transformer_engine_patch (47-106)
tests/unit/models/policy/test_automodel_types.py (1)
nemo_rl/models/policy/__init__.py (1)
  • AutomodelBackendConfig (37-55)
nemo_rl/models/policy/workers/megatron_policy_worker.py (1)
nemo_rl/models/policy/workers/patches.py (1)
  • apply_transformer_engine_patch (47-106)
nemo_rl/utils/automodel_checkpoint.py (2)
nemo_rl/utils/checkpoint.py (1)
  • CheckpointingConfig (36-67)
tests/unit/utils/test_checkpoint.py (2)
  • checkpoint_config (31-38)
  • checkpoint_dir (26-27)
tests/unit/utils/test_automodel_checkpoint.py (1)
nemo_rl/utils/automodel_checkpoint.py (9)
  • AutomodelCheckpointManager (42-390)
  • _infer_checkpoint_root (428-443)
  • detect_checkpoint_format (393-425)
  • set_model_state_dict_keys (192-200)
  • save_checkpoint (246-329)
  • load_checkpoint (331-390)
  • load_base_model (202-244)
  • init_checkpointer (84-130)
  • update_checkpointer_config (132-169)
🪛 GitHub Actions: Automodel Integration and Submodule Checks
3rdparty/Automodel-workspace/Automodel

[error] 1-1: Submodule is BEHIND the r0.5.0 branch. PR commits are missing from the target branch; submodule needs to be updated to include recent changes from r0.5.0.

🪛 Ruff (0.14.10)
nemo_rl/models/policy/workers/patches.py

27-31: Avoid specifying long messages outside the exception class

(TRY003)


37-42: Avoid specifying long messages outside the exception class

(TRY003)


105-105: Do not catch blind exception: Exception

(BLE001)

nemo_rl/models/policy/lm_policy.py

116-116: No explicit stacklevel keyword argument found

Set stacklevel=2

(B028)

tests/unit/models/policy/test_patches.py

187-187: Unused function argument: path

(ARG001)

tests/unit/models/policy/test_automodel_types.py

21-21: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

tests/unit/models/policy/test_dtensor_worker.py

841-841: Prefer next(iter(info["parameter_sample"].values())) over single element slice

Replace with next(iter(info["parameter_sample"].values()))

(RUF015)


847-847: Prefer next(iter(info["parameter_sample"].keys())) over single element slice

Replace with next(iter(info["parameter_sample"].keys()))

(RUF015)


854-854: Prefer next(iter(info["parameter_sample"].values())) over single element slice

Replace with next(iter(info["parameter_sample"].values()))

(RUF015)


861-861: Unused method argument: use_v2

(ARG002)


873-873: Unused method argument: use_v2

(ARG002)


1095-1095: Loop control variable warmup_step not used within loop body

Rename unused warmup_step to _warmup_step

(B007)


1101-1101: Loop control variable train_step not used within loop body

Rename unused train_step to _train_step

(B007)

nemo_rl/utils/automodel_checkpoint.py

281-283: Avoid specifying long messages outside the exception class

(TRY003)

tests/unit/utils/test_automodel_checkpoint.py

114-115: try-except-pass detected, consider logging the exception

(S110)


114-114: Do not catch blind exception: Exception

(BLE001)


129-130: try-except-pass detected, consider logging the exception

(S110)


129-129: Do not catch blind exception: Exception

(BLE001)


760-760: Unused method argument: init_distributed

(ARG002)


811-811: Unused method argument: init_distributed

(ARG002)


861-861: Unused method argument: init_distributed

(ARG002)


939-939: Unused method argument: init_distributed

(ARG002)

🪛 Shellcheck (0.11.0)
tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh

[warning] 6-6: NUM_NODES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 9-9: NUM_RUNS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 16-16: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 28-28: Double quote array expansions to avoid re-splitting elements.

(SC2068)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Lint check
🔇 Additional comments (30)
tests/unit/models/dtensor/test_lora.py (2)

73-73: Migration to LinearLoRA.init_lora_weights is complete.

All usages of the private function _patched_init_lora_weights have been successfully removed from the codebase. The code at line 73 correctly uses the public API LinearLoRA.init_lora_weights(lora2, "kaiming") for kaiming initialization testing.


55-55: The migration to LinearLoRA.init_lora_weights is verified and complete.

LinearLoRA's static methods provide the LoRA functionality for both instance initialization and monkey-patching. The codebase confirms that LinearLoRA.init_lora_weights(module, init_method) exists in nemo_automodel and is properly imported and used at lines 55 and 73. The old _patched_init_lora_weights function has been completely removed with no remaining references. The test correctly validates initialization behavior including mean properties, standard deviation expectations, and zero-initialization of LoRA B weights.

tests/test_suites/nightly.txt (1)

90-91: No issues found. The test entry tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh appears only once in the nightly test suite at line 91. The addition correctly follows the coding guideline format for appending nightly test driver script paths.

nemo_rl/models/policy/utils.py (1)

32-36: No action required—the import path is correct and error handling is already in place.

The new import path nemo_automodel._transformers.auto_model is documented in the official NeMo-AutoModel API, and the code already implements a try-except wrapper (lines 31–44) that gracefully handles ImportError by setting the NeMo model classes to None with a NEMO_AUTOMODEL_AVAILABLE flag. This design ensures the code will not fail at runtime if the import fails or if nemo_automodel is not installed.

3rdparty/Automodel-workspace/Automodel (1)

1-1: This review comment is based on incorrect context and cannot be verified.

The claimed cherry-pick to r0.5.0 branch does not exist—this PR #1470 targets the main branch. No r0.5.0 branch is present in the repository. Additionally, the referenced old commit hash cannot be found in git history, making the regression comparison unverifiable.

The file 3rdparty/Automodel-workspace/Automodel is a git submodule pointer (160000 mode), not a Python or shell script, so the NVIDIA copyright header guideline does not apply.

Likely an incorrect or invalid review comment.

nemo_rl/utils/venvs.py (1)

17-17: LGTM - Import organization improvement.

Moving the shutil import to the top-level follows Python import conventions and improves readability.

pyrefly.toml (1)

112-116: LGTM - Pyrefly configuration updates.

The new project-includes entries are correctly added for the new modules introduced in this PR: workers __init__.py, patches.py, and automodel_checkpoint.py.

nemo_rl/models/policy/workers/megatron_policy_worker.py (2)

132-132: LGTM - Refactored patch import.

The Transformer Engine patching logic is now properly delegated to the external patches module, improving code organization and maintainability.


451-451: LGTM - Early patch application.

Applying the Transformer Engine patch early in __init__ before other initialization is appropriate to ensure the patched code is in effect before any TE imports occur.

examples/configs/recipes/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.yaml (1)

1-22: LGTM - Well-structured recipe configuration.

The YAML follows the naming convention <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers].yaml as per coding guidelines. The automodel backend configuration aligns with the AutomodelBackendConfig TypedDict structure. Based on learnings from coding guidelines.

Please verify that openai/gpt-oss-20b is the intended model identifier, as this appears to be an internal or placeholder model name.

tests/unit/models/policy/test_automodel_types.py (1)

35-71: LGTM - Comprehensive TypedDict validation tests.

The test class properly validates that AutomodelBackendConfig keys are defined and verifies instantiation compatibility with the actual BackendConfig class when nemo_automodel is available.

nemo_rl/models/policy/__init__.py (2)

37-62: LGTM - Well-documented TypedDict definitions.

The AutomodelBackendConfig and AutomodelKwargs TypedDicts are properly documented with inline comments explaining each key's purpose, valid values, and types. The use of NotRequired for optional fields follows Python 3.12+ patterns. As per coding guidelines, TypedDict keys are documented.


81-81: LGTM - Clean DTensorConfig extension.

Adding automodel_kwargs as an optional field to DTensorConfig provides a clean integration path for automodel configuration without breaking existing usage.

tests/unit/utils/test_automodel_checkpoint.py (4)

135-163: LGTM - Thorough distributed test fixture.

The init_distributed fixture properly handles cleanup of DCP planner caches and device mesh caches before and after each test, ensuring test isolation. The fixture's side-effect-only usage (appearing as unused in test signatures) is a standard pytest pattern.


186-295: LGTM - Comprehensive checkpoint format detection tests.

Good coverage of various checkpoint formats (safetensors, torch_save with distcp/bin/pt), PEFT adapter detection, and edge cases (empty/nonexistent directories, nested structures).


353-616: LGTM - Well-structured AutomodelCheckpointManager tests.

The test class properly validates manager initialization, state dict key setting, error handling for uninitialized checkpointer, and configuration updates. Mock usage is appropriate for unit testing without requiring full distributed setup.


755-1032: LGTM - Valuable integration tests.

The TestSaveLoadIntegration class provides end-to-end validation of save/load operations with both safetensors and torch_save formats, including optimizer state and LoRA weights. These integration tests verify the full checkpoint workflow.

tests/unit/models/policy/test_patches.py (2)

28-123: LGTM - Comprehensive path resolution tests.

Excellent coverage of edge cases for _get_transformer_engine_file: package not found, no submodule locations, empty locations, file not found, and successful lookup with various path segments.


357-447: LGTM - Valuable integration tests with real files.

The integration tests using real temporary files validate actual file operations and verify patch idempotency. The test_patch_idempotent test correctly verifies that applying the patch twice produces identical content with only one success message.

pyproject.toml (1)

61-67: LGTM - Automodel dependencies properly configured.

The new dependencies for automodel support (nv-grouped-gemm, transformer-engine, deep_ep) are appropriately pinned with specific versions/git revisions for reproducibility.

nemo_rl/models/policy/workers/patches.py (2)

19-44: LGTM - Robust path resolution for TE file location.

The function properly handles missing package and file scenarios with clear error messages.


105-106: Broad exception catch is acceptable here.

The static analysis flags this as BLE001, but in this context catching all exceptions is intentional to ensure patch failures don't crash the main application. The error is logged for debugging purposes.

tests/unit/models/policy/test_dtensor_worker_v2.py (2)

299-381: Well-structured checkpoint save/load test with proper cleanup.

Good use of tempfile.TemporaryDirectory, proper GPU memory cleanup by shutting down the first policy before creating the second, and verification of loaded policy state.


384-457: Comprehensive mixed precision test covering training and logprobs.

Good coverage of both bfloat16 and float16 precisions, with appropriate assertions for NaN/Inf checks and dtype verification.

nemo_rl/utils/automodel_checkpoint.py (1)

246-329: Solid save_checkpoint implementation with PEFT support.

Good handling of optional components (optimizer, scheduler, tokenizer) and proper integration with the checkpointing configuration.

tests/unit/models/policy/test_dtensor_worker.py (1)

859-876: The use_v2 parameter is used indirectly via _get_use_v2.

The static analysis flags use_v2 as unused, but it's actually accessed through pytest's parametrization mechanism via _get_use_v2(request) in the fixture chain. The parameter name triggers the correct test configuration.

nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py (4)

128-129: TE patch applied early in initialization - good placement.

Applying the Transformer Engine patch before any model setup ensures compatibility is established early.


95-99: Clean precision mapping with explicit dtype lookup.

The STRING_TO_DTYPE dictionary provides a clear and maintainable way to convert precision strings to torch dtypes, with proper error handling at lines 165-168.


351-425: Well-structured FSDP2Manager integration and model parallelization.

Good separation between MoE model parallelization (using moe_parallelize_model) and standard HF model parallelization (using manager.parallelize). The device mesh references are properly stored for downstream usage.


1933-1957: Clean checkpoint manager initialization with lazy instantiation.

The _init_checkpoint_manager method properly handles lazy initialization and passes the required device meshes. Good use of getattr with default for model_state_dict_keys.

Comment on lines +151 to 161
self.cpu_offload = self.cfg["dtensor_cfg"]["cpu_offload"]
# torch distributed init. Envars for rank, world_size, and master_addr and master_port are set from the ray remote call
torch.distributed.init_process_group(backend="nccl")
backend = "nccl" if not self.cpu_offload else "cuda:nccl,cpu:gloo"
torch.distributed.init_process_group(backend=backend)
self.rank = torch.distributed.get_rank()
world_size = torch.distributed.get_world_size()
model_name = self.cfg["model_name"]

self.checkpoint_manager: Optional[AutomodelCheckpointManager] = None

self.cpu_offload = self.cfg["dtensor_cfg"]["cpu_offload"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Duplicate assignment of self.cpu_offload.

self.cpu_offload is assigned at line 151 and then again at line 161 with the same value. Remove the duplicate.

🔎 Proposed fix
         self.cpu_offload = self.cfg["dtensor_cfg"]["cpu_offload"]
         # torch distributed init. Envars for rank, world_size, and master_addr and master_port are set from the ray remote call
         backend = "nccl" if not self.cpu_offload else "cuda:nccl,cpu:gloo"
         torch.distributed.init_process_group(backend=backend)
         self.rank = torch.distributed.get_rank()
         world_size = torch.distributed.get_world_size()
         model_name = self.cfg["model_name"]

         self.checkpoint_manager: Optional[AutomodelCheckpointManager] = None

-        self.cpu_offload = self.cfg["dtensor_cfg"]["cpu_offload"]
         self.offload_optimizer_for_logprob = self.cfg["offload_optimizer_for_logprob"]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self.cpu_offload = self.cfg["dtensor_cfg"]["cpu_offload"]
# torch distributed init. Envars for rank, world_size, and master_addr and master_port are set from the ray remote call
torch.distributed.init_process_group(backend="nccl")
backend = "nccl" if not self.cpu_offload else "cuda:nccl,cpu:gloo"
torch.distributed.init_process_group(backend=backend)
self.rank = torch.distributed.get_rank()
world_size = torch.distributed.get_world_size()
model_name = self.cfg["model_name"]
self.checkpoint_manager: Optional[AutomodelCheckpointManager] = None
self.cpu_offload = self.cfg["dtensor_cfg"]["cpu_offload"]
self.cpu_offload = self.cfg["dtensor_cfg"]["cpu_offload"]
# torch distributed init. Envars for rank, world_size, and master_addr and master_port are set from the ray remote call
backend = "nccl" if not self.cpu_offload else "cuda:nccl,cpu:gloo"
torch.distributed.init_process_group(backend=backend)
self.rank = torch.distributed.get_rank()
world_size = torch.distributed.get_world_size()
model_name = self.cfg["model_name"]
self.checkpoint_manager: Optional[AutomodelCheckpointManager] = None
🤖 Prompt for AI Agents
In nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py around lines 151 to
161, there is a duplicate assignment of self.cpu_offload (first at line 151 and
again at line 161); remove the redundant assignment at line 161 so
self.cpu_offload is only set once from self.cfg["dtensor_cfg"]["cpu_offload"]
and no behavior changes occur.

Comment on lines +402 to +407
assert self.tp_size == 1, (
"Using custom implementation {self.model.__class__.__name__} for MoE model {model_name} which doesn't support tp_size > 1. Please use expert_parallel_size > 1 for custom implementation or set force_hf=True in your config at policy->dtensor_cfg->automodel_kwargs to use the HuggingFace implementation."
)
assert self.cp_size == 1, (
"Using custom implementation {self.model.__class__.__name__} for MoE model {model_name} which doesn't support cp_size > 1. Please set force_hf=True in your config at policy->dtensor_cfg->automodel_kwargs to use the HuggingFace implementation."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Assertion messages use wrong string formatting.

The assertion messages at lines 402-406 contain {self.model.__class__.__name__} but are not f-strings, so the variable won't be interpolated.

🔎 Proposed fix
-            assert self.tp_size == 1, (
-                "Using custom implementation {self.model.__class__.__name__} for MoE model {model_name} which doesn't support tp_size > 1. Please use expert_parallel_size > 1 for custom implementation or set force_hf=True in your config at policy->dtensor_cfg->automodel_kwargs to use the HuggingFace implementation."
-            )
-            assert self.cp_size == 1, (
-                "Using custom implementation {self.model.__class__.__name__} for MoE model {model_name} which doesn't support cp_size > 1. Please set force_hf=True in your config at policy->dtensor_cfg->automodel_kwargs to use the HuggingFace implementation."
-            )
+            assert self.tp_size == 1, (
+                f"Using custom implementation {self.model.__class__.__name__} for MoE model {model_name} which doesn't support tp_size > 1. Please use expert_parallel_size > 1 for custom implementation or set force_hf=True in your config at policy->dtensor_cfg->automodel_kwargs to use the HuggingFace implementation."
+            )
+            assert self.cp_size == 1, (
+                f"Using custom implementation {self.model.__class__.__name__} for MoE model {model_name} which doesn't support cp_size > 1. Please set force_hf=True in your config at policy->dtensor_cfg->automodel_kwargs to use the HuggingFace implementation."
+            )
🤖 Prompt for AI Agents
In nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py around lines 402 to
407, the assertion messages use brace placeholders like
{self.model.__class__.__name__} and {model_name} but are plain strings, so
variables are not interpolated; change each assertion message to use f-strings
(or .format()) so the actual values are substituted, e.g. prefix the string with
f and reference the variables inside the braces for both tp_size and cp_size
assertions.

Comment on lines +354 to +361
model_save_format, is_peft = detect_checkpoint_format(weights_path)

weights_dir = os.path.dirname(weights_path)
checkpoint_root = (
os.path.dirname(weights_dir)
if weights_dir.endswith("weights")
else weights_dir
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Duplicate checkpoint_root inference logic - use _infer_checkpoint_root helper.

This logic duplicates what _infer_checkpoint_root does at lines 428-443. The function already exists and should be reused here.

🔎 Proposed fix
         model_save_format, is_peft = detect_checkpoint_format(weights_path)

-        weights_dir = os.path.dirname(weights_path)
-        checkpoint_root = (
-            os.path.dirname(weights_dir)
-            if weights_dir.endswith("weights")
-            else weights_dir
-        )
+        checkpoint_root = _infer_checkpoint_root(weights_path)

         # Update checkpointer configuration
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
model_save_format, is_peft = detect_checkpoint_format(weights_path)
weights_dir = os.path.dirname(weights_path)
checkpoint_root = (
os.path.dirname(weights_dir)
if weights_dir.endswith("weights")
else weights_dir
)
model_save_format, is_peft = detect_checkpoint_format(weights_path)
checkpoint_root = _infer_checkpoint_root(weights_path)
🤖 Prompt for AI Agents
In nemo_rl/utils/automodel_checkpoint.py around lines 354 to 361, the code
re-implements checkpoint_root inference by taking dirname(dirname(weights_path))
when the weights dir ends with "weights" or else dirname(weights_path); replace
this duplicated logic by calling the existing _infer_checkpoint_root helper
(used at lines 428-443) instead: remove the manual dirname logic and invoke
_infer_checkpoint_root(weights_path) (or the helper's expected parameter) to
compute checkpoint_root, preserving behavior and any special cases handled by
the helper.

exit_if_max_steps_reached

# Run the experiment
cd $PROJECT_ROOT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add error handling for directory change.

The cd command should handle failures to prevent the script from continuing in the wrong directory if $PROJECT_ROOT is invalid or inaccessible.

🔎 Proposed fix
-cd $PROJECT_ROOT
+cd "$PROJECT_ROOT" || exit 1

Based on static analysis hints.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cd $PROJECT_ROOT
cd "$PROJECT_ROOT" || exit 1
🧰 Tools
🪛 Shellcheck (0.11.0)

[warning] 16-16: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)

🤖 Prompt for AI Agents
In tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh around line
16, the script runs cd $PROJECT_ROOT without handling failure; update the script
so the directory change is attempted with the variable quoted, then immediately
check the result and if it fails print a clear error to stderr and exit with a
non‑zero status (or enable safe script failure such as set -e at the top) to
prevent continuing in the wrong directory.

logger.tensorboard_enabled=True \
checkpointing.enabled=True \
checkpointing.checkpoint_dir=$CKPT_DIR \
$@ \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Quote array expansion to prevent argument splitting.

The unquoted $@ can cause issues with arguments containing spaces or special characters. Use "$@" to preserve argument boundaries correctly.

🔎 Proposed fix
     checkpointing.checkpoint_dir=$CKPT_DIR \
-    $@ \
+    "$@" \
     2>&1 | tee $RUN_LOG

Based on static analysis hints.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
$@ \
checkpointing.checkpoint_dir=$CKPT_DIR \
"$@" \
2>&1 | tee $RUN_LOG
🧰 Tools
🪛 Shellcheck (0.11.0)

[error] 28-28: Double quote array expansions to avoid re-splitting elements.

(SC2068)

🤖 Prompt for AI Agents
In tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh around line
28 the unquoted $@ may split arguments containing spaces or special characters;
replace it with "$@" so all passed arguments are preserved as distinct
parameters and not word-split by the shell.

Comment on lines +136 to +147
**(
{
"labels": torch.randint(0, vocab_size, (batch_size, seq_len)),
"sample_mask": torch.ones(batch_size).cuda(),
}
if mode == "train"
else {}
),
}
)
data = data.to("cpu")
return data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Device inconsistency: sample_mask created on CUDA before moving data to CPU.

The sample_mask tensor is created directly on CUDA (line 139), but then the entire BatchedDataDict is moved to CPU (line 146). This works because .to("cpu") will move all tensors including sample_mask, but creating it on CUDA first is unnecessary and inconsistent with how other tensors in the batch are created.

🔎 Proposed fix
             **(
                 {
                     "labels": torch.randint(0, vocab_size, (batch_size, seq_len)),
-                    "sample_mask": torch.ones(batch_size).cuda(),
+                    "sample_mask": torch.ones(batch_size),
                 }
                 if mode == "train"
                 else {}
             ),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
**(
{
"labels": torch.randint(0, vocab_size, (batch_size, seq_len)),
"sample_mask": torch.ones(batch_size).cuda(),
}
if mode == "train"
else {}
),
}
)
data = data.to("cpu")
return data
**(
{
"labels": torch.randint(0, vocab_size, (batch_size, seq_len)),
"sample_mask": torch.ones(batch_size),
}
if mode == "train"
else {}
),
}
)
data = data.to("cpu")
return data
🤖 Prompt for AI Agents
In tests/unit/models/policy/test_dtensor_worker_v2.py around lines 136 to 147,
the sample_mask is created on CUDA (torch.ones(batch_size).cuda()) but the
BatchedDataDict is subsequently moved to CPU; create sample_mask on the CPU
instead (e.g., torch.ones(batch_size) or torch.ones(batch_size, device="cpu") /
use the same device as other tensors) so it’s consistent before calling data =
data.to("cpu"); update the sample_mask creation to not call .cuda() or to use a
shared device variable.

@yuki-97 yuki-97 added the CI:L1 Run doctests, unit tests, and functional tests label Dec 23, 2025
@yuki-97 yuki-97 enabled auto-merge (squash) December 23, 2025 05:24
@yuki-97 yuki-97 merged commit cd3b423 into r0.5.0 Dec 23, 2025
79 of 85 checks passed
@yuki-97 yuki-97 deleted the cherry-pick-1470-r0.5.0 branch December 23, 2025 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick CI:L1 Run doctests, unit tests, and functional tests Run CICD

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants