cp: feat: DTensorPolicyV2 GPT-OSS SFT support (1470) into r0.5.0#1690
cp: feat: DTensorPolicyV2 GPT-OSS SFT support (1470) into r0.5.0#1690
feat: DTensorPolicyV2 GPT-OSS SFT support (1470) into r0.5.0#1690Conversation
Signed-off-by: Hemil Desai <hemild@nvidia.com> Co-authored-by: Hemil Desai <hemild@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
|
❌ Submodule Fast-Forward Check FailedCheck based on commit: 0f41577 (PR #1690 from ❌ Submodules that need attention:Automodel: ❌ Commits have DIVERGED from a common ancestor Please ensure all submodule commits are fast-forwards of the r0.5.0 branch before merging. |
📝 WalkthroughWalkthroughThis PR introduces Automodel and DeepEP integration for DTensor-based LLM training, including a new checkpoint management system, refactored Transformer-Engine patching, and dynamic attention implementation selection. Configuration structures for Automodel backends and new test coverage for checkpoint management and policy worker flows are added. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/unit/models/policy/test_dtensor_worker.py (1)
192-203: Same device inconsistency as in v2 tests.
sample_maskis created on CUDA (line 195) before the entire batch is moved to CPU (line 202). Consider creating it without specifying device.🔎 Proposed fix
**( { "labels": torch.randint(0, vocab_size, (batch_size, seq_len)), - "sample_mask": torch.ones(batch_size).cuda(), + "sample_mask": torch.ones(batch_size), } if mode == "train" else {} ),
🧹 Nitpick comments (8)
nemo_rl/models/policy/lm_policy.py (1)
115-119: Addstacklevelparameter to warning for better debugging.The warning correctly checks for
TORCH_CUDA_ARCH_LISTand provides helpful guidance. However, addingstacklevel=2will ensure the warning points to the caller's location rather than this line, making it easier to debug.🔎 Proposed fix
if "TORCH_CUDA_ARCH_LIST" not in os.environ: warnings.warn( "TORCH_CUDA_ARCH_LIST is not set. This is needed if using DeepEP in DTensorPolicyWorker V2. This variable is set in our container, but " "if you are running a custom container or baremetal, you may need to set this variable manually. Example: export TORCH_CUDA_ARCH_LIST='9.0 10.0'", + stacklevel=2, )Based on static analysis hints.
tests/unit/models/policy/test_automodel_types.py (1)
20-25: Remove unnecessarynoqadirective.The static analysis indicates
F401(unused import) rule is not enabled, making thenoqa: F401directive unnecessary. TheBackendConfigimport is actually used at line 65, so even if the rule were enabled, this wouldn't be flagged.🔎 Proposed fix
try: - from nemo_automodel.components.moe.utils import BackendConfig # noqa: F401 + from nemo_automodel.components.moe.utils import BackendConfig NEMO_AUTOMODEL_AVAILABLE = Truetests/unit/utils/test_automodel_checkpoint.py (1)
92-115: Intentional exception swallowing for cleanup resilience.The broad exception handling in
_cleanup_dcp_planner_cacheis appropriate here since this is a test cleanup helper that should not cause test failures. Consider adding a brief comment explaining this is intentional for test isolation.🔎 Optional: Add explanatory comment
except Exception: - pass + pass # Cleanup should not fail tests; errors are non-criticaltests/unit/models/policy/test_patches.py (1)
185-216: Consider using_for intentionally unused parameter.The
pathparameter inmock_open_funcis unused since the mock only needs to differentiate bymode. Using_or_pathwould clarify intent.🔎 Proposed fix
- def mock_open_func(path, mode="r"): + def mock_open_func(_path, mode="r"): call_count[0] += 1 if mode == "r": mock_file_handle.read.return_value = self.UNPATCHED_CONTENT return mock_file_handlenemo_rl/models/policy/workers/patches.py (1)
96-103: Consider moving imports to module level.The
importlibandsysimports inside the function could be moved to the top of the file for consistency with other imports.🔎 Proposed fix
import os +import sys +import importlib from importlib.util import find_specThen remove the local imports at lines 98-99.
nemo_rl/utils/automodel_checkpoint.py (1)
171-190: Accessing private_addonsattribute is fragile.This method directly manipulates
self.checkpointer._addons, which is a private implementation detail of theCheckpointerclass. If the underlying library changes this internal structure, this code will break silently.Consider adding a comment explaining why this is necessary and/or wrapping in a try-except to handle potential API changes gracefully.
🔎 Proposed documentation
def _rebuild_checkpointer_addons(self) -> None: """Rebuild the checkpointer's _addons list based on current config. The Checkpointer's _addons list is populated during __init__ based on config. When config changes (e.g., model_save_format or is_peft), we need to rebuild the addons list to match the new config. + + Note: This accesses the private _addons attribute of Checkpointer. + This coupling is necessary because the Checkpointer doesn't expose + a public API to update addons after initialization. """tests/unit/models/policy/test_dtensor_worker.py (2)
841-854: Usenext(iter(...))for cleaner single-element access.Static analysis correctly suggests using
next(iter(...))instead oflist(...)[0]for accessing the first element.🔎 Proposed fix
- param_sample = list(info["parameter_sample"].values())[0] + param_sample = next(iter(info["parameter_sample"].values())) ... - param_names = [list(info["parameter_sample"].keys())[0] for info in gpu_infos] + param_names = [next(iter(info["parameter_sample"].keys())) for info in gpu_infos] ... - param_device = list(info["parameter_sample"].values())[0]["device"] + param_device = next(iter(info["parameter_sample"].values()))["device"]
1093-1103: Rename unused loop variables to indicate intent.Per static analysis, rename unused loop variables to underscore-prefixed names.
🔎 Proposed fix
- for warmup_step in range(2): + for _warmup_step in range(2): results = policy.train(data, loss_fn) # Measure FLOPS on 3 iterations print("Measuring FLOPS on 3 iterations...") time_begin = time.time() - for train_step in range(3): + for _train_step in range(3): results = policy.train(data, loss_fn)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (21)
3rdparty/Automodel-workspace/Automodelexamples/configs/recipes/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.yamlnemo_rl/models/policy/__init__.pynemo_rl/models/policy/lm_policy.pynemo_rl/models/policy/utils.pynemo_rl/models/policy/workers/__init__.pynemo_rl/models/policy/workers/dtensor_policy_worker_v2.pynemo_rl/models/policy/workers/megatron_policy_worker.pynemo_rl/models/policy/workers/patches.pynemo_rl/utils/automodel_checkpoint.pynemo_rl/utils/venvs.pypyproject.tomlpyrefly.tomltests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.shtests/test_suites/nightly.txttests/unit/models/dtensor/test_lora.pytests/unit/models/policy/test_automodel_types.pytests/unit/models/policy/test_dtensor_worker.pytests/unit/models/policy/test_dtensor_worker_v2.pytests/unit/models/policy/test_patches.pytests/unit/utils/test_automodel_checkpoint.py
🧰 Additional context used
📓 Path-based instructions (9)
examples/configs/recipes/**/*.yaml
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
When adding support for a new model, create a recipe YAML under examples/configs/recipes/ in the appropriate domain subdirectory (llm, vlm, etc.)
Files:
examples/configs/recipes/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.yaml
examples/configs/recipes/llm/*.yaml
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Recipe YAML files should follow the naming pattern: --ng-[-modifiers][-long][.vN].yaml for LLM recipes
Files:
examples/configs/recipes/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.yaml
!(**/tests/**|**/test_*.py|**/test_*.sh)
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year
Files:
examples/configs/recipes/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.yamltests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.shnemo_rl/models/policy/workers/patches.pynemo_rl/models/policy/lm_policy.pypyrefly.tomltests/test_suites/nightly.txttests/unit/models/policy/test_patches.pytests/unit/models/policy/test_automodel_types.pytests/unit/models/policy/test_dtensor_worker_v2.pynemo_rl/models/policy/workers/dtensor_policy_worker_v2.pytests/unit/models/policy/test_dtensor_worker.pytests/unit/models/dtensor/test_lora.pynemo_rl/models/policy/utils.py3rdparty/Automodel-workspace/Automodelnemo_rl/utils/venvs.pynemo_rl/models/policy/__init__.pynemo_rl/models/policy/workers/megatron_policy_worker.pypyproject.tomlnemo_rl/utils/automodel_checkpoint.pytests/unit/utils/test_automodel_checkpoint.py
**/*.sh
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.sh: Use uv run instead of python to execute scripts
Follow the Google Shell Style Guide for shell scripts
Files:
tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh
tests/test_suites/**/*.sh
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
tests/test_suites/**/*.sh: When adding support for a new model, create a corresponding driver shell script under tests/test_suites/ in the matching domain
Driver shell scripts should match the YAML base name with .sh extension and invoke training entrypoint with uv run
Files:
tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh
**/*.{py,sh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)
Files:
tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.shnemo_rl/models/policy/workers/patches.pynemo_rl/models/policy/lm_policy.pytests/unit/models/policy/test_patches.pytests/unit/models/policy/test_automodel_types.pytests/unit/models/policy/test_dtensor_worker_v2.pynemo_rl/models/policy/workers/dtensor_policy_worker_v2.pytests/unit/models/policy/test_dtensor_worker.pytests/unit/models/dtensor/test_lora.pynemo_rl/models/policy/utils.pynemo_rl/utils/venvs.pynemo_rl/models/policy/__init__.pynemo_rl/models/policy/workers/megatron_policy_worker.pynemo_rl/utils/automodel_checkpoint.pytests/unit/utils/test_automodel_checkpoint.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Conform code to Python 3.12+
Indent code with 4 spaces. Do not use tabs
Use snake_case for file names
Use PascalCase for class names
Use snake_case for function and method names
Use snake_case for local variables
Prefix variable names that start with a number with 'k' (e.g., k_99th_percentile)
Use upper snake_case with 'G' prefix for global variables (e.g., G_MY_GLOBAL)
Use upper snake_case for constants
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
Prefer docstrings over comments for interfaces that may be used outside a file
Reserve comments for code within a function or interfaces that are local to a file
If a piece of code is commented out, include a comment describing its usage and why it's commented out. Remove debug comments before merging
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx
Avoid using reflection when functionality can be easily achieved without reflection
When using try-except blocks, limit the except clause to the smallest set of specific errors possible
When using try-except blocks for duck-typing, keep the body of the try as small as possible and use the else block for logic
YAML is the single source of truth for configuration defaults. Do not set non-None defaults in code for configuration values
For required configuration attributes, access config directly and expect presence (e.g., policy_cfg['precision']) without hidden defaults
Use typing.NotRequired to mark optional attributes in TypedDict for configuration
When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml
Follow the Google Python Style Guide for Python code
Files:
nemo_rl/models/policy/workers/patches.pynemo_rl/models/policy/lm_policy.pytests/unit/models/policy/test_patches.pytests/unit/models/policy/test_automodel_types.pytests/unit/models/policy/test_dtensor_worker_v2.pynemo_rl/models/policy/workers/dtensor_policy_worker_v2.pytests/unit/models/policy/test_dtensor_worker.pytests/unit/models/dtensor/test_lora.pynemo_rl/models/policy/utils.pynemo_rl/utils/venvs.pynemo_rl/models/policy/__init__.pynemo_rl/models/policy/workers/megatron_policy_worker.pynemo_rl/utils/automodel_checkpoint.pytests/unit/utils/test_automodel_checkpoint.py
nemo_rl/**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
For any source file under nemo_rl/*.py that defines a class or function decorated with @ray.remote, add a coverage pragma (# pragma: no cover) because these run in separate Ray processes
Files:
nemo_rl/models/policy/workers/patches.pynemo_rl/models/policy/lm_policy.pynemo_rl/models/policy/workers/dtensor_policy_worker_v2.pynemo_rl/models/policy/utils.pynemo_rl/utils/venvs.pynemo_rl/models/policy/__init__.pynemo_rl/models/policy/workers/megatron_policy_worker.pynemo_rl/utils/automodel_checkpoint.py
tests/test_suites/nightly.txt
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
When adding a nightly test for a new model, append the driver script path (relative to tests/test_suites/) to tests/test_suites/nightly.txt
Files:
tests/test_suites/nightly.txt
🧠 Learnings (9)
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/llm/*.yaml : Recipe YAML files should follow the naming pattern: <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].yaml for LLM recipes
Applied to files:
examples/configs/recipes/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.yaml
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/**/*.yaml : When adding support for a new model, create a recipe YAML under examples/configs/recipes/ in the appropriate domain subdirectory (llm, vlm, etc.)
Applied to files:
examples/configs/recipes/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.yaml
📚 Learning: 2025-10-12T14:46:57.171Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.
Applied to files:
tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to tests/test_suites/**/*.sh : When adding support for a new model, create a corresponding driver shell script under tests/test_suites/ in the matching domain
Applied to files:
tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to tests/test_suites/nightly.txt : When adding a nightly test for a new model, append the driver script path (relative to tests/test_suites/) to tests/test_suites/nightly.txt
Applied to files:
tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.shtests/test_suites/nightly.txt
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to tests/test_suites/**/*.sh : Driver shell scripts should match the YAML base name with .sh extension and invoke training entrypoint with uv run
Applied to files:
tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to **/*.py : When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml
Applied to files:
tests/unit/models/policy/test_automodel_types.pynemo_rl/models/policy/__init__.py
📚 Learning: 2025-09-17T01:52:21.399Z
Learnt from: ffrujeri
Repo: NVIDIA-NeMo/RL PR: 1023
File: nemo_rl/utils/checkpoint.py:58-65
Timestamp: 2025-09-17T01:52:21.399Z
Learning: model_state_dict_keys is not intended to be part of the nemo-rl CheckpointingConfig TypedDict - it's handled at the automodel implementation layer, not as a general checkpointing configuration parameter.
Applied to files:
tests/unit/models/policy/test_automodel_types.pynemo_rl/models/policy/__init__.pynemo_rl/utils/automodel_checkpoint.pytests/unit/utils/test_automodel_checkpoint.py
📚 Learning: 2025-10-30T20:50:44.126Z
Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.
Applied to files:
tests/unit/models/policy/test_dtensor_worker_v2.pynemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
🧬 Code graph analysis (5)
tests/unit/models/policy/test_patches.py (1)
nemo_rl/models/policy/workers/patches.py (2)
_get_transformer_engine_file(19-44)apply_transformer_engine_patch(47-106)
tests/unit/models/policy/test_automodel_types.py (1)
nemo_rl/models/policy/__init__.py (1)
AutomodelBackendConfig(37-55)
nemo_rl/models/policy/workers/megatron_policy_worker.py (1)
nemo_rl/models/policy/workers/patches.py (1)
apply_transformer_engine_patch(47-106)
nemo_rl/utils/automodel_checkpoint.py (2)
nemo_rl/utils/checkpoint.py (1)
CheckpointingConfig(36-67)tests/unit/utils/test_checkpoint.py (2)
checkpoint_config(31-38)checkpoint_dir(26-27)
tests/unit/utils/test_automodel_checkpoint.py (1)
nemo_rl/utils/automodel_checkpoint.py (9)
AutomodelCheckpointManager(42-390)_infer_checkpoint_root(428-443)detect_checkpoint_format(393-425)set_model_state_dict_keys(192-200)save_checkpoint(246-329)load_checkpoint(331-390)load_base_model(202-244)init_checkpointer(84-130)update_checkpointer_config(132-169)
🪛 GitHub Actions: Automodel Integration and Submodule Checks
3rdparty/Automodel-workspace/Automodel
[error] 1-1: Submodule is BEHIND the r0.5.0 branch. PR commits are missing from the target branch; submodule needs to be updated to include recent changes from r0.5.0.
🪛 Ruff (0.14.10)
nemo_rl/models/policy/workers/patches.py
27-31: Avoid specifying long messages outside the exception class
(TRY003)
37-42: Avoid specifying long messages outside the exception class
(TRY003)
105-105: Do not catch blind exception: Exception
(BLE001)
nemo_rl/models/policy/lm_policy.py
116-116: No explicit stacklevel keyword argument found
Set stacklevel=2
(B028)
tests/unit/models/policy/test_patches.py
187-187: Unused function argument: path
(ARG001)
tests/unit/models/policy/test_automodel_types.py
21-21: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
tests/unit/models/policy/test_dtensor_worker.py
841-841: Prefer next(iter(info["parameter_sample"].values())) over single element slice
Replace with next(iter(info["parameter_sample"].values()))
(RUF015)
847-847: Prefer next(iter(info["parameter_sample"].keys())) over single element slice
Replace with next(iter(info["parameter_sample"].keys()))
(RUF015)
854-854: Prefer next(iter(info["parameter_sample"].values())) over single element slice
Replace with next(iter(info["parameter_sample"].values()))
(RUF015)
861-861: Unused method argument: use_v2
(ARG002)
873-873: Unused method argument: use_v2
(ARG002)
1095-1095: Loop control variable warmup_step not used within loop body
Rename unused warmup_step to _warmup_step
(B007)
1101-1101: Loop control variable train_step not used within loop body
Rename unused train_step to _train_step
(B007)
nemo_rl/utils/automodel_checkpoint.py
281-283: Avoid specifying long messages outside the exception class
(TRY003)
tests/unit/utils/test_automodel_checkpoint.py
114-115: try-except-pass detected, consider logging the exception
(S110)
114-114: Do not catch blind exception: Exception
(BLE001)
129-130: try-except-pass detected, consider logging the exception
(S110)
129-129: Do not catch blind exception: Exception
(BLE001)
760-760: Unused method argument: init_distributed
(ARG002)
811-811: Unused method argument: init_distributed
(ARG002)
861-861: Unused method argument: init_distributed
(ARG002)
939-939: Unused method argument: init_distributed
(ARG002)
🪛 Shellcheck (0.11.0)
tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh
[warning] 6-6: NUM_NODES appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 9-9: NUM_RUNS appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 16-16: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.
(SC2164)
[error] 28-28: Double quote array expansions to avoid re-splitting elements.
(SC2068)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Lint check
- GitHub Check: Lint check
- GitHub Check: Lint check
🔇 Additional comments (30)
tests/unit/models/dtensor/test_lora.py (2)
73-73: Migration to LinearLoRA.init_lora_weights is complete.All usages of the private function
_patched_init_lora_weightshave been successfully removed from the codebase. The code at line 73 correctly uses the public APILinearLoRA.init_lora_weights(lora2, "kaiming")for kaiming initialization testing.
55-55: The migration toLinearLoRA.init_lora_weightsis verified and complete.LinearLoRA's static methods provide the LoRA functionality for both instance initialization and monkey-patching. The codebase confirms that
LinearLoRA.init_lora_weights(module, init_method)exists in nemo_automodel and is properly imported and used at lines 55 and 73. The old_patched_init_lora_weightsfunction has been completely removed with no remaining references. The test correctly validates initialization behavior including mean properties, standard deviation expectations, and zero-initialization of LoRA B weights.tests/test_suites/nightly.txt (1)
90-91: No issues found. The test entrytests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.shappears only once in the nightly test suite at line 91. The addition correctly follows the coding guideline format for appending nightly test driver script paths.nemo_rl/models/policy/utils.py (1)
32-36: No action required—the import path is correct and error handling is already in place.The new import path
nemo_automodel._transformers.auto_modelis documented in the official NeMo-AutoModel API, and the code already implements a try-except wrapper (lines 31–44) that gracefully handles ImportError by setting the NeMo model classes to None with aNEMO_AUTOMODEL_AVAILABLEflag. This design ensures the code will not fail at runtime if the import fails or if nemo_automodel is not installed.3rdparty/Automodel-workspace/Automodel (1)
1-1: This review comment is based on incorrect context and cannot be verified.The claimed cherry-pick to r0.5.0 branch does not exist—this PR #1470 targets the main branch. No r0.5.0 branch is present in the repository. Additionally, the referenced old commit hash cannot be found in git history, making the regression comparison unverifiable.
The file
3rdparty/Automodel-workspace/Automodelis a git submodule pointer (160000 mode), not a Python or shell script, so the NVIDIA copyright header guideline does not apply.Likely an incorrect or invalid review comment.
nemo_rl/utils/venvs.py (1)
17-17: LGTM - Import organization improvement.Moving the
shutilimport to the top-level follows Python import conventions and improves readability.pyrefly.toml (1)
112-116: LGTM - Pyrefly configuration updates.The new project-includes entries are correctly added for the new modules introduced in this PR: workers
__init__.py,patches.py, andautomodel_checkpoint.py.nemo_rl/models/policy/workers/megatron_policy_worker.py (2)
132-132: LGTM - Refactored patch import.The Transformer Engine patching logic is now properly delegated to the external
patchesmodule, improving code organization and maintainability.
451-451: LGTM - Early patch application.Applying the Transformer Engine patch early in
__init__before other initialization is appropriate to ensure the patched code is in effect before any TE imports occur.examples/configs/recipes/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.yaml (1)
1-22: LGTM - Well-structured recipe configuration.The YAML follows the naming convention
<algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers].yamlas per coding guidelines. The automodel backend configuration aligns with theAutomodelBackendConfigTypedDict structure. Based on learnings from coding guidelines.Please verify that
openai/gpt-oss-20bis the intended model identifier, as this appears to be an internal or placeholder model name.tests/unit/models/policy/test_automodel_types.py (1)
35-71: LGTM - Comprehensive TypedDict validation tests.The test class properly validates that
AutomodelBackendConfigkeys are defined and verifies instantiation compatibility with the actualBackendConfigclass whennemo_automodelis available.nemo_rl/models/policy/__init__.py (2)
37-62: LGTM - Well-documented TypedDict definitions.The
AutomodelBackendConfigandAutomodelKwargsTypedDicts are properly documented with inline comments explaining each key's purpose, valid values, and types. The use ofNotRequiredfor optional fields follows Python 3.12+ patterns. As per coding guidelines, TypedDict keys are documented.
81-81: LGTM - Clean DTensorConfig extension.Adding
automodel_kwargsas an optional field toDTensorConfigprovides a clean integration path for automodel configuration without breaking existing usage.tests/unit/utils/test_automodel_checkpoint.py (4)
135-163: LGTM - Thorough distributed test fixture.The
init_distributedfixture properly handles cleanup of DCP planner caches and device mesh caches before and after each test, ensuring test isolation. The fixture's side-effect-only usage (appearing as unused in test signatures) is a standard pytest pattern.
186-295: LGTM - Comprehensive checkpoint format detection tests.Good coverage of various checkpoint formats (safetensors, torch_save with distcp/bin/pt), PEFT adapter detection, and edge cases (empty/nonexistent directories, nested structures).
353-616: LGTM - Well-structured AutomodelCheckpointManager tests.The test class properly validates manager initialization, state dict key setting, error handling for uninitialized checkpointer, and configuration updates. Mock usage is appropriate for unit testing without requiring full distributed setup.
755-1032: LGTM - Valuable integration tests.The
TestSaveLoadIntegrationclass provides end-to-end validation of save/load operations with both safetensors and torch_save formats, including optimizer state and LoRA weights. These integration tests verify the full checkpoint workflow.tests/unit/models/policy/test_patches.py (2)
28-123: LGTM - Comprehensive path resolution tests.Excellent coverage of edge cases for
_get_transformer_engine_file: package not found, no submodule locations, empty locations, file not found, and successful lookup with various path segments.
357-447: LGTM - Valuable integration tests with real files.The integration tests using real temporary files validate actual file operations and verify patch idempotency. The
test_patch_idempotenttest correctly verifies that applying the patch twice produces identical content with only one success message.pyproject.toml (1)
61-67: LGTM - Automodel dependencies properly configured.The new dependencies for automodel support (nv-grouped-gemm, transformer-engine, deep_ep) are appropriately pinned with specific versions/git revisions for reproducibility.
nemo_rl/models/policy/workers/patches.py (2)
19-44: LGTM - Robust path resolution for TE file location.The function properly handles missing package and file scenarios with clear error messages.
105-106: Broad exception catch is acceptable here.The static analysis flags this as
BLE001, but in this context catching all exceptions is intentional to ensure patch failures don't crash the main application. The error is logged for debugging purposes.tests/unit/models/policy/test_dtensor_worker_v2.py (2)
299-381: Well-structured checkpoint save/load test with proper cleanup.Good use of
tempfile.TemporaryDirectory, proper GPU memory cleanup by shutting down the first policy before creating the second, and verification of loaded policy state.
384-457: Comprehensive mixed precision test covering training and logprobs.Good coverage of both
bfloat16andfloat16precisions, with appropriate assertions for NaN/Inf checks and dtype verification.nemo_rl/utils/automodel_checkpoint.py (1)
246-329: Solid save_checkpoint implementation with PEFT support.Good handling of optional components (optimizer, scheduler, tokenizer) and proper integration with the checkpointing configuration.
tests/unit/models/policy/test_dtensor_worker.py (1)
859-876: Theuse_v2parameter is used indirectly via_get_use_v2.The static analysis flags
use_v2as unused, but it's actually accessed through pytest's parametrization mechanism via_get_use_v2(request)in the fixture chain. The parameter name triggers the correct test configuration.nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py (4)
128-129: TE patch applied early in initialization - good placement.Applying the Transformer Engine patch before any model setup ensures compatibility is established early.
95-99: Clean precision mapping with explicit dtype lookup.The STRING_TO_DTYPE dictionary provides a clear and maintainable way to convert precision strings to torch dtypes, with proper error handling at lines 165-168.
351-425: Well-structured FSDP2Manager integration and model parallelization.Good separation between MoE model parallelization (using
moe_parallelize_model) and standard HF model parallelization (usingmanager.parallelize). The device mesh references are properly stored for downstream usage.
1933-1957: Clean checkpoint manager initialization with lazy instantiation.The
_init_checkpoint_managermethod properly handles lazy initialization and passes the required device meshes. Good use ofgetattrwith default formodel_state_dict_keys.
| self.cpu_offload = self.cfg["dtensor_cfg"]["cpu_offload"] | ||
| # torch distributed init. Envars for rank, world_size, and master_addr and master_port are set from the ray remote call | ||
| torch.distributed.init_process_group(backend="nccl") | ||
| backend = "nccl" if not self.cpu_offload else "cuda:nccl,cpu:gloo" | ||
| torch.distributed.init_process_group(backend=backend) | ||
| self.rank = torch.distributed.get_rank() | ||
| world_size = torch.distributed.get_world_size() | ||
| model_name = self.cfg["model_name"] | ||
|
|
||
| self.checkpoint_manager: Optional[AutomodelCheckpointManager] = None | ||
|
|
||
| self.cpu_offload = self.cfg["dtensor_cfg"]["cpu_offload"] |
There was a problem hiding this comment.
Duplicate assignment of self.cpu_offload.
self.cpu_offload is assigned at line 151 and then again at line 161 with the same value. Remove the duplicate.
🔎 Proposed fix
self.cpu_offload = self.cfg["dtensor_cfg"]["cpu_offload"]
# torch distributed init. Envars for rank, world_size, and master_addr and master_port are set from the ray remote call
backend = "nccl" if not self.cpu_offload else "cuda:nccl,cpu:gloo"
torch.distributed.init_process_group(backend=backend)
self.rank = torch.distributed.get_rank()
world_size = torch.distributed.get_world_size()
model_name = self.cfg["model_name"]
self.checkpoint_manager: Optional[AutomodelCheckpointManager] = None
- self.cpu_offload = self.cfg["dtensor_cfg"]["cpu_offload"]
self.offload_optimizer_for_logprob = self.cfg["offload_optimizer_for_logprob"]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| self.cpu_offload = self.cfg["dtensor_cfg"]["cpu_offload"] | |
| # torch distributed init. Envars for rank, world_size, and master_addr and master_port are set from the ray remote call | |
| torch.distributed.init_process_group(backend="nccl") | |
| backend = "nccl" if not self.cpu_offload else "cuda:nccl,cpu:gloo" | |
| torch.distributed.init_process_group(backend=backend) | |
| self.rank = torch.distributed.get_rank() | |
| world_size = torch.distributed.get_world_size() | |
| model_name = self.cfg["model_name"] | |
| self.checkpoint_manager: Optional[AutomodelCheckpointManager] = None | |
| self.cpu_offload = self.cfg["dtensor_cfg"]["cpu_offload"] | |
| self.cpu_offload = self.cfg["dtensor_cfg"]["cpu_offload"] | |
| # torch distributed init. Envars for rank, world_size, and master_addr and master_port are set from the ray remote call | |
| backend = "nccl" if not self.cpu_offload else "cuda:nccl,cpu:gloo" | |
| torch.distributed.init_process_group(backend=backend) | |
| self.rank = torch.distributed.get_rank() | |
| world_size = torch.distributed.get_world_size() | |
| model_name = self.cfg["model_name"] | |
| self.checkpoint_manager: Optional[AutomodelCheckpointManager] = None |
🤖 Prompt for AI Agents
In nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py around lines 151 to
161, there is a duplicate assignment of self.cpu_offload (first at line 151 and
again at line 161); remove the redundant assignment at line 161 so
self.cpu_offload is only set once from self.cfg["dtensor_cfg"]["cpu_offload"]
and no behavior changes occur.
| assert self.tp_size == 1, ( | ||
| "Using custom implementation {self.model.__class__.__name__} for MoE model {model_name} which doesn't support tp_size > 1. Please use expert_parallel_size > 1 for custom implementation or set force_hf=True in your config at policy->dtensor_cfg->automodel_kwargs to use the HuggingFace implementation." | ||
| ) | ||
| assert self.cp_size == 1, ( | ||
| "Using custom implementation {self.model.__class__.__name__} for MoE model {model_name} which doesn't support cp_size > 1. Please set force_hf=True in your config at policy->dtensor_cfg->automodel_kwargs to use the HuggingFace implementation." | ||
| ) |
There was a problem hiding this comment.
Assertion messages use wrong string formatting.
The assertion messages at lines 402-406 contain {self.model.__class__.__name__} but are not f-strings, so the variable won't be interpolated.
🔎 Proposed fix
- assert self.tp_size == 1, (
- "Using custom implementation {self.model.__class__.__name__} for MoE model {model_name} which doesn't support tp_size > 1. Please use expert_parallel_size > 1 for custom implementation or set force_hf=True in your config at policy->dtensor_cfg->automodel_kwargs to use the HuggingFace implementation."
- )
- assert self.cp_size == 1, (
- "Using custom implementation {self.model.__class__.__name__} for MoE model {model_name} which doesn't support cp_size > 1. Please set force_hf=True in your config at policy->dtensor_cfg->automodel_kwargs to use the HuggingFace implementation."
- )
+ assert self.tp_size == 1, (
+ f"Using custom implementation {self.model.__class__.__name__} for MoE model {model_name} which doesn't support tp_size > 1. Please use expert_parallel_size > 1 for custom implementation or set force_hf=True in your config at policy->dtensor_cfg->automodel_kwargs to use the HuggingFace implementation."
+ )
+ assert self.cp_size == 1, (
+ f"Using custom implementation {self.model.__class__.__name__} for MoE model {model_name} which doesn't support cp_size > 1. Please set force_hf=True in your config at policy->dtensor_cfg->automodel_kwargs to use the HuggingFace implementation."
+ )🤖 Prompt for AI Agents
In nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py around lines 402 to
407, the assertion messages use brace placeholders like
{self.model.__class__.__name__} and {model_name} but are plain strings, so
variables are not interpolated; change each assertion message to use f-strings
(or .format()) so the actual values are substituted, e.g. prefix the string with
f and reference the variables inside the braces for both tp_size and cp_size
assertions.
| model_save_format, is_peft = detect_checkpoint_format(weights_path) | ||
|
|
||
| weights_dir = os.path.dirname(weights_path) | ||
| checkpoint_root = ( | ||
| os.path.dirname(weights_dir) | ||
| if weights_dir.endswith("weights") | ||
| else weights_dir | ||
| ) |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Duplicate checkpoint_root inference logic - use _infer_checkpoint_root helper.
This logic duplicates what _infer_checkpoint_root does at lines 428-443. The function already exists and should be reused here.
🔎 Proposed fix
model_save_format, is_peft = detect_checkpoint_format(weights_path)
- weights_dir = os.path.dirname(weights_path)
- checkpoint_root = (
- os.path.dirname(weights_dir)
- if weights_dir.endswith("weights")
- else weights_dir
- )
+ checkpoint_root = _infer_checkpoint_root(weights_path)
# Update checkpointer configuration📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| model_save_format, is_peft = detect_checkpoint_format(weights_path) | |
| weights_dir = os.path.dirname(weights_path) | |
| checkpoint_root = ( | |
| os.path.dirname(weights_dir) | |
| if weights_dir.endswith("weights") | |
| else weights_dir | |
| ) | |
| model_save_format, is_peft = detect_checkpoint_format(weights_path) | |
| checkpoint_root = _infer_checkpoint_root(weights_path) |
🤖 Prompt for AI Agents
In nemo_rl/utils/automodel_checkpoint.py around lines 354 to 361, the code
re-implements checkpoint_root inference by taking dirname(dirname(weights_path))
when the weights dir ends with "weights" or else dirname(weights_path); replace
this duplicated logic by calling the existing _infer_checkpoint_root helper
(used at lines 428-443) instead: remove the manual dirname logic and invoke
_infer_checkpoint_root(weights_path) (or the helper's expected parameter) to
compute checkpoint_root, preserving behavior and any special cases handled by
the helper.
| exit_if_max_steps_reached | ||
|
|
||
| # Run the experiment | ||
| cd $PROJECT_ROOT |
There was a problem hiding this comment.
Add error handling for directory change.
The cd command should handle failures to prevent the script from continuing in the wrong directory if $PROJECT_ROOT is invalid or inaccessible.
🔎 Proposed fix
-cd $PROJECT_ROOT
+cd "$PROJECT_ROOT" || exit 1Based on static analysis hints.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cd $PROJECT_ROOT | |
| cd "$PROJECT_ROOT" || exit 1 |
🧰 Tools
🪛 Shellcheck (0.11.0)
[warning] 16-16: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.
(SC2164)
🤖 Prompt for AI Agents
In tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh around line
16, the script runs cd $PROJECT_ROOT without handling failure; update the script
so the directory change is attempted with the variable quoted, then immediately
check the result and if it fails print a clear error to stderr and exit with a
non‑zero status (or enable safe script failure such as set -e at the top) to
prevent continuing in the wrong directory.
| logger.tensorboard_enabled=True \ | ||
| checkpointing.enabled=True \ | ||
| checkpointing.checkpoint_dir=$CKPT_DIR \ | ||
| $@ \ |
There was a problem hiding this comment.
Quote array expansion to prevent argument splitting.
The unquoted $@ can cause issues with arguments containing spaces or special characters. Use "$@" to preserve argument boundaries correctly.
🔎 Proposed fix
checkpointing.checkpoint_dir=$CKPT_DIR \
- $@ \
+ "$@" \
2>&1 | tee $RUN_LOGBased on static analysis hints.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| $@ \ | |
| checkpointing.checkpoint_dir=$CKPT_DIR \ | |
| "$@" \ | |
| 2>&1 | tee $RUN_LOG |
🧰 Tools
🪛 Shellcheck (0.11.0)
[error] 28-28: Double quote array expansions to avoid re-splitting elements.
(SC2068)
🤖 Prompt for AI Agents
In tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh around line
28 the unquoted $@ may split arguments containing spaces or special characters;
replace it with "$@" so all passed arguments are preserved as distinct
parameters and not word-split by the shell.
| **( | ||
| { | ||
| "labels": torch.randint(0, vocab_size, (batch_size, seq_len)), | ||
| "sample_mask": torch.ones(batch_size).cuda(), | ||
| } | ||
| if mode == "train" | ||
| else {} | ||
| ), | ||
| } | ||
| ) | ||
| data = data.to("cpu") | ||
| return data |
There was a problem hiding this comment.
Device inconsistency: sample_mask created on CUDA before moving data to CPU.
The sample_mask tensor is created directly on CUDA (line 139), but then the entire BatchedDataDict is moved to CPU (line 146). This works because .to("cpu") will move all tensors including sample_mask, but creating it on CUDA first is unnecessary and inconsistent with how other tensors in the batch are created.
🔎 Proposed fix
**(
{
"labels": torch.randint(0, vocab_size, (batch_size, seq_len)),
- "sample_mask": torch.ones(batch_size).cuda(),
+ "sample_mask": torch.ones(batch_size),
}
if mode == "train"
else {}
),📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| **( | |
| { | |
| "labels": torch.randint(0, vocab_size, (batch_size, seq_len)), | |
| "sample_mask": torch.ones(batch_size).cuda(), | |
| } | |
| if mode == "train" | |
| else {} | |
| ), | |
| } | |
| ) | |
| data = data.to("cpu") | |
| return data | |
| **( | |
| { | |
| "labels": torch.randint(0, vocab_size, (batch_size, seq_len)), | |
| "sample_mask": torch.ones(batch_size), | |
| } | |
| if mode == "train" | |
| else {} | |
| ), | |
| } | |
| ) | |
| data = data.to("cpu") | |
| return data |
🤖 Prompt for AI Agents
In tests/unit/models/policy/test_dtensor_worker_v2.py around lines 136 to 147,
the sample_mask is created on CUDA (torch.ones(batch_size).cuda()) but the
BatchedDataDict is subsequently moved to CPU; create sample_mask on the CPU
instead (e.g., torch.ones(batch_size) or torch.ones(batch_size, device="cpu") /
use the same device as other tensors) so it’s consistent before calling data =
data.to("cpu"); update the sample_mask creation to not call .cuda() or to use a
shared device variable.
beep boop [🤖]: Hi @adil-a 👋,
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Chores
✏️ Tip: You can customize this high-level summary in your review settings.