chore: remove some default value by yuki-97 · Pull Request #1179 · NVIDIA-NeMo/RL

yuki-97 · 2025-09-22T14:05:38Z

Remove some default value in code.
Raise error when forgetting to set _v2 in config instead of using dtensor v1 by default.

_v2 is set to true in example configs since it's the recommend one, and set to false in all recipes to make them the same as before.

Summary by CodeRabbit

New Features
- None
Bug Fixes
- More reliable detection of Megatron/DTensor and DTensor v2 settings from configuration.
- Ensures DTensor v2 checkpointing is used only when v2 is explicitly enabled; non‑v2 path remains unchanged with existing validations.
Refactor
- Streamlined configuration checks for distributed features to reduce ambiguity and improve stability.
Chores
- Internal consistency updates to flag handling.
Notes
- No user‑visible API changes.

coderabbitai · 2025-09-22T14:13:41Z

📝 Walkthrough

Walkthrough

Refactors config flag access in Policy.init to explicit key checks for Megatron and DTensor. In save_checkpoint, distinguishes DTensor v2 from non-v2 via new use_dtensor/use_dtensor_v2 flags, routing to v2-specific checkpointing only when applicable; otherwise preserves existing non-v2 path and safetensors validation.

Changes

Cohort / File(s)	Summary of Changes
Policy config flags and checkpoint routing `nemo_rl/models/policy/lm_policy.py`	Replaced dict.get-based flag reads with explicit key existence checks for Megatron/DTensor; removed default fallback for DTensor `_v2` flag. In save_checkpoint, introduced `use_dtensor` and `use_dtensor_v2` gating to call v2 checkpoint path only when `_v2` is true; retained non-v2 path and safetensors validation logic.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Policy
  participant Config
  participant DTensorV2 as DTensor v2 Checkpointer
  participant LegacySave as Non-v2 Save Path
  participant SafeTensors as Safetensors Validator

  User->>Policy: save_checkpoint()
  Policy->>Config: Read dtensor_cfg.enabled and _v2
  alt DTensor enabled and _v2 true
    Policy->>DTensorV2: save checkpoint (v2)
    DTensorV2-->>Policy: result
  else DTensor not enabled or _v2 false
    Policy->>SafeTensors: validate (if safetensors)
    SafeTensors-->>Policy: ok/error
    Policy->>LegacySave: save checkpoint (non-v2)
    LegacySave-->>Policy: result
  end
  Policy-->>User: completion

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Test Results For Major Changes	⚠️ Warning	This PR changes configuration semantics by removing defaults and raising an error when policy.dtensor_cfg._v2 is not set, which is a breaking behavior change even though the diff is small. The PR description and conversation do not document any test results or testing plan beyond a CI label; there is no evidence of targeted tests for the new error path or for checkpointing behavior under DTensor v2 vs non‑v2. Since this is a breaking change and could affect user workflows, testing information should be included to pass this check. (github.com)	Please update the PR description with testing details: list the tests added or run (unit/integration), include links or summaries of CI runs, and show that (a) missing _v2 reliably raises the intended error, (b) DTensor v2 checkpointing still works, and (c) the non‑v2 path is unaffected. If any numerics or performance could be impacted, add quick before/after confirmations and the configs used; otherwise explicitly state that numerics/perf are unaffected.
Title Check	❓ Inconclusive	The title "chore: remove some default value" is related to the changeset because the PR removes default configuration behavior, but it is too vague: "some default value" does not indicate which default was removed or the behavioral impact (for example, requiring dtensor._v2 instead of defaulting). Because it lacks sufficient specificity for a teammate scanning history, the title is inconclusive against the guideline of clearly summarizing the primary change.	Please update the title to be specific and focused on the main behavioral change, for example "chore: require dtensor._v2 in config (remove default)" or "chore: remove default dtensor._v2 and raise error when missing"; this will make the PR intent clear to reviewers and in commit history.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch yukih/remove-default

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_rl/models/policy/lm_policy.py (1)
649-667: Reuse init-stored DTensor flags and declare _v2 in the TypedDict to avoid KeyError

DTensorConfig (nemo_rl/models/policy/init.py — DTensorConfig at ~lines 20–22) currently defines enabled and env_vars but not _v2. lm_policy.py (around lines 649–667) indexes into cfg["dtensor_cfg"]["_v2"] and recomputes flags — this risks KeyError and duplicates validation.

Replace local recomputation in nemo_rl/models/policy/lm_policy.py with the init-stored flag:
-        use_dtensor = "dtensor_cfg" in self.cfg and self.cfg["dtensor_cfg"]["enabled"]
-        use_dtensor_v2 = use_dtensor and self.cfg["dtensor_cfg"]["_v2"]
+        use_dtensor_v2 = getattr(self, "dtensor_v2", False)
Set flags once in init after computing dtensor_enable / use_v2:
self.dtensor_enabled = dtensor_enable
self.dtensor_v2 = use_v2 if dtensor_enable else False
Make _v2 explicit in nemo_rl/models/policy/init.py::DTensorConfig — add either _v2: bool if mandatory, or _v2: NotRequired[bool] if optional; alternatively change uses to cfg.get("_v2", False).

🧹 Nitpick comments (2)

nemo_rl/models/policy/lm_policy.py (2)

80-82: Raise clear errors when backend enabled flags are missing.

Direct indexing will raise a KeyError. Fail fast with a descriptive message to align with “no hidden defaults” while improving UX.

-        megatron_enable = "megatron_cfg" in config and config["megatron_cfg"]["enabled"]
-        dtensor_enable = "dtensor_cfg" in config and config["dtensor_cfg"]["enabled"]
+        if "megatron_cfg" in config and "enabled" not in config["megatron_cfg"]:
+            raise ValueError("policy.megatron_cfg.enabled must be set (true/false).")
+        if "dtensor_cfg" in config and "enabled" not in config["dtensor_cfg"]:
+            raise ValueError("policy.dtensor_cfg.enabled must be set (true/false).")
+        megatron_enable = "megatron_cfg" in config and config["megatron_cfg"]["enabled"]
+        dtensor_enable = "dtensor_cfg" in config and config["dtensor_cfg"]["enabled"]

103-106: Outdated comment and missing-key ergonomics for _v2.

Comment claims a default, but the code now requires explicit presence. Make the requirement explicit and raise a helpful error if _v2 is absent (or not a bool).

-            # Check if _v2 is enabled in dtensor_cfg (defaults to False for backward compatibility)
-            use_v2 = config["dtensor_cfg"]["_v2"]
+            # Require explicit DTensor version selection to avoid silent fallback.
+            if "_v2" not in config["dtensor_cfg"]:
+                raise ValueError(
+                    "policy.dtensor_cfg._v2 must be set: true for DTensor v2, false for v1."
+                )
+            use_v2 = config["dtensor_cfg"]["_v2"]
+            if not isinstance(use_v2, bool):
+                raise TypeError("policy.dtensor_cfg._v2 must be a bool.")

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2d3c43c and ba021b5.

📒 Files selected for processing (1)

nemo_rl/models/policy/lm_policy.py (3 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

nemo_rl/models/policy/lm_policy.py

nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

nemo_rl/models/policy/lm_policy.py

🧠 Learnings (1)

📚 Learning: 2025-09-20T14:58:45.460Z

Learnt from: CR
PR: NVIDIA-NeMo/RL#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.460Z
Learning: Applies to nemo_rl/**/*.py : Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults

Applied to files:

nemo_rl/models/policy/lm_policy.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: build-container / main
GitHub Check: sphinx-build / Build docs
GitHub Check: Lint check
GitHub Check: Lint check
GitHub Check: Post automodel integration comment / Comment on PR
GitHub Check: Post submodule check comment / Comment on PR

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 · 2025-09-23T01:22:30Z

closed as this will be done in #1024.

joyang-nv · 2025-09-24T16:25:00Z

nemo_rl/models/policy/lm_policy.py


-        megatron_enable = bool(config.get("megatron_cfg", {}).get("enabled", False))
-        dtensor_enable = bool(config.get("dtensor_cfg", {}).get("enabled", False))
+        megatron_enable = "megatron_cfg" in config and config["megatron_cfg"]["enabled"]


Will enabled be sure to exist? This is different from original code. It will default to False if enabled doesn't exist.

joyang-nv · 2025-09-24T16:25:39Z

nemo_rl/models/policy/lm_policy.py


            # Check if _v2 is enabled in dtensor_cfg (defaults to False for backward compatibility)
-            use_v2 = config.get("dtensor_cfg", {}).get("_v2", False)
+            use_v2 = config["dtensor_cfg"]["_v2"]


Similar, I guess you want to ensure this field must be set.

yuki-97 requested review from a team as code owners September 22, 2025 14:05

yuki-97 added the CI:L1 Run doctests, unit tests, and functional tests label Sep 22, 2025

yuki-97 temporarily deployed to nemo-ci September 22, 2025 14:06 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci September 22, 2025 14:12 — with GitHub Actions Inactive

coderabbitai bot reviewed Sep 22, 2025

View reviewed changes

yuki-97 requested review from a team as code owners September 22, 2025 15:31

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Sep 22, 2025

yuki-97 temporarily deployed to nemo-ci September 22, 2025 15:33 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci September 22, 2025 15:34 — with GitHub Actions Inactive

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Sep 22, 2025

yuki-97 temporarily deployed to nemo-ci September 22, 2025 16:22 — with GitHub Actions Inactive

yuki-97 added 3 commits September 22, 2025 17:32

remove some default value

13ce3a1

Signed-off-by: Yuki Huang <yukih@nvidia.com>

update unit test and config

17dc2c5

Signed-off-by: Yuki Huang <yukih@nvidia.com>

fix unit test

3bd4ae9

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 force-pushed the yukih/remove-default branch from d18bc5c to 3bd4ae9 Compare September 23, 2025 00:33

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Sep 23, 2025

yuki-97 temporarily deployed to nemo-ci September 23, 2025 00:33 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci September 23, 2025 00:34 — with GitHub Actions Inactive

yuki-97 closed this Sep 23, 2025

joyang-nv reviewed Sep 24, 2025

View reviewed changes

yuki-97 deleted the yukih/remove-default branch September 30, 2025 02:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: remove some default value#1179

chore: remove some default value#1179
yuki-97 wants to merge 3 commits intomainfrom
yukih/remove-default

yuki-97 commented Sep 22, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Sep 22, 2025 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

yuki-97 commented Sep 23, 2025

Uh oh!

joyang-nv Sep 24, 2025

Uh oh!

joyang-nv Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yuki-97 commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

yuki-97 commented Sep 23, 2025

Uh oh!

joyang-nv Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

joyang-nv Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuki-97 commented Sep 22, 2025 •

edited

Loading

coderabbitai bot commented Sep 22, 2025 •

edited

Loading