Skip to content

chore: remove some default value#1179

Closed
yuki-97 wants to merge 3 commits intomainfrom
yukih/remove-default
Closed

chore: remove some default value#1179
yuki-97 wants to merge 3 commits intomainfrom
yukih/remove-default

Conversation

@yuki-97
Copy link
Copy Markdown
Contributor

@yuki-97 yuki-97 commented Sep 22, 2025

Remove some default value in code.
Raise error when forgetting to set _v2 in config instead of using dtensor v1 by default.

_v2 is set to true in example configs since it's the recommend one, and set to false in all recipes to make them the same as before.

Summary by CodeRabbit

  • New Features
    • None
  • Bug Fixes
    • More reliable detection of Megatron/DTensor and DTensor v2 settings from configuration.
    • Ensures DTensor v2 checkpointing is used only when v2 is explicitly enabled; non‑v2 path remains unchanged with existing validations.
  • Refactor
    • Streamlined configuration checks for distributed features to reduce ambiguity and improve stability.
  • Chores
    • Internal consistency updates to flag handling.
  • Notes
    • No user‑visible API changes.

@yuki-97 yuki-97 requested review from a team as code owners September 22, 2025 14:05
@yuki-97 yuki-97 added the CI:L1 Run doctests, unit tests, and functional tests label Sep 22, 2025
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Sep 22, 2025

📝 Walkthrough

Walkthrough

Refactors config flag access in Policy.init to explicit key checks for Megatron and DTensor. In save_checkpoint, distinguishes DTensor v2 from non-v2 via new use_dtensor/use_dtensor_v2 flags, routing to v2-specific checkpointing only when applicable; otherwise preserves existing non-v2 path and safetensors validation.

Changes

Cohort / File(s) Summary of Changes
Policy config flags and checkpoint routing
nemo_rl/models/policy/lm_policy.py
Replaced dict.get-based flag reads with explicit key existence checks for Megatron/DTensor; removed default fallback for DTensor _v2 flag. In save_checkpoint, introduced use_dtensor and use_dtensor_v2 gating to call v2 checkpoint path only when _v2 is true; retained non-v2 path and safetensors validation logic.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Policy
  participant Config
  participant DTensorV2 as DTensor v2 Checkpointer
  participant LegacySave as Non-v2 Save Path
  participant SafeTensors as Safetensors Validator

  User->>Policy: save_checkpoint()
  Policy->>Config: Read dtensor_cfg.enabled and _v2
  alt DTensor enabled and _v2 true
    Policy->>DTensorV2: save checkpoint (v2)
    DTensorV2-->>Policy: result
  else DTensor not enabled or _v2 false
    Policy->>SafeTensors: validate (if safetensors)
    SafeTensors-->>Policy: ok/error
    Policy->>LegacySave: save checkpoint (non-v2)
    LegacySave-->>Policy: result
  end
  Policy-->>User: completion
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Test Results For Major Changes ⚠️ Warning This PR changes configuration semantics by removing defaults and raising an error when policy.dtensor_cfg._v2 is not set, which is a breaking behavior change even though the diff is small. The PR description and conversation do not document any test results or testing plan beyond a CI label; there is no evidence of targeted tests for the new error path or for checkpointing behavior under DTensor v2 vs non‑v2. Since this is a breaking change and could affect user workflows, testing information should be included to pass this check. (github.com) Please update the PR description with testing details: list the tests added or run (unit/integration), include links or summaries of CI runs, and show that (a) missing _v2 reliably raises the intended error, (b) DTensor v2 checkpointing still works, and (c) the non‑v2 path is unaffected. If any numerics or performance could be impacted, add quick before/after confirmations and the configs used; otherwise explicitly state that numerics/perf are unaffected.
Title Check ❓ Inconclusive The title "chore: remove some default value" is related to the changeset because the PR removes default configuration behavior, but it is too vague: "some default value" does not indicate which default was removed or the behavioral impact (for example, requiring dtensor._v2 instead of defaulting). Because it lacks sufficient specificity for a teammate scanning history, the title is inconclusive against the guideline of clearly summarizing the primary change. Please update the title to be specific and focused on the main behavioral change, for example "chore: require dtensor._v2 in config (remove default)" or "chore: remove default dtensor._v2 and raise error when missing"; this will make the PR intent clear to reviewers and in commit history.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch yukih/remove-default

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_rl/models/policy/lm_policy.py (1)

649-667: Reuse init-stored DTensor flags and declare _v2 in the TypedDict to avoid KeyError

DTensorConfig (nemo_rl/models/policy/init.py — DTensorConfig at ~lines 20–22) currently defines enabled and env_vars but not _v2. lm_policy.py (around lines 649–667) indexes into cfg["dtensor_cfg"]["_v2"] and recomputes flags — this risks KeyError and duplicates validation.

  • Replace local recomputation in nemo_rl/models/policy/lm_policy.py with the init-stored flag:
-        use_dtensor = "dtensor_cfg" in self.cfg and self.cfg["dtensor_cfg"]["enabled"]
-        use_dtensor_v2 = use_dtensor and self.cfg["dtensor_cfg"]["_v2"]
+        use_dtensor_v2 = getattr(self, "dtensor_v2", False)
  • Set flags once in init after computing dtensor_enable / use_v2:
self.dtensor_enabled = dtensor_enable
self.dtensor_v2 = use_v2 if dtensor_enable else False
  • Make _v2 explicit in nemo_rl/models/policy/init.py::DTensorConfig — add either _v2: bool if mandatory, or _v2: NotRequired[bool] if optional; alternatively change uses to cfg.get("_v2", False).
🧹 Nitpick comments (2)
nemo_rl/models/policy/lm_policy.py (2)

80-82: Raise clear errors when backend enabled flags are missing.

Direct indexing will raise a KeyError. Fail fast with a descriptive message to align with “no hidden defaults” while improving UX.

-        megatron_enable = "megatron_cfg" in config and config["megatron_cfg"]["enabled"]
-        dtensor_enable = "dtensor_cfg" in config and config["dtensor_cfg"]["enabled"]
+        if "megatron_cfg" in config and "enabled" not in config["megatron_cfg"]:
+            raise ValueError("policy.megatron_cfg.enabled must be set (true/false).")
+        if "dtensor_cfg" in config and "enabled" not in config["dtensor_cfg"]:
+            raise ValueError("policy.dtensor_cfg.enabled must be set (true/false).")
+        megatron_enable = "megatron_cfg" in config and config["megatron_cfg"]["enabled"]
+        dtensor_enable = "dtensor_cfg" in config and config["dtensor_cfg"]["enabled"]

103-106: Outdated comment and missing-key ergonomics for _v2.

Comment claims a default, but the code now requires explicit presence. Make the requirement explicit and raise a helpful error if _v2 is absent (or not a bool).

-            # Check if _v2 is enabled in dtensor_cfg (defaults to False for backward compatibility)
-            use_v2 = config["dtensor_cfg"]["_v2"]
+            # Require explicit DTensor version selection to avoid silent fallback.
+            if "_v2" not in config["dtensor_cfg"]:
+                raise ValueError(
+                    "policy.dtensor_cfg._v2 must be set: true for DTensor v2, false for v1."
+                )
+            use_v2 = config["dtensor_cfg"]["_v2"]
+            if not isinstance(use_v2, bool):
+                raise TypeError("policy.dtensor_cfg._v2 must be a bool.")
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2d3c43c and ba021b5.

📒 Files selected for processing (1)
  • nemo_rl/models/policy/lm_policy.py (3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

  • nemo_rl/models/policy/lm_policy.py
nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

  • nemo_rl/models/policy/lm_policy.py
🧠 Learnings (1)
📚 Learning: 2025-09-20T14:58:45.460Z
Learnt from: CR
PR: NVIDIA-NeMo/RL#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.460Z
Learning: Applies to nemo_rl/**/*.py : Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults

Applied to files:

  • nemo_rl/models/policy/lm_policy.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: build-container / main
  • GitHub Check: sphinx-build / Build docs
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Post automodel integration comment / Comment on PR
  • GitHub Check: Post submodule check comment / Comment on PR

@yuki-97 yuki-97 requested review from a team as code owners September 22, 2025 15:31
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Sep 22, 2025
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Sep 22, 2025
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97 yuki-97 force-pushed the yukih/remove-default branch from d18bc5c to 3bd4ae9 Compare September 23, 2025 00:33
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Sep 23, 2025
@yuki-97
Copy link
Copy Markdown
Contributor Author

yuki-97 commented Sep 23, 2025

closed as this will be done in #1024.

@yuki-97 yuki-97 closed this Sep 23, 2025

megatron_enable = bool(config.get("megatron_cfg", {}).get("enabled", False))
dtensor_enable = bool(config.get("dtensor_cfg", {}).get("enabled", False))
megatron_enable = "megatron_cfg" in config and config["megatron_cfg"]["enabled"]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will enabled be sure to exist? This is different from original code. It will default to False if enabled doesn't exist.


# Check if _v2 is enabled in dtensor_cfg (defaults to False for backward compatibility)
use_v2 = config.get("dtensor_cfg", {}).get("_v2", False)
use_v2 = config["dtensor_cfg"]["_v2"]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar, I guess you want to ensure this field must be set.

@yuki-97 yuki-97 deleted the yukih/remove-default branch September 30, 2025 02:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants