Conversation
Signed-off-by: Malay Nagda <malayn@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Signed-off-by: Malay Nagda <malayn@nvidia.com>
Signed-off-by: Malay Nagda <malayn@nvidia.com>
📝 WalkthroughWalkthroughThe changes modify Nemotron 3 Nano configuration setup by adding a Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py (1)
54-62: Use a tuple (or immutable sequence) forrecompute_modulesto guard against accidental mutation.The
["moe", "layernorm"]list literal is stored inside the dataclass and shared by bothNEMOTRON_3_NANO_PRETRAIN_CONFIG_H100_BF16_V1andNEMOTRON_3_NANO_PRETRAIN_CONFIG_H100_FP8_CS_V1. Even if the dataclass is frozen (preventing field reassignment), the list object itself remains mutable; an in-place.append()/.remove()call elsewhere would corrupt both aliases at once.♻️ Proposed fix
_NEMOTRON_3_NANO_PRETRAIN_CONFIG_H100 = replace( BASE_NEMOTRON_3_NANO_CONFIG, num_gpus=16, global_batch_size=1024, micro_batch_size=1, - recompute_modules=["moe", "layernorm"], + recompute_modules=("moe", "layernorm"), )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py` around lines 54 - 62, The recompute_modules argument in the _NEMOTRON_3_NANO_PRETRAIN_CONFIG_H100 replace call is using a mutable list literal ["moe", "layernorm"] that will be shared across the derived configs; change it to an immutable tuple ("moe", "layernorm") (or another immutable sequence) so the value stored in the dataclass cannot be mutated in-place; update the recompute_modules parameter in the replace call for _NEMOTRON_3_NANO_PRETRAIN_CONFIG_H100 accordingly so both NEMOTRON_3_NANO_PRETRAIN_CONFIG_H100_BF16_V1 and NEMOTRON_3_NANO_PRETRAIN_CONFIG_H100_FP8_CS_V1 inherit the immutable sequence.scripts/performance/configs/nemotronh/nemotron_3_nano_llm_pretrain.py (1)
52-53: Extract the repeatedmoe_flex_dispatcher_backendpropagation into a shared helper.The identical two-line conditional is copy-pasted verbatim into all five functions. Folding it into
set_nemotron_3_nano_common_configs(by also acceptingbase_cfg) or a separate helper eliminates the duplication and ensures any future changes are applied consistently.♻️ Proposed refactor
-def set_nemotron_3_nano_common_configs(cfg: ConfigContainer) -> None: - """Set common performance configurations for all Nemotron 3 Nano configs.""" +def set_nemotron_3_nano_common_configs(cfg: ConfigContainer, base_cfg=None) -> None: + """Set common performance configurations for all Nemotron 3 Nano configs.""" cfg.mixed_precision.grad_reduce_in_fp32 = False cfg.ddp.grad_reduce_in_fp32 = False + if base_cfg is not None and base_cfg.moe_flex_dispatcher_backend is not None: + cfg.model.moe_flex_dispatcher_backend = base_cfg.moe_flex_dispatcher_backendThen, in each function replace:
set_nemotron_3_nano_common_configs(cfg) set_workload_base_configs(cfg, base_cfg) - if base_cfg.moe_flex_dispatcher_backend is not None: - cfg.model.moe_flex_dispatcher_backend = base_cfg.moe_flex_dispatcher_backend + set_nemotron_3_nano_common_configs(cfg, base_cfg)Also applies to: 76-77, 100-101, 124-125, 148-149
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/performance/configs/nemotronh/nemotron_3_nano_llm_pretrain.py` around lines 52 - 53, The two-line conditional that propagates base_cfg.moe_flex_dispatcher_backend into cfg.model.moe_flex_dispatcher_backend is duplicated across five functions; extract it into a single helper (e.g., add an argument base_cfg to set_nemotron_3_nano_common_configs or create a small function like propagate_moe_flex_dispatcher_backend(base_cfg, cfg)) and call that helper from each of the five places instead of repeating the conditional; ensure the helper checks base_cfg.moe_flex_dispatcher_backend is not None and assigns to cfg.model.moe_flex_dispatcher_backend, and update the five callers to invoke the helper (or to pass base_cfg into set_nemotron_3_nano_common_configs) so future changes are centralized.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/performance/configs/nemotronh/nemotron_3_nano_llm_pretrain.py`:
- Around line 34-36: Remove the unused mock parameter from the five pretrain
config factory signatures: e.g., remove "mock: bool = True" from
nemotron_3_nano_pretrain_config_gb300 and the four other
nemotron_3_nano_*_pretrain_config_* functions in this file; update each function
signature to drop the mock parameter and ensure no internal references remain
(ruff ARG001 will be resolved). Also update any local references or tests that
call these factories with the mock positional/kwarg so callers pass only the
remaining parameters (precision and config_variant).
---
Nitpick comments:
In `@scripts/performance/configs/nemotronh/nemotron_3_nano_llm_pretrain.py`:
- Around line 52-53: The two-line conditional that propagates
base_cfg.moe_flex_dispatcher_backend into cfg.model.moe_flex_dispatcher_backend
is duplicated across five functions; extract it into a single helper (e.g., add
an argument base_cfg to set_nemotron_3_nano_common_configs or create a small
function like propagate_moe_flex_dispatcher_backend(base_cfg, cfg)) and call
that helper from each of the five places instead of repeating the conditional;
ensure the helper checks base_cfg.moe_flex_dispatcher_backend is not None and
assigns to cfg.model.moe_flex_dispatcher_backend, and update the five callers to
invoke the helper (or to pass base_cfg into set_nemotron_3_nano_common_configs)
so future changes are centralized.
In
`@scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py`:
- Around line 54-62: The recompute_modules argument in the
_NEMOTRON_3_NANO_PRETRAIN_CONFIG_H100 replace call is using a mutable list
literal ["moe", "layernorm"] that will be shared across the derived configs;
change it to an immutable tuple ("moe", "layernorm") (or another immutable
sequence) so the value stored in the dataclass cannot be mutated in-place;
update the recompute_modules parameter in the replace call for
_NEMOTRON_3_NANO_PRETRAIN_CONFIG_H100 accordingly so both
NEMOTRON_3_NANO_PRETRAIN_CONFIG_H100_BF16_V1 and
NEMOTRON_3_NANO_PRETRAIN_CONFIG_H100_FP8_CS_V1 inherit the immutable sequence.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
scripts/performance/configs/nemotronh/nemotron_3_nano_llm_pretrain.pyscripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py
| def nemotron_3_nano_pretrain_config_gb300( | ||
| precision: str = "bf16", mock: bool = True, config_variant: str = "v1" | ||
| ) -> ConfigContainer: |
There was a problem hiding this comment.
Remove the unused mock parameter from all five function signatures.
The mock: bool = True argument is added to every pretrain config factory but is never read inside any of them. The PR title "remove mock arg" directly contradicts leaving it in, and ruff flags it as ARG001 in all five locations. This will fail linting/CI.
🔧 Proposed fix (shown for one function; apply the same diff to all five)
-def nemotron_3_nano_pretrain_config_gb300(
- precision: str = "bf16", mock: bool = True, config_variant: str = "v1"
-) -> ConfigContainer:
+def nemotron_3_nano_pretrain_config_gb300(
+ precision: str = "bf16", config_variant: str = "v1"
+) -> ConfigContainer:Also applies to: 58-60, 82-84, 106-108, 130-132
🧰 Tools
🪛 Ruff (0.15.2)
[warning] 35-35: Unused function argument: mock
(ARG001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/performance/configs/nemotronh/nemotron_3_nano_llm_pretrain.py` around
lines 34 - 36, Remove the unused mock parameter from the five pretrain config
factory signatures: e.g., remove "mock: bool = True" from
nemotron_3_nano_pretrain_config_gb300 and the four other
nemotron_3_nano_*_pretrain_config_* functions in this file; update each function
signature to drop the mock parameter and ensure no internal references remain
(ruff ARG001 will be resolved). Also update any local references or tests that
call these factories with the mock positional/kwarg so callers pass only the
remaining parameters (precision and config_variant).
Signed-off-by: Malay Nagda <malayn@nvidia.com>
Signed-off-by: Malay Nagda <malayn@nvidia.com> Signed-off-by: pengdurice <pengduhit@gmail.com>
Signed-off-by: Malay Nagda <malayn@nvidia.com>
What does this PR do ?
Updates Nemotron 3 Nano perf configs.
Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Summary by CodeRabbit