Conversation
…config logging - Fix adam_eps default from 1e-5 to 1e-8 in optimizer_utils.py to match Mcore - Remove redundant adam_eps overrides in recipe files (olmoe, nemotron_3_nano, glm45, llama3) - Add log_non_default_values() method to ConfigContainer for proactive config debugging - Replace verbose YAML config logging with focused non-default value comparison The non-default logging compares optimizer, ddp, and model configs against their Megatron Core parent class defaults, making it easier to catch unintended deviations. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
📝 WalkthroughWalkthroughThe pull request consolidates optimizer epsilon configuration by removing explicit adam_eps parameters across recipe files and updating the default value from 1e-5 to 1e-8 in utility functions. It also introduces configuration logging functionality to report non-default values at setup time. Changes
Sequence DiagramsequenceDiagram
participant Setup
participant Config
participant MCore as Megatron Core
participant Helpers
participant Logger
Setup->>Config: log_non_default_values()
activate Config
Config->>Helpers: _get_mcore_transformer_parent(model_config)
Helpers-->>Config: transformer_parent_class
Config->>MCore: Get defaults for optimizer
MCore-->>Config: optimizer_defaults
Config->>Helpers: _get_non_default_values(optimizer, mcore_class)
Helpers-->>Config: non_default_optimizer_values
Config->>MCore: Get defaults for ddp
MCore-->>Config: ddp_defaults
Config->>Helpers: _get_non_default_values(ddp, mcore_class)
Helpers-->>Config: non_default_ddp_values
Config->>Helpers: _get_key_config_values(train_config)
Helpers-->>Config: key_train_values
Config->>Logger: Log non-default values summary
Logger-->>Setup: Configuration report
deactivate Config
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/megatron/bridge/training/config.py`:
- Around line 1663-1723: Docstring for log_non_default_values overstates that
"For configs that don't inherit from Mcore, all values are logged" but the
implementation calls _get_key_config_values which filters out None/large values;
update the docstring in log_non_default_values to accurately state that
non‑Mcore configs will have "key" or filtered values logged (via
_get_key_config_values) rather than all fields. Mention the function names
_get_key_config_values and log_non_default_values so reviewers can locate the
change and keep the rest of the behavior unchanged.
🧹 Nitpick comments (1)
src/megatron/bridge/training/config.py (1)
1745-1820: Prefer built-in generics for new type hints.
Python 3.10+ guideline prefersdict/tupleoverDict/Tuplein new code. Please align the new helper signatures.♻️ Suggested update
-def _get_non_default_values(config_obj: Any, mcore_class: type) -> Dict[str, Tuple[Any, Any]]: +def _get_non_default_values(config_obj: Any, mcore_class: type) -> dict[str, tuple[Any, Any]]: @@ -def _get_key_config_values(config_obj: Any) -> Dict[str, Any]: +def _get_key_config_values(config_obj: Any) -> dict[str, Any]:
Tests cover: - _get_mcore_transformer_parent for GPT and DeepSeek models - _get_non_default_values for optimizer, DDP, and model configs - _get_key_config_values for non-Mcore configs - ConfigContainer.log_non_default_values method - Verify adam_eps is not logged when matching Mcore default Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
# Conflicts: # src/megatron/bridge/recipes/glm/glm45.py # src/megatron/bridge/recipes/llama/llama3.py # src/megatron/bridge/recipes/olmoe/olmoe_7b.py
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
|
/ok to test 7dbacf1 |
…plementation The implementation now calls cfg.log_non_default_values() instead of cfg.print_yaml(), so update tests to check for the new method calls.
|
/ok to test 20dc4e4 |
…config logging (NVIDIA-NeMo#2184) Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: sowmen <sowmendipta@gmail.com>
…config logging (#2184) Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
[recipe, training] fix: Correct adam_eps default and add non-default config logging
Description
Problem
The default for
adam_epsin PyTorch and Megatron Core is1e-8. However,src/megatron/bridge/recipes/utils/optimizer_utils.pywas overriding this default to1e-5. This caused the Deepseek v3 model to converge too slowly compared to implementations in other frameworks.To compound the problem, many individual recipes were overriding
adam_epsback to1e-8(redundantly), while others kept1e-5.Solution
adam_epsdefault: Changed the default from1e-5to1e-8inoptimizer_utils.pyto match Megatron Coreadam_epsback to1e-8log_non_default_values()inConfigContainerto help catch similar issues in the futureProactive Logging Feature
The new logging feature compares config values against Megatron Core defaults at runtime, logging only the differences. This makes it much easier to:
Example Output
Files Changed
src/megatron/bridge/recipes/utils/optimizer_utils.pyadam_epsdefault from1e-5to1e-8src/megatron/bridge/recipes/olmoe/olmoe_7b.pyadam_eps=1e-8overridesrc/megatron/bridge/recipes/nemotronh/nemotron_3_nano.pyadam_eps=1e-8overridesrc/megatron/bridge/recipes/glm/glm45.pyadam_eps=1e-8overridessrc/megatron/bridge/recipes/llama/llama3.pyadam_eps=1e-8overridesrc/megatron/bridge/training/config.pylog_non_default_values()methodsrc/megatron/bridge/training/setup.pyTesting
Checklist
Summary by CodeRabbit
Improvements
New Features