[recipe, training] fix: Correct adam_eps default and add non-default config logging by yaoyu-33 · Pull Request #2184 · NVIDIA-NeMo/Megatron-Bridge

yaoyu-33 · 2026-02-03T02:11:55Z

[recipe, training] fix: Correct adam_eps default and add non-default config logging

Description

Problem

The default for adam_eps in PyTorch and Megatron Core is 1e-8. However, src/megatron/bridge/recipes/utils/optimizer_utils.py was overriding this default to 1e-5. This caused the Deepseek v3 model to converge too slowly compared to implementations in other frameworks.

To compound the problem, many individual recipes were overriding adam_eps back to 1e-8 (redundantly), while others kept 1e-5.

Solution

Fix adam_eps default: Changed the default from 1e-5 to 1e-8 in optimizer_utils.py to match Megatron Core
Remove redundant overrides: Cleaned up recipes that were overriding adam_eps back to 1e-8
Add proactive config logging: Implemented log_non_default_values() in ConfigContainer to help catch similar issues in the future

Proactive Logging Feature

The new logging feature compares config values against Megatron Core defaults at runtime, logging only the differences. This makes it much easier to:

Spot unintended configuration deviations
Debug training issues caused by config mismatches
Verify that recipes are using expected values

Example Output

======================================================================
Configuration Summary (Non-Default Values vs Megatron Core)
======================================================================

[optimizer] Non-default values (vs Mcore OptimizerConfig):
  adam_beta2: 0.95  (Mcore default: 0.999)
  bf16: True  (Mcore default: False)
  fp8_recipe: 'tensorwise'  (Mcore default: None)
  lr: 0.0003  (Mcore default: None)
  min_lr: 3e-05  (Mcore default: None)
  params_dtype: torch.bfloat16  (Mcore default: torch.float32)
  use_distributed_optimizer: True  (Mcore default: False)
  weight_decay: 0.1  (Mcore default: 0.01)

[ddp] Non-default values (vs Mcore DistributedDataParallelConfig):
  average_in_collective: True  (Mcore default: False)
  check_for_nan_in_grad: True  (Mcore default: False)
  data_parallel_sharding_strategy: 'optim_grads_params'  (Mcore default: 'no_shard')
  grad_reduce_in_fp32: True  (Mcore default: False)
  overlap_grad_reduce: True  (Mcore default: False)
  overlap_param_gather: True  (Mcore default: False)
  use_distributed_optimizer: True  (Mcore default: False)

[model] Non-default values (vs Mcore TransformerConfig):
  add_bias_linear: False  (Mcore default: True)
  attention_backend: None  (Mcore default: <AttnBackend.auto: 5>)
  attention_dropout: 0.0  (Mcore default: 0.1)
  attention_softmax_in_fp32: False  (Mcore default: True)
  autocast_dtype: torch.bfloat16  (Mcore default: None)
  bf16: True  (Mcore default: False)
  cross_entropy_fusion_impl: 'te'  (Mcore default: 'native')
  cross_entropy_loss_fusion: True  (Mcore default: False)
  cuda_graph_scope: []  (Mcore default: 'full')
  deallocate_pipeline_outputs: True  (Mcore default: False)
  embedding_init_method_std: 0.02  (Mcore default: None)
  expert_tensor_parallel_size: 2  (Mcore default: None)
  ffn_hidden_size: 3072  (Mcore default: None)
  fp8_recipe: 'tensorwise'  (Mcore default: 'delayed')
  gated_linear_unit: True  (Mcore default: False)
  gradient_accumulation_fusion: True  (Mcore default: False)
  hidden_dropout: 0.0  (Mcore default: 0.1)
  hidden_size: 1024  (Mcore default: 0)
  kv_channels: 128  (Mcore default: None)
  ...

----------------------------------------------------------------------
Other Configuration Values:
----------------------------------------------------------------------

[train]:
  exit_signal: <Signals.SIGTERM: 15>
  global_batch_size: 2
  manual_gc: True
  micro_batch_size: 1
  train_iters: 10

[scheduler]:
  lr_decay_iters: None
  lr_decay_samples: 126953125000
  lr_warmup_iters: 500
  lr_warmup_steps: 4000

[checkpoint]:
  async_save: False
  ckpt_format: 'torch_dist'
  pretrained_checkpoint: None
  save_interval: 1000

======================================================================

Files Changed

File	Change
`src/megatron/bridge/recipes/utils/optimizer_utils.py`	Fixed `adam_eps` default from `1e-5` to `1e-8`
`src/megatron/bridge/recipes/olmoe/olmoe_7b.py`	Removed redundant `adam_eps=1e-8` override
`src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py`	Removed redundant `adam_eps=1e-8` override
`src/megatron/bridge/recipes/glm/glm45.py`	Removed redundant `adam_eps=1e-8` overrides
`src/megatron/bridge/recipes/llama/llama3.py`	Removed redundant `adam_eps=1e-8` override
`src/megatron/bridge/training/config.py`	Added `log_non_default_values()` method
`src/megatron/bridge/training/setup.py`	Replaced verbose YAML logging with focused non-default logging

Testing

Verified the new logging output on Qwen3 recipe training
Existing unit tests pass

Checklist

Commits are signed-off
Code follows existing style conventions
Pre-commit hooks pass (ruff linting/formatting)

Summary by CodeRabbit

Improvements
- Updated default optimizer epsilon value in training recipes for improved consistency.
- Simplified optimizer configurations across multiple model recipes by removing redundant explicit settings.
New Features
- Added configuration logging to display parameter values that differ from framework defaults.
- Enhanced visibility into training configuration with improved reporting at training startup.

…config logging - Fix adam_eps default from 1e-5 to 1e-8 in optimizer_utils.py to match Mcore - Remove redundant adam_eps overrides in recipe files (olmoe, nemotron_3_nano, glm45, llama3) - Add log_non_default_values() method to ConfigContainer for proactive config debugging - Replace verbose YAML config logging with focused non-default value comparison The non-default logging compares optimizer, ddp, and model configs against their Megatron Core parent class defaults, making it easier to catch unintended deviations. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

copy-pr-bot · 2026-02-03T02:11:58Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-02-03T02:16:34Z

📝 Walkthrough

Walkthrough

The pull request consolidates optimizer epsilon configuration by removing explicit adam_eps parameters across recipe files and updating the default value from 1e-5 to 1e-8 in utility functions. It also introduces configuration logging functionality to report non-default values at setup time.

Changes

Cohort / File(s)	Summary
Recipe optimizer parameter cleanup `src/megatron/bridge/recipes/glm/glm45.py`, `src/megatron/bridge/recipes/llama/llama3.py`, `src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py`, `src/megatron/bridge/recipes/olmoe/olmoe_7b.py`	Removed explicit `adam_eps=1e-8` arguments from optimizer configurations in pretraining and finetuning setups, relying on updated default values instead.
Optimizer utility defaults `src/megatron/bridge/recipes/utils/optimizer_utils.py`	Updated default `adam_eps` parameter from `1e-5` to `1e-8` in `distributed_fused_adam_with_cosine_annealing` and `distributed_fused_adam_with_cosine_annealing_samples` functions.
Configuration logging infrastructure `src/megatron/bridge/training/config.py`	Added public method `log_non_default_values()` and three module-level helper functions to inspect and log configuration values that differ from Megatron Core defaults, with structured formatting for Core-inherited and non-Core configs.
Setup integration `src/megatron/bridge/training/setup.py`	Repositioned `maybe_log_and_save_config()` call to execute at setup start instead of end, and updated docstring to clarify non-default-only logging behavior.

Sequence Diagram

sequenceDiagram
    participant Setup
    participant Config
    participant MCore as Megatron Core
    participant Helpers
    participant Logger

    Setup->>Config: log_non_default_values()
    activate Config
    Config->>Helpers: _get_mcore_transformer_parent(model_config)
    Helpers-->>Config: transformer_parent_class
    
    Config->>MCore: Get defaults for optimizer
    MCore-->>Config: optimizer_defaults
    Config->>Helpers: _get_non_default_values(optimizer, mcore_class)
    Helpers-->>Config: non_default_optimizer_values
    
    Config->>MCore: Get defaults for ddp
    MCore-->>Config: ddp_defaults
    Config->>Helpers: _get_non_default_values(ddp, mcore_class)
    Helpers-->>Config: non_default_ddp_values
    
    Config->>Helpers: _get_key_config_values(train_config)
    Helpers-->>Config: key_train_values
    
    Config->>Logger: Log non-default values summary
    Logger-->>Setup: Configuration report
    deactivate Config

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

cuichenx
ananthsub
ko3n1g

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main changes: correcting the adam_eps default and adding non-default config logging, which aligns with the core objectives of the PR.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes	✅ Passed	PR description includes testing information (Qwen3 recipe training verification, existing unit tests pass) for changes affecting optimizer parameters and convergence behavior.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/adam-eps-default-and-config-logging

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/megatron/bridge/training/config.py`:
- Around line 1663-1723: Docstring for log_non_default_values overstates that
"For configs that don't inherit from Mcore, all values are logged" but the
implementation calls _get_key_config_values which filters out None/large values;
update the docstring in log_non_default_values to accurately state that
non‑Mcore configs will have "key" or filtered values logged (via
_get_key_config_values) rather than all fields. Mention the function names
_get_key_config_values and log_non_default_values so reviewers can locate the
change and keep the rest of the behavior unchanged.

🧹 Nitpick comments (1)

src/megatron/bridge/training/config.py (1)
1745-1820: Prefer built-in generics for new type hints.
Python 3.10+ guideline prefers dict / tuple over Dict / Tuple in new code. Please align the new helper signatures.
♻️ Suggested update
-def _get_non_default_values(config_obj: Any, mcore_class: type) -> Dict[str, Tuple[Any, Any]]:
+def _get_non_default_values(config_obj: Any, mcore_class: type) -> dict[str, tuple[Any, Any]]:
@@
-def _get_key_config_values(config_obj: Any) -> Dict[str, Any]:
+def _get_key_config_values(config_obj: Any) -> dict[str, Any]:

src/megatron/bridge/training/config.py

Tests cover: - _get_mcore_transformer_parent for GPT and DeepSeek models - _get_non_default_values for optimizer, DDP, and model configs - _get_key_config_values for non-Mcore configs - ConfigContainer.log_non_default_values method - Verify adam_eps is not logged when matching Mcore default Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

# Conflicts: # src/megatron/bridge/recipes/glm/glm45.py # src/megatron/bridge/recipes/llama/llama3.py # src/megatron/bridge/recipes/olmoe/olmoe_7b.py

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 · 2026-02-03T19:22:35Z

/ok to test 7dbacf1

…plementation The implementation now calls cfg.log_non_default_values() instead of cfg.print_yaml(), so update tests to check for the new method calls.

yaoyu-33 · 2026-02-03T23:30:46Z

/ok to test 20dc4e4

cuichenx

LGTM

…config logging (NVIDIA-NeMo#2184) Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: sowmen <sowmendipta@gmail.com>

…config logging (#2184) Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

src/megatron/bridge/training/config.py Show resolved Hide resolved

yaoyu-33 added 3 commits February 2, 2026 19:55

Merge branch 'main' into fix/adam-eps-default-and-config-logging

38a5125

# Conflicts: # src/megatron/bridge/recipes/glm/glm45.py # src/megatron/bridge/recipes/llama/llama3.py # src/megatron/bridge/recipes/olmoe/olmoe_7b.py

update adam_eps

7dbacf1

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

copy-pr-bot bot temporarily deployed to nemo-ci February 3, 2026 19:23 Inactive

copy-pr-bot bot temporarily deployed to test February 3, 2026 19:23 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 3, 2026 20:51 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci February 3, 2026 21:28 Failure

yaoyu-33 and others added 2 commits February 3, 2026 13:15

Merge branch 'main' into fix/adam-eps-default-and-config-logging

3529b00

Fix test_setup.py tests to match updated maybe_log_and_save_config im…

20dc4e4

…plementation The implementation now calls cfg.log_non_default_values() instead of cfg.print_yaml(), so update tests to check for the new method calls.

copy-pr-bot bot temporarily deployed to nemo-ci February 3, 2026 23:31 Inactive

copy-pr-bot bot temporarily deployed to test February 3, 2026 23:31 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 4, 2026 00:44 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci February 4, 2026 01:19 Failure

copy-pr-bot bot had a problem deploying to nemo-ci February 4, 2026 01:39 Failure

copy-pr-bot bot had a problem deploying to nemo-ci February 4, 2026 02:53 Failure

copy-pr-bot bot temporarily deployed to nemo-ci February 4, 2026 03:28 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 4, 2026 03:37 Inactive

yaoyu-33 enabled auto-merge (squash) February 4, 2026 19:10

cuichenx self-requested a review February 4, 2026 20:58

cuichenx approved these changes Feb 4, 2026

View reviewed changes

yaoyu-33 merged commit b273cd3 into main Feb 4, 2026
71 of 77 checks passed

yaoyu-33 deleted the fix/adam-eps-default-and-config-logging branch February 4, 2026 21:02

ko3n1g pushed a commit that referenced this pull request Feb 24, 2026

[recipe, training] fix: Correct adam_eps default and add non-default …

7cf6851

…config logging (#2184) Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g mentioned this pull request Feb 24, 2026

260201: Cherrypick various changes #2509

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[recipe, training] fix: Correct adam_eps default and add non-default config logging#2184

[recipe, training] fix: Correct adam_eps default and add non-default config logging#2184
yaoyu-33 merged 6 commits intomainfrom
fix/adam-eps-default-and-config-logging

yaoyu-33 commented Feb 3, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Feb 3, 2026

Uh oh!

coderabbitai bot commented Feb 3, 2026

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

yaoyu-33 commented Feb 3, 2026

Uh oh!

yaoyu-33 commented Feb 3, 2026

Uh oh!

cuichenx left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yaoyu-33 commented Feb 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[recipe, training] fix: Correct adam_eps default and add non-default config logging

Description

Problem

Solution

Proactive Logging Feature

Example Output

Files Changed

Testing

Checklist

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Feb 3, 2026

Uh oh!

coderabbitai bot commented Feb 3, 2026

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yaoyu-33 commented Feb 3, 2026

Uh oh!

yaoyu-33 commented Feb 3, 2026

Uh oh!

cuichenx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yaoyu-33 commented Feb 3, 2026 •

edited by coderabbitai bot

Loading