Skip to content

[recipe, training] fix: Correct adam_eps default and add non-default config logging#2184

Merged
yaoyu-33 merged 6 commits intomainfrom
fix/adam-eps-default-and-config-logging
Feb 4, 2026
Merged

[recipe, training] fix: Correct adam_eps default and add non-default config logging#2184
yaoyu-33 merged 6 commits intomainfrom
fix/adam-eps-default-and-config-logging

Conversation

@yaoyu-33
Copy link
Copy Markdown
Contributor

@yaoyu-33 yaoyu-33 commented Feb 3, 2026

[recipe, training] fix: Correct adam_eps default and add non-default config logging

Description

Problem

The default for adam_eps in PyTorch and Megatron Core is 1e-8. However, src/megatron/bridge/recipes/utils/optimizer_utils.py was overriding this default to 1e-5. This caused the Deepseek v3 model to converge too slowly compared to implementations in other frameworks.

To compound the problem, many individual recipes were overriding adam_eps back to 1e-8 (redundantly), while others kept 1e-5.

Solution

  1. Fix adam_eps default: Changed the default from 1e-5 to 1e-8 in optimizer_utils.py to match Megatron Core
  2. Remove redundant overrides: Cleaned up recipes that were overriding adam_eps back to 1e-8
  3. Add proactive config logging: Implemented log_non_default_values() in ConfigContainer to help catch similar issues in the future

Proactive Logging Feature

The new logging feature compares config values against Megatron Core defaults at runtime, logging only the differences. This makes it much easier to:

  • Spot unintended configuration deviations
  • Debug training issues caused by config mismatches
  • Verify that recipes are using expected values

Example Output

======================================================================
Configuration Summary (Non-Default Values vs Megatron Core)
======================================================================

[optimizer] Non-default values (vs Mcore OptimizerConfig):
  adam_beta2: 0.95  (Mcore default: 0.999)
  bf16: True  (Mcore default: False)
  fp8_recipe: 'tensorwise'  (Mcore default: None)
  lr: 0.0003  (Mcore default: None)
  min_lr: 3e-05  (Mcore default: None)
  params_dtype: torch.bfloat16  (Mcore default: torch.float32)
  use_distributed_optimizer: True  (Mcore default: False)
  weight_decay: 0.1  (Mcore default: 0.01)

[ddp] Non-default values (vs Mcore DistributedDataParallelConfig):
  average_in_collective: True  (Mcore default: False)
  check_for_nan_in_grad: True  (Mcore default: False)
  data_parallel_sharding_strategy: 'optim_grads_params'  (Mcore default: 'no_shard')
  grad_reduce_in_fp32: True  (Mcore default: False)
  overlap_grad_reduce: True  (Mcore default: False)
  overlap_param_gather: True  (Mcore default: False)
  use_distributed_optimizer: True  (Mcore default: False)

[model] Non-default values (vs Mcore TransformerConfig):
  add_bias_linear: False  (Mcore default: True)
  attention_backend: None  (Mcore default: <AttnBackend.auto: 5>)
  attention_dropout: 0.0  (Mcore default: 0.1)
  attention_softmax_in_fp32: False  (Mcore default: True)
  autocast_dtype: torch.bfloat16  (Mcore default: None)
  bf16: True  (Mcore default: False)
  cross_entropy_fusion_impl: 'te'  (Mcore default: 'native')
  cross_entropy_loss_fusion: True  (Mcore default: False)
  cuda_graph_scope: []  (Mcore default: 'full')
  deallocate_pipeline_outputs: True  (Mcore default: False)
  embedding_init_method_std: 0.02  (Mcore default: None)
  expert_tensor_parallel_size: 2  (Mcore default: None)
  ffn_hidden_size: 3072  (Mcore default: None)
  fp8_recipe: 'tensorwise'  (Mcore default: 'delayed')
  gated_linear_unit: True  (Mcore default: False)
  gradient_accumulation_fusion: True  (Mcore default: False)
  hidden_dropout: 0.0  (Mcore default: 0.1)
  hidden_size: 1024  (Mcore default: 0)
  kv_channels: 128  (Mcore default: None)
  ...

----------------------------------------------------------------------
Other Configuration Values:
----------------------------------------------------------------------

[train]:
  exit_signal: <Signals.SIGTERM: 15>
  global_batch_size: 2
  manual_gc: True
  micro_batch_size: 1
  train_iters: 10

[scheduler]:
  lr_decay_iters: None
  lr_decay_samples: 126953125000
  lr_warmup_iters: 500
  lr_warmup_steps: 4000

[checkpoint]:
  async_save: False
  ckpt_format: 'torch_dist'
  pretrained_checkpoint: None
  save_interval: 1000

======================================================================

Files Changed

File Change
src/megatron/bridge/recipes/utils/optimizer_utils.py Fixed adam_eps default from 1e-5 to 1e-8
src/megatron/bridge/recipes/olmoe/olmoe_7b.py Removed redundant adam_eps=1e-8 override
src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py Removed redundant adam_eps=1e-8 override
src/megatron/bridge/recipes/glm/glm45.py Removed redundant adam_eps=1e-8 overrides
src/megatron/bridge/recipes/llama/llama3.py Removed redundant adam_eps=1e-8 override
src/megatron/bridge/training/config.py Added log_non_default_values() method
src/megatron/bridge/training/setup.py Replaced verbose YAML logging with focused non-default logging

Testing

  • Verified the new logging output on Qwen3 recipe training
  • Existing unit tests pass

Checklist

  • Commits are signed-off
  • Code follows existing style conventions
  • Pre-commit hooks pass (ruff linting/formatting)

Summary by CodeRabbit

  • Improvements

    • Updated default optimizer epsilon value in training recipes for improved consistency.
    • Simplified optimizer configurations across multiple model recipes by removing redundant explicit settings.
  • New Features

    • Added configuration logging to display parameter values that differ from framework defaults.
    • Enhanced visibility into training configuration with improved reporting at training startup.

…config logging

- Fix adam_eps default from 1e-5 to 1e-8 in optimizer_utils.py to match Mcore
- Remove redundant adam_eps overrides in recipe files (olmoe, nemotron_3_nano, glm45, llama3)
- Add log_non_default_values() method to ConfigContainer for proactive config debugging
- Replace verbose YAML config logging with focused non-default value comparison

The non-default logging compares optimizer, ddp, and model configs against their
Megatron Core parent class defaults, making it easier to catch unintended deviations.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Feb 3, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 3, 2026

📝 Walkthrough

Walkthrough

The pull request consolidates optimizer epsilon configuration by removing explicit adam_eps parameters across recipe files and updating the default value from 1e-5 to 1e-8 in utility functions. It also introduces configuration logging functionality to report non-default values at setup time.

Changes

Cohort / File(s) Summary
Recipe optimizer parameter cleanup
src/megatron/bridge/recipes/glm/glm45.py, src/megatron/bridge/recipes/llama/llama3.py, src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py, src/megatron/bridge/recipes/olmoe/olmoe_7b.py
Removed explicit adam_eps=1e-8 arguments from optimizer configurations in pretraining and finetuning setups, relying on updated default values instead.
Optimizer utility defaults
src/megatron/bridge/recipes/utils/optimizer_utils.py
Updated default adam_eps parameter from 1e-5 to 1e-8 in distributed_fused_adam_with_cosine_annealing and distributed_fused_adam_with_cosine_annealing_samples functions.
Configuration logging infrastructure
src/megatron/bridge/training/config.py
Added public method log_non_default_values() and three module-level helper functions to inspect and log configuration values that differ from Megatron Core defaults, with structured formatting for Core-inherited and non-Core configs.
Setup integration
src/megatron/bridge/training/setup.py
Repositioned maybe_log_and_save_config() call to execute at setup start instead of end, and updated docstring to clarify non-default-only logging behavior.

Sequence Diagram

sequenceDiagram
    participant Setup
    participant Config
    participant MCore as Megatron Core
    participant Helpers
    participant Logger

    Setup->>Config: log_non_default_values()
    activate Config
    Config->>Helpers: _get_mcore_transformer_parent(model_config)
    Helpers-->>Config: transformer_parent_class
    
    Config->>MCore: Get defaults for optimizer
    MCore-->>Config: optimizer_defaults
    Config->>Helpers: _get_non_default_values(optimizer, mcore_class)
    Helpers-->>Config: non_default_optimizer_values
    
    Config->>MCore: Get defaults for ddp
    MCore-->>Config: ddp_defaults
    Config->>Helpers: _get_non_default_values(ddp, mcore_class)
    Helpers-->>Config: non_default_ddp_values
    
    Config->>Helpers: _get_key_config_values(train_config)
    Helpers-->>Config: key_train_values
    
    Config->>Logger: Log non-default values summary
    Logger-->>Setup: Configuration report
    deactivate Config
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • cuichenx
  • ananthsub
  • ko3n1g
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: correcting the adam_eps default and adding non-default config logging, which aligns with the core objectives of the PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes ✅ Passed PR description includes testing information (Qwen3 recipe training verification, existing unit tests pass) for changes affecting optimizer parameters and convergence behavior.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/adam-eps-default-and-config-logging

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/megatron/bridge/training/config.py`:
- Around line 1663-1723: Docstring for log_non_default_values overstates that
"For configs that don't inherit from Mcore, all values are logged" but the
implementation calls _get_key_config_values which filters out None/large values;
update the docstring in log_non_default_values to accurately state that
non‑Mcore configs will have "key" or filtered values logged (via
_get_key_config_values) rather than all fields. Mention the function names
_get_key_config_values and log_non_default_values so reviewers can locate the
change and keep the rest of the behavior unchanged.
🧹 Nitpick comments (1)
src/megatron/bridge/training/config.py (1)

1745-1820: Prefer built-in generics for new type hints.
Python 3.10+ guideline prefers dict / tuple over Dict / Tuple in new code. Please align the new helper signatures.

♻️ Suggested update
-def _get_non_default_values(config_obj: Any, mcore_class: type) -> Dict[str, Tuple[Any, Any]]:
+def _get_non_default_values(config_obj: Any, mcore_class: type) -> dict[str, tuple[Any, Any]]:
@@
-def _get_key_config_values(config_obj: Any) -> Dict[str, Any]:
+def _get_key_config_values(config_obj: Any) -> dict[str, Any]:

Tests cover:
- _get_mcore_transformer_parent for GPT and DeepSeek models
- _get_non_default_values for optimizer, DDP, and model configs
- _get_key_config_values for non-Mcore configs
- ConfigContainer.log_non_default_values method
- Verify adam_eps is not logged when matching Mcore default

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
# Conflicts:
#	src/megatron/bridge/recipes/glm/glm45.py
#	src/megatron/bridge/recipes/llama/llama3.py
#	src/megatron/bridge/recipes/olmoe/olmoe_7b.py
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33
Copy link
Copy Markdown
Contributor Author

yaoyu-33 commented Feb 3, 2026

/ok to test 7dbacf1

yaoyu-33 and others added 2 commits February 3, 2026 13:15
…plementation

The implementation now calls cfg.log_non_default_values() instead of
cfg.print_yaml(), so update tests to check for the new method calls.
@yaoyu-33
Copy link
Copy Markdown
Contributor Author

yaoyu-33 commented Feb 3, 2026

/ok to test 20dc4e4

@yaoyu-33 yaoyu-33 enabled auto-merge (squash) February 4, 2026 19:10
@cuichenx cuichenx self-requested a review February 4, 2026 20:58
Copy link
Copy Markdown
Contributor

@cuichenx cuichenx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yaoyu-33 yaoyu-33 merged commit b273cd3 into main Feb 4, 2026
71 of 77 checks passed
@yaoyu-33 yaoyu-33 deleted the fix/adam-eps-default-and-config-logging branch February 4, 2026 21:02
sowmen pushed a commit to sowmen/Megatron-Bridge that referenced this pull request Feb 11, 2026
…config logging (NVIDIA-NeMo#2184)

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: sowmen <sowmendipta@gmail.com>
ko3n1g pushed a commit that referenced this pull request Feb 24, 2026
…config logging (#2184)

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g ko3n1g mentioned this pull request Feb 24, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants