Skip to content

[sync] FP8 params support for megatron-fsdp (MXFP8/Blockwise)#2135

Merged
ananthsub merged 3 commits intoNVIDIA-NeMo:mainfrom
ananthsub:sync-2239
Jan 30, 2026
Merged

[sync] FP8 params support for megatron-fsdp (MXFP8/Blockwise)#2135
ananthsub merged 3 commits intoNVIDIA-NeMo:mainfrom
ananthsub:sync-2239

Conversation

@ananthsub
Copy link
Contributor

@ananthsub ananthsub commented Jan 30, 2026

What does this PR do ?

Sync argument validation from NVIDIA/Megatron-LM#2239

Changelog

  • Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Summary by CodeRabbit

  • Bug Fixes

    • When using Megatron FSDP for distributed training, certain advanced memory optimization settings are automatically disabled if present in the training configuration.
  • Tests

    • Added comprehensive test coverage verifying that Megatron FSDP properly detects and adjusts incompatible settings during configuration validation.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
@ananthsub ananthsub requested a review from BoxiangW January 30, 2026 00:14
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 30, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ananthsub
Copy link
Contributor Author

/ok to test 3f551eb

@ananthsub ananthsub marked this pull request as ready for review January 30, 2026 00:23
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 30, 2026

📝 Walkthrough

Walkthrough

The changes add validation logic to Megatron FSDP configuration that disables two gradient buffer reuse flags (ddp.reuse_grad_buf_for_mxfp8_param_ag and optimizer.reuse_grad_buf_for_mxfp8_param_ag) during config validation when Megatron FSDP is active, along with a corresponding test to verify this enforcement.

Changes

Cohort / File(s) Summary
Megatron FSDP Configuration Validation
src/megatron/bridge/training/config.py
Added validation logic in validate() to disable reuse_grad_buf_for_mxfp8_param_ag flags in both DDP and optimizer configs when Megatron FSDP is active. DDP flag disabling includes a notice printout.
Configuration Validation Tests
tests/unit_tests/training/test_config.py
Added test_megatron_fsdp_forces_reuse_grad_buf_false() test method to verify that the two gradient buffer reuse flags are forced to False during Megatron FSDP config validation.

Estimated Code Review Effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested Reviewers

  • yaoyu-33
  • gautham-kollu
  • malay-nagda
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '[sync] FP8 params support for megatron-fsdp (MXFP8/Blockwise)' accurately reflects the main objective of adding FP8 parameter support, though it does not explicitly mention the specific config validation changes made to disable incompatible flags.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes ✅ Passed PR contains only 7 lines of minor defensive validation code with appropriate test coverage (+36 lines), qualifying as minor changes per check criteria.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

🧪 Unit Test Generation v2 is now available!

We have significantly improved our unit test generation capabilities.

To enable: Add this to your .coderabbit.yaml configuration:

reviews:
  finishing_touches:
    unit_tests:
      enabled: true

Try it out by using the @coderabbitai generate unit tests command on your code files or under ✨ Finishing Touches on the walkthrough!

Have feedback? Share your thoughts on our Discord thread!


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@tests/unit_tests/training/test_config.py`:
- Around line 1049-1084: The test function
test_megatron_fsdp_forces_reuse_grad_buf_false currently declares an unused
monkeypatch parameter which triggers ruff ARG002; remove the monkeypatch
parameter from the function signature so the test becomes def
test_megatron_fsdp_forces_reuse_grad_buf_false(self): and ensure there are no
references to monkeypatch elsewhere in that test (no other changes needed since
it isn't used).

@ananthsub ananthsub merged commit aa30683 into NVIDIA-NeMo:main Jan 30, 2026
149 of 157 checks passed
@ananthsub ananthsub deleted the sync-2239 branch January 30, 2026 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants