[sync] FP8 params support for megatron-fsdp (MXFP8/Blockwise) by ananthsub · Pull Request #2135 · NVIDIA-NeMo/Megatron-Bridge

ananthsub · 2026-01-30T00:14:49Z

What does this PR do ?

Sync argument validation from NVIDIA/Megatron-LM#2239

Changelog

Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Summary by CodeRabbit

Bug Fixes
- When using Megatron FSDP for distributed training, certain advanced memory optimization settings are automatically disabled if present in the training configuration.
Tests
- Added comprehensive test coverage verifying that Megatron FSDP properly detects and adjusts incompatible settings during configuration validation.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

copy-pr-bot · 2026-01-30T00:14:52Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

ananthsub · 2026-01-30T00:15:07Z

/ok to test 3f551eb

coderabbitai · 2026-01-30T00:26:26Z

📝 Walkthrough

Walkthrough

The changes add validation logic to Megatron FSDP configuration that disables two gradient buffer reuse flags (ddp.reuse_grad_buf_for_mxfp8_param_ag and optimizer.reuse_grad_buf_for_mxfp8_param_ag) during config validation when Megatron FSDP is active, along with a corresponding test to verify this enforcement.

Changes

Cohort / File(s)	Summary
Megatron FSDP Configuration Validation `src/megatron/bridge/training/config.py`	Added validation logic in `validate()` to disable `reuse_grad_buf_for_mxfp8_param_ag` flags in both DDP and optimizer configs when Megatron FSDP is active. DDP flag disabling includes a notice printout.
Configuration Validation Tests `tests/unit_tests/training/test_config.py`	Added `test_megatron_fsdp_forces_reuse_grad_buf_false()` test method to verify that the two gradient buffer reuse flags are forced to False during Megatron FSDP config validation.

Estimated Code Review Effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested Reviewers

yaoyu-33
gautham-kollu
malay-nagda

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title '[sync] FP8 params support for megatron-fsdp (MXFP8/Blockwise)' accurately reflects the main objective of adding FP8 parameter support, though it does not explicitly mention the specific config validation changes made to disable incompatible flags.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes	✅ Passed	PR contains only 7 lines of minor defensive validation code with appropriate test coverage (+36 lines), qualifying as minor changes per check criteria.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

🧪 Unit Test Generation v2 is now available!

We have significantly improved our unit test generation capabilities.

To enable: Add this to your .coderabbit.yaml configuration:

reviews:
  finishing_touches:
    unit_tests:
      enabled: true

Try it out by using the @coderabbitai generate unit tests command on your code files or under ✨ Finishing Touches on the walkthrough!

Have feedback? Share your thoughts on our Discord thread!

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@tests/unit_tests/training/test_config.py`:
- Around line 1049-1084: The test function
test_megatron_fsdp_forces_reuse_grad_buf_false currently declares an unused
monkeypatch parameter which triggers ruff ARG002; remove the monkeypatch
parameter from the function signature so the test becomes def
test_megatron_fsdp_forces_reuse_grad_buf_false(self): and ensure there are no
references to monkeypatch elsewhere in that test (no other changes needed since
it isn't used).

tests/unit_tests/training/test_config.py

ananthsub added 3 commits January 29, 2026 16:10

[sync] FP8 params support for megatron-fsdp (MXFP8/Blockwise)

fed5bd5

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

fix test

497644f

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

fix comments

3f551eb

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

ananthsub requested a review from BoxiangW January 30, 2026 00:14

copy-pr-bot bot temporarily deployed to nemo-ci January 30, 2026 00:15 Inactive

BoxiangW approved these changes Jan 30, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to test January 30, 2026 00:15 Inactive

ananthsub marked this pull request as ready for review January 30, 2026 00:23

coderabbitai bot reviewed Jan 30, 2026

View reviewed changes

tests/unit_tests/training/test_config.py Show resolved Hide resolved

copy-pr-bot bot temporarily deployed to nemo-ci January 30, 2026 01:51 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 30, 2026 01:59 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 30, 2026 02:09 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 30, 2026 02:09 Failure

copy-pr-bot bot temporarily deployed to nemo-ci January 30, 2026 02:09 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 30, 2026 02:09 Failure

copy-pr-bot bot temporarily deployed to nemo-ci January 30, 2026 02:09 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 30, 2026 06:50 Failure

copy-pr-bot bot had a problem deploying to nemo-ci January 30, 2026 18:01 Failure

copy-pr-bot bot temporarily deployed to nemo-ci January 30, 2026 18:01 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 30, 2026 18:30 Inactive

ananthsub merged commit aa30683 into NVIDIA-NeMo:main Jan 30, 2026
149 of 157 checks passed

ananthsub deleted the sync-2239 branch January 30, 2026 19:03

mehraakash mentioned this pull request Mar 11, 2026

chore: Bump MCore (main) 2026-03-10 — replaces conflicting #2719 #2738

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sync] FP8 params support for megatron-fsdp (MXFP8/Blockwise)#2135

[sync] FP8 params support for megatron-fsdp (MXFP8/Blockwise)#2135
ananthsub merged 3 commits intoNVIDIA-NeMo:mainfrom
ananthsub:sync-2239

ananthsub commented Jan 30, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Jan 30, 2026

Uh oh!

ananthsub commented Jan 30, 2026

Uh oh!

coderabbitai bot commented Jan 30, 2026

Walkthrough

Changes

Estimated Code Review Effort

Suggested Reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ananthsub commented Jan 30, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Jan 30, 2026

Uh oh!

ananthsub commented Jan 30, 2026

Uh oh!

coderabbitai bot commented Jan 30, 2026

Walkthrough

Changes

Estimated Code Review Effort

Suggested Reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ananthsub commented Jan 30, 2026 •

edited by coderabbitai bot

Loading