[Dev] FP8 params support for megatron-fsdp (MXFP8/Blockwise) by kunlunl · Pull Request #2086 · NVIDIA/Megatron-LM

kunlunl · 2025-11-02T10:49:00Z

What does this PR do ?

main MR: #2239

(Merged) TE MR: NVIDIA/TransformerEngine#2055

Major changes:

Make Megatron-FSDP’s AG pipeline support using different data parallel buffers, because MXFP8 has different quantization direction in forward and backward passes.
Decouple the FP8-related logic from the main workflow and provide a unified abstraction to 1) operate the raw data storage of different recipes; 2) Create or discard transpose cache for different recipes.

Convergence Test

Comparison of convergence behavior across several configurations:

Distopt-BF16
Distopt-CSFP8
MFSDP-CSFP8
MFSDP-MXFP8
MFSDP-BlockFP8

https://api.wandb.ai/links/nvidia/sga6o3t3

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

copy-pr-bot · 2025-11-02T10:49:04Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

shjwudp

Please resolve the conflicts and make sure all tests pass.

I think naming the file that handles mixed precision as mixed_precision.py would be more appropriate.

shjwudp · 2025-11-05T03:33:01Z

Tested on Llama3-8B. Convergence was stable across all configurations.

Performance Result:

M-FSDP-mxfp8 achieved a 34% speedup over M-FSDP-bf16 (1385 TFLOPs vs. 1033 TFLOPs).
M-FSDP-mxfp8 reduced memory usage by 21.93 GB (155.89 GB vs. 177.82 GB for bf16).

Full test report: https://api.wandb.ai/links/nvidia/z3nax4om

Signed-off-by: kunlunl <kunlunl@nvidia.com>

shjwudp

Looks good to me, thanks!

Signed-off-by: kunlunl <kunlunl@nvidia.com>

kunlunl · 2025-11-13T10:11:13Z

/ok to test 838eaad

cspades · 2025-12-16T17:40:04Z

@kunlunl @shjwudp Can we also add a simple non-MLM unit test for FP8 using fully_shard?

I can work on this as well and get it in a subsequent PR. Just curious if there's any blockers / should be as easy as fp8_model_init + fp8_autocast.

…dec16_v2 Fix blockwise FP8 cast main weight to model weight shard

kunlunl · 2025-12-17T03:49:16Z

/ok to test e8582ee

Signed-off-by: kunlunl <kunlunl@nvidia.com>

kunlunl · 2025-12-17T04:05:53Z

/ok to test 71e1534

kunlunl · 2025-12-17T05:45:32Z

/ok to test b873258

kunlunl · 2025-12-17T12:02:31Z

/ok to test b412b79

kunlunl · 2025-12-17T12:25:15Z

/ok to test 914a6e0

cspades · 2025-12-17T16:53:28Z

megatron/core/distributed/fsdp/src/megatron_fsdp/megatron_fsdp.py

+            if self.enable_fine_grained_param_gather_hook:
+                param_list = list(module.parameters(recurse=False))


@kunlunl Basic question, why does MXFP8 need fine-grained AG again?

Good question. This logic was added to support MXFP8 + activation recompute:

It ensures the accuracy of the all-gather sub-module parameters — the parameters used in the backward all-gather cannot be reused in the forward pass in MXFP8.

It improves the performance of gradient accumulation and gradient reduction. Fine-grained handling allows earlier launches and prevents potential hangs during MoE training.

There seems to be an issue with point (1), and there may be alternative solutions to address it. : /

Yes, for "the parameters used in the backward all-gather cannot be reused in the forward pass in MXFP8", it's because mxfp8 is 1D scaling, so its row-wise data and column-wise have different values, we cannot get the column-wise data by transposing the row-wise data.

kunlunl requested review from a team as code owners November 2, 2025 10:49

shjwudp requested changes Nov 5, 2025

View reviewed changes

yanring added module: moe dev branch Dev branch related issues and development labels Nov 5, 2025

erhoo82 mentioned this pull request Nov 6, 2025

FSDP support for MXFP8 NVIDIA-NeMo/Megatron-Bridge#1213

Closed

native fp8 support for megatron-fsdp

ae41d13

kunlunl force-pushed the kunlunl/megatron-fsdp-fp8-params branch from bcafaf7 to ae41d13 Compare November 13, 2025 07:22

kunlunl added 2 commits November 13, 2025 15:26

Rename low_precision.py to mixed_precision.py

231b464

Signed-off-by: kunlunl <kunlunl@nvidia.com>

Format

b4e01a8

Signed-off-by: kunlunl <kunlunl@nvidia.com>

shjwudp approved these changes Nov 13, 2025

View reviewed changes

kunlunl mentioned this pull request Nov 13, 2025

FP8 params support for megatron-fsdp (MXFP8/Blockwise) #2239

Merged

6 tasks

kunlunl changed the title ~~[Dev] [Draft] FP8 params support for megatron-fsdp~~ [Dev] FP8 params support for megatron-fsdp Nov 13, 2025

Format

838eaad

Signed-off-by: kunlunl <kunlunl@nvidia.com>

copy-pr-bot bot temporarily deployed to nemo-ci November 13, 2025 10:11 Inactive

ko3n1g added this to the Core 0.16 milestone Nov 13, 2025

copy-pr-bot bot temporarily deployed to nemo-ci November 13, 2025 10:11 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci November 13, 2025 10:11 Failure

copy-pr-bot bot temporarily deployed to nemo-ci November 13, 2025 10:11 Inactive

copy-pr-bot bot temporarily deployed to test November 13, 2025 10:12 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 13, 2025 10:19 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 13, 2025 10:24 Inactive

shjwudp added 2 commits December 16, 2025 23:14

Fix blockwise FP8 cast main weight to model weight shard

d8192d7

fix memory clean-up

02cda20

kunlunl added 2 commits December 17, 2025 11:47

Merge pull request #7 from shjwudp/megatron-fsdp-fp8-params-jianbinc-…

313f884

…dec16_v2 Fix blockwise FP8 cast main weight to model weight shard

Merge branch 'dev' into kunlunl/megatron-fsdp-fp8-params

e8582ee

copy-pr-bot bot temporarily deployed to nemo-ci December 17, 2025 03:49 Inactive

Remove unused import

71e1534

Signed-off-by: kunlunl <kunlunl@nvidia.com>

copy-pr-bot bot temporarily deployed to nemo-ci December 17, 2025 04:05 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci December 17, 2025 04:06 Inactive

copy-pr-bot bot temporarily deployed to test December 17, 2025 04:06 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci December 17, 2025 04:41 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci December 17, 2025 05:45 Inactive

copy-pr-bot bot temporarily deployed to test December 17, 2025 05:46 Inactive

kunlunl force-pushed the kunlunl/megatron-fsdp-fp8-params branch from b873258 to 71e1534 Compare December 17, 2025 05:54

cspades approved these changes Dec 17, 2025

View reviewed changes

ko3n1g mentioned this pull request Jan 4, 2026

Revert "[Dev] FP8 params support for megatron-fsdp (MXFP8/Blockwise) … #2804

Merged

6 tasks

kunlunl mentioned this pull request Jan 6, 2026

[dev] Reapply fsdp mxfp8 #2828

Merged

6 tasks

		if self.enable_fine_grained_param_gather_hook:
		param_list = list(module.parameters(recurse=False))

Conversation

kunlunl commented Nov 2, 2025 • edited by shjwudp Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Convergence Test

Contribution process

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

copy-pr-bot bot commented Nov 2, 2025

Uh oh!

shjwudp left a comment

Choose a reason for hiding this comment

Uh oh!

shjwudp commented Nov 5, 2025

Uh oh!

shjwudp left a comment

Choose a reason for hiding this comment

Uh oh!

kunlunl commented Nov 13, 2025

Uh oh!

cspades commented Dec 16, 2025

Uh oh!

kunlunl commented Dec 17, 2025

Uh oh!

kunlunl commented Dec 17, 2025

Uh oh!

kunlunl commented Dec 17, 2025

Uh oh!

kunlunl commented Dec 17, 2025

Uh oh!

kunlunl commented Dec 17, 2025

Uh oh!

cspades Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

shjwudp Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

kunlunl Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kunlunl commented Nov 2, 2025 •

edited by shjwudp

Loading

(Step 1): Add PR label `Expert Review`