[Dev] FP8 params support for megatron-fsdp (MXFP8/Blockwise)#2086
[Dev] FP8 params support for megatron-fsdp (MXFP8/Blockwise)#2086shjwudp merged 30 commits intoNVIDIA:devfrom
Conversation
shjwudp
left a comment
There was a problem hiding this comment.
Please resolve the conflicts and make sure all tests pass.
I think naming the file that handles mixed precision as mixed_precision.py would be more appropriate.
|
Tested on Llama3-8B. Convergence was stable across all configurations. Performance Result:
Full test report: https://api.wandb.ai/links/nvidia/z3nax4om |
bcafaf7 to
ae41d13
Compare
Signed-off-by: kunlunl <kunlunl@nvidia.com>
shjwudp
left a comment
There was a problem hiding this comment.
Looks good to me, thanks!
|
/ok to test 838eaad |
…dec16_v2 Fix blockwise FP8 cast main weight to model weight shard
|
/ok to test e8582ee |
Signed-off-by: kunlunl <kunlunl@nvidia.com>
|
/ok to test 71e1534 |
|
/ok to test b873258 |
b873258 to
71e1534
Compare
|
/ok to test b412b79 |
|
/ok to test 914a6e0 |
| if self.enable_fine_grained_param_gather_hook: | ||
| param_list = list(module.parameters(recurse=False)) |
There was a problem hiding this comment.
@kunlunl Basic question, why does MXFP8 need fine-grained AG again?
There was a problem hiding this comment.
Good question. This logic was added to support MXFP8 + activation recompute:
- It ensures the accuracy of the all-gather sub-module parameters — the parameters used in the backward all-gather cannot be reused in the forward pass in MXFP8.
- It improves the performance of gradient accumulation and gradient reduction. Fine-grained handling allows earlier launches and prevents potential hangs during MoE training.
There seems to be an issue with point (1), and there may be alternative solutions to address it. : /
There was a problem hiding this comment.
Yes, for "the parameters used in the backward all-gather cannot be reused in the forward pass in MXFP8", it's because mxfp8 is 1D scaling, so its row-wise data and column-wise have different values, we cannot get the column-wise data by transposing the row-wise data.
What does this PR do ?
main MR: #2239
(Merged) TE MR: NVIDIA/TransformerEngine#2055
Major changes:
Convergence Test
Comparison of convergence behavior across several configurations:
https://api.wandb.ai/links/nvidia/sga6o3t3
Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.