FP8 params support for megatron-fsdp (MXFP8/Blockwise)#2239
FP8 params support for megatron-fsdp (MXFP8/Blockwise)#2239BoxiangW merged 4 commits intoNVIDIA:mainfrom
Conversation
|
Is FP8 activations/grad support on Hopper FSDP with block wise support on the roadmap as well? O-o |
|
|
See comment on Dev PR: #2086 (comment) Can we add a few simple unit tests? |
| return TE_VERSION > PkgVersion(vers) | ||
|
|
||
|
|
||
| def is_float8tensor(tensor: torch.Tensor) -> bool: |
There was a problem hiding this comment.
| def is_float8tensor(tensor: torch.Tensor) -> bool: | |
| def is_float8tensor(tensor: torch.Tensor) -> TypeGuard[FP8_TENSOR_CLASS]: |
Hi @Skylion007 , Do you mean FP8 activations + param support on Hopper? We has already been merged into the dev branch (PR-2086), and we'll look into merging it into the main branch when we have time. |
|
/ok to test 42a6bdc |
Signed-off-by: kunlunl <kunlunl@nvidia.com> Co-authored-by: jianbinc <shjwudp@gmail.com>
42a6bdc to
feb6753
Compare
|
/ok to test feb6753 |
|
/ok to test 31624f7 |
There was a problem hiding this comment.
Finally found time to prototype this backend implementation in fully_shard, I'm generally happy with this PR. I'll submit a follow-up PR directly to main that exposes FP8 parameter support to fully_shard and a unit test for it as well.
@shjwudp @kunlunl I do have a comment beyond the scope of this PR though, pertaining to this fp8_model_init: https://github.com/NVIDIA/Megatron-LM/blob/dev/megatron/core/distributed/fsdp/src/megatron_fsdp/param_and_grad_buffer.py#L3738
We should move this to mcore_fsdp_adapter.py during MegatronFSDP.__init__ so both in Megatron-LM and native Torch, we can have the same initialization pattern:
# Construct toy model.
with te.pytorch.quantized_model_init(
enabled=True,
recipe=fp8_recipe,
# Needed for FP8 parameters with Megatron-FSDP.
preserve_high_precision_init_val=True,
):
toy_model = ToyTETransformer(
model_dim=DIM_SIZE,
num_heads=2,
num_layers=NUM_LAYERS,
output_dim=DIM_SIZE,
device="meta"
)
# Fully-shard the model.
# NOTE: We do NOT need the quantized_model_init context manager for Megatron-FSDP,
# because it has already setup the correct state during the root module FP8 init, I believe?
mfsdp_model = fully_shard_model(
module=toy_model,
fsdp_unit_modules=[te.pytorch.TransformerLayer, te.pytorch.Linear],
zero_dp_strategy=3,
init_model_with_meta_device=True,
)
This should not break Megatron-LM right? (Testing...) I believe this also means we do not need an fp8_param_gather argument either!
|
/ok to test cba67e3 |
|
API compatibility check error is expected, just like in the DEV PR, with the same violations. |
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: kunlunl <kunlunl@nvidia.com> Signed-off-by: jianbinc <shjwudp@gmail.com> Co-authored-by: jianbinc <shjwudp@gmail.com> Co-authored-by: Cory Ye <44509866+cspades@users.noreply.github.com>
Signed-off-by: kunlunl <kunlunl@nvidia.com> Signed-off-by: jianbinc <shjwudp@gmail.com> Co-authored-by: jianbinc <shjwudp@gmail.com> Co-authored-by: Cory Ye <44509866+cspades@users.noreply.github.com>
What does this PR do ?
dev MR: #2086
Major changes:
Make Megatron-FSDP’s AG pipeline support using different data parallel buffers, because MXFP8 has different quantization direction in forward and backward passes.
Decouple the FP8-related logic from the main workflow and provide a unified abstraction to 1) operate the raw data storage of different recipes; 2) Create or discard transpose cache for different recipes.
Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.