mixtral: drop training-branching hack for SFT segfault & add ZeRO-3 leaf utility#2185
Conversation
The workaround that chose between `call_sparse_moe_op` (training) and `call_dynamic_moe_op` (inference) was introduced to avoid a segmentation fault during SFT training on earlier Synapse releases. (See PR huggingface#1798) The underlying bug is fixed in Synapse 1.21.0, so the hack is no longer needed. Replace the branching logic with the unified `torch.ops.hpu.mixture_of_experts` call for both training and inference, and remove the TODO comment.
This reverts commit f447155.
- Introduced `apply_zero3_leaf_promotion` to mark model submodules as ZeRO-3 leaf modules - The function is a no-op unless both: - is_deepspeed_zero3_enabled=True (caller asserts ZeRO-3 active) - use_zero3_leaf_promotion=True (user opt-in flag) - Uses a registry-based approach for model-type-specific leaf class mapping
Replace inline DeepSpeed leaf-module patching with the new `optimum.habana.distributed.apply_zero3_leaf_promotion` utility. Activation is controlled by the existing script_args flags `use_zero3_leaf_promotion` and the runtime ZeRO-3 status check.
- Enables ZeRO Stage 3 with overlap communication to support `torch.ops.hpu.mixture_of_experts`
regisss
left a comment
There was a problem hiding this comment.
Nice PR! I think it's worth adding a regression test in https://github.com/huggingface/optimum-habana/blob/main/tests/test_examples.py. You can use the same command you provided in this PR.
Thanks! It takes about 30 minutes to run, which is why I initially left it out. Please let me know if you'd like me to include it. |
I think it's okay to include it. Worst case, I'll make it run less training steps later. |
|
@regisss I added the test, just need to double check the reference numbers and then I will ping you. The G3 sounds OK, I only need to fix G2. I also reduced the max_steps to do the test in less time on 8 cards rather than 4 |
|
@regisss The PR is ready for your review. test commands are updated for 8 HPUs, so those and reference numbers can be further optimized in the future. For now, I followed the steps outlined in the README to mimic the test setup. I also excluded perplexity due to the long runtime, but it can be added later if needed. |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
12cc695 to
491d626
Compare
491d626 to
48f9c0c
Compare
IlyasMoutawwakil
left a comment
There was a problem hiding this comment.
LGTM ! thanks for iterating on this ! I left one last nit but it's not important
…eaf utility (huggingface#2185) (huggingface#607) Co-authored-by: Yaser Afshar <yaser.afshar@intel.com>
What does this PR do?
1. Removes the temporary MOE-kernel workaround
if self.training … else …branch that selected different HPU MOE kernels.call_sparse_moe_opandcall_dynamic_moe_opwith the single HPU-optimizedtorch.ops.hpu.mixture_of_expertsafter Synapse1.21.0fixed the segfault reported in PR [1.20.0] Temporary workaround to avoid segmentation fault #1798.2. Adds reusable ZeRO-3 leaf-promotion utility
optimum/habana/distributed/zero3_utils.pycontainingapply_zero3_leaf_promotion(model).(See https://github.com/deepspeedai/DeepSpeed/blob/master/deepspeed/utils/z3_leaf_module.py#L70)
optimum.habana.distributedso any training script must check ifdeepspeedis imported and ZeRO-3 is active to call it.3. Wires the utility into
sft.pyscript4. Provides new ZeRO-3 config template for Mixtral
examples/language-modeling/mixtral_zero3_config.jsontorch.ops.hpu.mixture_of_expertsTests:
main
this PR
📊 Training Performance Comparison
Before submitting