mixtral: drop training-branching hack for SFT segfault & add ZeRO-3 leaf utility by yafshar · Pull Request #2185 · huggingface/optimum-habana

yafshar · 2025-07-30T21:02:49Z

What does this PR do?

1. Removes the temporary MOE-kernel workaround

Deletes the TODO block if self.training … else … branch that selected different HPU MOE kernels.
Replaces both call_sparse_moe_op and call_dynamic_moe_op with the single HPU-optimized
torch.ops.hpu.mixture_of_experts after Synapse 1.21.0 fixed the segfault reported in PR [1.20.0] Temporary workaround to avoid segmentation fault #1798.
Keeps the post-kernel all-reduce for inference unchanged.

2. Adds reusable ZeRO-3 leaf-promotion utility

Introduces optimum/habana/distributed/zero3_utils.py containing
apply_zero3_leaf_promotion(model).
(See https://github.com/deepspeedai/DeepSpeed/blob/master/deepspeed/utils/z3_leaf_module.py#L70)
Exports the helper from optimum.habana.distributed so any training script must check if deepspeed is imported and ZeRO-3 is active to call it.

3. Wires the utility into `sft.py` script

Update the README

4. Provides new ZeRO-3 config template for Mixtral

Adds examples/language-modeling/mixtral_zero3_config.json
This config enables ZeRO Stage 3 with overlap communication to support torch.ops.hpu.mixture_of_experts

Tests:

main

>>> PT_HPU_LAZY_MODE=1 PT_ENABLE_INT64_SUPPORT=1 python ../gaudi_spawn.py --world_size 4 --use_deepspeed sft.py \
--model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1 \
--dataset_name "philschmid/dolly-15k-oai-style" \
--subset 'data/' \
--streaming False \
--deepspeed ../language-modeling/llama2_ds_zero3_config.json \
--output_dir="./model_mixtral" \
--do_train \
--max_steps=500 \
--logging_steps=10 \
--save_steps=100 \
--per_device_train_batch_size=2 \
--per_device_eval_batch_size=1 \
--gradient_accumulation_steps=2 \
--learning_rate=1e-4 \
--lr_scheduler_type="cosine" \
--warmup_steps=100 \
--weight_decay=0.05 \
--optim="paged_adamw_32bit" \
--lora_target_modules "q_proj" "v_proj" \
--bf16 \
--remove_unused_columns=False \
--max_seq_length 512 \
--run_name="sft_mixtral" \
--report_to=none \
--use_habana \
--use_lazy_mode


***** train metrics *****
  epoch                       =     1.3972
  max_memory_allocated (GB)   =      80.43
  memory_allocated (GB)       =      23.33
  total_flos                  =   107879GF
  total_memory_available (GB) =     126.54
  train_loss                  =     5.1121
  train_runtime               = 0:33:59.01
  train_samples_per_second    =      3.923
  train_steps_per_second      =      0.245

this PR

>>> PT_HPU_LAZY_MODE=1 PT_ENABLE_INT64_SUPPORT=1 python ../gaudi_spawn.py --world_size 4 --use_deepspeed sft.py \
--model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1 \
--dataset_name "philschmid/dolly-15k-oai-style" \
--subset 'data/' \
--streaming False \
--deepspeed ../language-modeling/mixtral_ds_zero3_config.json \
--output_dir="./model_mixtral" \
--do_train \
--max_steps=500 \
--logging_steps=10 \
--save_steps=100 \
--per_device_train_batch_size=2 \
--per_device_eval_batch_size=1 \
--gradient_accumulation_steps=2 \
--learning_rate=1e-4 \
--lr_scheduler_type="cosine" \
--warmup_steps=100 \
--weight_decay=0.05 \
--optim="paged_adamw_32bit" \
--lora_target_modules "q_proj" "v_proj" \
--bf16 \
--remove_unused_columns=False \
--max_seq_length 512 \
--run_name="sft_mixtral" \
--report_to=none \
--use_habana \
--use_lazy_mode \
--use_zero3_leaf_promotion

***** train metrics *****
  epoch                       =     1.3972
  max_memory_allocated (GB)   =      46.23
  memory_allocated (GB)       =      23.49
  total_flos                  =   107879GF
  total_memory_available (GB) =     126.54
  train_loss                  =     5.1126
  train_runtime               = 0:21:16.90
  train_samples_per_second    =      6.265
  train_steps_per_second      =      0.392

📊 Training Performance Comparison

Metric	Main	This PR	Improvement / Change
Train Runtime	33 min 59 sec	21 min 16 sec	⬇️ ~37.4% faster
Train Samples/sec	3.923	6.265	⬆️ ~59.6% increase
Train Steps/sec	0.245	0.392	⬆️ ~60% increase
Max Memory Allocated (GB)	80.43	46.23	⬇️ ~42.5% less memory usage
Train Loss	5.1121	5.1126	⬆️ Negligible increase

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

The workaround that chose between `call_sparse_moe_op` (training) and `call_dynamic_moe_op` (inference) was introduced to avoid a segmentation fault during SFT training on earlier Synapse releases. (See PR huggingface#1798) The underlying bug is fixed in Synapse 1.21.0, so the hack is no longer needed. Replace the branching logic with the unified `torch.ops.hpu.mixture_of_experts` call for both training and inference, and remove the TODO comment.

This reverts commit f447155.

- Introduced `apply_zero3_leaf_promotion` to mark model submodules as ZeRO-3 leaf modules - The function is a no-op unless both: - is_deepspeed_zero3_enabled=True (caller asserts ZeRO-3 active) - use_zero3_leaf_promotion=True (user opt-in flag) - Uses a registry-based approach for model-type-specific leaf class mapping

Replace inline DeepSpeed leaf-module patching with the new `optimum.habana.distributed.apply_zero3_leaf_promotion` utility. Activation is controlled by the existing script_args flags `use_zero3_leaf_promotion` and the runtime ZeRO-3 status check.

- Enables ZeRO Stage 3 with overlap communication to support `torch.ops.hpu.mixture_of_experts`

regisss

Nice PR! I think it's worth adding a regression test in https://github.com/huggingface/optimum-habana/blob/main/tests/test_examples.py. You can use the same command you provided in this PR.

yafshar · 2025-08-11T12:10:36Z

Nice PR! I think it's worth adding a regression test in https://github.com/huggingface/optimum-habana/blob/main/tests/test_examples.py. You can use the same command you provided in this PR.

Thanks! It takes about 30 minutes to run, which is why I initially left it out. Please let me know if you'd like me to include it.

regisss · 2025-08-11T12:14:43Z

Nice PR! I think it's worth adding a regression test in https://github.com/huggingface/optimum-habana/blob/main/tests/test_examples.py. You can use the same command you provided in this PR.

Thanks! It takes about 30 minutes to run, which is why I initially left it out. Please let me know if you'd like me to include it.

I think it's okay to include it. Worst case, I'll make it run less training steps later.

yafshar · 2025-08-12T21:47:34Z

@regisss I added the test, just need to double check the reference numbers and then I will ping you. The G3 sounds OK, I only need to fix G2. I also reduced the max_steps to do the test in less time on 8 cards rather than 4

yafshar · 2025-08-13T16:14:45Z

@regisss The PR is ready for your review. test commands are updated for 8 HPUs, so those and reference numbers can be further optimized in the future. For now, I followed the steps outlined in the README to mimic the test setup. I also excluded perplexity due to the long runtime, but it can be added later if needed.

HuggingFaceDocBuilderDev · 2025-08-19T10:39:58Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

IlyasMoutawwakil

LGTM ! thanks for iterating on this ! I left one last nit but it's not important

regisss

LGTM 🚀

…eaf utility (#2185)

…eaf utility (huggingface#2185) (huggingface#607) Co-authored-by: Yaser Afshar <yaser.afshar@intel.com>

yafshar added 8 commits July 23, 2025 10:25

Add recomp flag for lazy and torch.compile modes

f447155

Merge branch 'main' into mixtral/remove-sft-segfault-hack

8045f32

Revert "Add recomp flag for lazy and torch.compile modes"

95f366c

This reverts commit f447155.

feat(config): add DeepSpeed ZeRO-3 config

510fa78

- Enables ZeRO Stage 3 with overlap communication to support `torch.ops.hpu.mixture_of_experts`

Update the README

bfee838

yafshar requested a review from regisss as a code owner July 30, 2025 21:02

yafshar added 4 commits July 30, 2025 15:15

Merge branch 'main' into mixtral/remove-sft-segfault-hack

1b37314

Minor fix, update the zero config

d0ac1f5

Merge branch 'main' into mixtral/remove-sft-segfault-hack

016e200

Merge branch 'main' into mixtral/remove-sft-segfault-hack

b430a60

regisss approved these changes Aug 11, 2025

View reviewed changes

Adding a regression test

6649ae3

yafshar added 3 commits August 12, 2025 14:49

Merge branch 'main' into mixtral/remove-sft-segfault-hack

29cc3dd

Fix the env variable for sft-trl-mixtral

96110e8

Update reference for G2

743e60b

Merge branch 'main' into mixtral/remove-sft-segfault-hack

739853e

IlyasMoutawwakil reviewed Aug 19, 2025

View reviewed changes

Comment thread examples/trl/sft.py Outdated

IlyasMoutawwakil reviewed Aug 19, 2025

View reviewed changes

Comment thread examples/trl/sft.py Outdated

IlyasMoutawwakil reviewed Aug 19, 2025

View reviewed changes

Comment thread optimum/habana/distributed/zero3_utils.py Outdated

IlyasMoutawwakil reviewed Aug 19, 2025

View reviewed changes

Comment thread tests/test_examples.py Outdated

Rename ZeRO-3 availability flag

e699822

yafshar force-pushed the mixtral/remove-sft-segfault-hack branch from 12cc695 to 491d626 Compare August 19, 2025 12:24

Replace dynamic import with explicit class imports for clarity

48f9c0c

yafshar force-pushed the mixtral/remove-sft-segfault-hack branch from 491d626 to 48f9c0c Compare August 19, 2025 12:25

Avoid adding empty gaudi_config_name to cmd args

4ad5ae6

IlyasMoutawwakil reviewed Aug 19, 2025

View reviewed changes

Comment thread examples/trl/sft.py Outdated

Move ZeRO-3 leaf promotion check to caller

6b15860

IlyasMoutawwakil reviewed Aug 20, 2025

View reviewed changes

Comment thread examples/trl/README.md Outdated

IlyasMoutawwakil approved these changes Aug 20, 2025

View reviewed changes

yafshar added 2 commits August 20, 2025 07:51

Moved mixtral_ds_zero3_config.json to language-modeling folder

1771837

Correct the config path

3ed77d5

regisss approved these changes Aug 21, 2025

View reviewed changes

regisss merged commit d186356 into huggingface:main Aug 21, 2025
2 of 4 checks passed

yafshar deleted the mixtral/remove-sft-segfault-hack branch August 21, 2025 12:10

astachowiczhabana pushed a commit that referenced this pull request Aug 29, 2025

mixtral: drop training-branching hack for SFT segfault & add ZeRO-3 l…

c6482cf

…eaf utility (#2185)

astachowiczhabana pushed a commit that referenced this pull request Sep 17, 2025

mixtral: drop training-branching hack for SFT segfault & add ZeRO-3 l…

084e943

…eaf utility (#2185)

gplutop7 pushed a commit to HabanaAI/optimum-habana-fork that referenced this pull request Oct 15, 2025

mixtral: drop training-branching hack for SFT segfault & add ZeRO-3 l…

859ccd5

…eaf utility (huggingface#2185) (huggingface#607) Co-authored-by: Yaser Afshar <yaser.afshar@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mixtral: drop training-branching hack for SFT segfault & add ZeRO-3 leaf utility#2185

mixtral: drop training-branching hack for SFT segfault & add ZeRO-3 leaf utility#2185
regisss merged 23 commits into
huggingface:mainfrom
yafshar:mixtral/remove-sft-segfault-hack

yafshar commented Jul 30, 2025 •

edited

Loading

Uh oh!

regisss left a comment

Uh oh!

yafshar commented Aug 11, 2025

Uh oh!

regisss commented Aug 11, 2025

Uh oh!

yafshar commented Aug 12, 2025 •

edited

Loading

Uh oh!

yafshar commented Aug 13, 2025

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Aug 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

IlyasMoutawwakil left a comment

Uh oh!

regisss left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yafshar commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

1. Removes the temporary MOE-kernel workaround

2. Adds reusable ZeRO-3 leaf-promotion utility

3. Wires the utility into sft.py script

4. Provides new ZeRO-3 config template for Mixtral

📊 Training Performance Comparison

Before submitting

Uh oh!

regisss left a comment

Choose a reason for hiding this comment

Uh oh!

yafshar commented Aug 11, 2025

Uh oh!

regisss commented Aug 11, 2025

Uh oh!

yafshar commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yafshar commented Aug 13, 2025

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Aug 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

IlyasMoutawwakil left a comment

Choose a reason for hiding this comment

Uh oh!

regisss left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yafshar commented Jul 30, 2025 •

edited

Loading

3. Wires the utility into `sft.py` script

yafshar commented Aug 12, 2025 •

edited

Loading