mixtral: drop training-branching hack for SFT segfault by yafshar · Pull Request #2169 · huggingface/optimum-habana

yafshar · 2025-07-23T17:30:21Z

What does this PR do?

What

Removes the temporary workaround that selected different MOE kernels depending on self.training.
After Synapse 1.21.0, the segmentation fault (See PR #1798) that originally motivated the hack is resolved,
so we can now use a single HPU-optimized kernel for both training and inference.

Changes

Delete the TODO block and the if self.training … else … branch.
Replace both call_sparse_moe_op and call_dynamic_moe_op calls with torch.ops.hpu.mixture_of_experts, passing the weight lists (w1_list, w2_list, w3_list).
Move the all-reduce logic that runs only at inference time below the new kernel call.

Tests:

>>> PT_HPU_LAZY_MODE=1 PT_ENABLE_INT64_SUPPORT=1 python ../gaudi_spawn.py --world_size 4 --use_deepspeed sft.py \
--model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1 \
--dataset_name "philschmid/dolly-15k-oai-style" \
--subset 'data/' \
--streaming False \
--deepspeed ../language-modeling/llama2_ds_zero3_config.json \
--output_dir="./model_mixtral" \
--do_train \
--max_steps=500 \
--logging_steps=10 \
--save_steps=100 \
--per_device_train_batch_size=2 \
--per_device_eval_batch_size=1 \
--gradient_accumulation_steps=2 \
--learning_rate=1e-4 \
--lr_scheduler_type="cosine" \
--warmup_steps=100 \
--weight_decay=0.05 \
--optim="paged_adamw_32bit" \
--lora_target_modules "q_proj" "v_proj" \
--bf16 \
--remove_unused_columns=False \
--max_seq_length 512 \
--run_name="sft_mixtral" \
--report_to=none \
--use_habana \
--use_lazy_mode

main

***** train metrics *****
  epoch                       =     1.3972
  max_memory_allocated (GB)   =      80.43
  memory_allocated (GB)       =      23.33
  total_flos                  =   107879GF
  total_memory_available (GB) =     126.54
  train_loss                  =     5.1121
  train_runtime               = 0:33:59.01
  train_samples_per_second    =      3.923
  train_steps_per_second      =      0.245

this PR

text-generation

>>> PT_HPU_LAZY_MODE=1 python3 run_generation.py \
--model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1 \
--use_hpu_graphs \
--limit_hpu_graphs   \
--use_kv_cache \
--bucket_size 128 \
--max_new_tokens 1024  \
--max_input_tokens 2048  \
--batch_size 8 \
--bf16 \
--reuse_cache \
--bucket_internal \
--mlcommons_dataset <path to mlcommons dataset pickle file> \
--dataset_name mlcommons \
--n_iterations 1 \
--warmup 1 \
--output_dir .

main

this PR

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

The workaround that chose between `call_sparse_moe_op` (training) and `call_dynamic_moe_op` (inference) was introduced to avoid a segmentation fault during SFT training on earlier Synapse releases. (See PR huggingface#1798) The underlying bug is fixed in Synapse 1.21.0, so the hack is no longer needed. Replace the branching logic with the unified `torch.ops.hpu.mixture_of_experts` call for both training and inference, and remove the TODO comment.

yafshar closed this Jul 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mixtral: drop training-branching hack for SFT segfault#2169

mixtral: drop training-branching hack for SFT segfault#2169
yafshar wants to merge 1 commit into
huggingface:mainfrom
yafshar:mixtral/remove-sft-segfault-hack

yafshar commented Jul 23, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yafshar commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

What

Changes

Before submitting

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yafshar commented Jul 23, 2025 •

edited

Loading