Skip to content

mixtral: drop training-branching hack for SFT segfault#2169

Closed
yafshar wants to merge 1 commit into
huggingface:mainfrom
yafshar:mixtral/remove-sft-segfault-hack
Closed

mixtral: drop training-branching hack for SFT segfault#2169
yafshar wants to merge 1 commit into
huggingface:mainfrom
yafshar:mixtral/remove-sft-segfault-hack

Conversation

@yafshar
Copy link
Copy Markdown
Contributor

@yafshar yafshar commented Jul 23, 2025

What does this PR do?

What

Removes the temporary workaround that selected different MOE kernels depending on self.training.
After Synapse 1.21.0, the segmentation fault (See PR #1798) that originally motivated the hack is resolved,
so we can now use a single HPU-optimized kernel for both training and inference.

Changes

  • Delete the TODO block and the if self.training … else … branch.
  • Replace both call_sparse_moe_op and call_dynamic_moe_op calls with torch.ops.hpu.mixture_of_experts, passing the weight lists (w1_list, w2_list, w3_list).
  • Move the all-reduce logic that runs only at inference time below the new kernel call.

Tests:

>>> PT_HPU_LAZY_MODE=1 PT_ENABLE_INT64_SUPPORT=1 python ../gaudi_spawn.py --world_size 4 --use_deepspeed sft.py \
--model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1 \
--dataset_name "philschmid/dolly-15k-oai-style" \
--subset 'data/' \
--streaming False \
--deepspeed ../language-modeling/llama2_ds_zero3_config.json \
--output_dir="./model_mixtral" \
--do_train \
--max_steps=500 \
--logging_steps=10 \
--save_steps=100 \
--per_device_train_batch_size=2 \
--per_device_eval_batch_size=1 \
--gradient_accumulation_steps=2 \
--learning_rate=1e-4 \
--lr_scheduler_type="cosine" \
--warmup_steps=100 \
--weight_decay=0.05 \
--optim="paged_adamw_32bit" \
--lora_target_modules "q_proj" "v_proj" \
--bf16 \
--remove_unused_columns=False \
--max_seq_length 512 \
--run_name="sft_mixtral" \
--report_to=none \
--use_habana \
--use_lazy_mode

main

***** train metrics *****
  epoch                       =     1.3972
  max_memory_allocated (GB)   =      80.43
  memory_allocated (GB)       =      23.33
  total_flos                  =   107879GF
  total_memory_available (GB) =     126.54
  train_loss                  =     5.1121
  train_runtime               = 0:33:59.01
  train_samples_per_second    =      3.923
  train_steps_per_second      =      0.245

this PR

text-generation

>>> PT_HPU_LAZY_MODE=1 python3 run_generation.py \
--model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1 \
--use_hpu_graphs \
--limit_hpu_graphs   \
--use_kv_cache \
--bucket_size 128 \
--max_new_tokens 1024  \
--max_input_tokens 2048  \
--batch_size 8 \
--bf16 \
--reuse_cache \
--bucket_internal \
--mlcommons_dataset <path to mlcommons dataset pickle file> \
--dataset_name mlcommons \
--n_iterations 1 \
--warmup 1 \
--output_dir .

main

this PR

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

The workaround that chose between `call_sparse_moe_op` (training) and
`call_dynamic_moe_op` (inference) was introduced to avoid a segmentation
fault during SFT training on earlier Synapse releases. (See PR huggingface#1798)
The underlying bug is fixed in Synapse 1.21.0, so the hack is no longer
needed.

Replace the branching logic with the unified
`torch.ops.hpu.mixture_of_experts` call for both training and
inference, and remove the TODO comment.
@yafshar yafshar closed this Jul 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant