[1.20.0] Temporary workaround to avoid segmentation fault by yafshar · Pull Request #1798 · huggingface/optimum-habana

yafshar · 2025-02-25T22:41:45Z

What does this PR do?

This is a temporary workaround to avoid segmentation fault during SFT training

Added call_sparse_moe_op for training
Added conditional logic to use call_sparse_moe_op during training to prevent segmentation faults.
TODO: This is a temporary solution. Remove this section after the issue is fixed.

Fixes # (issue)

The example from examples/trl with fewer steps here for faster reproduction.

main

>>> DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --world_size 4 --use_deepspeed sft.py   --model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1     --dataset_name "philschmid/dolly-15k-oai-style"     --subset 'data/'  --streaming False     --deepspeed ../language-modeling/llama2_ds_zero3_config.json     --output_dir="./model_mixtral"     --do_train     --max_steps=50     --logging_steps=1     --save_steps=10     --per_device_train_batch_size=1     --per_device_eval_batch_size=1     --gradient_accumulation_steps=4     --learning_rate=1e-4     --lr_scheduler_type="cosine"     --warmup_steps=10     --weight_decay=0.05     --optim="paged_adamw_32bit"     --lora_target_modules "q_proj" "v_proj"     --bf16     --remove_unused_columns=False     --max_seq_length 512     --run_name="sft_mixtral"     --report_to=none     --use_habana     --use_lazy_mode

Internal Error: Received signal - Segmentation fault
Internal Error: Received signal - Segmentation fault
Internal Error: Received signal - Segmentation fault
Internal Error: Received signal - Segmentation fault

this PR

>>> DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --world_size 4 --use_deepspeed sft.py   --model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1     --dataset_name "philschmid/dolly-15k-oai-style"     --subset 'data/'  --streaming False     --deepspeed ../language-modeling/llama2_ds_zero3_config.json     --output_dir="./model_mixtral"     --do_train     --max_steps=50     --logging_steps=1     --save_steps=10     --per_device_train_batch_size=1     --per_device_eval_batch_size=1     --gradient_accumulation_steps=4     --learning_rate=1e-4     --lr_scheduler_type="cosine"     --warmup_steps=10     --weight_decay=0.05     --optim="paged_adamw_32bit"     --lora_target_modules "q_proj" "v_proj"     --bf16     --remove_unused_columns=False     --max_seq_length 512     --run_name="sft_mixtral"     --report_to=none     --use_habana     --use_lazy_mode

***** train metrics *****
  epoch                       =       0.14
  max_memory_allocated (GB)   =      54.66
  memory_allocated (GB)       =      22.99
  total_flos                  =    10756GF
  total_memory_available (GB) =      94.62
  train_loss                  =     1.8162
  train_runtime               = 0:11:12.63
  train_samples_per_second    =      1.189
  train_steps_per_second      =      0.074

[2025-02-25 22:58:05,038] [INFO] [launch.py:351:main] Process 77734 exits successfully.
[2025-02-25 22:58:06,039] [INFO] [launch.py:351:main] Process 77733 exits successfully.
[2025-02-25 22:58:06,040] [INFO] [launch.py:351:main] Process 77732 exits successfully.
[2025-02-25 22:58:09,043] [INFO] [launch.py:351:main] Process 77731 exits successfully.

Other cases tested

>>> QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_generation.py --model_name_or_path mistralai/Mixtral-8x7B-v0.1 --use_hpu_graphs --use_kv_cache --limit_hpu_graphs --bucket_size 128 --max_new_tokens 128 --batch_size 1 --bf16

Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1.1: ('DeepSpeed is a machine learning framework that enables training of large models on a single machine with a single GPU. It is designed to be easy to use and efficient, and it can be used to train models on a variety of tasks.\n\n## Introduction\n\nDeepSpeed is a machine learning framework that enables training of large models on a single machine with a single GPU. It is designed to be easy to use and efficient, and it can be used to train models on a variety of tasks.\n\n## What is DeepSpeed?\n\nDeepSpeed is a machine learning framework that enables training of large models on a single machine with a single GPU. It is designed',)


Stats:
-----------------------------------------------------------------------------------
Input tokens
Throughput (including tokenization) = 62.832924872923535 tokens/second
Memory allocated                    = 87.83 GB
Max memory allocated                = 87.98 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 8.808805398002733 seconds

>>> QUANT_CONFIG=./quantization_config/maxabs_quant_mixtral.json python run_generation.py --model_name_or_path mistralai/Mixtral-8x7B-v0.1 --use_hpu_graphs --use_kv_cache --limit_hpu_graphs --bucket_size 128 --max_new_tokens 2048 --batch_size 16 --bf16

Stats:
----------------------------------------------------------------------------------
Input tokens
Throughput (including tokenization) = 673.7834361721677 tokens/second
Memory allocated                    = 83.85 GB
Max memory allocated                = 88.14 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 256.92234525698586 seconds

>>> RUN_SLOW=1 GAUDI2_CI=1 pytest tests/test_text_generation_example.py -v -s -k "mistralai/Mixtral"

================= 4 passed, 59 deselected in 818.52s (0:13:38) =================

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

…ning - Added call_sparse_moe_op for training - Added conditional logic to use call_sparse_moe_op during training to prevent segmentation faults. - TODO: This is a temporary solution. Remove this section after the issue is fixed.

HuggingFaceDocBuilderDev · 2025-02-26T21:33:22Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

regisss

LGTM!

The workaround that chose between `call_sparse_moe_op` (training) and `call_dynamic_moe_op` (inference) was introduced to avoid a segmentation fault during SFT training on earlier Synapse releases. (See PR huggingface#1798) The underlying bug is fixed in Synapse 1.21.0, so the hack is no longer needed. Replace the branching logic with the unified `torch.ops.hpu.mixture_of_experts` call for both training and inference, and remove the TODO comment.

yafshar changed the title ~~Temporary workaround to avoid segmentation fault~~ [1.20.0] Temporary workaround to avoid segmentation fault Feb 25, 2025

yafshar marked this pull request as ready for review February 25, 2025 23:33

yafshar requested a review from regisss as a code owner February 25, 2025 23:33

libinta added the run-test Run CI for PRs from external contributors label Feb 26, 2025

regisss approved these changes Feb 26, 2025

View reviewed changes

regisss merged commit 2691f25 into huggingface:main Feb 26, 2025

regisss pushed a commit that referenced this pull request Feb 26, 2025

Temporary workaround to avoid segmentation fault (#1798)

810ca45

yafshar deleted the mixtral_fix branch February 26, 2025 22:57

yafshar mentioned this pull request Jul 23, 2025

mixtral: drop training-branching hack for SFT segfault #2169

Closed

3 tasks

yafshar mentioned this pull request Jul 30, 2025

mixtral: drop training-branching hack for SFT segfault & add ZeRO-3 leaf utility #2185

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1.20.0] Temporary workaround to avoid segmentation fault#1798

[1.20.0] Temporary workaround to avoid segmentation fault#1798
regisss merged 1 commit into
huggingface:mainfrom
yafshar:mixtral_fix

yafshar commented Feb 25, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Feb 26, 2025

Uh oh!

regisss left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yafshar commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Uh oh!

HuggingFaceDocBuilderDev commented Feb 26, 2025

Uh oh!

regisss left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yafshar commented Feb 25, 2025 •

edited

Loading