Skip to content

[1.20.0] Temporary workaround to avoid segmentation fault#1798

Merged
regisss merged 1 commit into
huggingface:mainfrom
yafshar:mixtral_fix
Feb 26, 2025
Merged

[1.20.0] Temporary workaround to avoid segmentation fault#1798
regisss merged 1 commit into
huggingface:mainfrom
yafshar:mixtral_fix

Conversation

@yafshar
Copy link
Copy Markdown
Contributor

@yafshar yafshar commented Feb 25, 2025

What does this PR do?

This is a temporary workaround to avoid segmentation fault during SFT training

  • Added call_sparse_moe_op for training
  • Added conditional logic to use call_sparse_moe_op during training to prevent segmentation faults.
  • TODO: This is a temporary solution. Remove this section after the issue is fixed.

Fixes # (issue)

The example from examples/trl with fewer steps here for faster reproduction.

main

>>> DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --world_size 4 --use_deepspeed sft.py   --model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1     --dataset_name "philschmid/dolly-15k-oai-style"     --subset 'data/'  --streaming False     --deepspeed ../language-modeling/llama2_ds_zero3_config.json     --output_dir="./model_mixtral"     --do_train     --max_steps=50     --logging_steps=1     --save_steps=10     --per_device_train_batch_size=1     --per_device_eval_batch_size=1     --gradient_accumulation_steps=4     --learning_rate=1e-4     --lr_scheduler_type="cosine"     --warmup_steps=10     --weight_decay=0.05     --optim="paged_adamw_32bit"     --lora_target_modules "q_proj" "v_proj"     --bf16     --remove_unused_columns=False     --max_seq_length 512     --run_name="sft_mixtral"     --report_to=none     --use_habana     --use_lazy_mode

Internal Error: Received signal - Segmentation fault
Internal Error: Received signal - Segmentation fault
Internal Error: Received signal - Segmentation fault
Internal Error: Received signal - Segmentation fault

this PR

>>> DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --world_size 4 --use_deepspeed sft.py   --model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1     --dataset_name "philschmid/dolly-15k-oai-style"     --subset 'data/'  --streaming False     --deepspeed ../language-modeling/llama2_ds_zero3_config.json     --output_dir="./model_mixtral"     --do_train     --max_steps=50     --logging_steps=1     --save_steps=10     --per_device_train_batch_size=1     --per_device_eval_batch_size=1     --gradient_accumulation_steps=4     --learning_rate=1e-4     --lr_scheduler_type="cosine"     --warmup_steps=10     --weight_decay=0.05     --optim="paged_adamw_32bit"     --lora_target_modules "q_proj" "v_proj"     --bf16     --remove_unused_columns=False     --max_seq_length 512     --run_name="sft_mixtral"     --report_to=none     --use_habana     --use_lazy_mode

***** train metrics *****
  epoch                       =       0.14
  max_memory_allocated (GB)   =      54.66
  memory_allocated (GB)       =      22.99
  total_flos                  =    10756GF
  total_memory_available (GB) =      94.62
  train_loss                  =     1.8162
  train_runtime               = 0:11:12.63
  train_samples_per_second    =      1.189
  train_steps_per_second      =      0.074

[2025-02-25 22:58:05,038] [INFO] [launch.py:351:main] Process 77734 exits successfully.
[2025-02-25 22:58:06,039] [INFO] [launch.py:351:main] Process 77733 exits successfully.
[2025-02-25 22:58:06,040] [INFO] [launch.py:351:main] Process 77732 exits successfully.
[2025-02-25 22:58:09,043] [INFO] [launch.py:351:main] Process 77731 exits successfully.

Other cases tested

>>> QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_generation.py --model_name_or_path mistralai/Mixtral-8x7B-v0.1 --use_hpu_graphs --use_kv_cache --limit_hpu_graphs --bucket_size 128 --max_new_tokens 128 --batch_size 1 --bf16

Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1.1: ('DeepSpeed is a machine learning framework that enables training of large models on a single machine with a single GPU. It is designed to be easy to use and efficient, and it can be used to train models on a variety of tasks.\n\n## Introduction\n\nDeepSpeed is a machine learning framework that enables training of large models on a single machine with a single GPU. It is designed to be easy to use and efficient, and it can be used to train models on a variety of tasks.\n\n## What is DeepSpeed?\n\nDeepSpeed is a machine learning framework that enables training of large models on a single machine with a single GPU. It is designed',)


Stats:
-----------------------------------------------------------------------------------
Input tokens
Throughput (including tokenization) = 62.832924872923535 tokens/second
Memory allocated                    = 87.83 GB
Max memory allocated                = 87.98 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 8.808805398002733 seconds
>>> QUANT_CONFIG=./quantization_config/maxabs_quant_mixtral.json python run_generation.py --model_name_or_path mistralai/Mixtral-8x7B-v0.1 --use_hpu_graphs --use_kv_cache --limit_hpu_graphs --bucket_size 128 --max_new_tokens 2048 --batch_size 16 --bf16

Stats:
----------------------------------------------------------------------------------
Input tokens
Throughput (including tokenization) = 673.7834361721677 tokens/second
Memory allocated                    = 83.85 GB
Max memory allocated                = 88.14 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 256.92234525698586 seconds
>>> RUN_SLOW=1 GAUDI2_CI=1 pytest tests/test_text_generation_example.py -v -s -k "mistralai/Mixtral"

================= 4 passed, 59 deselected in 818.52s (0:13:38) =================

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

…ning

- Added call_sparse_moe_op for training
- Added conditional logic to use call_sparse_moe_op during training
  to prevent segmentation faults.
- TODO: This is a temporary solution. Remove this section after the
  issue is fixed.
@yafshar yafshar changed the title Temporary workaround to avoid segmentation fault [1.20.0] Temporary workaround to avoid segmentation fault Feb 25, 2025
@yafshar yafshar marked this pull request as ready for review February 25, 2025 23:33
@yafshar yafshar requested a review from regisss as a code owner February 25, 2025 23:33
@libinta libinta added the run-test Run CI for PRs from external contributors label Feb 26, 2025
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@regisss regisss merged commit 2691f25 into huggingface:main Feb 26, 2025
@yafshar yafshar deleted the mixtral_fix branch February 26, 2025 22:57
yafshar added a commit to yafshar/optimum-habana that referenced this pull request Jul 23, 2025
The workaround that chose between `call_sparse_moe_op` (training) and
`call_dynamic_moe_op` (inference) was introduced to avoid a segmentation
fault during SFT training on earlier Synapse releases. (See PR huggingface#1798)
The underlying bug is fixed in Synapse 1.21.0, so the hack is no longer
needed.

Replace the branching logic with the unified
`torch.ops.hpu.mixture_of_experts` call for both training and
inference, and remove the TODO comment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-test Run CI for PRs from external contributors

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants