Skip to content

enable GPT-OSS#2214

Merged
regisss merged 12 commits into
mainfrom
schoi/gpt_oss
Sep 17, 2025
Merged

enable GPT-OSS#2214
regisss merged 12 commits into
mainfrom
schoi/gpt_oss

Conversation

@schoi-habana
Copy link
Copy Markdown
Collaborator

@schoi-habana schoi-habana commented Aug 21, 2025

dependant to #2209 as gpt-oss is added in huggingface 4.55.0

accuracy comparison to the baseline

Hellaswag Baseline(transformers) This PR
acc 0.4153 0.4161
acc_stderr 0.0049 0.0049
acc_norm 0.5765 0.5789
acc_norm_stderr 0.0049 0.0049

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

fused RoPE not enabled yet
@github-actions
Copy link
Copy Markdown

The code quality check failed, please run make style.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

for prefill, keep the original as much as possible
for decode (+use_kv_cache), find the token idx from attention_mask and mask tokens before (token_idx - sliding_window) as -inf
@github-actions
Copy link
Copy Markdown

The code quality check failed, please run make style.

@github-actions
Copy link
Copy Markdown

The code quality check failed, please run make style.

1 similar comment
@github-actions
Copy link
Copy Markdown

The code quality check failed, please run make style.

@github-actions
Copy link
Copy Markdown

The code quality check failed, please run make style.

@github-actions
Copy link
Copy Markdown

The code quality check failed, please run make style.

@schoi-habana schoi-habana requested a review from mandy-li August 27, 2025 00:34
@schoi-habana schoi-habana marked this pull request as ready for review August 27, 2025 00:35
@imangohari1 imangohari1 mentioned this pull request Aug 28, 2025
3 tasks
Copy link
Copy Markdown
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread optimum/habana/transformers/modeling_utils.py Outdated
Comment thread optimum/habana/transformers/modeling_utils.py Outdated
Comment thread optimum/habana/transformers/models/gpt_oss/configuration_gpt_oss.py Outdated
# When sliding_window is not None, find the token_idx by chechking the last idx of 1 in attention_mask_2d
if input_shape[-1] == 1:
cumsum = attention_mask_2d.cumsum(dim=1)
token_idx = cumsum.argmax(dim=1, keepdim=True)[0]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
token_idx = cumsum.argmax(dim=1, keepdim=True)[0]
token_idx = cumsum.argmax(dim=1, keepdim=True)[0].item()

Extract the token index as an integer from the cumulative attention mask for later use in _make_causal_mask

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change causes significant perf drop. so instead i updated the type hint from int to torch.Tensor
token_idx: Optional[torch.Tensor] = None,

Comment thread optimum/habana/transformers/models/mistral/configuration_mistral.py
@schoi-habana
Copy link
Copy Markdown
Collaborator Author

@regisss added a basic functional test. please let me know if you want to add more test cases!

@github-actions
Copy link
Copy Markdown

The code quality check failed, please run make style.

Copy link
Copy Markdown
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LEt's add GPT-OSS to the table in the README and in the docs please

Comment thread tests/test_text_generation_example.py Outdated
@pbielak
Copy link
Copy Markdown
Collaborator

pbielak commented Sep 16, 2025

@schoi-habana What is the status of this PR? Is everything on your side done or do we still need some changes?

@schoi-habana
Copy link
Copy Markdown
Collaborator Author

@pbielak it's ready. please review

Copy link
Copy Markdown
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
Waiting a bit before merging if @pbielak has additional comments.

@pbielak
Copy link
Copy Markdown
Collaborator

pbielak commented Sep 17, 2025

No additional comments from my side - please go ahead with the merge @regisss

@regisss
Copy link
Copy Markdown
Collaborator

regisss commented Sep 17, 2025

I just pushed one more commit to fix the test name in test_text_generation_example.json, see #2262 for more context

@regisss regisss merged commit 9fffa78 into main Sep 17, 2025
6 of 8 checks passed
@regisss regisss deleted the schoi/gpt_oss branch September 17, 2025 09:30
astachowiczhabana pushed a commit that referenced this pull request Sep 22, 2025
Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>
astachowiczhabana pushed a commit that referenced this pull request Sep 23, 2025
Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>
astachowiczhabana pushed a commit that referenced this pull request Sep 25, 2025
Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>
gplutop7 pushed a commit to HabanaAI/optimum-habana-fork that referenced this pull request Oct 15, 2025
Co-authored-by: Sun Choi <schoi@habana.ai>
Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>
Co-authored-by: Adam Stachowicz <105052242+astachowiczhabana@users.noreply.github.com>
gplutop7 pushed a commit to HabanaAI/optimum-habana-fork that referenced this pull request Nov 6, 2025
Co-authored-by: Sun Choi <schoi@habana.ai>
Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>
Co-authored-by: Adam Stachowicz <105052242+astachowiczhabana@users.noreply.github.com>
@jaideepsai-narayan
Copy link
Copy Markdown

jaideepsai-narayan commented Jan 27, 2026

Hi @regisss, this PR is only to support 20B model?, i am able to infer 20B model without any issues but when I tried running 120B model with gaudi_spwan.py got the below issue, it's like OOM.

Command used:

PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py --model_name_or_path unsloth/gpt-oss-120b-BF16 --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --do_sample --batch_size 1 --prompt "Hello world" "How are you?" --sdp_on_bf16 --parallel_strategy "ep"

Output LOG:

PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size
 8 run_generation.py --model_name_or_path unsloth/gpt-oss-120b-BF16 --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --do_sample
 --batch_size 1 --prompt "Hello world" "How are you?" --sdp_on_bf16 --parallel_strategy "ep"
[WARNING|misc.py:214] 2026-01-27 06:23:26,231 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but habana-frameworks v1.23.0.695 was found, this could lead to undefined behavior!
[WARNING|misc.py:227] 2026-01-27 06:23:26,386 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but the driver version is v1.23.0, this could lead to undefined behavior!
DistributedRunner run(): command = deepspeed --num_nodes 1 --num_gpus 8 --no_local_rank --master_port 29500 run_generation.py --model_name_or_path unsloth/gpt-oss-120b-BF16 --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --do_sample --batch_size 1 --prompt "Hello world" "How are you?" --sdp_on_bf16 --parallel_strategy ep
[2026-01-27 06:23:29,126] [INFO] [real_accelerator.py:225:get_accelerator] Setting ds_accelerator to hpu (auto detect)
[2026-01-27 06:23:31,511] [WARNING] [runner.py:215:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2026-01-27 06:23:31,511] [INFO] [runner.py:607:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --no_local_rank --enable_each_rank_log=None run_generation.py --model_name_or_path unsloth/gpt-oss-120b-BF16 --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --do_sample --batch_size 1 --prompt Hello world How are you? --sdp_on_bf16 --parallel_strategy ep
[2026-01-27 06:23:34,270] [INFO] [real_accelerator.py:225:get_accelerator] Setting ds_accelerator to hpu (auto detect)
[2026-01-27 06:23:36,626] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2026-01-27 06:23:36,626] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=8, node_rank=0
[2026-01-27 06:23:36,626] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2026-01-27 06:23:36,626] [INFO] [launch.py:164:main] dist_world_size=8
[2026-01-27 06:23:36,626] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2026-01-27 06:23:36,627] [INFO] [launch.py:256:main] process 109953 spawned with command: ['/usr/bin/python3', '-u', 'run_generation.py', '--model_name_or_path', 'unsloth/gpt-oss-120b-BF16', '--use_hpu_graphs', '--use_kv_cache', '--max_new_tokens', '100', '--do_sample', '--batch_size', '1', '--prompt', 'Hello world', 'How are you?', '--sdp_on_bf16', '--parallel_strategy', 'ep']
[2026-01-27 06:23:36,627] [INFO] [launch.py:256:main] process 109954 spawned with command: ['/usr/bin/python3', '-u', 'run_generation.py', '--model_name_or_path', 'unsloth/gpt-oss-120b-BF16', '--use_hpu_graphs', '--use_kv_cache', '--max_new_tokens', '100', '--do_sample', '--batch_size', '1', '--prompt', 'Hello world', 'How are you?', '--sdp_on_bf16', '--parallel_strategy', 'ep']
[2026-01-27 06:23:36,627] [INFO] [launch.py:256:main] process 109955 spawned with command: ['/usr/bin/python3', '-u', 'run_generation.py', '--model_name_or_path', 'unsloth/gpt-oss-120b-BF16', '--use_hpu_graphs', '--use_kv_cache', '--max_new_tokens', '100', '--do_sample', '--batch_size', '1', '--prompt', 'Hello world', 'How are you?', '--sdp_on_bf16', '--parallel_strategy', 'ep']
[2026-01-27 06:23:36,628] [INFO] [launch.py:256:main] process 109956 spawned with command: ['/usr/bin/python3', '-u', 'run_generation.py', '--model_name_or_path', 'unsloth/gpt-oss-120b-BF16', '--use_hpu_graphs', '--use_kv_cache', '--max_new_tokens', '100', '--do_sample', '--batch_size', '1', '--prompt', 'Hello world', 'How are you?', '--sdp_on_bf16', '--parallel_strategy', 'ep']
[2026-01-27 06:23:36,628] [INFO] [launch.py:256:main] process 109957 spawned with command: ['/usr/bin/python3', '-u', 'run_generation.py', '--model_name_or_path', 'unsloth/gpt-oss-120b-BF16', '--use_hpu_graphs', '--use_kv_cache', '--max_new_tokens', '100', '--do_sample', '--batch_size', '1', '--prompt', 'Hello world', 'How are you?', '--sdp_on_bf16', '--parallel_strategy', 'ep']
[2026-01-27 06:23:36,628] [INFO] [launch.py:256:main] process 109958 spawned with command: ['/usr/bin/python3', '-u', 'run_generation.py', '--model_name_or_path', 'unsloth/gpt-oss-120b-BF16', '--use_hpu_graphs', '--use_kv_cache', '--max_new_tokens', '100', '--do_sample', '--batch_size', '1', '--prompt', 'Hello world', 'How are you?', '--sdp_on_bf16', '--parallel_strategy', 'ep']
[2026-01-27 06:23:36,628] [INFO] [launch.py:256:main] process 109959 spawned with command: ['/usr/bin/python3', '-u', 'run_generation.py', '--model_name_or_path', 'unsloth/gpt-oss-120b-BF16', '--use_hpu_graphs', '--use_kv_cache', '--max_new_tokens', '100', '--do_sample', '--batch_size', '1', '--prompt', 'Hello world', 'How are you?', '--sdp_on_bf16', '--parallel_strategy', 'ep']
[2026-01-27 06:23:36,629] [INFO] [launch.py:256:main] process 109960 spawned with command: ['/usr/bin/python3', '-u', 'run_generation.py', '--model_name_or_path', 'unsloth/gpt-oss-120b-BF16', '--use_hpu_graphs', '--use_kv_cache', '--max_new_tokens', '100', '--do_sample', '--batch_size', '1', '--prompt', 'Hello world', 'How are you?', '--sdp_on_bf16', '--parallel_strategy', 'ep']
[WARNING|misc.py:214] 2026-01-27 06:23:41,067 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but habana-frameworks v1.23.0.695 was found, this could lead to undefined behavior!
[WARNING|misc.py:214] 2026-01-27 06:23:41,069 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but habana-frameworks v1.23.0.695 was found, this could lead to undefined behavior!
[WARNING|misc.py:214] 2026-01-27 06:23:41,071 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but habana-frameworks v1.23.0.695 was found, this could lead to undefined behavior!
[WARNING|misc.py:214] 2026-01-27 06:23:41,080 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but habana-frameworks v1.23.0.695 was found, this could lead to undefined behavior!
[WARNING|misc.py:214] 2026-01-27 06:23:41,152 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but habana-frameworks v1.23.0.695 was found, this could lead to undefined behavior!
[WARNING|misc.py:214] 2026-01-27 06:23:41,154 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but habana-frameworks v1.23.0.695 was found, this could lead to undefined behavior!
[WARNING|misc.py:214] 2026-01-27 06:23:41,156 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but habana-frameworks v1.23.0.695 was found, this could lead to undefined behavior!
[WARNING|misc.py:214] 2026-01-27 06:23:41,222 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but habana-frameworks v1.23.0.695 was found, this could lead to undefined behavior!
[WARNING|misc.py:227] 2026-01-27 06:23:41,252 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but the driver version is v1.23.0, this could lead to undefined behavior!
[WARNING|misc.py:227] 2026-01-27 06:23:41,284 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but the driver version is v1.23.0, this could lead to undefined behavior!
[WARNING|misc.py:227] 2026-01-27 06:23:41,326 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but the driver version is v1.23.0, this could lead to undefined behavior!
[WARNING|misc.py:227] 2026-01-27 06:23:41,348 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but the driver version is v1.23.0, this could lead to undefined behavior!
[WARNING|misc.py:227] 2026-01-27 06:23:41,360 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but the driver version is v1.23.0, this could lead to undefined behavior!
[WARNING|misc.py:227] 2026-01-27 06:23:41,372 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but the driver version is v1.23.0, this could lead to undefined behavior!
[WARNING|misc.py:227] 2026-01-27 06:23:41,408 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but the driver version is v1.23.0, this could lead to undefined behavior!
[WARNING|misc.py:227] 2026-01-27 06:23:41,417 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but the driver version is v1.23.0, this could lead to undefined behavior!
Initializing conditional components...
Using HPU fused kernel for apply_rotary_pos_emb
Using HPU fused kernel for RMSNorm
Using HPU fused kernel for apply_rotary_pos_emb
Using HPU fused kernel for RMSNorm
Fetching 73 files: 100%|█████████████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 35145.11it/s]
Fetching 73 files: 100%|████████████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 396098.57it/s]
Fetching 73 files: 100%|████████████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 446984.22it/s]
Fetching 73 files: 100%|█████████████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 33065.25it/s]
Fetching 73 files: 100%|████████████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 254517.20it/s]
Fetching 73 files: 100%|█████████████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 39024.24it/s]
Fetching 73 files: 100%|█████████████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 23446.22it/s]
Fetching 73 files: 100%|████████████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 453606.21it/s]
Fetching 73 files: 100%|█████████████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 17383.00it/s]
01/27/2026 06:23:44 - INFO - __main__ - Multi-device ep run.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 1303.58it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 1339.31it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 1369.61it/s]
Loading checkpoint shards:   0%|                                                                             | 0/73 [00:00<?, ?it/s]01/27/2026 06:23:45 - INFO - __main__ - Creating Model
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 1442.36it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 1328.48it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 1438.96it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 1438.35it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 1405.39it/s]
============================= HPU PT BRIDGE CONFIGURATION ON RANK = 0 ============= 
 PT_HPU_LAZY_MODE = 1
 PT_HPU_RECIPE_CACHE_CONFIG = ,false,1024,false
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 0
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
 PT_HPU_EAGER_PIPELINE_ENABLE = 1
 PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1
 PT_HPU_ENABLE_LAZY_COLLECTIVES = 1
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM       : 1763 GB
------------------------------------------------------------------------------
[rank4]: Traceback (most recent call last):
[rank4]:   File "/root/optimum-habana/examples/text-generation/run_generation.py", line 1161, in <module>
[rank4]:     main()
[rank4]:   File "/root/optimum-habana/examples/text-generation/run_generation.py", line 530, in main
[rank4]:     model, assistant_model, tokenizer, generation_config = initialize_model(args, logger)
[rank4]:                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/root/optimum-habana/examples/text-generation/utils.py", line 824, in initialize_model
[rank4]:     else setup_distributed_model_ep(args, model_dtype, model_kwargs, logger)
[rank4]:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/root/optimum-habana/examples/text-generation/utils.py", line 476, in setup_distributed_model_ep
[rank4]:     model = model.eval().to(args.device)
[rank4]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/usr/local/lib/python3.12/dist-packages/transformers/modeling_utils.py", line 4346, in to
[rank4]:     return super().to(*args, **kwargs)
[rank4]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1374, in to
[rank4]:     return self._apply(convert)
[rank4]:            ^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank4]:     module._apply(fn)
[rank4]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank4]:     module._apply(fn)
[rank4]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank4]:     module._apply(fn)
[rank4]:   [Previous line repeated 2 more times]
[rank4]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 957, in _apply
[rank4]:     param_applied = fn(param)
[rank4]:                     ^^^^^^^^^
[rank4]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1360, in convert
[rank4]:     return t.to(
[rank4]:            ^^^^^
[rank4]: RuntimeError: [Rank:4] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::2123366400 (2025)MB
[rank6]: Traceback (most recent call last):
[rank6]:   File "/root/optimum-habana/examples/text-generation/run_generation.py", line 1161, in <module>
[rank6]:     main()
[rank6]:   File "/root/optimum-habana/examples/text-generation/run_generation.py", line 530, in main
[rank6]:     model, assistant_model, tokenizer, generation_config = initialize_model(args, logger)
[rank6]:                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/root/optimum-habana/examples/text-generation/utils.py", line 824, in initialize_model
[rank6]:     else setup_distributed_model_ep(args, model_dtype, model_kwargs, logger)
[rank6]:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/root/optimum-habana/examples/text-generation/utils.py", line 476, in setup_distributed_model_ep
[rank6]:     model = model.eval().to(args.device)
[rank6]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/transformers/modeling_utils.py", line 4346, in to
[rank6]:     return super().to(*args, **kwargs)
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1374, in to
[rank6]:     return self._apply(convert)
[rank6]:            ^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank6]:     module._apply(fn)
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank6]:     module._apply(fn)
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank6]:     module._apply(fn)
[rank6]:   [Previous line repeated 2 more times]
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 957, in _apply
[rank6]:     param_applied = fn(param)
[rank6]:                     ^^^^^^^^^
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1360, in convert
[rank6]:     return t.to(
[rank6]:            ^^^^^
[rank6]: RuntimeError: [Rank:6] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::2123366400 (2025)MB
[rank0]: Traceback (most recent call last):
[rank0]:   File "/root/optimum-habana/examples/text-generation/run_generation.py", line 1161, in <module>
[rank0]:     main()
[rank0]:   File "/root/optimum-habana/examples/text-generation/run_generation.py", line 530, in main
[rank0]:     model, assistant_model, tokenizer, generation_config = initialize_model(args, logger)
[rank0]:                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/optimum-habana/examples/text-generation/utils.py", line 824, in initialize_model
[rank0]:     else setup_distributed_model_ep(args, model_dtype, model_kwargs, logger)
[rank0]:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/optimum-habana/examples/text-generation/utils.py", line 476, in setup_distributed_model_ep
[rank0]:     model = model.eval().to(args.device)
[rank0]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/transformers/modeling_utils.py", line 4346, in to
[rank0]:     return super().to(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1374, in to
[rank0]:     return self._apply(convert)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank0]:     module._apply(fn)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank0]:     module._apply(fn)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank0]:     module._apply(fn)
[rank0]:   [Previous line repeated 2 more times]
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 957, in _apply
[rank0]:     param_applied = fn(param)
[rank0]:                     ^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1360, in convert
[rank0]:     return t.to(
[rank0]:            ^^^^^
[rank0]: RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::2123366400 (2025)MB
[rank2]: Traceback (most recent call last):
[rank2]:   File "/root/optimum-habana/examples/text-generation/run_generation.py", line 1161, in <module>
[rank2]:     main()
[rank2]:   File "/root/optimum-habana/examples/text-generation/run_generation.py", line 530, in main
[rank2]:     model, assistant_model, tokenizer, generation_config = initialize_model(args, logger)
[rank2]:                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/optimum-habana/examples/text-generation/utils.py", line 824, in initialize_model
[rank2]:     else setup_distributed_model_ep(args, model_dtype, model_kwargs, logger)
[rank2]:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/optimum-habana/examples/text-generation/utils.py", line 476, in setup_distributed_model_ep
[rank2]:     model = model.eval().to(args.device)
[rank2]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/transformers/modeling_utils.py", line 4346, in to
[rank2]:     return super().to(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1374, in to
[rank2]:     return self._apply(convert)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank2]:     module._apply(fn)
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank2]:     module._apply(fn)
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank2]:     module._apply(fn)
[rank2]:   [Previous line repeated 2 more times]
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 957, in _apply
[rank2]:     param_applied = fn(param)
[rank2]:                     ^^^^^^^^^
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1360, in convert
[rank2]:     return t.to(
[rank2]:            ^^^^^
[rank2]: RuntimeError: [Rank:2] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::2123366400 (2025)MB
[rank7]: Traceback (most recent call last):
[rank7]:   File "/root/optimum-habana/examples/text-generation/run_generation.py", line 1161, in <module>
[rank7]:     main()
[rank7]:   File "/root/optimum-habana/examples/text-generation/run_generation.py", line 530, in main
[rank7]:     model, assistant_model, tokenizer, generation_config = initialize_model(args, logger)
[rank7]:                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/root/optimum-habana/examples/text-generation/utils.py", line 824, in initialize_model
[rank7]:     else setup_distributed_model_ep(args, model_dtype, model_kwargs, logger)
[rank7]:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/root/optimum-habana/examples/text-generation/utils.py", line 476, in setup_distributed_model_ep
[rank7]:     model = model.eval().to(args.device)
[rank7]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/transformers/modeling_utils.py", line 4346, in to
[rank7]:     return super().to(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1374, in to
[rank7]:     return self._apply(convert)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank7]:     module._apply(fn)
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank7]:     module._apply(fn)
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank7]:     module._apply(fn)
[rank7]:   [Previous line repeated 2 more times]
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 957, in _apply
[rank7]:     param_applied = fn(param)
[rank7]:                     ^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1360, in convert
[rank7]:     return t.to(
[rank7]:            ^^^^^
[rank7]: RuntimeError: [Rank:7] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::2123366400 (2025)MB
[rank5]: Traceback (most recent call last):
[rank5]:   File "/root/optimum-habana/examples/text-generation/run_generation.py", line 1161, in <module>
[rank5]:     main()
[rank5]:   File "/root/optimum-habana/examples/text-generation/run_generation.py", line 530, in main
[rank5]:     model, assistant_model, tokenizer, generation_config = initialize_model(args, logger)
[rank5]:                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/root/optimum-habana/examples/text-generation/utils.py", line 824, in initialize_model
[rank5]:     else setup_distributed_model_ep(args, model_dtype, model_kwargs, logger)
[rank5]:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/root/optimum-habana/examples/text-generation/utils.py", line 476, in setup_distributed_model_ep
[rank5]:     model = model.eval().to(args.device)
[rank5]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/usr/local/lib/python3.12/dist-packages/transformers/modeling_utils.py", line 4346, in to
[rank5]:     return super().to(*args, **kwargs)
[rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1374, in to
[rank5]:     return self._apply(convert)
[rank5]:            ^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank5]:     module._apply(fn)
[rank5]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank5]:     module._apply(fn)
[rank5]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank5]:     module._apply(fn)
[rank5]:   [Previous line repeated 2 more times]
[rank5]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 957, in _apply
[rank5]:     param_applied = fn(param)
[rank5]:                     ^^^^^^^^^
[rank5]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1360, in convert
[rank5]:     return t.to(
[rank5]:            ^^^^^
[rank5]: RuntimeError: [Rank:5] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::2123366400 (2025)MB
[rank3]: Traceback (most recent call last):
[rank3]:   File "/root/optimum-habana/examples/text-generation/run_generation.py", line 1161, in <module>
[rank3]:     main()
[rank3]:   File "/root/optimum-habana/examples/text-generation/run_generation.py", line 530, in main
[rank3]:     model, assistant_model, tokenizer, generation_config = initialize_model(args, logger)
[rank3]:                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/root/optimum-habana/examples/text-generation/utils.py", line 824, in initialize_model
[rank3]:     else setup_distributed_model_ep(args, model_dtype, model_kwargs, logger)
[rank3]:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/root/optimum-habana/examples/text-generation/utils.py", line 476, in setup_distributed_model_ep
[rank3]:     model = model.eval().to(args.device)
[rank3]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/transformers/modeling_utils.py", line 4346, in to
[rank3]:     return super().to(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1374, in to
[rank3]:     return self._apply(convert)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank3]:     module._apply(fn)
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank3]:     module._apply(fn)
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank3]:     module._apply(fn)
[rank3]:   [Previous line repeated 2 more times]
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 957, in _apply
[rank3]:     param_applied = fn(param)
[rank3]:                     ^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1360, in convert
[rank3]:     return t.to(
[rank3]:            ^^^^^
[rank3]: RuntimeError: [Rank:3] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::2123366400 (2025)MB
[rank1]: Traceback (most recent call last):
[rank1]:   File "/root/optimum-habana/examples/text-generation/run_generation.py", line 1161, in <module>
[rank1]:     main()
[rank1]:   File "/root/optimum-habana/examples/text-generation/run_generation.py", line 530, in main
[rank1]:     model, assistant_model, tokenizer, generation_config = initialize_model(args, logger)
[rank1]:                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/optimum-habana/examples/text-generation/utils.py", line 824, in initialize_model
[rank1]:     else setup_distributed_model_ep(args, model_dtype, model_kwargs, logger)
[rank1]:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/optimum-habana/examples/text-generation/utils.py", line 476, in setup_distributed_model_ep
[rank1]:     model = model.eval().to(args.device)
[rank1]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/transformers/modeling_utils.py", line 4346, in to
[rank1]:     return super().to(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1374, in to
[rank1]:     return self._apply(convert)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank1]:     module._apply(fn)
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank1]:     module._apply(fn)
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank1]:     module._apply(fn)
[rank1]:   [Previous line repeated 2 more times]
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 957, in _apply
[rank1]:     param_applied = fn(param)
[rank1]:                     ^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1360, in convert
[rank1]:     return t.to(
[rank1]:            ^^^^^
[rank1]: RuntimeError: [Rank:1] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::2123366400 (2025)MB
[2026-01-27 06:24:16,634] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 109953
[2026-01-27 06:24:16,772] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 109954
[2026-01-27 06:24:16,868] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 109955
[2026-01-27 06:24:16,869] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 109956
[2026-01-27 06:24:16,869] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 109957
[2026-01-27 06:24:16,870] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 109958
[2026-01-27 06:24:16,871] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 109959
[2026-01-27 06:24:16,872] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 109960
[2026-01-27 06:24:16,873] [ERROR] [launch.py:325:sigkill_handler] ['/usr/bin/python3', '-u', 'run_generation.py', '--model_name_or_path', 'unsloth/gpt-oss-120b-BF16', '--use_hpu_graphs', '--use_kv_cache', '--max_new_tokens', '100', '--do_sample', '--batch_size', '1', '--prompt', 'Hello world', 'How are you?', '--sdp_on_bf16', '--parallel_strategy', 'ep'] exits with return code = 1
[ERROR|distributed_runner.py:222] 2026-01-27 06:24:18,236 >> deepspeed --num_nodes 1 --num_gpus 8 --no_local_rank --master_port 29500 run_generation.py --model_name_or_path unsloth/gpt-oss-120b-BF16 --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --do_sample --batch_size 1 --prompt "Hello world" "How are you?" --sdp_on_bf16 --parallel_strategy ep  exited with status = 1

@regisss
Copy link
Copy Markdown
Collaborator

regisss commented Jan 27, 2026

@jaideepsai-narayan We don't test the 120B checkpoint in CI so I'm not sure it's supposed to work. Maybe @schoi-habana knows more about that?
I guess you're trying to run it on Gaudi3?

@jaideepsai-narayan
Copy link
Copy Markdown

@jaideepsai-narayan We don't test the 120B checkpoint in CI so I'm not sure it's supposed to work. Maybe @schoi-habana knows more about that? I guess you're trying to run it on Gaudi3?

Yes @regisss we are trying to run on Gaudi3

@regisss
Copy link
Copy Markdown
Collaborator

regisss commented Jan 27, 2026

@jaideepsai-narayan I just tried to run it on a Gaudi3 server with SynapseAI v1.22 and got the same error. I think the issue here is that quantized checkpoints (which actually are the original checkpoints) rely on the mxfp4 data type which is not supported by Gaudi3. And there doesn't seem to be other quantized versions available.

@jaideepsai-narayan
Copy link
Copy Markdown

Thank you so much @regisss, Do you have any timeline on when MXFP4 support for Gaudi3 (or compatible quantized checkpoints) will be implemented?
Is this something planned for an upcoming release, and if so, which version or estimated ETA?

@regisss
Copy link
Copy Markdown
Collaborator

regisss commented Jan 27, 2026

Unfortunately I don't have any visibility on Gaudi's roadmap, maybe folks from Intel have more information. But I guess this is a hardware-related constraint so I don't think Gaudi3 will ever be compatible with mxfp4...
There doesn't seem to be a GPTQ/AWQ checkpoint either: https://huggingface.co/openai/gpt-oss-120b/discussions/32

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants