enable GPT-OSS by schoi-habana · Pull Request #2214 · huggingface/optimum-habana

schoi-habana · 2025-08-21T06:13:34Z

dependant to #2209 as gpt-oss is added in huggingface 4.55.0

accuracy comparison to the baseline

Hellaswag	Baseline(transformers)	This PR
acc	0.4153	0.4161
acc_stderr	0.0049	0.0049
acc_norm	0.5765	0.5789
acc_norm_stderr	0.0049	0.0049

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

fused RoPE not enabled yet

github-actions · 2025-08-21T06:14:18Z

The code quality check failed, please run make style.

HuggingFaceDocBuilderDev · 2025-08-21T06:18:31Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

for prefill, keep the original as much as possible for decode (+use_kv_cache), find the token idx from attention_mask and mask tokens before (token_idx - sliding_window) as -inf

github-actions · 2025-08-21T23:08:10Z

The code quality check failed, please run make style.

github-actions · 2025-08-25T17:39:23Z

The code quality check failed, please run make style.

github-actions · 2025-08-27T00:27:14Z

The code quality check failed, please run make style.

github-actions · 2025-08-27T00:31:21Z

The code quality check failed, please run make style.

github-actions · 2025-08-27T00:33:14Z

The code quality check failed, please run make style.

regisss

@schoi-habana Let'add a test in https://github.com/huggingface/optimum-habana/blob/main/tests/test_text_generation_example.py too

yafshar · 2025-09-06T11:30:57Z

+            # When sliding_window is not None, find the token_idx by chechking the last idx of 1 in attention_mask_2d
+            if input_shape[-1] == 1:
+                cumsum = attention_mask_2d.cumsum(dim=1)
+                token_idx = cumsum.argmax(dim=1, keepdim=True)[0]


Suggested change

token_idx = cumsum.argmax(dim=1, keepdim=True)[0]

token_idx = cumsum.argmax(dim=1, keepdim=True)[0].item()

Extract the token index as an integer from the cumulative attention mask for later use in _make_causal_mask

this change causes significant perf drop. so instead i updated the type hint from int to torch.Tensor
token_idx: Optional[torch.Tensor] = None,

schoi-habana · 2025-09-09T19:05:52Z

@regisss added a basic functional test. please let me know if you want to add more test cases!

github-actions · 2025-09-10T17:55:07Z

The code quality check failed, please run make style.

regisss

LEt's add GPT-OSS to the table in the README and in the docs please

…p() to prevent one large graph

pbielak · 2025-09-16T08:05:42Z

@schoi-habana What is the status of this PR? Is everything on your side done or do we still need some changes?

schoi-habana · 2025-09-16T16:52:59Z

@pbielak it's ready. please review

regisss

LGTM!
Waiting a bit before merging if @pbielak has additional comments.

pbielak · 2025-09-17T09:10:07Z

No additional comments from my side - please go ahead with the merge @regisss

regisss · 2025-09-17T09:19:01Z

I just pushed one more commit to fix the test name in test_text_generation_example.json, see #2262 for more context

Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>

Co-authored-by: Sun Choi <schoi@habana.ai> Co-authored-by: regisss <15324346+regisss@users.noreply.github.com> Co-authored-by: Adam Stachowicz <105052242+astachowiczhabana@users.noreply.github.com>

jaideepsai-narayan · 2026-01-27T06:46:38Z

Hi @regisss, this PR is only to support 20B model?, i am able to infer 20B model without any issues but when I tried running 120B model with gaudi_spwan.py got the below issue, it's like OOM.

Command used:

PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py --model_name_or_path unsloth/gpt-oss-120b-BF16 --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --do_sample --batch_size 1 --prompt "Hello world" "How are you?" --sdp_on_bf16 --parallel_strategy "ep"

Output LOG:

PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size
 8 run_generation.py --model_name_or_path unsloth/gpt-oss-120b-BF16 --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --do_sample
 --batch_size 1 --prompt "Hello world" "How are you?" --sdp_on_bf16 --parallel_strategy "ep"
[WARNING|misc.py:214] 2026-01-27 06:23:26,231 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but habana-frameworks v1.23.0.695 was found, this could lead to undefined behavior!
[WARNING|misc.py:227] 2026-01-27 06:23:26,386 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but the driver version is v1.23.0, this could lead to undefined behavior!
DistributedRunner run(): command = deepspeed --num_nodes 1 --num_gpus 8 --no_local_rank --master_port 29500 run_generation.py --model_name_or_path unsloth/gpt-oss-120b-BF16 --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --do_sample --batch_size 1 --prompt "Hello world" "How are you?" --sdp_on_bf16 --parallel_strategy ep
[2026-01-27 06:23:29,126] [INFO] [real_accelerator.py:225:get_accelerator] Setting ds_accelerator to hpu (auto detect)
[2026-01-27 06:23:31,511] [WARNING] [runner.py:215:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2026-01-27 06:23:31,511] [INFO] [runner.py:607:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --no_local_rank --enable_each_rank_log=None run_generation.py --model_name_or_path unsloth/gpt-oss-120b-BF16 --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --do_sample --batch_size 1 --prompt Hello world How are you? --sdp_on_bf16 --parallel_strategy ep
[2026-01-27 06:23:34,270] [INFO] [real_accelerator.py:225:get_accelerator] Setting ds_accelerator to hpu (auto detect)
[2026-01-27 06:23:36,626] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2026-01-27 06:23:36,626] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=8, node_rank=0
[2026-01-27 06:23:36,626] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2026-01-27 06:23:36,626] [INFO] [launch.py:164:main] dist_world_size=8
[2026-01-27 06:23:36,626] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2026-01-27 06:23:36,627] [INFO] [launch.py:256:main] process 109953 spawned with command: ['/usr/bin/python3', '-u', 'run_generation.py', '--model_name_or_path', 'unsloth/gpt-oss-120b-BF16', '--use_hpu_graphs', '--use_kv_cache', '--max_new_tokens', '100', '--do_sample', '--batch_size', '1', '--prompt', 'Hello world', 'How are you?', '--sdp_on_bf16', '--parallel_strategy', 'ep']
[2026-01-27 06:23:36,627] [INFO] [launch.py:256:main] process 109954 spawned with command: ['/usr/bin/python3', '-u', 'run_generation.py', '--model_name_or_path', 'unsloth/gpt-oss-120b-BF16', '--use_hpu_graphs', '--use_kv_cache', '--max_new_tokens', '100', '--do_sample', '--batch_size', '1', '--prompt', 'Hello world', 'How are you?', '--sdp_on_bf16', '--parallel_strategy', 'ep']
[2026-01-27 06:23:36,627] [INFO] [launch.py:256:main] process 109955 spawned with command: ['/usr/bin/python3', '-u', 'run_generation.py', '--model_name_or_path', 'unsloth/gpt-oss-120b-BF16', '--use_hpu_graphs', '--use_kv_cache', '--max_new_tokens', '100', '--do_sample', '--batch_size', '1', '--prompt', 'Hello world', 'How are you?', '--sdp_on_bf16', '--parallel_strategy', 'ep']
[2026-01-27 06:23:36,628] [INFO] [launch.py:256:main] process 109956 spawned with command: ['/usr/bin/python3', '-u', 'run_generation.py', '--model_name_or_path', 'unsloth/gpt-oss-120b-BF16', '--use_hpu_graphs', '--use_kv_cache', '--max_new_tokens', '100', '--do_sample', '--batch_size', '1', '--prompt', 'Hello world', 'How are you?', '--sdp_on_bf16', '--parallel_strategy', 'ep']
[2026-01-27 06:23:36,628] [INFO] [launch.py:256:main] process 109957 spawned with command: ['/usr/bin/python3', '-u', 'run_generation.py', '--model_name_or_path', 'unsloth/gpt-oss-120b-BF16', '--use_hpu_graphs', '--use_kv_cache', '--max_new_tokens', '100', '--do_sample', '--batch_size', '1', '--prompt', 'Hello world', 'How are you?', '--sdp_on_bf16', '--parallel_strategy', 'ep']
[2026-01-27 06:23:36,628] [INFO] [launch.py:256:main] process 109958 spawned with command: ['/usr/bin/python3', '-u', 'run_generation.py', '--model_name_or_path', 'unsloth/gpt-oss-120b-BF16', '--use_hpu_graphs', '--use_kv_cache', '--max_new_tokens', '100', '--do_sample', '--batch_size', '1', '--prompt', 'Hello world', 'How are you?', '--sdp_on_bf16', '--parallel_strategy', 'ep']
[2026-01-27 06:23:36,628] [INFO] [launch.py:256:main] process 109959 spawned with command: ['/usr/bin/python3', '-u', 'run_generation.py', '--model_name_or_path', 'unsloth/gpt-oss-120b-BF16', '--use_hpu_graphs', '--use_kv_cache', '--max_new_tokens', '100', '--do_sample', '--batch_size', '1', '--prompt', 'Hello world', 'How are you?', '--sdp_on_bf16', '--parallel_strategy', 'ep']
[2026-01-27 06:23:36,629] [INFO] [launch.py:256:main] process 109960 spawned with command: ['/usr/bin/python3', '-u', 'run_generation.py', '--model_name_or_path', 'unsloth/gpt-oss-120b-BF16', '--use_hpu_graphs', '--use_kv_cache', '--max_new_tokens', '100', '--do_sample', '--batch_size', '1', '--prompt', 'Hello world', 'How are you?', '--sdp_on_bf16', '--parallel_strategy', 'ep']
[WARNING|misc.py:214] 2026-01-27 06:23:41,067 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but habana-frameworks v1.23.0.695 was found, this could lead to undefined behavior!
[WARNING|misc.py:214] 2026-01-27 06:23:41,069 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but habana-frameworks v1.23.0.695 was found, this could lead to undefined behavior!
[WARNING|misc.py:214] 2026-01-27 06:23:41,071 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but habana-frameworks v1.23.0.695 was found, this could lead to undefined behavior!
[WARNING|misc.py:214] 2026-01-27 06:23:41,080 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but habana-frameworks v1.23.0.695 was found, this could lead to undefined behavior!
[WARNING|misc.py:214] 2026-01-27 06:23:41,152 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but habana-frameworks v1.23.0.695 was found, this could lead to undefined behavior!
[WARNING|misc.py:214] 2026-01-27 06:23:41,154 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but habana-frameworks v1.23.0.695 was found, this could lead to undefined behavior!
[WARNING|misc.py:214] 2026-01-27 06:23:41,156 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but habana-frameworks v1.23.0.695 was found, this could lead to undefined behavior!
[WARNING|misc.py:214] 2026-01-27 06:23:41,222 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but habana-frameworks v1.23.0.695 was found, this could lead to undefined behavior!
[WARNING|misc.py:227] 2026-01-27 06:23:41,252 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but the driver version is v1.23.0, this could lead to undefined behavior!
[WARNING|misc.py:227] 2026-01-27 06:23:41,284 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but the driver version is v1.23.0, this could lead to undefined behavior!
[WARNING|misc.py:227] 2026-01-27 06:23:41,326 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but the driver version is v1.23.0, this could lead to undefined behavior!
[WARNING|misc.py:227] 2026-01-27 06:23:41,348 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but the driver version is v1.23.0, this could lead to undefined behavior!
[WARNING|misc.py:227] 2026-01-27 06:23:41,360 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but the driver version is v1.23.0, this could lead to undefined behavior!
[WARNING|misc.py:227] 2026-01-27 06:23:41,372 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but the driver version is v1.23.0, this could lead to undefined behavior!
[WARNING|misc.py:227] 2026-01-27 06:23:41,408 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but the driver version is v1.23.0, this could lead to undefined behavior!
[WARNING|misc.py:227] 2026-01-27 06:23:41,417 >> optimum-habana v1.21.0.dev0 has been validated for SynapseAI v1.22.0 but the driver version is v1.23.0, this could lead to undefined behavior!
Initializing conditional components...
Using HPU fused kernel for apply_rotary_pos_emb
Using HPU fused kernel for RMSNorm
Using HPU fused kernel for apply_rotary_pos_emb
Using HPU fused kernel for RMSNorm
Fetching 73 files: 100%|█████████████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 35145.11it/s]
Fetching 73 files: 100%|████████████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 396098.57it/s]
Fetching 73 files: 100%|████████████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 446984.22it/s]
Fetching 73 files: 100%|█████████████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 33065.25it/s]
Fetching 73 files: 100%|████████████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 254517.20it/s]
Fetching 73 files: 100%|█████████████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 39024.24it/s]
Fetching 73 files: 100%|█████████████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 23446.22it/s]
Fetching 73 files: 100%|████████████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 453606.21it/s]
Fetching 73 files: 100%|█████████████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 17383.00it/s]
01/27/2026 06:23:44 - INFO - __main__ - Multi-device ep run.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 1303.58it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 1339.31it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 1369.61it/s]
Loading checkpoint shards:   0%|                                                                             | 0/73 [00:00<?, ?it/s]01/27/2026 06:23:45 - INFO - __main__ - Creating Model
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 1442.36it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 1328.48it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 1438.96it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 1438.35it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 1405.39it/s]
============================= HPU PT BRIDGE CONFIGURATION ON RANK = 0 ============= 
 PT_HPU_LAZY_MODE = 1
 PT_HPU_RECIPE_CACHE_CONFIG = ,false,1024,false
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 0
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
 PT_HPU_EAGER_PIPELINE_ENABLE = 1
 PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1
 PT_HPU_ENABLE_LAZY_COLLECTIVES = 1
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM       : 1763 GB
------------------------------------------------------------------------------
[rank4]: Traceback (most recent call last):
[rank4]:   File "/root/optimum-habana/examples/text-generation/run_generation.py", line 1161, in <module>
[rank4]:     main()
[rank4]:   File "/root/optimum-habana/examples/text-generation/run_generation.py", line 530, in main
[rank4]:     model, assistant_model, tokenizer, generation_config = initialize_model(args, logger)
[rank4]:                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/root/optimum-habana/examples/text-generation/utils.py", line 824, in initialize_model
[rank4]:     else setup_distributed_model_ep(args, model_dtype, model_kwargs, logger)
[rank4]:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/root/optimum-habana/examples/text-generation/utils.py", line 476, in setup_distributed_model_ep
[rank4]:     model = model.eval().to(args.device)
[rank4]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/usr/local/lib/python3.12/dist-packages/transformers/modeling_utils.py", line 4346, in to
[rank4]:     return super().to(*args, **kwargs)
[rank4]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1374, in to
[rank4]:     return self._apply(convert)
[rank4]:            ^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank4]:     module._apply(fn)
[rank4]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank4]:     module._apply(fn)
[rank4]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank4]:     module._apply(fn)
[rank4]:   [Previous line repeated 2 more times]
[rank4]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 957, in _apply
[rank4]:     param_applied = fn(param)
[rank4]:                     ^^^^^^^^^
[rank4]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1360, in convert
[rank4]:     return t.to(
[rank4]:            ^^^^^
[rank4]: RuntimeError: [Rank:4] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::2123366400 (2025)MB
[rank6]: Traceback (most recent call last):
[rank6]:   File "/root/optimum-habana/examples/text-generation/run_generation.py", line 1161, in <module>
[rank6]:     main()
[rank6]:   File "/root/optimum-habana/examples/text-generation/run_generation.py", line 530, in main
[rank6]:     model, assistant_model, tokenizer, generation_config = initialize_model(args, logger)
[rank6]:                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/root/optimum-habana/examples/text-generation/utils.py", line 824, in initialize_model
[rank6]:     else setup_distributed_model_ep(args, model_dtype, model_kwargs, logger)
[rank6]:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/root/optimum-habana/examples/text-generation/utils.py", line 476, in setup_distributed_model_ep
[rank6]:     model = model.eval().to(args.device)
[rank6]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/transformers/modeling_utils.py", line 4346, in to
[rank6]:     return super().to(*args, **kwargs)
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1374, in to
[rank6]:     return self._apply(convert)
[rank6]:            ^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank6]:     module._apply(fn)
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank6]:     module._apply(fn)
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank6]:     module._apply(fn)
[rank6]:   [Previous line repeated 2 more times]
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 957, in _apply
[rank6]:     param_applied = fn(param)
[rank6]:                     ^^^^^^^^^
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1360, in convert
[rank6]:     return t.to(
[rank6]:            ^^^^^
[rank6]: RuntimeError: [Rank:6] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::2123366400 (2025)MB
[rank0]: Traceback (most recent call last):
[rank0]:   File "/root/optimum-habana/examples/text-generation/run_generation.py", line 1161, in <module>
[rank0]:     main()
[rank0]:   File "/root/optimum-habana/examples/text-generation/run_generation.py", line 530, in main
[rank0]:     model, assistant_model, tokenizer, generation_config = initialize_model(args, logger)
[rank0]:                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/optimum-habana/examples/text-generation/utils.py", line 824, in initialize_model
[rank0]:     else setup_distributed_model_ep(args, model_dtype, model_kwargs, logger)
[rank0]:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/optimum-habana/examples/text-generation/utils.py", line 476, in setup_distributed_model_ep
[rank0]:     model = model.eval().to(args.device)
[rank0]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/transformers/modeling_utils.py", line 4346, in to
[rank0]:     return super().to(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1374, in to
[rank0]:     return self._apply(convert)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank0]:     module._apply(fn)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank0]:     module._apply(fn)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank0]:     module._apply(fn)
[rank0]:   [Previous line repeated 2 more times]
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 957, in _apply
[rank0]:     param_applied = fn(param)
[rank0]:                     ^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1360, in convert
[rank0]:     return t.to(
[rank0]:            ^^^^^
[rank0]: RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::2123366400 (2025)MB
[rank2]: Traceback (most recent call last):
[rank2]:   File "/root/optimum-habana/examples/text-generation/run_generation.py", line 1161, in <module>
[rank2]:     main()
[rank2]:   File "/root/optimum-habana/examples/text-generation/run_generation.py", line 530, in main
[rank2]:     model, assistant_model, tokenizer, generation_config = initialize_model(args, logger)
[rank2]:                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/optimum-habana/examples/text-generation/utils.py", line 824, in initialize_model
[rank2]:     else setup_distributed_model_ep(args, model_dtype, model_kwargs, logger)
[rank2]:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/optimum-habana/examples/text-generation/utils.py", line 476, in setup_distributed_model_ep
[rank2]:     model = model.eval().to(args.device)
[rank2]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/transformers/modeling_utils.py", line 4346, in to
[rank2]:     return super().to(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1374, in to
[rank2]:     return self._apply(convert)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank2]:     module._apply(fn)
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank2]:     module._apply(fn)
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank2]:     module._apply(fn)
[rank2]:   [Previous line repeated 2 more times]
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 957, in _apply
[rank2]:     param_applied = fn(param)
[rank2]:                     ^^^^^^^^^
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1360, in convert
[rank2]:     return t.to(
[rank2]:            ^^^^^
[rank2]: RuntimeError: [Rank:2] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::2123366400 (2025)MB
[rank7]: Traceback (most recent call last):
[rank7]:   File "/root/optimum-habana/examples/text-generation/run_generation.py", line 1161, in <module>
[rank7]:     main()
[rank7]:   File "/root/optimum-habana/examples/text-generation/run_generation.py", line 530, in main
[rank7]:     model, assistant_model, tokenizer, generation_config = initialize_model(args, logger)
[rank7]:                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/root/optimum-habana/examples/text-generation/utils.py", line 824, in initialize_model
[rank7]:     else setup_distributed_model_ep(args, model_dtype, model_kwargs, logger)
[rank7]:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/root/optimum-habana/examples/text-generation/utils.py", line 476, in setup_distributed_model_ep
[rank7]:     model = model.eval().to(args.device)
[rank7]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/transformers/modeling_utils.py", line 4346, in to
[rank7]:     return super().to(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1374, in to
[rank7]:     return self._apply(convert)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank7]:     module._apply(fn)
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank7]:     module._apply(fn)
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank7]:     module._apply(fn)
[rank7]:   [Previous line repeated 2 more times]
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 957, in _apply
[rank7]:     param_applied = fn(param)
[rank7]:                     ^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1360, in convert
[rank7]:     return t.to(
[rank7]:            ^^^^^
[rank7]: RuntimeError: [Rank:7] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::2123366400 (2025)MB
[rank5]: Traceback (most recent call last):
[rank5]:   File "/root/optimum-habana/examples/text-generation/run_generation.py", line 1161, in <module>
[rank5]:     main()
[rank5]:   File "/root/optimum-habana/examples/text-generation/run_generation.py", line 530, in main
[rank5]:     model, assistant_model, tokenizer, generation_config = initialize_model(args, logger)
[rank5]:                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/root/optimum-habana/examples/text-generation/utils.py", line 824, in initialize_model
[rank5]:     else setup_distributed_model_ep(args, model_dtype, model_kwargs, logger)
[rank5]:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/root/optimum-habana/examples/text-generation/utils.py", line 476, in setup_distributed_model_ep
[rank5]:     model = model.eval().to(args.device)
[rank5]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/usr/local/lib/python3.12/dist-packages/transformers/modeling_utils.py", line 4346, in to
[rank5]:     return super().to(*args, **kwargs)
[rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1374, in to
[rank5]:     return self._apply(convert)
[rank5]:            ^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank5]:     module._apply(fn)
[rank5]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank5]:     module._apply(fn)
[rank5]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank5]:     module._apply(fn)
[rank5]:   [Previous line repeated 2 more times]
[rank5]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 957, in _apply
[rank5]:     param_applied = fn(param)
[rank5]:                     ^^^^^^^^^
[rank5]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1360, in convert
[rank5]:     return t.to(
[rank5]:            ^^^^^
[rank5]: RuntimeError: [Rank:5] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::2123366400 (2025)MB
[rank3]: Traceback (most recent call last):
[rank3]:   File "/root/optimum-habana/examples/text-generation/run_generation.py", line 1161, in <module>
[rank3]:     main()
[rank3]:   File "/root/optimum-habana/examples/text-generation/run_generation.py", line 530, in main
[rank3]:     model, assistant_model, tokenizer, generation_config = initialize_model(args, logger)
[rank3]:                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/root/optimum-habana/examples/text-generation/utils.py", line 824, in initialize_model
[rank3]:     else setup_distributed_model_ep(args, model_dtype, model_kwargs, logger)
[rank3]:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/root/optimum-habana/examples/text-generation/utils.py", line 476, in setup_distributed_model_ep
[rank3]:     model = model.eval().to(args.device)
[rank3]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/transformers/modeling_utils.py", line 4346, in to
[rank3]:     return super().to(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1374, in to
[rank3]:     return self._apply(convert)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank3]:     module._apply(fn)
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank3]:     module._apply(fn)
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank3]:     module._apply(fn)
[rank3]:   [Previous line repeated 2 more times]
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 957, in _apply
[rank3]:     param_applied = fn(param)
[rank3]:                     ^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1360, in convert
[rank3]:     return t.to(
[rank3]:            ^^^^^
[rank3]: RuntimeError: [Rank:3] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::2123366400 (2025)MB
[rank1]: Traceback (most recent call last):
[rank1]:   File "/root/optimum-habana/examples/text-generation/run_generation.py", line 1161, in <module>
[rank1]:     main()
[rank1]:   File "/root/optimum-habana/examples/text-generation/run_generation.py", line 530, in main
[rank1]:     model, assistant_model, tokenizer, generation_config = initialize_model(args, logger)
[rank1]:                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/optimum-habana/examples/text-generation/utils.py", line 824, in initialize_model
[rank1]:     else setup_distributed_model_ep(args, model_dtype, model_kwargs, logger)
[rank1]:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/optimum-habana/examples/text-generation/utils.py", line 476, in setup_distributed_model_ep
[rank1]:     model = model.eval().to(args.device)
[rank1]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/transformers/modeling_utils.py", line 4346, in to
[rank1]:     return super().to(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1374, in to
[rank1]:     return self._apply(convert)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank1]:     module._apply(fn)
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank1]:     module._apply(fn)
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 930, in _apply
[rank1]:     module._apply(fn)
[rank1]:   [Previous line repeated 2 more times]
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 957, in _apply
[rank1]:     param_applied = fn(param)
[rank1]:                     ^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1360, in convert
[rank1]:     return t.to(
[rank1]:            ^^^^^
[rank1]: RuntimeError: [Rank:1] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::2123366400 (2025)MB
[2026-01-27 06:24:16,634] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 109953
[2026-01-27 06:24:16,772] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 109954
[2026-01-27 06:24:16,868] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 109955
[2026-01-27 06:24:16,869] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 109956
[2026-01-27 06:24:16,869] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 109957
[2026-01-27 06:24:16,870] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 109958
[2026-01-27 06:24:16,871] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 109959
[2026-01-27 06:24:16,872] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 109960
[2026-01-27 06:24:16,873] [ERROR] [launch.py:325:sigkill_handler] ['/usr/bin/python3', '-u', 'run_generation.py', '--model_name_or_path', 'unsloth/gpt-oss-120b-BF16', '--use_hpu_graphs', '--use_kv_cache', '--max_new_tokens', '100', '--do_sample', '--batch_size', '1', '--prompt', 'Hello world', 'How are you?', '--sdp_on_bf16', '--parallel_strategy', 'ep'] exits with return code = 1
[ERROR|distributed_runner.py:222] 2026-01-27 06:24:18,236 >> deepspeed --num_nodes 1 --num_gpus 8 --no_local_rank --master_port 29500 run_generation.py --model_name_or_path unsloth/gpt-oss-120b-BF16 --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --do_sample --batch_size 1 --prompt "Hello world" "How are you?" --sdp_on_bf16 --parallel_strategy ep  exited with status = 1

regisss · 2026-01-27T08:20:56Z

@jaideepsai-narayan We don't test the 120B checkpoint in CI so I'm not sure it's supposed to work. Maybe @schoi-habana knows more about that?
I guess you're trying to run it on Gaudi3?

jaideepsai-narayan · 2026-01-27T08:43:59Z

@jaideepsai-narayan We don't test the 120B checkpoint in CI so I'm not sure it's supposed to work. Maybe @schoi-habana knows more about that? I guess you're trying to run it on Gaudi3?

Yes @regisss we are trying to run on Gaudi3

regisss · 2026-01-27T14:21:13Z

@jaideepsai-narayan I just tried to run it on a Gaudi3 server with SynapseAI v1.22 and got the same error. I think the issue here is that quantized checkpoints (which actually are the original checkpoints) rely on the mxfp4 data type which is not supported by Gaudi3. And there doesn't seem to be other quantized versions available.

jaideepsai-narayan · 2026-01-27T14:25:42Z

Thank you so much @regisss, Do you have any timeline on when MXFP4 support for Gaudi3 (or compatible quantized checkpoints) will be implemented?
Is this something planned for an upcoming release, and if so, which version or estimated ETA?

regisss · 2026-01-27T14:34:10Z

Unfortunately I don't have any visibility on Gaudi's roadmap, maybe folks from Intel have more information. But I guess this is a hardware-related constraint so I don't think Gaudi3 will ever be compatible with mxfp4...
There doesn't seem to be a GPTQ/AWQ checkpoint either: https://huggingface.co/openai/gpt-oss-120b/discussions/32

working version

f25c71c

fused RoPE not enabled yet

clean up the masking with sliding window attention

fdd5205

for prefill, keep the original as much as possible for decode (+use_kv_cache), find the token idx from attention_mask and mask tokens before (token_idx - sliding_window) as -inf

remove unnecessary code

46c1475

schoi-habana force-pushed the schoi/gpt_oss branch from b4d368e to d611aef Compare August 27, 2025 00:30

schoi-habana force-pushed the schoi/gpt_oss branch from d611aef to 3e88b29 Compare August 27, 2025 00:32

code cleaned

97725a2

schoi-habana force-pushed the schoi/gpt_oss branch from 3e88b29 to 97725a2 Compare August 27, 2025 00:33

schoi-habana requested a review from mandy-li August 27, 2025 00:34

schoi-habana marked this pull request as ready for review August 27, 2025 00:35

schoi-habana requested review from regisss and vivekgoe as code owners August 27, 2025 00:35

imangohari1 mentioned this pull request Aug 28, 2025

Added the SWA to Gemma2. #2210

Merged

3 tasks

regisss reviewed Sep 3, 2025

View reviewed changes

Comment thread optimum/habana/transformers/modeling_utils.py Outdated

Comment thread optimum/habana/transformers/modeling_utils.py Outdated

Comment thread optimum/habana/transformers/models/gpt_oss/configuration_gpt_oss.py Outdated

karol-brejna-i assigned pbielak Sep 4, 2025

schoi-habana added 2 commits September 5, 2025 21:15

enabled DeepSpeed

9e0a3fd

DS cleaned

2f09e05

yafshar reviewed Sep 6, 2025

View reviewed changes

pbielak reviewed Sep 8, 2025

View reviewed changes

Comment thread optimum/habana/transformers/models/mistral/configuration_mistral.py

Fixed accuracy issue with hpu graph and dynamicity

bd4a77d

schoi-habana force-pushed the schoi/gpt_oss branch from cb82e5c to bd4a77d Compare September 9, 2025 19:01

schoi-habana force-pushed the schoi/gpt_oss branch from e799e23 to bd4a77d Compare September 10, 2025 17:54

huggingface deleted a comment from github-actions Bot Sep 10, 2025

regisss reviewed Sep 15, 2025

View reviewed changes

Comment thread tests/test_text_generation_example.py Outdated

schoi-habana added 4 commits September 15, 2025 18:58

replace torch.bmm() with Matmul for INC quantization and add mark_ste…

58c96aa

…p() to prevent one large graph

Merge branch 'main' into schoi/gpt_oss

5e71382

formatting

3ea071a

adding GPT-OSS to README and docs

024178b

regisss approved these changes Sep 17, 2025

View reviewed changes

Fix test name in

93f3127

regisss merged commit 9fffa78 into main Sep 17, 2025
6 of 8 checks passed

regisss deleted the schoi/gpt_oss branch September 17, 2025 09:30

astachowiczhabana pushed a commit that referenced this pull request Sep 22, 2025

Enable GPT-OSS (#2214)

f9598d2

Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>

astachowiczhabana pushed a commit that referenced this pull request Sep 23, 2025

Enable GPT-OSS (#2214)

1b7d798

Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>

astachowiczhabana pushed a commit that referenced this pull request Sep 25, 2025

Enable GPT-OSS (#2214)

a736729

Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>

	token_idx = cumsum.argmax(dim=1, keepdim=True)[0]
	token_idx = cumsum.argmax(dim=1, keepdim=True)[0].item()

Conversation

schoi-habana commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Uh oh!

github-actions Bot commented Aug 21, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Aug 21, 2025

Uh oh!

github-actions Bot commented Aug 21, 2025

Uh oh!

github-actions Bot commented Aug 25, 2025

Uh oh!

github-actions Bot commented Aug 27, 2025

Uh oh!

github-actions Bot commented Aug 27, 2025

Uh oh!

github-actions Bot commented Aug 27, 2025

Uh oh!

regisss left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yafshar Sep 6, 2025

Choose a reason for hiding this comment

Uh oh!

schoi-habana Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

schoi-habana commented Sep 9, 2025

Uh oh!

github-actions Bot commented Sep 10, 2025

Uh oh!

regisss left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pbielak commented Sep 16, 2025

Uh oh!

schoi-habana commented Sep 16, 2025

Uh oh!

regisss left a comment

Choose a reason for hiding this comment

Uh oh!

pbielak commented Sep 17, 2025

Uh oh!

regisss commented Sep 17, 2025

Uh oh!

Uh oh!

jaideepsai-narayan commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

regisss commented Jan 27, 2026

Uh oh!

jaideepsai-narayan commented Jan 27, 2026

Uh oh!

regisss commented Jan 27, 2026

Uh oh!

jaideepsai-narayan commented Jan 27, 2026

Uh oh!

regisss commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

schoi-habana commented Aug 21, 2025 •

edited

Loading

jaideepsai-narayan commented Jan 27, 2026 •

edited

Loading