fix: bypass _forward_impl for dp_size==1 to fix DeepSeek R1 FP8 crash by kamil-kaczor · Pull Request #1441 · vllm-project/vllm-gaudi

kamil-kaczor · 2026-05-12T09:07:48Z

Problem

DeepSeek R1 (671B) crashes during warmup on G3 with FP8 quantization (GAUDISW-248418).

Two error manifestations:

RuntimeError: Incompatible input shapes, broadcast not possible. Tensor1 Size: 7168 30720 Tensor2 Size: 256 1
RuntimeError: Attempting to broadcast a dimension of length 256 at -1! Mismatching argument at index 1 had torch.Size([1, 256]); but expected shape should be broadcastable to [8192, 7168]

Both crash at hpu_grouped_topk_router.py:64 during MoE gate application.

Root Cause

_forward_impl introduces graph breaks via _sequence_parallel_context() (calls get_forward_context()). Combined with double gate application (gate called in patched_fused_moe_forward AND again inside _forward_impl), Dynamo miscompiles the graph on HPU Synapse, causing shape mismatches.

Regression window: Build 254 (good) → Build 260 (broken), introduced by commit 98863a7 (MoE dynamo recompilation fix).

Fix

For dp_size==1 (the common single-node case), bypass _forward_impl entirely and call _apply_quant_method + _maybe_combine directly. This:

Eliminates graph breaks from _sequence_parallel_context() and get_forward_context()
Skips the no-op _maybe_dispatch() (only needed for dp_size > 1)
Prevents double gate application
Adds a RuntimeError guard for pcp_size > 1 (unsupported in fast path)

The dp_size > 1 fallback via _forward_entry is unchanged.

Testing

Tested on G3 (8x HL-325L) with DeepSeek R1 671B FP8 TP=8:

✅ Prompt warmup: 54/54 items completed (crash site in original bug)
✅ Decode warmup: 25/25 items completed
✅ End-to-end inference: valid completions returned

Fixes: GAUDISW-248418

Signed-off-by: Kamil Kaczor <kamil.kaczor@intel.com>

github-actions · 2026-05-12T09:08:29Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

Copilot

Pull request overview

This PR addresses a DeepSeek R1 FP8 warmup crash on Gaudi by avoiding Dynamo graph-break/miscompile conditions in the MoE runner path when dp_size == 1. It does so by bypassing _forward_impl in the single-DP case and directly invoking _apply_quant_method + _maybe_combine, while keeping the dp_size > 1 path unchanged.

Changes:

For dp_size == 1, bypass _forward_impl and call _apply_quant_method + _maybe_combine directly to avoid graph breaks and redundant gate/stream-sync behavior.
Add a runtime guard that raises for pcp_size > 1 in the dp_size == 1 fast path.
Minor formatting/style cleanup in the patching section (quote style and line wrapping).

github-actions · 2026-05-13T03:16:22Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
54f548e9e58087f0155e4e164e416ad7efdfde6d

INC's _sync_shared_moe_gates() sets runner.gate = None after FP8 conversion to force external routing. However, when the model still uses the internal router path (passing hidden_states as router_logits), our patched forward needs the gate to convert hidden_states → logits. Stash the gate reference as _hpu_gate_ref at init time (before INC clears it) and use it as fallback in patched_fused_moe_forward. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Kamil Kaczor <kamil.kaczor@intel.com>

github-actions · 2026-05-13T11:11:21Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

github-actions · 2026-05-14T06:14:27Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
54f548e9e58087f0155e4e164e416ad7efdfde6d

github-actions · 2026-05-18T12:12:41Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
54f548e9e58087f0155e4e164e416ad7efdfde6d

…fast path (#1469) PR #1441 added an _hpu_gate_ref fallback in the dp_size==1 fast path that unconditionally re-invoked a runner-owned gate, overwriting router_logits supplied by the caller. For SharedFusedMoE models (Qwen3 MoE, ernie45, ...) the block's mlp.gate(...) has already produced router_logits and _sync_shared_moe_gates sets runner.gate=None post-INC; the cached _hpu_gate_ref still points at the pre-INC module and produced shape/dtype mismatches under fp8. Only invoke the runner-owned gate when the caller did not provide router_logits, preserving the DeepSeek R1 internal-router fast path from #1441. --------- Signed-off-by: Iryna Boiko <iboiko@habana.ai>

…FP8 crash #1441 (#1459) Signed-off-by: Iryna Boiko <iboiko@habana.ai>

fix: bypass _forward_impl for dp_size==1 to fix DeepSeek R1 FP8 crash

9f156fc

Signed-off-by: Kamil Kaczor <kamil.kaczor@intel.com>

Copilot AI review requested due to automatic review settings May 12, 2026 09:07

kamil-kaczor requested review from PatrykWo, adobrzyn, afierka-intel, iboiko-habana, jbyczkow, ksmusz, mgawarkiewicz-intel, michalkuligowski and xuechendi as code owners May 12, 2026 09:07

Copilot started reviewing on behalf of kamil-kaczor May 12, 2026 09:08 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

Comment thread vllm_gaudi/ops/hpu_fused_moe.py

Comment thread vllm_gaudi/ops/hpu_fused_moe.py

Merge branch 'main' into fix/deepseek-r1-fp8-warmup-crash

d0cf3f6

github-actions Bot mentioned this pull request May 12, 2026

🚦 Team Review Dashboard #701

Open

Merge branch 'main' into fix/deepseek-r1-fp8-warmup-crash

cf13482

kamil-kaczor and others added 2 commits May 13, 2026 09:33

Merge branch 'main' into fix/deepseek-r1-fp8-warmup-crash

95a5ff2

Merge branch 'main' into fix/deepseek-r1-fp8-warmup-crash

8ed5b89

Merge branch 'main' into fix/deepseek-r1-fp8-warmup-crash

44617ef

iboiko-habana approved these changes May 18, 2026

View reviewed changes

iboiko-habana merged commit 4d6d38c into vllm-project:main May 18, 2026
2 checks passed

iboiko-habana mentioned this pull request May 20, 2026

Fix stale gate ref overriding caller router_logits in dp_size==1 MoE fast path #1469

Merged

mgawarkiewicz-intel pushed a commit that referenced this pull request May 25, 2026

Port of: fix: bypass _forward_impl for dp_size==1 to fix DeepSeek R1 …

de9f2fb

…FP8 crash #1441 (#1459) Signed-off-by: Iryna Boiko <iboiko@habana.ai>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: bypass _forward_impl for dp_size==1 to fix DeepSeek R1 FP8 crash#1441

fix: bypass _forward_impl for dp_size==1 to fix DeepSeek R1 FP8 crash#1441
iboiko-habana merged 7 commits into
vllm-project:mainfrom
kamil-kaczor:fix/deepseek-r1-fp8-warmup-crash

kamil-kaczor commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kamil-kaczor commented May 12, 2026

Problem

Root Cause

Fix

Testing

Uh oh!

github-actions Bot commented May 12, 2026

🚧 CI Blocked

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 13, 2026

✅ CI Passed

Uh oh!

github-actions Bot commented May 13, 2026

🚧 CI Blocked

Uh oh!

github-actions Bot commented May 14, 2026

✅ CI Passed

Uh oh!

github-actions Bot commented May 18, 2026

✅ CI Passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants