Skip to content

fix: bypass _forward_impl for dp_size==1 to fix DeepSeek R1 FP8 crash#1441

Merged
iboiko-habana merged 7 commits into
vllm-project:mainfrom
kamil-kaczor:fix/deepseek-r1-fp8-warmup-crash
May 18, 2026
Merged

fix: bypass _forward_impl for dp_size==1 to fix DeepSeek R1 FP8 crash#1441
iboiko-habana merged 7 commits into
vllm-project:mainfrom
kamil-kaczor:fix/deepseek-r1-fp8-warmup-crash

Conversation

@kamil-kaczor
Copy link
Copy Markdown
Collaborator

Problem

DeepSeek R1 (671B) crashes during warmup on G3 with FP8 quantization (GAUDISW-248418).

Two error manifestations:

  • RuntimeError: Incompatible input shapes, broadcast not possible. Tensor1 Size: 7168 30720 Tensor2 Size: 256 1
  • RuntimeError: Attempting to broadcast a dimension of length 256 at -1! Mismatching argument at index 1 had torch.Size([1, 256]); but expected shape should be broadcastable to [8192, 7168]

Both crash at hpu_grouped_topk_router.py:64 during MoE gate application.

Root Cause

_forward_impl introduces graph breaks via _sequence_parallel_context() (calls get_forward_context()). Combined with double gate application (gate called in patched_fused_moe_forward AND again inside _forward_impl), Dynamo miscompiles the graph on HPU Synapse, causing shape mismatches.

Regression window: Build 254 (good) → Build 260 (broken), introduced by commit 98863a7 (MoE dynamo recompilation fix).

Fix

For dp_size==1 (the common single-node case), bypass _forward_impl entirely and call _apply_quant_method + _maybe_combine directly. This:

  1. Eliminates graph breaks from _sequence_parallel_context() and get_forward_context()
  2. Skips the no-op _maybe_dispatch() (only needed for dp_size > 1)
  3. Prevents double gate application
  4. Adds a RuntimeError guard for pcp_size > 1 (unsupported in fast path)

The dp_size > 1 fallback via _forward_entry is unchanged.

Testing

Tested on G3 (8x HL-325L) with DeepSeek R1 671B FP8 TP=8:

  • ✅ Prompt warmup: 54/54 items completed (crash site in original bug)
  • ✅ Decode warmup: 25/25 items completed
  • ✅ End-to-end inference: valid completions returned

Fixes: GAUDISW-248418

Signed-off-by: Kamil Kaczor <kamil.kaczor@intel.com>
@github-actions
Copy link
Copy Markdown

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a DeepSeek R1 FP8 warmup crash on Gaudi by avoiding Dynamo graph-break/miscompile conditions in the MoE runner path when dp_size == 1. It does so by bypassing _forward_impl in the single-DP case and directly invoking _apply_quant_method + _maybe_combine, while keeping the dp_size > 1 path unchanged.

Changes:

  • For dp_size == 1, bypass _forward_impl and call _apply_quant_method + _maybe_combine directly to avoid graph breaks and redundant gate/stream-sync behavior.
  • Add a runtime guard that raises for pcp_size > 1 in the dp_size == 1 fast path.
  • Minor formatting/style cleanup in the patching section (quote style and line wrapping).

Comment thread vllm_gaudi/ops/hpu_fused_moe.py
Comment thread vllm_gaudi/ops/hpu_fused_moe.py
@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
54f548e9e58087f0155e4e164e416ad7efdfde6d

kamil-kaczor and others added 2 commits May 13, 2026 09:33
INC's _sync_shared_moe_gates() sets runner.gate = None after FP8
conversion to force external routing. However, when the model still
uses the internal router path (passing hidden_states as router_logits),
our patched forward needs the gate to convert hidden_states → logits.

Stash the gate reference as _hpu_gate_ref at init time (before INC
clears it) and use it as fallback in patched_fused_moe_forward.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Kamil Kaczor <kamil.kaczor@intel.com>
@github-actions
Copy link
Copy Markdown

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
54f548e9e58087f0155e4e164e416ad7efdfde6d

@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
54f548e9e58087f0155e4e164e416ad7efdfde6d

@iboiko-habana iboiko-habana merged commit 4d6d38c into vllm-project:main May 18, 2026
2 checks passed
iboiko-habana added a commit that referenced this pull request May 22, 2026
…fast path (#1469)

PR #1441 added an _hpu_gate_ref fallback in the dp_size==1 fast path
that unconditionally re-invoked a runner-owned gate, overwriting
router_logits supplied by the caller. For SharedFusedMoE models
(Qwen3 MoE, ernie45, ...) the block's mlp.gate(...) has already
produced router_logits and _sync_shared_moe_gates sets
runner.gate=None post-INC; the cached _hpu_gate_ref still points at
the pre-INC module and produced shape/dtype mismatches under fp8.

Only invoke the runner-owned gate when the caller did not provide
router_logits, preserving the DeepSeek R1 internal-router fast path
from #1441.

---------

Signed-off-by: Iryna Boiko <iboiko@habana.ai>
mgawarkiewicz-intel pushed a commit that referenced this pull request May 25, 2026
…FP8 crash #1441 (#1459)

Signed-off-by: Iryna Boiko <iboiko@habana.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants