fix: bypass _forward_impl for dp_size==1 to fix DeepSeek R1 FP8 crash#1441
Conversation
Signed-off-by: Kamil Kaczor <kamil.kaczor@intel.com>
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
There was a problem hiding this comment.
Pull request overview
This PR addresses a DeepSeek R1 FP8 warmup crash on Gaudi by avoiding Dynamo graph-break/miscompile conditions in the MoE runner path when dp_size == 1. It does so by bypassing _forward_impl in the single-DP case and directly invoking _apply_quant_method + _maybe_combine, while keeping the dp_size > 1 path unchanged.
Changes:
- For
dp_size == 1, bypass_forward_impland call_apply_quant_method+_maybe_combinedirectly to avoid graph breaks and redundant gate/stream-sync behavior. - Add a runtime guard that raises for
pcp_size > 1in thedp_size == 1fast path. - Minor formatting/style cleanup in the patching section (quote style and line wrapping).
✅ CI PassedAll checks passed successfully against the following vllm commit: |
INC's _sync_shared_moe_gates() sets runner.gate = None after FP8 conversion to force external routing. However, when the model still uses the internal router path (passing hidden_states as router_logits), our patched forward needs the gate to convert hidden_states → logits. Stash the gate reference as _hpu_gate_ref at init time (before INC clears it) and use it as fallback in patched_fused_moe_forward. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Kamil Kaczor <kamil.kaczor@intel.com>
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
✅ CI PassedAll checks passed successfully against the following vllm commit: |
✅ CI PassedAll checks passed successfully against the following vllm commit: |
…fast path (#1469) PR #1441 added an _hpu_gate_ref fallback in the dp_size==1 fast path that unconditionally re-invoked a runner-owned gate, overwriting router_logits supplied by the caller. For SharedFusedMoE models (Qwen3 MoE, ernie45, ...) the block's mlp.gate(...) has already produced router_logits and _sync_shared_moe_gates sets runner.gate=None post-INC; the cached _hpu_gate_ref still points at the pre-INC module and produced shape/dtype mismatches under fp8. Only invoke the runner-owned gate when the caller did not provide router_logits, preserving the DeepSeek R1 internal-router fast path from #1441. --------- Signed-off-by: Iryna Boiko <iboiko@habana.ai>
Problem
DeepSeek R1 (671B) crashes during warmup on G3 with FP8 quantization (GAUDISW-248418).
Two error manifestations:
RuntimeError: Incompatible input shapes, broadcast not possible. Tensor1 Size: 7168 30720 Tensor2 Size: 256 1RuntimeError: Attempting to broadcast a dimension of length 256 at -1! Mismatching argument at index 1 had torch.Size([1, 256]); but expected shape should be broadcastable to [8192, 7168]Both crash at
hpu_grouped_topk_router.py:64during MoE gate application.Root Cause
_forward_implintroduces graph breaks via_sequence_parallel_context()(callsget_forward_context()). Combined with double gate application (gate called inpatched_fused_moe_forwardAND again inside_forward_impl), Dynamo miscompiles the graph on HPU Synapse, causing shape mismatches.Regression window: Build 254 (good) → Build 260 (broken), introduced by commit
98863a7(MoE dynamo recompilation fix).Fix
For
dp_size==1(the common single-node case), bypass_forward_implentirely and call_apply_quant_method+_maybe_combinedirectly. This:_sequence_parallel_context()andget_forward_context()_maybe_dispatch()(only needed for dp_size > 1)pcp_size > 1(unsupported in fast path)The
dp_size > 1fallback via_forward_entryis unchanged.Testing
Tested on G3 (8x HL-325L) with DeepSeek R1 671B FP8 TP=8:
Fixes: GAUDISW-248418