Skip to content

Port of: fix: bypass _forward_impl for dp_size==1 to fix DeepSeek R1 FP8 crash #1441#1459

Merged
mgawarkiewicz-intel merged 1 commit into
vllm-project:releases/v0.21.0from
iboiko-habana:port1441
May 25, 2026
Merged

Port of: fix: bypass _forward_impl for dp_size==1 to fix DeepSeek R1 FP8 crash #1441#1459
mgawarkiewicz-intel merged 1 commit into
vllm-project:releases/v0.21.0from
iboiko-habana:port1441

Conversation

@iboiko-habana
Copy link
Copy Markdown
Collaborator

No description provided.

…FP8 crash vllm-project#1441

Signed-off-by: Iryna Boiko <iboiko@habana.ai>
Copilot AI review requested due to automatic review settings May 19, 2026 09:23
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Ports an upstream fix to the Gaudi fused MoE runner to avoid a DeepSeek R1 FP8 crash by changing the dp_size == 1 execution path to bypass _forward_impl and directly run quant/application + output combination steps.

Changes:

  • Implement a dp_size == 1 fast path that calls _apply_quant_method and _maybe_combine directly instead of _forward_impl.
  • Persist additional runner references (_hpu_layer_ref, _hpu_gate_ref) during FusedMoE.__init__ to avoid relying on potentially detached gate modules at runtime.
  • Minor style/formatting adjustments to patch assignments.

Comment on lines +306 to +308
if self.moe_config.pcp_size > 1:
raise RuntimeError("dp_size==1 fast path does not support pcp_size > 1")
layer = self._hpu_layer_ref
Comment on lines 283 to 289
Instead of calling forward_dispatch (which uses get_layer_from_name,
ensure_moe_quant_config_init, and _sequence_parallel_context — all of
which access ForwardContext and cause torch.compile graph breaks), we
use a layer reference stashed on the runner at FusedMoE.__init__ time
(self._hpu_layer_ref) and call _forward_impl directly. This also
(self._hpu_layer_ref) and bypass _forward_impl for dp_size==1,
calling _apply_quant_method + _maybe_combine directly. This also
bypasses self.layer_name (a per-layer string) so dynamo no longer
Copy link
Copy Markdown
Collaborator

@kamil-kaczor kamil-kaczor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
ad7125a431e176d4161099480a66f0169609a690

@mgawarkiewicz-intel mgawarkiewicz-intel merged commit de9f2fb into vllm-project:releases/v0.21.0 May 25, 2026
2 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants