Enable slicing for fp8 FusedSDPA by yangulei · Pull Request #1285 · vllm-project/vllm-gaudi

yangulei · 2026-04-01T06:51:22Z

No description provided.

Copilot

Pull request overview

This PR extends the existing FusedSDPA slicing mechanism to the FP8 FusedSDPA path on HPU, so long-context “chunked prefill” can use the sliced dispatch when FP8 attention is enabled.

Changes:

Refactors common slicing setup into a shared ModuleFusedSDPABase.
Adds a sliced forward path for ModuleFP8FusedSDPA (FP8 FusedSDPA) and routes into it under the same gating conditions as BF16.
Updates the HPU attention backend to prefer the FP8 FusedSDPA kernel when FP8 attention is enabled and the kernel is available.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
`vllm_gaudi/extension/utils.py`	Adds shared slicing setup base class and implements FP8 sliced FusedSDPA forward path.
`vllm_gaudi/attention/backends/hpu_attn.py`	Switches `fsdpa_op` to FP8 FusedSDPA when FP8 attention is enabled and import succeeds.

Copilot · 2026-04-01T06:58:41Z

+            False,  # is_amax_s
+            False,  # is_amax_o
+            None,  # valid_seq_len
+            "right",  # seq_padding_type


fp8_fsdpa_fwd hardcodes seq_padding_type to "right" and ignores the padding_side argument flowing into forward/_sliced_fsdpa_fwd. This can silently break correctness for left-padded inputs and is inconsistent with the BF16 slicing path (which passes padding_side). Thread padding_side through to fp8_fsdpa_fwd and use it instead of the constant.

Copilot · 2026-04-01T06:58:41Z

+                    mask_chunk = mask_chunk.clone() if mask_chunk is not None else None
+                    self.break_graph()
+
+                chunk_res = self.fp8_fsdpa_fwd(q_chunk, k_chunk, v_chunk, None, dropout_p, scale, False, softmax_mode)


In the FP8 sliced path (context part), mask_chunk is computed (and even cloned under with_graph_breaks) but fp8_fsdpa_fwd is called with attn_mask=None.
This drops the padding mask for context chunks and can produce incorrect attention results when padding is present; pass mask_chunk through instead of None.

Suggested change

chunk_res = self.fp8_fsdpa_fwd(q_chunk, k_chunk, v_chunk, None, dropout_p, scale, False, softmax_mode)

chunk_res = self.fp8_fsdpa_fwd(q_chunk, k_chunk, v_chunk, mask_chunk, dropout_p, scale, False, softmax_mode)

Copilot · 2026-04-01T06:58:41Z

@@ -413,6 +419,7 @@ def __init__(self, fusedSDPA):
        self.d_scale_q = torch.tensor(1.0)
        self.d_scale_k = torch.tensor(1.0)
        self.d_scale_v = torch.tensor(1.0)
+        self.d_scale_output = torch.tensor(1.0)


d_scale_output is introduced and used to dequantize sliced FP8 outputs, but it is not configured anywhere outside this module (unlike scale_q/k/v and d_scale_q/k/v, which are set during weight/scale processing). As-is, slicing will always dequant with the default 1.0 scale. Consider wiring d_scale_output into the same scale-setup path (e.g., wherever fused_scaled_dot_product_attention.d_scale_q/k/v is set) or deriving it from the kernel’s output scale contract.

Copilot · 2026-04-01T06:58:41Z

+        enable_slicing = enable_slicing and slice_thld >= slice_thld_default
+        if not enable_slicing and slice_thld > 0:
+            logger().warning('Invalid slice sequence length threshold, the threshold should be '
+                             f'>= min(max_num_batched_tokens, 8192), falling back to default {slice_thld_default}.')
+            slice_thld = slice_thld_default


When VLLM_HPU_FSDPA_SLICE_SEQ_LEN_THLD is set to an invalid value (< default), the code logs that it is “falling back to default”, but enable_slicing stays False and the fallback threshold is never used. Either re-enable slicing after resetting slice_thld, or update the log/message to reflect that slicing remains disabled for invalid thresholds.

Suggested change

enable_slicing = enable_slicing and slice_thld >= slice_thld_default

if not enable_slicing and slice_thld > 0:

logger().warning('Invalid slice sequence length threshold, the threshold should be '

f'>= min(max_num_batched_tokens, 8192), falling back to default {slice_thld_default}.')

slice_thld = slice_thld_default

if 0 < slice_thld < slice_thld_default:

logger().warning('Invalid slice sequence length threshold, the threshold should be '

f'>= min(max_num_batched_tokens, 8192), falling back to default {slice_thld_default}.')

slice_thld = slice_thld_default

elif slice_thld <= 0:

enable_slicing = False

This reverts commit 9271c08.

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

czhu15

LGTM and example test code passed.
Pls check the comments from copilot.

The boolean mask handling for attn_bias was accidentally removed in commit f337029 (Enable slicing for fp8 FusedSDPA vllm-project#1285). When attn_bias is a boolean tensor, the code should use masked_fill to set invalid positions to -inf, but instead it was using add_ which only adds 0/1 to the attention weights. This causes incorrect attention scores and accuracy degradation, especially for long prompts where proper masking of padded positions is critical. Signed-off-by: copilot <copilot@github.com> Signed-off-by: GitHub <noreply@github.com> Co-authored-by: JyhWind <40982453+JyhWind@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 1, 2026 06:51

Copilot started reviewing on behalf of yangulei April 1, 2026 06:51 View session

Copilot AI reviewed Apr 1, 2026

View reviewed changes

yangulei added 4 commits April 2, 2026 07:45

Revert "Use Boolean attention mask (vllm-project#1032)"

e3f89b2

This reverts commit 9271c08.

use finite numbers for attention mask value

f3ff913

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

enable slicing for fp8 FusedSDPA

1a004e0

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

fix gc error

6c3c91a

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

yangulei force-pushed the fp8_slice branch from 950d941 to 6c3c91a Compare April 3, 2026 00:18

yangulei marked this pull request as ready for review April 3, 2026 08:37

yangulei requested review from Wei-Lin-Intel, czhu15, mgawarkiewicz-intel, piotrbocian, taotod and wpyszka as code owners April 3, 2026 08:37

github-actions Bot mentioned this pull request Apr 3, 2026

🚦 Team Review Dashboard #701

Open

use detach() instead of clone() to solve the accuracy issue

9f186b8

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

czhu15 approved these changes Apr 8, 2026

View reviewed changes

czhu15 merged commit f337029 into vllm-project:aice Apr 8, 2026
1 check passed

yangulei deleted the fp8_slice branch April 8, 2026 05:11

yangulei mentioned this pull request Apr 14, 2026

Use finite numbers for the attention mask #1290

Closed

Copilot AI mentioned this pull request May 19, 2026

Fix: restore boolean attention mask handling in _naive_prompt_attention JyhWind/vllm-gaudi#1

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable slicing for fp8 FusedSDPA#1285

Enable slicing for fp8 FusedSDPA#1285
czhu15 merged 5 commits into
vllm-project:aicefrom
yangulei:fp8_slice

yangulei commented Apr 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

czhu15 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	chunk_res = self.fp8_fsdpa_fwd(q_chunk, k_chunk, v_chunk, None, dropout_p, scale, False, softmax_mode)
	chunk_res = self.fp8_fsdpa_fwd(q_chunk, k_chunk, v_chunk, mask_chunk, dropout_p, scale, False, softmax_mode)

Conversation

yangulei commented Apr 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

czhu15 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants