Enable slicing for the FusedSDPA by yangulei · Pull Request #1155 · vllm-project/vllm-gaudi

yangulei · 2026-03-13T06:05:50Z

FusedSDPA can be split into smaller chunks to improve performance while using the padding-aware bucketing strategy which guarantees the max absolute padding in the sequence and context dimensions.

Usage

Parameter name	Description	Default value
`VLLM_HPU_FSDPA_SLICE_ENABLED`	Enable the slicing.	`True` for padding-aware bucketing strategy
`VLLM_HPU_FSDPA_SLICE_SEQ_LEN_THLD`	KV length threshold above which slicing is applied.	`min(max_num_batched_tokens, 8192)`
`VLLM_HPU_FSDPA_SLICE_CHUNK_SIZE`	Chunk size for `q_len` and `kv_len` in each chunk. Rounded up to the next multiple of 1024.	`VLLM_HPU_FSDPA_SLICE_SEQ_LEN_THLD // 2`
`VLLM_HPU_FSDPA_SLICE_WITH_GRAPH_BREAKS`	Places each chunk in a separate graph to reduce compilation time.	`true` for lazy mode and `false` otherwise

Important

These parameters are effective only with the padding-aware bucketing strategy set by VLLM_BUCKETING_STRATEGY="pad".

Implementation

Take the prefix-prefill with [bs, query, context] = [1, 9037, 8832] as an example. The prefill shape will first be padded to [1, 10880, 11008] by the bucketing. The attention mask will be looks like:

Not that there are padding in query and context dimensions.

The original implementation pass the full attention mask to the FusedSDPA kernel.

This PR introduced an implementation to calculate the FSDPA in chunks by slicing the Q, K and V as below:

Where the color of the rectangles differentiate the is_causal and attn_mask parameters passed to the FusedSDPA kernel:

rgb(255,0,0): is_causal=False and attn_mask is not None
rgb(255,255,0): is_causal=True and attn_mask=None
rgb(255,0,255): is_causal=False and attn_mask=None

In this way, most of the chunks call the FusedSDPA without attention mask to get better performance, and the graph for the chunks might be reused across different buckets to reduce the warmup duration.

Dependencies

Add the padding-aware bucketing strategy #762 as the number of chunks with padding is determined by the PAD_MAX for the query and context.

Thanks @Wei-Lin-Intel for the original idea and the detailed behavior of the FusedSDPA kernel.

github-actions · 2026-03-13T12:12:03Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
f296a1966dca96cd69e5c1fa1264edbf611a1bd6

yangulei · 2026-04-09T08:17:03Z

Upstreaming the slicing for fp8 FusedSDPA.

yangulei · 2026-04-13T06:32:02Z

@afierka-intel
This PR is re-implemented to enable slicing for both bf16 and fp8 FusedSDPA. Please help to review.

cc: @czhu15

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

github-actions · 2026-05-11T07:27:41Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
8eb401134e750781a202c0b6dc4059616cdb4954

Copilot AI review requested due to automatic review settings March 13, 2026 06:05

yangulei requested review from PatrykWo, adobrzyn, afierka-intel, iboiko-habana, kamil-kaczor, ksmusz, mgawarkiewicz-intel, michalkuligowski and xuechendi as code owners March 13, 2026 06:05

Copilot started reviewing on behalf of yangulei March 13, 2026 06:06 View session

This comment was marked as duplicate.

Sign in to view

github-actions Bot mentioned this pull request Mar 13, 2026

🚦 Team Review Dashboard #701

Open

yangulei requested a review from Copilot March 13, 2026 07:00

This comment was marked as duplicate.

Sign in to view

Copilot started reviewing on behalf of yangulei March 13, 2026 07:20 View session

yangulei mentioned this pull request Mar 19, 2026

Add the option to use Boolean attention mask #1154

Closed

yangulei force-pushed the slice_fsdpa_main branch from 565f598 to 1bc419a Compare March 24, 2026 05:13

adobrzyn assigned afierka-intel Apr 1, 2026

yangulei marked this pull request as draft April 9, 2026 08:16

yangulei force-pushed the slice_fsdpa_main branch from 1bc419a to 67d4c4a Compare April 10, 2026 08:47

yangulei changed the title ~~Enable slicing for the BF16 FusedSDPA~~ Enable slicing for the FusedSDPA Apr 13, 2026

yangulei force-pushed the slice_fsdpa_main branch from 656aea9 to 64475dd Compare April 13, 2026 06:10

yangulei marked this pull request as ready for review April 13, 2026 06:10

yangulei requested a review from Copilot April 13, 2026 06:32

vllm-project deleted a comment from github-actions Bot Apr 13, 2026

yangulei added 21 commits May 9, 2026 01:02

enable longbench tests for compile mode

a67baae

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

refine logging

3ffab00

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

add comment for is_causal and attn_bias is not None

893d05c

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

update longbench tests

3d34864

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

allow 1024 thld

b94f0d6

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

use graph breaking in lazy mode only

8661afe

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

fix the INC path

48620be

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

avoid div by 0

de498dd

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

fix accuracy

7a3b496

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

remove dup tests

8d49101

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

refine slice config

309d4d6

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

fix license year

390575e

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

move imports outside

0d96686

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

disable graph break for non-lazy mode in test

a9b5c27

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

add assert for max_query_pad

4f50858

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

guard q_len and kv_len

b6af0cc

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

refine INC dispatch

f373357

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

use more realistic QKV in UT

8db1a1b

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

improve accuracy by avoid passing is_causal=True

aa41e65

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

remove the NaN sanitization

9e1f2ac

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

remove eps in the merge

cbef47f

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

yangulei force-pushed the slice_fsdpa_main branch from d7d8231 to db7e4c5 Compare May 9, 2026 01:21

fix lazy mode tests

4d9e65f

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

yangulei force-pushed the slice_fsdpa_main branch 2 times, most recently from 2eac778 to 4d9e65f Compare May 9, 2026 01:32

yangulei added 2 commits May 9, 2026 04:58

skip tests for lazy mode

09ab7a8

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

skip E2E tests for lazy mode

29a58ec

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

kamil-kaczor merged commit 3a0e975 into vllm-project:main May 11, 2026
2 checks passed

yangulei deleted the slice_fsdpa_main branch May 12, 2026 00:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable slicing for the FusedSDPA#1155

Enable slicing for the FusedSDPA#1155
kamil-kaczor merged 40 commits into
vllm-project:mainfrom
yangulei:slice_fsdpa_main

yangulei commented Mar 13, 2026 •

edited

Loading

Uh oh!

This comment was marked as duplicate.

Uh oh!

This comment was marked as duplicate.

Uh oh!

github-actions Bot commented Mar 13, 2026

Uh oh!

yangulei commented Apr 9, 2026

Uh oh!

yangulei commented Apr 13, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

yangulei commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Usage

Implementation

Dependencies

Thanks @Wei-Lin-Intel for the original idea and the detailed behavior of the FusedSDPA kernel.

Uh oh!

This comment was marked as duplicate.

Uh oh!

This comment was marked as duplicate.

Uh oh!

github-actions Bot commented Mar 13, 2026

✅ CI Passed

Uh oh!

yangulei commented Apr 9, 2026

Uh oh!

yangulei commented Apr 13, 2026

Uh oh!

github-actions Bot commented May 11, 2026

✅ CI Passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yangulei commented Mar 13, 2026 •

edited

Loading