Skip to content

Enable slicing for the FusedSDPA#1155

Merged
kamil-kaczor merged 40 commits into
vllm-project:mainfrom
yangulei:slice_fsdpa_main
May 11, 2026
Merged

Enable slicing for the FusedSDPA#1155
kamil-kaczor merged 40 commits into
vllm-project:mainfrom
yangulei:slice_fsdpa_main

Conversation

@yangulei
Copy link
Copy Markdown
Collaborator

@yangulei yangulei commented Mar 13, 2026

FusedSDPA can be split into smaller chunks to improve performance while using the padding-aware bucketing strategy which guarantees the max absolute padding in the sequence and context dimensions.

Usage

Parameter name Description Default value
VLLM_HPU_FSDPA_SLICE_ENABLED Enable the slicing. True for padding-aware bucketing strategy
VLLM_HPU_FSDPA_SLICE_SEQ_LEN_THLD KV length threshold above which slicing is applied. min(max_num_batched_tokens, 8192)
VLLM_HPU_FSDPA_SLICE_CHUNK_SIZE Chunk size for q_len and kv_len in each chunk. Rounded up to the next multiple of 1024. VLLM_HPU_FSDPA_SLICE_SEQ_LEN_THLD // 2
VLLM_HPU_FSDPA_SLICE_WITH_GRAPH_BREAKS Places each chunk in a separate graph to reduce compilation time. true for lazy mode and false otherwise

Important

These parameters are effective only with the padding-aware bucketing strategy set by VLLM_BUCKETING_STRATEGY="pad".

Implementation

Take the prefix-prefill with [bs, query, context] = [1, 9037, 8832] as an example. The prefill shape will first be padded to [1, 10880, 11008] by the bucketing. The attention mask will be looks like:
fsdpa_apc_b1_s9037_c8832_pb1_ps10880_pc11008_attention_mask

Not that there are padding in query and context dimensions.

The original implementation pass the full attention mask to the FusedSDPA kernel.

This PR introduced an implementation to calculate the FSDPA in chunks by slicing the Q, K and V as below:
fsdpa_apc_b1_s9037_c8832_pb1_ps10880_pc11008_bf16_SliceQKV

Where the color of the rectangles differentiate the is_causal and attn_mask parameters passed to the FusedSDPA kernel:

  • rgb(255,0,0): is_causal=False and attn_mask is not None
  • rgb(255,255,0): is_causal=True and attn_mask=None
  • rgb(255,0,255): is_causal=False and attn_mask=None

In this way, most of the chunks call the FusedSDPA without attention mask to get better performance, and the graph for the chunks might be reused across different buckets to reduce the warmup duration.

Dependencies


Thanks @Wei-Lin-Intel for the original idea and the detailed behavior of the FusedSDPA kernel.

This comment was marked as duplicate.

@yangulei yangulei requested a review from Copilot March 13, 2026 07:00

This comment was marked as duplicate.

@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
f296a1966dca96cd69e5c1fa1264edbf611a1bd6

@yangulei
Copy link
Copy Markdown
Collaborator Author

yangulei commented Apr 9, 2026

Upstreaming the slicing for fp8 FusedSDPA.

@yangulei yangulei changed the title Enable slicing for the BF16 FusedSDPA Enable slicing for the FusedSDPA Apr 13, 2026
@yangulei yangulei marked this pull request as ready for review April 13, 2026 06:10
@yangulei
Copy link
Copy Markdown
Collaborator Author

@afierka-intel
This PR is re-implemented to enable slicing for both bf16 and fp8 FusedSDPA. Please help to review.

cc: @czhu15

@yangulei yangulei requested a review from Copilot April 13, 2026 06:32
@vllm-project vllm-project deleted a comment from github-actions Bot Apr 13, 2026
@vllm-project vllm-project deleted a comment from github-actions Bot Apr 13, 2026
yangulei added 21 commits May 9, 2026 01:02
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
@yangulei yangulei force-pushed the slice_fsdpa_main branch from d7d8231 to db7e4c5 Compare May 9, 2026 01:21
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
@yangulei yangulei force-pushed the slice_fsdpa_main branch 2 times, most recently from 2eac778 to 4d9e65f Compare May 9, 2026 01:32
yangulei added 2 commits May 9, 2026 04:58
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
8eb401134e750781a202c0b6dc4059616cdb4954

@kamil-kaczor kamil-kaczor merged commit 3a0e975 into vllm-project:main May 11, 2026
2 checks passed
@yangulei yangulei deleted the slice_fsdpa_main branch May 12, 2026 00:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants