Enable slicing for the BF16 FusedSDPA by yangulei · Pull Request #1034 · vllm-project/vllm-gaudi

yangulei · 2026-02-25T07:49:38Z

Note that this PR depends on:

the Boolean attention mask introduced by Use Boolean attention mask #1032 to get valid m and linv from the FusedSDPA kernel,
the default query/ctx bucketing config modified in refine linear bucketing defaults #1086

Copilot

Pull request overview

This PR adds an optional “sliced” forward path for the HPU BF16 FusedSDPA kernel to improve long-context performance when using linear bucketing, and updates bucketing padding defaults/documentation to better align with attention-mask usage.

Changes:

Add a chunked/sliced FusedSDPA forward path (with optional graph breaks) for the causal + attention-mask case.
Adjust linear bucketing defaults for *_BUCKET_PAD_MAX (reduce default absolute padding).
Update environment variable documentation and remove redundant causal+attn_bias handling from the ops wrapper.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File	Description
`vllm_gaudi/extension/utils.py`	Introduces sliced FusedSDPA execution + slicing setup/config via env vars.
`vllm_gaudi/extension/ops.py`	Removes local workaround; relies on ModuleFusedSDPA behavior.
`vllm_gaudi/extension/bucketing/linear.py`	Changes default `pad_max` values to reduce padding.
`docs/configuration/env_variables.md`	Documents updated defaults and adds new FusedSDPA slicing tuning parameters.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

taotod

LGTM， please fix the typos that copilot found or suggested.

yangulei · 2026-02-26T01:56:24Z

@yupengzh-intel @testdig
Please help to review, thanks!

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

testdig · 2026-02-26T02:58:39Z

+                                 f'falling back to default {qkv_chunk_size_default}.')
+                qkv_chunk_size = qkv_chunk_size_default
+            if qkv_chunk_size % 1024 != 0:
+                qkv_chunk_size = (qkv_chunk_size + 1023) // 1024 * 1024


is qkv_chunk_size = math.ceil(qkv_chunk_size / 1024) * 1024 easy to read?

Have to use qkv_chunk_size = int(math.ceil(qkv_chunk_size / 1024)) * 1024 and import math. I prefer the integer arithmetic here.

I rechecked this and found out that match.ceil() returns int, so I use it for all the ceiling in the code now.

testdig · 2026-02-26T05:42:42Z

+| Parameter name                           | Description                                                                                     | Default value                              |
+| ---------------------------------------- | ----------------------------------------------------------------------------------------------- | ------------------------------------------ |
+| `VLLM_HPU_FSDPA_SLICE_SEQ_LEN_THLD`      | KV length threshold above which slicing is enabled. Set to `-1` to disable slicing.             | `min(max_num_batched_tokens, 8192)`        |
+| `VLLM_HPU_FSDPA_SLICE_CHUNK_SIZE`        | Chunk size for `q_len` and `kv_len` in each chunk. Rounded up to the nearest multiple of 1024.  | `VLLM_HPU_FSDPA_SLICE_SEQ_LEN_THLD // 2`   |


Rounded up to the “next" looks better than "nearest"

Will do, thank you.

testdig · 2026-02-26T05:51:08Z

+
+| Parameter name                           | Description                                                                                     | Default value                              |
+| ---------------------------------------- | ----------------------------------------------------------------------------------------------- | ------------------------------------------ |
+| `VLLM_HPU_FSDPA_SLICE_SEQ_LEN_THLD`      | KV length threshold above which slicing is enabled. Set to `-1` to disable slicing.             | `min(max_num_batched_tokens, 8192)`        |


does it mean "Enable slicing when the KV length exceeds this threshold."?

The enabled/disable here is ambiguous. Setting VLLM_HPU_FSDPA_SLICE_SEQ_LEN_THLD to a negative value totally disable the slicing. And setting VLLM_HPU_FSDPA_SLICE_SEQ_LEN_THLD to a valid value will enable the slicing, and the slicing only applied for KV length exceeds this threshold.

Will change the enabled to applied.

testdig · 2026-02-26T05:56:36Z

+            return False
+
+        max_num_batched_tokens = bucketing_manager.max_num_batched_tokens
+        qkv_slice_thld_default = min(max_num_batched_tokens, 8192)


in the case if max_num_batched_tokens is 8k or 16k, and I set VLLM_HPU_FSDPA_SLICE_SEQ_LEN_THLD to 4k, then slicing cannot be enabled, is it by design?

qkv_slice_thld will be reset to min(max_num_batched_tokens, 8192) = 8192 in those cases. The optimization only get performance gain for kv_len >= (1+ num_padded_ctx_chunks) * max_num_batched_tokens. Setting the VLLM_HPU_FSDPA_SLICE_SEQ_LEN_THLD less than max_num_batched_tokens will hurt the performance.

testdig · 2026-02-26T05:57:55Z

+        bs = query.shape[0]
+        q_len = query.shape[-2]
+        kv_len = key.shape[-2]
+        if (self.enable_slicing and bs == 1 and q_len != kv_len and kv_len >= self.qkv_slice_thld and is_causal


if slicing works only if bs == 1, shall we document it?

bs == 1 is always satisfied for prefills with context. Actually I'm not sure whether the bs > 1 prefills with context works.

testdig · 2026-02-26T06:13:16Z

 | Prompt | query length step (`VLLM_PROMPT_QUERY_BUCKET_STEP`)                      | `block_size`                                                                                                  |
 | Prompt | query length max (`VLLM_PROMPT_QUERY_BUCKET_MAX`)                        | `max_num_batched_tokens`                                                                                      |
-| Prompt | query length max abs padding (`VLLM_PROMPT_QUERY_BUCKET_PAD_MAX`)        | `max_num_batched_tokens`                                                                                      |
+| Prompt | query length max abs padding (`VLLM_PROMPT_QUERY_BUCKET_PAD_MAX`)        | `max_num_batched_tokens // 4`                                                                                 |


IMHO, it would be better to have a standalone PR to change the two default max padding value.

Yes, but the max_query_pad and max_ctx_pad share the same ENVs and default values in linear bucketing, so they have to be modified simultaneously anyway.

In fact, the prompt_buckets are not generated yet when the __init__() of FusedSDPA is called, so we have to get the maximum possible padding from the configure instead of the actual buckets.

testdig

LGTM

czhu15

A very important PR to further improve the performance on Gaudi.
Some comments on old version PR. pls kindly check if they are valid in the new PR too.

czhu15 · 2026-02-27T02:51:21Z

+| `VLLM_HPU_FSDPA_SLICE_WITH_GRAPH_BREAKS` | Places each chunk in a separate graph to reduce compilation time.                            | `true`                                     |
+
+!!! note
+    These parameters are effective only with the linear bucketing strategy, where the max absolute padding for `query` and `context` determines attention-mask usage.


what's the meaning of "where the max absolute padding for query and context determines attention-mask usage."? If it is pure internal implementation, suggest to not expose to external user (this document).

The max abs padding in the query and context determines the chunks for which the FusedSDPA have to be called with attention mask to handle the padding in them.

it is a note message. what should user do if user enables slicing with linear bucketing strategy?
for me, "where the max absolute padding for query and context determines attention-mask usage." it is a pure internal implementation and I guess most users don't understand it :) I would suggest to remove it from the readme file. Only keep "These parameters are effective only with the linear bucketing strategy" is enough.

I see, will make it clearer, thanks.

czhu15 · 2026-02-27T02:55:24Z

+        bs = query.shape[0]
+        q_len = query.shape[-2]
+        kv_len = key.shape[-2]
+        if (self.enable_slicing and bs == 1 and q_len != kv_len and kv_len >= self.qkv_slice_thld and is_causal


"(self.enable_slicing and bs == 1 and q_len != kv_len and kv_len >= self.qkv_slice_thld and is_causal..."
too many checks here. maybe better to wrapper them in an internal function for better readability.
BTW, why can't use sliced fsdpa when q_len == kv_len?

I will add comments here to make it clearer.
It's a normal causal FSDPA for q_len == kv_len, the default call is the most efficient way as it call FusedSDPA with is_causal=True and valid_seq_length without re-scaling overhead.

got it. thanks for the explaination.

czhu15 · 2026-02-28T01:50:57Z

+                last_m = new_m
+
+                if self.with_graph_breaks:
+                    self.break_graph()


same comment as above.

czhu15 · 2026-03-03T00:55:23Z

 | Prompt | batch size step (`VLLM_PROMPT_BS_BUCKET_STEP`)                           | `1`                                                                                                           |
 | Prompt | batch size max (`VLLM_PROMPT_BS_BUCKET_MAX`)                             | `max_num_prefill_seqs`                                                                                        |
-| Prompt | batch size max abs padding (`VLLM_PROMPT_BS_BUCKET_PAD_MAX`)             | `16`                                                                                                          |
+| Prompt | batch size max abs padding (`VLLM_PROMPT_BS_BUCKET_PAD_MAX`)             | `max_num_prefill_seqs / 4`                                                                                    |


why slice will change the prompt bs padding?

It's not for slicing. I aligned the *BUCKET_PAD_MAX = math.ceil(BUCKET_MAX / 4) for all the possible configurations.

czhu15 · 2026-03-03T00:59:53Z

-| Prompt | sequence ctx max (`VLLM_PROMPT_CTX_BUCKET_MAX`)                          | `(max_model_len - block_size) // block_size`                                                                  |
-| Prompt | sequence ctx max abs padding (`VLLM_PROMPT_CTX_BUCKET_PAD_MAX`)          | `max_num_batched_tokens // block_size`                                                                        |
+| Prompt | sequence ctx step (`VLLM_PROMPT_CTX_BUCKET_STEP`)                        | `2`                                                                                                           |
+| Prompt | sequence ctx max (`VLLM_PROMPT_CTX_BUCKET_MAX`)                          | `(max_model_len - block_size) / block_size`                                                                   |


// should be correct, but not /

It's actually math.ceil((max_model_len - block_size) / block_size). Do you think we should submit the changes to bucketing by another PR?

Yes. I prefer to submit separate PRs for bucketing optimization. Otherwise this PR contains too many changes not related to slice feature.

czhu15 · 2026-03-03T01:01:02Z

 | Prompt | sequence ctx max padding percent (`VLLM_PROMPT_CTX_BUCKET_PAD_PERCENT`)  | `25`                                                                                                          |
 | Decode | batch size min (`VLLM_DECODE_BS_BUCKET_MIN`)                             | `1`                                                                                                           |
-| Decode | batch size step (`VLLM_DECODE_BS_BUCKET_STEP`)                           | `32`                                                                                                          |
+| Decode | batch size step (`VLLM_DECODE_BS_BUCKET_STEP`)                           | `2`                                                                                                           |


why change VLLM_DECODE_BS_BUCKET_STEP to a so small value?

The previous step of 32 introduced to many padding in some cases with max-concurrency < 32. The value in the code is modified to 2 in the PR for linear bucketing with limits, while the doc here unintentionally left unchanged.

czhu15 · 2026-03-03T01:01:55Z

-| Decode | block size max (`VLLM_DECODE_BLOCK_BUCKET_MAX`)                          | `max_model_len * max_num_seqs // block_size` <br>by default or `max_blocks` <br>if `VLLM_CONTIGUOUS_PA = True`|
-| Decode | block size max abs padding (`VLLM_DECODE_BLOCK_BUCKET_PAD_MAX`)          | `max_num_batched_tokens * max_num_seqs // block_size`                                                         |
+| Decode | block size max (`VLLM_DECODE_BLOCK_BCUKET_MAX`)                          | `max_model_len * max_num_seqs / block_size` <br>by default or `max_blocks` <br>if `VLLM_CONTIGUOUS_PA = True` |
+| Decode | block size max abs padding (`VLLM_DECODE_BLOCK_BUCKET_PAD_MAX`)          | `VLLM_DECODE_BLOCK_BCUKET_MAX / 4`                                                                            |


should be // but no /

czhu15 · 2026-03-03T01:07:04Z

+                                                       pad_max=math.ceil(max_decode_blocks / 4),
                                                       pad_percent=25)
-        if decode_block_bucket_cfg[2] > max_blocks:
+        if contiguous_pa and decode_block_bucket_cfg[2] > max_blocks:


why need check contiguous_pa? Does slice feature depend on contiguous pa?

contiguous_pa and decode_block_bucket_cfg[2] is VLLM_DECODE_BLOCK_BCUKET_MAX, and VLLM_DECODE_BLOCK_BCUKET_MAX <= max_blocks is satisfied for contiguous PA only. The original code may cause not warmed-up for non contiguous PA cases.

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

yupengzh-intel

LGTM

yangulei · 2026-03-04T03:25:17Z

@czhu15 The changes for bucketing is submitted in #1086.

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

czhu15

LGTM

Note that this PR depends on: - the **Boolean** attention mask introduced by vllm-project#1032 to get valid `m` and `linv` from the FusedSDPA kernel, - the default query/ctx bucketing config modified in vllm-project#1086 --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com>

Port of PR vllm-project#1034 from aice branch to main. Splits FusedSDPA kernel into smaller chunks for long sequences to: - Fit chunks into SRAM for better performance - Improve TPC/MME pipelining - Reduce attention-mask usage for padded regions New env vars: - VLLM_HPU_FSDPA_SLICE_SEQ_LEN_THLD: KV length threshold for slicing - VLLM_HPU_FSDPA_SLICE_CHUNK_SIZE: chunk size (rounded to 1024) - VLLM_HPU_FSDPA_SLICE_WITH_GRAPH_BREAKS: graph break control Only active with linear bucketing strategy and boolean attention masks. Depends on: Boolean attention mask (port of vllm-project#1032) Ref: GAUDISW-245533 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Port of PR vllm-project#1034 from aice branch to main. Splits FusedSDPA kernel into smaller chunks for long sequences to: - Fit chunks into SRAM for better performance - Improve TPC/MME pipelining - Reduce attention-mask usage for padded regions New env vars: - VLLM_HPU_FSDPA_SLICE_SEQ_LEN_THLD: KV length threshold for slicing - VLLM_HPU_FSDPA_SLICE_CHUNK_SIZE: chunk size (rounded to 1024) - VLLM_HPU_FSDPA_SLICE_WITH_GRAPH_BREAKS: graph break control Only active with linear bucketing strategy and boolean attention masks. Depends on: Boolean attention mask (port of vllm-project#1032) Ref: GAUDISW-245533 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Artur Fierka <artur.fierka@intel.com> Co-authored-by: yangulei <24203353+yangulei@users.noreply.github.com>

Copilot AI review requested due to automatic review settings February 25, 2026 07:49

yangulei requested review from Wei-Lin-Intel, czhu15, mgawarkiewicz-intel, piotrbocian, taotod and wpyszka as code owners February 25, 2026 07:49

Copilot started reviewing on behalf of yangulei February 25, 2026 07:50 View session

Copilot AI reviewed Feb 25, 2026

View reviewed changes

yangulei requested a review from Copilot February 25, 2026 08:35

Copilot started reviewing on behalf of yangulei February 25, 2026 08:36 View session

github-actions Bot mentioned this pull request Feb 25, 2026

🚦 Team Review Dashboard #701

Open

Copilot AI reviewed Feb 25, 2026

View reviewed changes

Comment thread vllm_gaudi/extension/utils.py

Comment thread vllm_gaudi/extension/utils.py Outdated

Comment thread vllm_gaudi/extension/utils.py Outdated

Comment thread vllm_gaudi/extension/utils.py Outdated

taotod reviewed Feb 25, 2026

View reviewed changes

yangulei requested a review from Copilot February 26, 2026 01:56

Copilot started reviewing on behalf of yangulei February 26, 2026 01:56 View session

Copilot AI reviewed Feb 26, 2026

View reviewed changes

Comment thread vllm_gaudi/extension/utils.py

Comment thread vllm_gaudi/extension/bucketing/linear.py Outdated

Comment thread docs/configuration/env_variables.md Outdated

testdig reviewed Feb 26, 2026

View reviewed changes

testdig approved these changes Feb 26, 2026

View reviewed changes

yangulei force-pushed the slice_fsdpa branch from 2ad51a6 to f76b8d6 Compare February 27, 2026 02:04

czhu15 force-pushed the aice branch from 809daf8 to 36bf5e5 Compare February 27, 2026 07:57

yangulei force-pushed the slice_fsdpa branch from f76b8d6 to 480fc85 Compare February 28, 2026 01:58

czhu15 reviewed Feb 28, 2026

View reviewed changes

czhu15 reviewed Mar 3, 2026

View reviewed changes

enable bf16 fsdpa slicing

93acea9

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

yangulei force-pushed the slice_fsdpa branch from df63f78 to 93acea9 Compare March 4, 2026 02:45

yupengzh-intel approved these changes Mar 4, 2026

View reviewed changes

only enable graph breaking for lazy mode by default

12eb49c

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

czhu15 approved these changes Mar 5, 2026

View reviewed changes

czhu15 merged commit b9db1af into vllm-project:aice Mar 5, 2026
1 check passed

yangulei deleted the slice_fsdpa branch March 5, 2026 02:58

afierka-intel mentioned this pull request Mar 12, 2026

Use Boolean attention mask and enable FusedSDPA slicing for long sequences #1149

Closed

Conversation

yangulei commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

taotod left a comment

Choose a reason for hiding this comment

Uh oh!

yangulei commented Feb 26, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

testdig left a comment

Choose a reason for hiding this comment

Uh oh!

czhu15 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

czhu15 Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

yangulei commented Feb 25, 2026 •

edited

Loading

czhu15 Feb 28, 2026 •

edited

Loading

yangulei Mar 3, 2026 •

edited

Loading