refine linear bucketing defaults by yangulei · Pull Request #1086 · vllm-project/vllm-gaudi

yangulei · 2026-03-04T03:17:59Z

No description provided.

Copilot

Pull request overview

Updates the Gaudi linear bucketing strategy defaults and configuration handling to better scale bucket padding with workload size, while aligning documentation with the new defaults.

Changes:

Refine linear bucketing defaults (steps, pad_max, and several max calculations) for prompt/decode configs.
Adjust decode config checking to only clamp DECODE_BLOCK_BUCKET_MAX when contiguous PA is enabled, and fix pad_max==0 handling in warmup range generation.
Update environment-variable documentation to reflect the revised defaults.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`vllm_gaudi/extension/bucketing/linear.py`	Updates default bucket config calculations, contiguous-PA-specific clamping, and warmup padding edge-case handling.
`docs/configuration/env_variables.md`	Updates documented defaults for linear bucketing env vars to match current behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-04T03:38:57Z

+                                                       pad_max=math.ceil(max_decode_blocks / 4),
                                                       pad_percent=25)
-        if decode_block_bucket_cfg[2] > max_blocks:
+        if contiguous_pa and decode_block_bucket_cfg[2] > max_blocks:


decode_block_bucket_cfg[2] (max) is only capped to max_blocks when contiguous_pa is true. In non-contiguous PA mode this can leave VLLM_DECODE_BLOCK_BUCKET_MAX above the KV cache capacity (num_hpu_blocks), which can lead to warmup generating decode buckets with num_blocks larger than what’s allocated. Consider capping decode_block_bucket_cfg[2] to max_blocks regardless of contiguous_pa (similar to how ExponentialBucketingStrategy.get_decode_cfgs uses max_blocks even when not contiguous) or adding an explicit guard that decode num_blocks never exceeds max_blocks.

Suggested change

if contiguous_pa and decode_block_bucket_cfg[2] > max_blocks:

if decode_block_bucket_cfg[2] > max_blocks:

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-04T04:07:18Z

+                                                       pad_max=math.ceil(max_num_batched_tokens / 4),
                                                       pad_percent=25)
-        max_ctx = math.ceil((max_model_len - prompt_query_bucket_cfg[0]) // block_size)
+        max_ctx = math.ceil((max_model_len - prompt_query_bucket_cfg[0]) / block_size)


max_ctx is computed with ceil((max_model_len - prompt_query_bucket_cfg[0]) / block_size), which can overestimate by 1 when max_model_len isn’t divisible by block_size. That produces a ctx bucket where query + ctx * block_size > max_model_len, which is then filtered out later but still expands the warmup range unnecessarily. Use floor-division for the bound (and clamp at 0) so the configured max ctx is always feasible for the minimum query length.

Suggested change

max_ctx = math.ceil((max_model_len - prompt_query_bucket_cfg[0]) / block_size)

available_ctx_tokens = max_model_len - prompt_query_bucket_cfg[0]

max_ctx = max(0, available_ctx_tokens // block_size)

Copilot · 2026-03-04T04:07:19Z

+| Prompt | sequence ctx max padding percent (`VLLM_PROMPT_CTX_BUCKET_PAD_PERCENT`)  | `25`                                                                                                                |
+| Decode | batch size min (`VLLM_DECODE_BS_BUCKET_MIN`)                             | `1`                                                                                                                 |
+| Decode | batch size step (`VLLM_DECODE_BS_BUCKET_STEP`)                           | `2`                                                                                                                 |
+| Decode | batch size max (`VLLM_DECODE_BS_BUCKET_MAX`)                             | `max_num_seqs`                                                                                                      |
+| Decode | batch size max abs padding (`VLLM_DECODE_BS_BUCKET_PAD_MAX`)             | `ceil(max_num_seqs / 4)`                                                                                            |
+| Decode | batch size max padding percent (`VLLM_DECODE_BS_BUCKET_PAD_PERCENT`)     | `25`                                                                                                                |
+| Decode | block size min (`VLLM_DECODE_BLOCK_BUCKET_MIN`)                          | `block_size`                                                                                                        |
+| Decode | block size step (`VLLM_DECODE_BLOCK_BUCKET_STEP`)                        | `block_size`                                                                                                        |
+| Decode | block size max (`VLLM_DECODE_BLOCK_BUCKET_MAX`)                          | `ceil(max_model_len * max_num_seqs / block_size)` <br>by default or `max_blocks` <br>if `VLLM_CONTIGUOUS_PA = True` |
+| Decode | block size max abs padding (`VLLM_DECODE_BLOCK_BUCKET_PAD_MAX`)          | `ceil(VLLM_DECODE_BLOCK_BUCKET_MAX / 4)`                                                                            |
+| Decode | block size max padding percent (`VLLM_DECODE_BLOCK_BUCKET_PAD_PERCENT`)  | `25`                                                                                                                |


Signed-off-by: Youlei Yang <youlei.yang@intel.com>

czhu15

LGTM

Note that this PR depends on: - the **Boolean** attention mask introduced by #1032 to get valid `m` and `linv` from the FusedSDPA kernel, - the default query/ctx bucketing config modified in #1086 --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com>

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

Note that this PR depends on: - the **Boolean** attention mask introduced by vllm-project#1032 to get valid `m` and `linv` from the FusedSDPA kernel, - the default query/ctx bucketing config modified in vllm-project#1086 --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com>

Copilot AI review requested due to automatic review settings March 4, 2026 03:17

yangulei requested review from Wei-Lin-Intel, czhu15, mgawarkiewicz-intel, piotrbocian, taotod and wpyszka as code owners March 4, 2026 03:18

Copilot started reviewing on behalf of yangulei March 4, 2026 03:18 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

Comment thread docs/configuration/env_variables.md Outdated

Comment thread vllm_gaudi/extension/bucketing/linear.py

yangulei mentioned this pull request Mar 4, 2026

Enable slicing for the BF16 FusedSDPA #1034

Merged

yangulei force-pushed the lin_bkt branch from a49a6fe to 02fadef Compare March 4, 2026 03:33

yangulei requested a review from Copilot March 4, 2026 03:34

Copilot started reviewing on behalf of yangulei March 4, 2026 03:34 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

yangulei changed the title ~~refine linear bucketing defaults and fix config checking~~ refine linear bucketing defaults Mar 4, 2026

yangulei requested a review from Copilot March 4, 2026 04:02

github-actions Bot mentioned this pull request Mar 4, 2026

🚦 Team Review Dashboard #701

Open

Copilot started reviewing on behalf of yangulei March 4, 2026 04:02 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

refine default bucketing config

ac7a2ad

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

yangulei force-pushed the lin_bkt branch from 99eed7e to ac7a2ad Compare March 4, 2026 05:04

czhu15 approved these changes Mar 5, 2026

View reviewed changes

czhu15 merged commit b42e6ae into vllm-project:aice Mar 5, 2026
1 check passed

yangulei deleted the lin_bkt branch March 5, 2026 02:57

tvoas pushed a commit to tvoas/vllm-gaudi that referenced this pull request Mar 11, 2026

refine linear bucketing defaults (vllm-project#1086)

36efbb8

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

afierka-intel mentioned this pull request Mar 12, 2026

Use Boolean attention mask and enable FusedSDPA slicing for long sequences #1149

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refine linear bucketing defaults#1086

refine linear bucketing defaults#1086
czhu15 merged 1 commit into
vllm-project:aicefrom
yangulei:lin_bkt

yangulei commented Mar 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

czhu15 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	if contiguous_pa and decode_block_bucket_cfg[2] > max_blocks:
	if decode_block_bucket_cfg[2] > max_blocks:

	max_ctx = math.ceil((max_model_len - prompt_query_bucket_cfg[0]) / block_size)
	available_ctx_tokens = max_model_len - prompt_query_bucket_cfg[0]
	max_ctx = max(0, available_ctx_tokens // block_size)

Conversation

yangulei commented Mar 4, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

czhu15 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants