Skip to content

refine linear bucketing defaults#1086

Merged
czhu15 merged 1 commit into
vllm-project:aicefrom
yangulei:lin_bkt
Mar 5, 2026
Merged

refine linear bucketing defaults#1086
czhu15 merged 1 commit into
vllm-project:aicefrom
yangulei:lin_bkt

Conversation

@yangulei
Copy link
Copy Markdown
Collaborator

@yangulei yangulei commented Mar 4, 2026

No description provided.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the Gaudi linear bucketing strategy defaults and configuration handling to better scale bucket padding with workload size, while aligning documentation with the new defaults.

Changes:

  • Refine linear bucketing defaults (steps, pad_max, and several max calculations) for prompt/decode configs.
  • Adjust decode config checking to only clamp DECODE_BLOCK_BUCKET_MAX when contiguous PA is enabled, and fix pad_max==0 handling in warmup range generation.
  • Update environment-variable documentation to reflect the revised defaults.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
vllm_gaudi/extension/bucketing/linear.py Updates default bucket config calculations, contiguous-PA-specific clamping, and warmup padding edge-case handling.
docs/configuration/env_variables.md Updates documented defaults for linear bucketing env vars to match current behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/configuration/env_variables.md Outdated
Comment thread vllm_gaudi/extension/bucketing/linear.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pad_max=math.ceil(max_decode_blocks / 4),
pad_percent=25)
if decode_block_bucket_cfg[2] > max_blocks:
if contiguous_pa and decode_block_bucket_cfg[2] > max_blocks:
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decode_block_bucket_cfg[2] (max) is only capped to max_blocks when contiguous_pa is true. In non-contiguous PA mode this can leave VLLM_DECODE_BLOCK_BUCKET_MAX above the KV cache capacity (num_hpu_blocks), which can lead to warmup generating decode buckets with num_blocks larger than what’s allocated. Consider capping decode_block_bucket_cfg[2] to max_blocks regardless of contiguous_pa (similar to how ExponentialBucketingStrategy.get_decode_cfgs uses max_blocks even when not contiguous) or adding an explicit guard that decode num_blocks never exceeds max_blocks.

Suggested change
if contiguous_pa and decode_block_bucket_cfg[2] > max_blocks:
if decode_block_bucket_cfg[2] > max_blocks:

Copilot uses AI. Check for mistakes.
@yangulei yangulei changed the title refine linear bucketing defaults and fix config checking refine linear bucketing defaults Mar 4, 2026
@yangulei yangulei requested a review from Copilot March 4, 2026 04:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pad_max=math.ceil(max_num_batched_tokens / 4),
pad_percent=25)
max_ctx = math.ceil((max_model_len - prompt_query_bucket_cfg[0]) // block_size)
max_ctx = math.ceil((max_model_len - prompt_query_bucket_cfg[0]) / block_size)
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_ctx is computed with ceil((max_model_len - prompt_query_bucket_cfg[0]) / block_size), which can overestimate by 1 when max_model_len isn’t divisible by block_size. That produces a ctx bucket where query + ctx * block_size > max_model_len, which is then filtered out later but still expands the warmup range unnecessarily. Use floor-division for the bound (and clamp at 0) so the configured max ctx is always feasible for the minimum query length.

Suggested change
max_ctx = math.ceil((max_model_len - prompt_query_bucket_cfg[0]) / block_size)
available_ctx_tokens = max_model_len - prompt_query_bucket_cfg[0]
max_ctx = max(0, available_ctx_tokens // block_size)

Copilot uses AI. Check for mistakes.
Comment thread docs/configuration/env_variables.md Outdated
Comment on lines +79 to +89
| Prompt | sequence ctx max padding percent (`VLLM_PROMPT_CTX_BUCKET_PAD_PERCENT`) | `25` |
| Decode | batch size min (`VLLM_DECODE_BS_BUCKET_MIN`) | `1` |
| Decode | batch size step (`VLLM_DECODE_BS_BUCKET_STEP`) | `2` |
| Decode | batch size max (`VLLM_DECODE_BS_BUCKET_MAX`) | `max_num_seqs` |
| Decode | batch size max abs padding (`VLLM_DECODE_BS_BUCKET_PAD_MAX`) | `ceil(max_num_seqs / 4)` |
| Decode | batch size max padding percent (`VLLM_DECODE_BS_BUCKET_PAD_PERCENT`) | `25` |
| Decode | block size min (`VLLM_DECODE_BLOCK_BUCKET_MIN`) | `block_size` |
| Decode | block size step (`VLLM_DECODE_BLOCK_BUCKET_STEP`) | `block_size` |
| Decode | block size max (`VLLM_DECODE_BLOCK_BUCKET_MAX`) | `ceil(max_model_len * max_num_seqs / block_size)` <br>by default or `max_blocks` <br>if `VLLM_CONTIGUOUS_PA = True` |
| Decode | block size max abs padding (`VLLM_DECODE_BLOCK_BUCKET_PAD_MAX`) | `ceil(VLLM_DECODE_BLOCK_BUCKET_MAX / 4)` |
| Decode | block size max padding percent (`VLLM_DECODE_BLOCK_BUCKET_PAD_PERCENT`) | `25` |
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated defaults describe DECODE BLOCK buckets in terms of block_size (tokens-per-block), but the runtime API uses this dimension as a number of blocks (e.g., callers pass len(block_list) / sum(num_blocks)). This makes the table misleading and will be wrong if the code is corrected to keep everything in block-count units. Please clarify the unit in the docs (blocks vs tokens) and align the default expressions accordingly.

Suggested change
| Prompt | sequence ctx max padding percent (`VLLM_PROMPT_CTX_BUCKET_PAD_PERCENT`) | `25` |
| Decode | batch size min (`VLLM_DECODE_BS_BUCKET_MIN`) | `1` |
| Decode | batch size step (`VLLM_DECODE_BS_BUCKET_STEP`) | `2` |
| Decode | batch size max (`VLLM_DECODE_BS_BUCKET_MAX`) | `max_num_seqs` |
| Decode | batch size max abs padding (`VLLM_DECODE_BS_BUCKET_PAD_MAX`) | `ceil(max_num_seqs / 4)` |
| Decode | batch size max padding percent (`VLLM_DECODE_BS_BUCKET_PAD_PERCENT`) | `25` |
| Decode | block size min (`VLLM_DECODE_BLOCK_BUCKET_MIN`) | `block_size` |
| Decode | block size step (`VLLM_DECODE_BLOCK_BUCKET_STEP`) | `block_size` |
| Decode | block size max (`VLLM_DECODE_BLOCK_BUCKET_MAX`) | `ceil(max_model_len * max_num_seqs / block_size)` <br>by default or `max_blocks` <br>if `VLLM_CONTIGUOUS_PA = True` |
| Decode | block size max abs padding (`VLLM_DECODE_BLOCK_BUCKET_PAD_MAX`) | `ceil(VLLM_DECODE_BLOCK_BUCKET_MAX / 4)` |
| Decode | block size max padding percent (`VLLM_DECODE_BLOCK_BUCKET_PAD_PERCENT`) | `25` |
| Prompt | sequence ctx max padding percent (`VLLM_PROMPT_CTX_BUCKET_PAD_PERCENT`) | `25` |
| Decode | batch size min (`VLLM_DECODE_BS_BUCKET_MIN`) | `1` |
| Decode | batch size step (`VLLM_DECODE_BS_BUCKET_STEP`) | `2` |
| Decode | batch size max (`VLLM_DECODE_BS_BUCKET_MAX`) | `max_num_seqs` |
| Decode | batch size max abs padding (`VLLM_DECODE_BS_BUCKET_PAD_MAX`) | `ceil(max_num_seqs / 4)` |
| Decode | batch size max padding percent (`VLLM_DECODE_BS_BUCKET_PAD_PERCENT`) | `25` |
| Decode | block count min, in blocks (`VLLM_DECODE_BLOCK_BUCKET_MIN`) | `1` |
| Decode | block count step, in blocks (`VLLM_DECODE_BLOCK_BUCKET_STEP`) | `1` |
| Decode | block count max, in blocks (`VLLM_DECODE_BLOCK_BUCKET_MAX`) | `ceil(max_model_len * max_num_seqs / block_size)` <br>by default or `max_blocks` <br>if `VLLM_CONTIGUOUS_PA = True` |
| Decode | block count max abs padding, in blocks (`VLLM_DECODE_BLOCK_BUCKET_PAD_MAX`) | `ceil(VLLM_DECODE_BLOCK_BUCKET_MAX / 4)` |
| Decode | block count max padding percent (`VLLM_DECODE_BLOCK_BUCKET_PAD_PERCENT`) | `25` |

Copilot uses AI. Check for mistakes.
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Copy link
Copy Markdown
Collaborator

@czhu15 czhu15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@czhu15 czhu15 merged commit b42e6ae into vllm-project:aice Mar 5, 2026
1 check passed
czhu15 pushed a commit that referenced this pull request Mar 5, 2026
Note that this PR depends on:
- the **Boolean** attention mask introduced by
#1032 to get valid `m`
and `linv` from the FusedSDPA kernel,
- the default query/ctx bucketing config modified in
#1086

---------

Signed-off-by: Youlei Yang <youlei.yang@intel.com>
@yangulei yangulei deleted the lin_bkt branch March 5, 2026 02:57
tvoas pushed a commit to tvoas/vllm-gaudi that referenced this pull request Mar 11, 2026
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
tvoas pushed a commit to tvoas/vllm-gaudi that referenced this pull request Mar 11, 2026
Note that this PR depends on:
- the **Boolean** attention mask introduced by
vllm-project#1032 to get valid `m`
and `linv` from the FusedSDPA kernel,
- the default query/ctx bucketing config modified in
vllm-project#1086

---------

Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants