refine linear bucketing defaults#1086
Conversation
There was a problem hiding this comment.
Pull request overview
Updates the Gaudi linear bucketing strategy defaults and configuration handling to better scale bucket padding with workload size, while aligning documentation with the new defaults.
Changes:
- Refine linear bucketing defaults (steps,
pad_max, and several max calculations) for prompt/decode configs. - Adjust decode config checking to only clamp
DECODE_BLOCK_BUCKET_MAXwhen contiguous PA is enabled, and fixpad_max==0handling in warmup range generation. - Update environment-variable documentation to reflect the revised defaults.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
vllm_gaudi/extension/bucketing/linear.py |
Updates default bucket config calculations, contiguous-PA-specific clamping, and warmup padding edge-case handling. |
docs/configuration/env_variables.md |
Updates documented defaults for linear bucketing env vars to match current behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| pad_max=math.ceil(max_decode_blocks / 4), | ||
| pad_percent=25) | ||
| if decode_block_bucket_cfg[2] > max_blocks: | ||
| if contiguous_pa and decode_block_bucket_cfg[2] > max_blocks: |
There was a problem hiding this comment.
decode_block_bucket_cfg[2] (max) is only capped to max_blocks when contiguous_pa is true. In non-contiguous PA mode this can leave VLLM_DECODE_BLOCK_BUCKET_MAX above the KV cache capacity (num_hpu_blocks), which can lead to warmup generating decode buckets with num_blocks larger than what’s allocated. Consider capping decode_block_bucket_cfg[2] to max_blocks regardless of contiguous_pa (similar to how ExponentialBucketingStrategy.get_decode_cfgs uses max_blocks even when not contiguous) or adding an explicit guard that decode num_blocks never exceeds max_blocks.
| if contiguous_pa and decode_block_bucket_cfg[2] > max_blocks: | |
| if decode_block_bucket_cfg[2] > max_blocks: |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| pad_max=math.ceil(max_num_batched_tokens / 4), | ||
| pad_percent=25) | ||
| max_ctx = math.ceil((max_model_len - prompt_query_bucket_cfg[0]) // block_size) | ||
| max_ctx = math.ceil((max_model_len - prompt_query_bucket_cfg[0]) / block_size) |
There was a problem hiding this comment.
max_ctx is computed with ceil((max_model_len - prompt_query_bucket_cfg[0]) / block_size), which can overestimate by 1 when max_model_len isn’t divisible by block_size. That produces a ctx bucket where query + ctx * block_size > max_model_len, which is then filtered out later but still expands the warmup range unnecessarily. Use floor-division for the bound (and clamp at 0) so the configured max ctx is always feasible for the minimum query length.
| max_ctx = math.ceil((max_model_len - prompt_query_bucket_cfg[0]) / block_size) | |
| available_ctx_tokens = max_model_len - prompt_query_bucket_cfg[0] | |
| max_ctx = max(0, available_ctx_tokens // block_size) |
| | Prompt | sequence ctx max padding percent (`VLLM_PROMPT_CTX_BUCKET_PAD_PERCENT`) | `25` | | ||
| | Decode | batch size min (`VLLM_DECODE_BS_BUCKET_MIN`) | `1` | | ||
| | Decode | batch size step (`VLLM_DECODE_BS_BUCKET_STEP`) | `2` | | ||
| | Decode | batch size max (`VLLM_DECODE_BS_BUCKET_MAX`) | `max_num_seqs` | | ||
| | Decode | batch size max abs padding (`VLLM_DECODE_BS_BUCKET_PAD_MAX`) | `ceil(max_num_seqs / 4)` | | ||
| | Decode | batch size max padding percent (`VLLM_DECODE_BS_BUCKET_PAD_PERCENT`) | `25` | | ||
| | Decode | block size min (`VLLM_DECODE_BLOCK_BUCKET_MIN`) | `block_size` | | ||
| | Decode | block size step (`VLLM_DECODE_BLOCK_BUCKET_STEP`) | `block_size` | | ||
| | Decode | block size max (`VLLM_DECODE_BLOCK_BUCKET_MAX`) | `ceil(max_model_len * max_num_seqs / block_size)` <br>by default or `max_blocks` <br>if `VLLM_CONTIGUOUS_PA = True` | | ||
| | Decode | block size max abs padding (`VLLM_DECODE_BLOCK_BUCKET_PAD_MAX`) | `ceil(VLLM_DECODE_BLOCK_BUCKET_MAX / 4)` | | ||
| | Decode | block size max padding percent (`VLLM_DECODE_BLOCK_BUCKET_PAD_PERCENT`) | `25` | |
There was a problem hiding this comment.
The updated defaults describe DECODE BLOCK buckets in terms of block_size (tokens-per-block), but the runtime API uses this dimension as a number of blocks (e.g., callers pass len(block_list) / sum(num_blocks)). This makes the table misleading and will be wrong if the code is corrected to keep everything in block-count units. Please clarify the unit in the docs (blocks vs tokens) and align the default expressions accordingly.
| | Prompt | sequence ctx max padding percent (`VLLM_PROMPT_CTX_BUCKET_PAD_PERCENT`) | `25` | | |
| | Decode | batch size min (`VLLM_DECODE_BS_BUCKET_MIN`) | `1` | | |
| | Decode | batch size step (`VLLM_DECODE_BS_BUCKET_STEP`) | `2` | | |
| | Decode | batch size max (`VLLM_DECODE_BS_BUCKET_MAX`) | `max_num_seqs` | | |
| | Decode | batch size max abs padding (`VLLM_DECODE_BS_BUCKET_PAD_MAX`) | `ceil(max_num_seqs / 4)` | | |
| | Decode | batch size max padding percent (`VLLM_DECODE_BS_BUCKET_PAD_PERCENT`) | `25` | | |
| | Decode | block size min (`VLLM_DECODE_BLOCK_BUCKET_MIN`) | `block_size` | | |
| | Decode | block size step (`VLLM_DECODE_BLOCK_BUCKET_STEP`) | `block_size` | | |
| | Decode | block size max (`VLLM_DECODE_BLOCK_BUCKET_MAX`) | `ceil(max_model_len * max_num_seqs / block_size)` <br>by default or `max_blocks` <br>if `VLLM_CONTIGUOUS_PA = True` | | |
| | Decode | block size max abs padding (`VLLM_DECODE_BLOCK_BUCKET_PAD_MAX`) | `ceil(VLLM_DECODE_BLOCK_BUCKET_MAX / 4)` | | |
| | Decode | block size max padding percent (`VLLM_DECODE_BLOCK_BUCKET_PAD_PERCENT`) | `25` | | |
| | Prompt | sequence ctx max padding percent (`VLLM_PROMPT_CTX_BUCKET_PAD_PERCENT`) | `25` | | |
| | Decode | batch size min (`VLLM_DECODE_BS_BUCKET_MIN`) | `1` | | |
| | Decode | batch size step (`VLLM_DECODE_BS_BUCKET_STEP`) | `2` | | |
| | Decode | batch size max (`VLLM_DECODE_BS_BUCKET_MAX`) | `max_num_seqs` | | |
| | Decode | batch size max abs padding (`VLLM_DECODE_BS_BUCKET_PAD_MAX`) | `ceil(max_num_seqs / 4)` | | |
| | Decode | batch size max padding percent (`VLLM_DECODE_BS_BUCKET_PAD_PERCENT`) | `25` | | |
| | Decode | block count min, in blocks (`VLLM_DECODE_BLOCK_BUCKET_MIN`) | `1` | | |
| | Decode | block count step, in blocks (`VLLM_DECODE_BLOCK_BUCKET_STEP`) | `1` | | |
| | Decode | block count max, in blocks (`VLLM_DECODE_BLOCK_BUCKET_MAX`) | `ceil(max_model_len * max_num_seqs / block_size)` <br>by default or `max_blocks` <br>if `VLLM_CONTIGUOUS_PA = True` | | |
| | Decode | block count max abs padding, in blocks (`VLLM_DECODE_BLOCK_BUCKET_PAD_MAX`) | `ceil(VLLM_DECODE_BLOCK_BUCKET_MAX / 4)` | | |
| | Decode | block count max padding percent (`VLLM_DECODE_BLOCK_BUCKET_PAD_PERCENT`) | `25` | |
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Note that this PR depends on: - the **Boolean** attention mask introduced by vllm-project#1032 to get valid `m` and `linv` from the FusedSDPA kernel, - the default query/ctx bucketing config modified in vllm-project#1086 --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com>
No description provided.