Add the padding-aware bucketing strategy#762
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces configurable absolute and relative padding limits to the linear bucketing algorithm to better balance warmup time and runtime performance. The change replaces exponential bucketing as the default strategy.
- Adds new environment variables for controlling padding limits (
PAD_MAXandPAD_PERCENT) across all bucket dimensions - Implements a new
warmup_range_with_limitsfunction that generates buckets respecting these padding constraints - Changes the default bucketing strategy from exponential to linear with limits
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| vllm_gaudi/extension/features.py | Adds new environment variables for padding limits and switches default bucketing strategy to linear |
| vllm_gaudi/extension/bucketing/linear.py | Implements padding-aware bucket generation with new warmup_range_with_limits function and updated configuration handling |
| vllm_gaudi/extension/bucketing/common.py | Simplifies bucketing strategy selection and adds debug logging for bucket ranges |
| tests/unit_tests/test_bucketing.py | Updates tests to accommodate new padding parameters in bucket configuration |
| docs/configuration/env_variables.md | Documents new padding-related environment variables and updated defaults |
Comments suppressed due to low confidence (1)
vllm_gaudi/extension/bucketing/linear.py:1
- The
BUCKET_PAD_PERCENTenvironment variables are defined asinttype, but they represent percentages. This could lead to confusion as the documentation shows25meaning 25%, but users might expect values like0.25. Consider using a float type or clearly documenting that the value should be specified as an integer percentage (0-100).
import os
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
0e742bc to
575e7d1
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
Comments suppressed due to low confidence (1)
vllm_gaudi/extension/bucketing/linear.py:1
- The
PAD_PERCENTparameter is stored as an integer but represents a percentage value (0-50). Consider using a float type or renaming to indicate it's in integer percentage points to avoid confusion.
import os
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
faf3e47 to
aa9d30f
Compare
|
Submitted #780 to solve the OOM issue in CI. |
✅ CI PassedAll checks passed successfully against the following vllm commit: |
✅ CI PassedAll checks passed successfully against the following vllm commit: |
e4eabf6 to
fefd207
Compare
✅ CI PassedAll checks passed successfully against the following vllm commit: |
4b01722 to
8c7c399
Compare
✅ CI PassedAll checks passed successfully against the following vllm commit: |
0de2420 to
4a8ca80
Compare
afierka-intel
left a comment
There was a problem hiding this comment.
Hi @yangulei ,
Thank you for your patience and for your PR: #762. We now have nearly all the benchmark results and have made a decision on the path forward.
We've decided not to change the default bucketing strategy to the one proposed in your PR. You're right that it improves performance in long-context scenarios, and it doesn't degrade performance in other cases either. However, the warmup time is unacceptable for virtually all scenarios — with your PR we see 2.5×–3.5× longer warmup compared to main. This is visible not only on MoE models, but also on simple, small models like granite-3.3-2b.
That said, we do like your approach and would like to incorporate it — but as a separate bucketing strategy. Currently we have linear and exponential bucketing strategies. We'd be happy to add yours as a third option, targeted at long-context scenarios. It would not be the default, but could be enabled via an environment variable or a configuration parameter.
Thank you for understanding,
Artur
4a8ca80 to
1debb37
Compare
|
@afierka-intel Moved the impl to |
0400124 to
1debb37
Compare
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
1debb37 to
84f94f7
Compare
|
Hi @yangulei 👋 Thank you for reworking this PR to make padding-aware bucketing an opt-in strategy — the updated design with After reviewing the full diff against the current codebase, I found a couple of things that need attention before we can merge. I want to be upfront: one of them is a pattern that is easy to miss because it is buried in each strategy independently rather than centralized. 🔴 Needs fix before merge1. Missing Both Without it, using Suggested fix: Add a similar 2. The current codebase actively uses Suggested fix: Add a deprecation warning in Optionally, you could auto-map 🟡 Minor suggestions (non-blocking)3. This is correct (decode query is always 1 token in autoregressive mode), but a reader unfamiliar with the code might wonder why all five values are 1. A short comment like 4. Default strategy test The test ✅ Things that look good
Thank you again for the quality work here. Looking forward to the next iteration! 🙏 |
84f94f7 to
7866ce5
Compare
|
Hi @afierka-intel |
afierka-intel
left a comment
There was a problem hiding this comment.
Thanks for addressing all the feedback! The changes look great:
- ✅
merged_prefillhandling is now properly implemented inPaddingAwareBucketingStrategy, consistent with the linear and exponential strategies - ✅
VLLM_EXPONENTIAL_BUCKETINGbackward compatibility with deprecation warning — clean implementation - ✅ Helpful comment on
decode_query_bucket_cfgand default strategy test added
Nice work on the comprehensive test coverage too. 👍
Could you please rebase on top of the latest main so we can trigger a fresh CI run?
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
7866ce5 to
e5d3474
Compare
Done. |
✅ CI PassedAll checks passed successfully against the following vllm commit: |
Signed-off-by: copilot <copilot@github.com> Tests cover the four PRs addressing long-context bucketing: - PR #762: Padding-aware bucketing strategy (warmup ranges, configs, generation) - PR #1122: Exponential decode block formula, limit cap, filter, linear fix - PR #1155: FusedSDPA slicing contract (pad_max bounds, strategy selection) - PR #1346: HPU graph capture skip (cudagraph size, warmup clamp scenarios) - Cross-PR integration: end-to-end 256K scenario, fallback, regressions 49 test functions organized in 6 test classes. Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>
Introduce the new `PaddingAwareBucketingStrategy` which could be enabled
by setting `VLLM_BUCKETING_STRATEGY="pad"` and further tuning by setting
the `VLLM_{phase}_{dim}_BUCKET_PAD_MAX` and
`VLLM_{phase}_{dim}_BUCKET_PAD_PERCENT` for the max absolute and
relative padding limits respectively.
## Motivation
The exponential bucketing is introduced to significantly reduce the
number of buckets. For the example with `max_num_batched_tokens=8192`,
`max_model_len=32768`, `max_num_seqs=256` and `# hpu blocks: 4127`. The
exponential bucketing generates **120** prompt buckets and **81** decode
buckets, and the linear bucketing generates **14368** prompt buckets and
**4042** decode buckets. The exponential buckets are filtered
combinations with the following ranges:
```
Prompt query range: [128, 256, 384, 512, 640, 1792, 2816, 3968, 4992, 6144, 7168, 8192]
Prompt context range: [0, 1, 3, 8, 22, 56, 90, 124, 158, 192]
Decode BS range: [1, 2, 4, 8, 14, 24, 42, 78, 140, 256]
Decode context range: [1, 256, 512, 768, 1024, 1280, 1536, 1792, 2304, 2816, 3584, 4352]
```
The max absolute padding (`max(bucket[i]-bucket[i-1]-1)`) is
proportional to the bucket max without limitations, and the max relative
padding (`(bucket[i]-bucket[i-1]-1)/bucket[i]`) towards **50%** for
large bucket max. The large padding cause large overhead especially for
the cases with long sequences.
We need a bucketing algorithm to **balance** the bucket number (warmup
time) and the runtime performance (padding overhead).
## Changes
- Enhance the `warmup_range` in linear bucketing to
`warmup_range_with_limits` to generate a range that ensure the absolute
and relative padding not exceeds the specified limits.
- Introduce new ENVs named `VLLM_{phase}_{dim}_BUCKET_PAD_MAX` and
`VLLM_{phase}_{dim}_BUCKET_PAD_PERCENT` to set the absolute and relative
padding limits respectively.
- Introduce the new ENV named `VLLM_BUCKETING_STRATEGY` to selects the
bucketing strategies from `exp`, `lin` and `pad` for the default
exponential, the linear and the padding-aware bucketing strategies
respectively.
For the above example with default settings:
```
Prompt query range: [128, 256, 384, 512, 640, 768, 1024, 1280, 1664, 2176, 2816, 3712, 4864, 6400, 8192]
Prompt context range: [0, 1, 2, 4, 6, 8, 12, 16, 22, 30, 40, 54, 64, 86, 116, 128, 172, 192, 255]
Decode BS range: [1, 2, 4, 6, 8, 12, 16, 22, 30, 32, 44, 60, 64, 86, 96, 128, 160, 192, 224, 256]
Decode context range: [128, 256, 384, 512, 640, 768, 1024, 1280, 1664, 2176, 2816, 3712, 4127]
```
Which results in **284** prompt buckets and **222** decode buckets with
much less padding.
## Benefits
- Could simulate the exponential bucketing by setting large
`VLLM_{phase}_{dim}_BUCKET_PAD_MAX` and setting
`VLLM_{phase}_{dim}_BUCKET_PAD_PERCENT=50`.
- Could fallback to the original linear bucketing by setting
`VLLM_{phase}_{dim}_BUCKET_PAD_PERCENT=0`.
- Users could further tuning the absolute and relative padding limits to
balance the warmup time and runtime performance.
- Setting `VLLM_{phase}_{dim}_BUCKET_PAD_MAX` to multiple of
`PT_HPU_SDPA_BR_FACTOR` and `PT_HPU_SDPA_BC_FACTOR` could generate
buckets that align with the slicing chunk size and give better
performance.
---------
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
FusedSDPA can be split into smaller chunks to improve performance while using the padding-aware bucketing strategy which guarantees the max absolute padding in the sequence and context dimensions. ## Usage | Parameter name | Description | Default value | | ---------------------------------------- | -------------------------------------------------------------------------------------------- | ------------------------------------------- | | `VLLM_HPU_FSDPA_SLICE_ENABLED` | Enable the slicing. | `True` for padding-aware bucketing strategy | | `VLLM_HPU_FSDPA_SLICE_SEQ_LEN_THLD` | KV length threshold above which slicing is applied. | `min(max_num_batched_tokens, 8192)` | | `VLLM_HPU_FSDPA_SLICE_CHUNK_SIZE` | Chunk size for `q_len` and `kv_len` in each chunk. Rounded up to the next multiple of 1024. | `VLLM_HPU_FSDPA_SLICE_SEQ_LEN_THLD // 2` | | `VLLM_HPU_FSDPA_SLICE_WITH_GRAPH_BREAKS` | Places each chunk in a separate graph to reduce compilation time. | `true` for lazy mode and `false` otherwise | > [!IMPORTANT] > These parameters are effective only with the padding-aware bucketing strategy set by `VLLM_BUCKETING_STRATEGY="pad"`. ## Implementation Take the prefix-prefill with `[bs, query, context] = [1, 9037, 8832]` as an example. The prefill shape will first be padded to `[1, 10880, 11008]` by the bucketing. The attention mask will be looks like:  > Not that there are padding in query and context dimensions. The original implementation pass the full attention mask to the FusedSDPA kernel. This PR introduced an implementation to calculate the FSDPA in chunks by slicing the `Q`, `K` and `V` as below:  Where the color of the rectangles differentiate the `is_causal` and `attn_mask` parameters passed to the FusedSDPA kernel: - `rgb(255,0,0)`: `is_causal=False` and `attn_mask is not None` - `rgb(255,255,0)`: `is_causal=True` and `attn_mask=None` - `rgb(255,0,255)`: `is_causal=False` and `attn_mask=None` In this way, most of the chunks call the FusedSDPA without attention mask to get better performance, and the graph for the chunks might be reused across different buckets to reduce the warmup duration. ## Dependencies - #762 as the number of chunks with padding is determined by the `PAD_MAX` for the query and context. --- ### Thanks @Wei-Lin-Intel for the original idea and the detailed behavior of the FusedSDPA kernel. --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Introduce the new
PaddingAwareBucketingStrategywhich could be enabled by settingVLLM_BUCKETING_STRATEGY="pad"and further tuning by setting theVLLM_{phase}_{dim}_BUCKET_PAD_MAXandVLLM_{phase}_{dim}_BUCKET_PAD_PERCENTfor the max absolute and relative padding limits respectively.Motivation
The exponential bucketing is introduced to significantly reduce the number of buckets. For the example with
max_num_batched_tokens=8192,max_model_len=32768,max_num_seqs=256and# hpu blocks: 4127. The exponential bucketing generates 120 prompt buckets and 81 decode buckets, and the linear bucketing generates 14368 prompt buckets and 4042 decode buckets. The exponential buckets are filtered combinations with the following ranges:The max absolute padding (
max(bucket[i]-bucket[i-1]-1)) is proportional to the bucket max without limitations, and the max relative padding ((bucket[i]-bucket[i-1]-1)/bucket[i]) towards 50% for large bucket max. The large padding cause large overhead especially for the cases with long sequences.We need a bucketing algorithm to balance the bucket number (warmup time) and the runtime performance (padding overhead).
Changes
warmup_rangein linear bucketing towarmup_range_with_limitsto generate a range that ensure the absolute and relative padding not exceeds the specified limits.VLLM_{phase}_{dim}_BUCKET_PAD_MAXandVLLM_{phase}_{dim}_BUCKET_PAD_PERCENTto set the absolute and relative padding limits respectively.VLLM_BUCKETING_STRATEGYto selects the bucketing strategies fromexp,linandpadfor the default exponential, the linear and the padding-aware bucketing strategies respectively.For the above example with default settings:
Which results in 284 prompt buckets and 222 decode buckets with much less padding.
Benefits
VLLM_{phase}_{dim}_BUCKET_PAD_MAXand settingVLLM_{phase}_{dim}_BUCKET_PAD_PERCENT=50.VLLM_{phase}_{dim}_BUCKET_PAD_PERCENT=0.VLLM_{phase}_{dim}_BUCKET_PAD_MAXto multiple ofPT_HPU_SDPA_BR_FACTORandPT_HPU_SDPA_BC_FACTORcould generate buckets that align with the slicing chunk size and give better performance.