Fixes for the decode bucketing in non-contiguous pa scenario#1122
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses incorrect/unsafe decode bucketing behavior in non-contiguous PA scenarios, aiming to prevent oversized decode bucket selection and reduce warmup-time OOM risk.
Changes:
- Adjust decode bucket sizing/limits (notably for exponential strategy) to better reflect
max_model_lenand PA mode. - Add a decode-bucket filter to omit buckets whose batched context exceeds the batched
max_model_len. - Clamp dummy decode warmup inputs to KV-cache block capacity to reduce OOM likelihood.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
vllm_gaudi/v1/worker/hpu_model_runner.py |
Clamp dummy decode warmup block usage when generating per-seq lengths. |
vllm_gaudi/extension/bucketing/linear.py |
Adjust when decode block MAX is overridden based on contiguous PA; move MIN sanity check. |
vllm_gaudi/extension/bucketing/exponential.py |
Recompute max_decode_blocks/decode_blocks_limit based on max_model_len and PA mode. |
vllm_gaudi/extension/bucketing/common.py |
Add decode bucket filtering by batched max-model-len and debug logging for omitted buckets. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
2155657 to
074ea9f
Compare
✅ CI PassedAll checks passed successfully against the following vllm commit: |
074ea9f to
d78c614
Compare
d78c614 to
ae8cfc6
Compare
093bf08 to
942ad5d
Compare
| max_blocks=max_blocks) | ||
|
|
||
| expected_max = max_blocks * 3 # 10779 | ||
| expected_max = math.ceil(max_model_len / block_size) * max_num_seqs |
There was a problem hiding this comment.
What is the maximum for expected_max value here? This value is calculated, in fact calculation is copied from the algorithm, this test should check specific values and that filter in fact reduces buckets number
There was a problem hiding this comment.
The expected_max is calculated base on the corner decoding case with context_lens = [max_model_len - 1] + [1] * (max_num_seqs - 1). The context lens will be padded to the max one and the total decoding blocks could be calculated with math.ceil((max_model_len - 1) / block_size) * max_num_seqs which could be simplified to math.ceil(max_model_len / block_size) * max_num_seqs as max_model_len >> block_size for common cases.
For an example with max_model_len=16384, max_num_seqs=128 and block_size=128. The corner decoding case with context_lens = [16383, 1, 1, 1, ... 1, 1, 1] needs ceil(16383 / 128) * 128 = ceil(16384 / 128) * 128= 16384 blocks after padding.
There was a problem hiding this comment.
And the UT for the newly added filter is added. Thanks for remind that.
| max_decode_blocks = max_blocks | ||
| decode_block_bucket_cfg = read_bucket_settings('decode', 'block', min=1, step=block_size, max=max_decode_blocks) | ||
| if decode_block_bucket_cfg[2] > max_blocks: | ||
| if contiguous_pa and decode_block_bucket_cfg[2] > max_blocks: |
There was a problem hiding this comment.
What is the decode bucket count now?
There was a problem hiding this comment.
This is a bug fix for the missing buckets which cause "not warmed-up" warnings for non-contiguous PA cases. The actual number of decode buckets for linear bucketing is sensitive to the *_BUCKET_STEP_* configuration. And the default settings usually produce too many buckets that needs hours even days to warmup for cases with long max_model_len.
Signed-off-by: copilot <copilot@github.com> Tests cover the four PRs addressing long-context bucketing: - PR #762: Padding-aware bucketing strategy (warmup ranges, configs, generation) - PR #1122: Exponential decode block formula, limit cap, filter, linear fix - PR #1155: FusedSDPA slicing contract (pad_max bounds, strategy selection) - PR #1346: HPU graph capture skip (cudagraph size, warmup clamp scenarios) - Cross-PR integration: end-to-end 256K scenario, fallback, regressions 49 test functions organized in 6 test classes. Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>
✅ CI PassedAll checks passed successfully against the following vllm commit: |
4544644 to
76b1cfa
Compare
✅ CI PassedAll checks passed successfully against the following vllm commit: |
Remove all production code changes from PRs #1122, #1155, #1346 and keep only the two test files created for issue #1347: - tests/unit_tests/test_bucketing_issue_1347.py - tests/unit_tests/test_bucketing_warmup_time.py Signed-off-by: GitHub Copilot <copilot@github.com> Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>
1473712 to
5fc3493
Compare
e47c1ec to
ced814b
Compare
✅ CI PassedAll checks passed successfully against the following vllm commit: |
|
@adobrzyn @kamil-kaczor |
85b61e8 to
961a2bc
Compare
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
961a2bc to
15addf4
Compare
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
…size * bs Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
c44cf5e to
2c3b44b
Compare
✅ CI PassedAll checks passed successfully against the following vllm commit: |
Changes:
max_decode_blocksanddecode_blocks_limit.num_blocksto avoid OOM.