Skip to content

Fixes for the decode bucketing in non-contiguous pa scenario#1122

Merged
kamil-kaczor merged 8 commits into
vllm-project:mainfrom
yangulei:decode_bucket_main
May 11, 2026
Merged

Fixes for the decode bucketing in non-contiguous pa scenario#1122
kamil-kaczor merged 8 commits into
vllm-project:mainfrom
yangulei:decode_bucket_main

Conversation

@yangulei
Copy link
Copy Markdown
Collaborator

Changes:

  • Fix the calculation for max_decode_blocks and decode_blocks_limit.
  • Add a filter to drop the decode buckets with batched contexts larger than batched max_model_len.
  • Create dummy decode warmup inputs with the limitation of num_blocks to avoid OOM.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses incorrect/unsafe decode bucketing behavior in non-contiguous PA scenarios, aiming to prevent oversized decode bucket selection and reduce warmup-time OOM risk.

Changes:

  • Adjust decode bucket sizing/limits (notably for exponential strategy) to better reflect max_model_len and PA mode.
  • Add a decode-bucket filter to omit buckets whose batched context exceeds the batched max_model_len.
  • Clamp dummy decode warmup inputs to KV-cache block capacity to reduce OOM likelihood.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
vllm_gaudi/v1/worker/hpu_model_runner.py Clamp dummy decode warmup block usage when generating per-seq lengths.
vllm_gaudi/extension/bucketing/linear.py Adjust when decode block MAX is overridden based on contiguous PA; move MIN sanity check.
vllm_gaudi/extension/bucketing/exponential.py Recompute max_decode_blocks/decode_blocks_limit based on max_model_len and PA mode.
vllm_gaudi/extension/bucketing/common.py Add decode bucket filtering by batched max-model-len and debug logging for omitted buckets.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread vllm_gaudi/v1/worker/hpu_model_runner.py
Comment thread vllm_gaudi/extension/bucketing/linear.py Outdated
@yangulei yangulei force-pushed the decode_bucket_main branch from 2155657 to 074ea9f Compare March 12, 2026 02:35
@yangulei yangulei requested a review from PatrykWo as a code owner March 12, 2026 02:35
@yangulei yangulei changed the title fixes for decode bucketing in non-contiguous pa scenario fixes for the decode bucketing in non-contiguous pa scenario Mar 12, 2026
@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
2a68464c5bf1a26821afe76cf49dc53f75b87e98

@yangulei yangulei changed the title fixes for the decode bucketing in non-contiguous pa scenario Fixes for the decode bucketing in non-contiguous pa scenario Mar 17, 2026
@yangulei yangulei force-pushed the decode_bucket_main branch from 074ea9f to d78c614 Compare March 19, 2026 05:25
@yangulei yangulei force-pushed the decode_bucket_main branch from d78c614 to ae8cfc6 Compare March 24, 2026 05:16
@yangulei yangulei force-pushed the decode_bucket_main branch 2 times, most recently from 093bf08 to 942ad5d Compare April 14, 2026 00:31
max_blocks=max_blocks)

expected_max = max_blocks * 3 # 10779
expected_max = math.ceil(max_model_len / block_size) * max_num_seqs
Copy link
Copy Markdown
Collaborator

@michalkuligowski michalkuligowski Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the maximum for expected_max value here? This value is calculated, in fact calculation is copied from the algorithm, this test should check specific values and that filter in fact reduces buckets number

Copy link
Copy Markdown
Collaborator Author

@yangulei yangulei Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected_max is calculated base on the corner decoding case with context_lens = [max_model_len - 1] + [1] * (max_num_seqs - 1). The context lens will be padded to the max one and the total decoding blocks could be calculated with math.ceil((max_model_len - 1) / block_size) * max_num_seqs which could be simplified to math.ceil(max_model_len / block_size) * max_num_seqs as max_model_len >> block_size for common cases.
For an example with max_model_len=16384, max_num_seqs=128 and block_size=128. The corner decoding case with context_lens = [16383, 1, 1, 1, ... 1, 1, 1] needs ceil(16383 / 128) * 128 = ceil(16384 / 128) * 128= 16384 blocks after padding.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the UT for the newly added filter is added. Thanks for remind that.

max_decode_blocks = max_blocks
decode_block_bucket_cfg = read_bucket_settings('decode', 'block', min=1, step=block_size, max=max_decode_blocks)
if decode_block_bucket_cfg[2] > max_blocks:
if contiguous_pa and decode_block_bucket_cfg[2] > max_blocks:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the decode bucket count now?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bug fix for the missing buckets which cause "not warmed-up" warnings for non-contiguous PA cases. The actual number of decode buckets for linear bucketing is sensitive to the *_BUCKET_STEP_* configuration. And the default settings usually produce too many buckets that needs hours even days to warmup for cases with long max_model_len.

Copilot AI added a commit that referenced this pull request Apr 14, 2026
Signed-off-by: copilot <copilot@github.com>

Tests cover the four PRs addressing long-context bucketing:
- PR #762:  Padding-aware bucketing strategy (warmup ranges, configs, generation)
- PR #1122: Exponential decode block formula, limit cap, filter, linear fix
- PR #1155: FusedSDPA slicing contract (pad_max bounds, strategy selection)
- PR #1346: HPU graph capture skip (cudagraph size, warmup clamp scenarios)
- Cross-PR integration: end-to-end 256K scenario, fallback, regressions

49 test functions organized in 6 test classes.

Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
f976e3b98ba45677a2213673a442c6cbff141e8e

Copilot AI added a commit that referenced this pull request Apr 14, 2026
)

Signed-off-by: GitHub Copilot <copilot@github.com>

Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>
@yangulei yangulei force-pushed the decode_bucket_main branch from 4544644 to 76b1cfa Compare April 15, 2026 02:49
@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
0e39202ca911319c7747a2f9d5a0c162fdff4fd9

Copilot AI added a commit that referenced this pull request Apr 16, 2026
Remove all production code changes from PRs #1122, #1155, #1346 and keep
only the two test files created for issue #1347:
- tests/unit_tests/test_bucketing_issue_1347.py
- tests/unit_tests/test_bucketing_warmup_time.py

Signed-off-by: GitHub Copilot <copilot@github.com>

Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>
@yangulei yangulei force-pushed the decode_bucket_main branch from 1473712 to 5fc3493 Compare April 24, 2026 01:18
@yangulei yangulei requested a review from jbyczkow as a code owner April 24, 2026 01:18
@yangulei yangulei force-pushed the decode_bucket_main branch 2 times, most recently from e47c1ec to ced814b Compare April 27, 2026 00:43
@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
d886c26d4d4fef7d079696beb4ece1cfb4b008a8

@yangulei
Copy link
Copy Markdown
Collaborator Author

@adobrzyn @kamil-kaczor
Could you help to review this PR? This could solve the issue GAUDISW-247226.

Copy link
Copy Markdown
Collaborator

@kamil-kaczor kamil-kaczor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@yangulei yangulei force-pushed the decode_bucket_main branch 2 times, most recently from 85b61e8 to 961a2bc Compare May 7, 2026 08:02
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@yangulei yangulei force-pushed the decode_bucket_main branch from 961a2bc to 15addf4 Compare May 8, 2026 00:51
yangulei added 8 commits May 9, 2026 01:24
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
…size * bs

Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
@yangulei yangulei force-pushed the decode_bucket_main branch from c44cf5e to 2c3b44b Compare May 9, 2026 01:29
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

✅ CI Passed

All checks passed successfully against the following vllm commit:
8eb401134e750781a202c0b6dc4059616cdb4954

@kamil-kaczor kamil-kaczor merged commit f24f3f9 into vllm-project:main May 11, 2026
2 checks passed
@yangulei yangulei deleted the decode_bucket_main branch May 12, 2026 00:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants