Fixes for the decode bucketing in non-contiguous pa scenario by yangulei · Pull Request #1122 · vllm-project/vllm-gaudi

yangulei · 2026-03-10T05:05:06Z

Changes:

Fix the calculation for max_decode_blocks and decode_blocks_limit.
Add a filter to drop the decode buckets with batched contexts larger than batched max_model_len.
Create dummy decode warmup inputs with the limitation of num_blocks to avoid OOM.

Copilot

Pull request overview

This PR addresses incorrect/unsafe decode bucketing behavior in non-contiguous PA scenarios, aiming to prevent oversized decode bucket selection and reduce warmup-time OOM risk.

Changes:

Adjust decode bucket sizing/limits (notably for exponential strategy) to better reflect max_model_len and PA mode.
Add a decode-bucket filter to omit buckets whose batched context exceeds the batched max_model_len.
Clamp dummy decode warmup inputs to KV-cache block capacity to reduce OOM likelihood.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
`vllm_gaudi/v1/worker/hpu_model_runner.py`	Clamp dummy decode warmup block usage when generating per-seq lengths.
`vllm_gaudi/extension/bucketing/linear.py`	Adjust when decode block MAX is overridden based on contiguous PA; move MIN sanity check.
`vllm_gaudi/extension/bucketing/exponential.py`	Recompute `max_decode_blocks`/`decode_blocks_limit` based on `max_model_len` and PA mode.
`vllm_gaudi/extension/bucketing/common.py`	Add decode bucket filtering by batched max-model-len and debug logging for omitted buckets.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

github-actions · 2026-03-12T06:58:43Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
2a68464c5bf1a26821afe76cf49dc53f75b87e98

michalkuligowski · 2026-04-14T09:19:50Z

                                               max_blocks=max_blocks)

-    expected_max = max_blocks * 3  # 10779
+    expected_max = math.ceil(max_model_len / block_size) * max_num_seqs


What is the maximum for expected_max value here? This value is calculated, in fact calculation is copied from the algorithm, this test should check specific values and that filter in fact reduces buckets number

The expected_max is calculated base on the corner decoding case with context_lens = [max_model_len - 1] + [1] * (max_num_seqs - 1). The context lens will be padded to the max one and the total decoding blocks could be calculated with math.ceil((max_model_len - 1) / block_size) * max_num_seqs which could be simplified to math.ceil(max_model_len / block_size) * max_num_seqs as max_model_len >> block_size for common cases.
For an example with max_model_len=16384, max_num_seqs=128 and block_size=128. The corner decoding case with context_lens = [16383, 1, 1, 1, ... 1, 1, 1] needs ceil(16383 / 128) * 128 = ceil(16384 / 128) * 128= 16384 blocks after padding.

And the UT for the newly added filter is added. Thanks for remind that.

michalkuligowski · 2026-04-14T09:20:24Z

            max_decode_blocks = max_blocks
        decode_block_bucket_cfg = read_bucket_settings('decode', 'block', min=1, step=block_size, max=max_decode_blocks)
-        if decode_block_bucket_cfg[2] > max_blocks:
+        if contiguous_pa and decode_block_bucket_cfg[2] > max_blocks:


What is the decode bucket count now?

This is a bug fix for the missing buckets which cause "not warmed-up" warnings for non-contiguous PA cases. The actual number of decode buckets for linear bucketing is sensitive to the *_BUCKET_STEP_* configuration. And the default settings usually produce too many buckets that needs hours even days to warmup for cases with long max_model_len.

Signed-off-by: copilot <copilot@github.com> Tests cover the four PRs addressing long-context bucketing: - PR #762: Padding-aware bucketing strategy (warmup ranges, configs, generation) - PR #1122: Exponential decode block formula, limit cap, filter, linear fix - PR #1155: FusedSDPA slicing contract (pad_max bounds, strategy selection) - PR #1346: HPU graph capture skip (cudagraph size, warmup clamp scenarios) - Cross-PR integration: end-to-end 256K scenario, fallback, regressions 49 test functions organized in 6 test classes. Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>

github-actions · 2026-04-14T12:38:40Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
f976e3b98ba45677a2213673a442c6cbff141e8e

) Signed-off-by: GitHub Copilot <copilot@github.com> Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>

github-actions · 2026-04-15T06:15:00Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
0e39202ca911319c7747a2f9d5a0c162fdff4fd9

Remove all production code changes from PRs #1122, #1155, #1346 and keep only the two test files created for issue #1347: - tests/unit_tests/test_bucketing_issue_1347.py - tests/unit_tests/test_bucketing_warmup_time.py Signed-off-by: GitHub Copilot <copilot@github.com> Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>

github-actions · 2026-04-27T04:53:28Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
d886c26d4d4fef7d079696beb4ece1cfb4b008a8

yangulei · 2026-04-27T04:57:03Z

@adobrzyn @kamil-kaczor
Could you help to review this PR? This could solve the issue GAUDISW-247226.

kamil-kaczor

lgtm

github-actions · 2026-05-07T08:02:27Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

…size * bs Signed-off-by: Youlei Yang <youlei.yang@intel.com>

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

github-actions · 2026-05-09T04:37:26Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
8eb401134e750781a202c0b6dc4059616cdb4954

Copilot AI review requested due to automatic review settings March 10, 2026 05:05

yangulei requested review from adobrzyn, afierka-intel, iboiko-habana, kamil-kaczor, ksmusz, mgawarkiewicz-intel, michalkuligowski and xuechendi as code owners March 10, 2026 05:05

Copilot started reviewing on behalf of yangulei March 10, 2026 05:08 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

Comment thread vllm_gaudi/v1/worker/hpu_model_runner.py

Comment thread vllm_gaudi/extension/bucketing/linear.py Outdated

github-actions Bot mentioned this pull request Mar 10, 2026

🚦 Team Review Dashboard #701

Open

adobrzyn assigned afierka-intel Mar 11, 2026

yangulei force-pushed the decode_bucket_main branch from 2155657 to 074ea9f Compare March 12, 2026 02:35

yangulei requested a review from PatrykWo as a code owner March 12, 2026 02:35

yangulei changed the title ~~fixes for decode bucketing in non-contiguous pa scenario~~ fixes for the decode bucketing in non-contiguous pa scenario Mar 12, 2026

yangulei changed the title ~~fixes for the decode bucketing in non-contiguous pa scenario~~ Fixes for the decode bucketing in non-contiguous pa scenario Mar 17, 2026

yangulei force-pushed the decode_bucket_main branch from 074ea9f to d78c614 Compare March 19, 2026 05:25

yangulei mentioned this pull request Mar 19, 2026

Fix OOM crashes during high-concurrency inference (GAUDISW-246982) #1124

Merged

yangulei force-pushed the decode_bucket_main branch from d78c614 to ae8cfc6 Compare March 24, 2026 05:16

yangulei force-pushed the decode_bucket_main branch 2 times, most recently from 093bf08 to 942ad5d Compare April 14, 2026 00:31

yangulei mentioned this pull request Apr 14, 2026

Supports 256k model length with TP=1 on Gaudi2 for Qwen3-30B-A3B-Thinking-2507 #1347

Closed

4 tasks

michalkuligowski reviewed Apr 14, 2026

View reviewed changes

Copilot AI mentioned this pull request Apr 14, 2026

Combined changes from PRs #1122, #1155, #1346 for 256K context support on Gaudi2 #1348

Draft

yangulei force-pushed the decode_bucket_main branch from 4544644 to 76b1cfa Compare April 15, 2026 02:49

yangulei force-pushed the decode_bucket_main branch from 1473712 to 5fc3493 Compare April 24, 2026 01:18

yangulei requested a review from jbyczkow as a code owner April 24, 2026 01:18

yangulei force-pushed the decode_bucket_main branch 2 times, most recently from e47c1ec to ced814b Compare April 27, 2026 00:43

afierka-intel assigned kamil-kaczor and adobrzyn and unassigned afierka-intel May 5, 2026

kamil-kaczor approved these changes May 6, 2026

View reviewed changes

yangulei force-pushed the decode_bucket_main branch 2 times, most recently from 85b61e8 to 961a2bc Compare May 7, 2026 08:02

yangulei force-pushed the decode_bucket_main branch from 961a2bc to 15addf4 Compare May 8, 2026 00:51

yangulei added 8 commits May 9, 2026 01:24

fix max decode blocks for non-contiguous pa in linear bucketing

5ac1801

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

fix max decode blocks for non-contiguous pa in exp bucketing

8d1c242

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

fix decode blocks limit for exp bucketing

85be44c

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

fix decode warmup OOM issue for large num_blocks

806769d

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

add filter for decode buckets to ensure ctx <= max_model_len / block_…

c15ec47

…size * bs Signed-off-by: Youlei Yang <youlei.yang@intel.com>

fix tests

b1f9660

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

remove unified buckets

c3cc743

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

add UT for the newly added filter

2c3b44b

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

yangulei force-pushed the decode_bucket_main branch from c44cf5e to 2c3b44b Compare May 9, 2026 01:29

kamil-kaczor merged commit f24f3f9 into vllm-project:main May 11, 2026
2 checks passed

yangulei deleted the decode_bucket_main branch May 12, 2026 00:11

Conversation

yangulei commented Mar 10, 2026

Changes:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Mar 12, 2026

✅ CI Passed

Uh oh!

michalkuligowski Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yangulei Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yangulei Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

michalkuligowski Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

yangulei Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 14, 2026

✅ CI Passed

Uh oh!

github-actions Bot commented Apr 15, 2026

✅ CI Passed

Uh oh!

github-actions Bot commented Apr 27, 2026

✅ CI Passed

Uh oh!

yangulei commented Apr 27, 2026

Uh oh!

kamil-kaczor left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 7, 2026

🚧 CI Blocked

Uh oh!

github-actions Bot commented May 9, 2026

✅ CI Passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

michalkuligowski Apr 14, 2026 •

edited

Loading

yangulei Apr 15, 2026 •

edited

Loading