Fix decode bucket generation for hybrid models with mismatched block sizes by yangulei · Pull Request #1485 · vllm-project/vllm-gaudi

yangulei · 2026-05-24T04:14:31Z

Problem

For hybrid models like Qwen3.5 (GDN + attention), _align_hybrid_block_size() sets block_size=640 (unified KV-cache page for mamba/attention alignment), while HPU kernels use attn_block_size=128.

The decode bucket generation (introduced by f24f3f9) uses the formula:

max_decode_blocks = ceil(max_model_len / block_size) * max_num_seqs
                  = ceil(262144 / 640) * 45 = 18450

But the runtime decode path (_create_decode_input_data) computes num_blocks using attn_block_size=128, producing values up to ceil(262144/128) * 45 = 92160.

This causes hundreds of "Configuration was not warmed-up" warnings and costly HPU graph recompilation on every decode step.

Root Cause

Two different block_size semantics coexist:

self.block_size = 640: KV-cache management page size (unified for hybrid mamba/attention)
self.attn_block_size = 128: HPU attention kernel page size (what hardware actually uses)

Decode bucket generation used block_size but should use attn_block_size to match the runtime.

Fix

Temporarily scope bucketing_manager.block_size to attn_block_size during decode bucket generation in warmup_model(), then restore the original value so prompt fallback paths remain unaffected.

Testing

Verified with Qwen3.5-35B-A3B on 4x Gaudi3 (TP=4, max_model_len=262144, max_num_seqs=45)
Decode buckets now correctly cover runtime num_blocks range
No more "Configuration was not warmed-up" warnings during serving

Signed-off-by: Youlei Yang youlei.yang@intel.com

Copilot

Pull request overview

Adjusts Gaudi bucketing warmup to generate decode buckets using the HPU attention kernel block granularity (attn_block_size) for hybrid models where it differs from the KV-cache management block_size, preventing “not warmed-up” warnings and repeated HPU graph recompilation during decode.

Changes:

Temporarily overrides bucketing_manager.block_size to attn_block_size when generating decode buckets in warmup_model().
Restores the original bucketing_manager.block_size afterward to avoid impacting prompt fallback behavior.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

adobrzyn · 2026-05-25T08:23:28Z

Finding 1 🟡 Medium · tests/unit_tests/test_decode_bucket_hybrid.py:L122-141

The test file defines a local _generate_seq_lengths() that reimplements the production method instead of importing/patching the real HPUModelRunner._generate_seq_lengths. As written, the tests will pass even if the production method in vllm_gaudi/v1/worker/hpu_model_runner.py regresses (e.g., someone re-introduces the unconditional cap). The same risk applies to _MockModelRunner — it only encodes the assumed invariants, not the real ones.

Suggestion: Either (a) instantiate the real HPUModelRunner with mocks and call runner._generate_seq_lengths(...), or (b) import the function (or extract it into a pure helper shared by production and tests). Otherwise this file behaves more like documentation than a regression test.

[- Reviewed by Awesome ChlOpus]

For hybrid models like Qwen3.5 where block_size (640) differs from attn_block_size (128), two issues caused 'not warmed-up' warnings: 1. Decode bucket generation used block_size=640 instead of attn_block_size=128, producing too few/small buckets. Fix: scope bucketing_manager.block_size to attn_block_size during decode bucket generation (with try/finally for safe restoration). 2. Warmup execution capped num_blocks at kv_cache_config.num_blocks (physical pool), preventing large decode buckets from being warmed. At runtime, prefix-sharing can produce sum(block_table_entries) > physical blocks. Fix: only cap for contiguous PA where block_id must be valid; non-contiguous PA uses block_id=0 (always safe). Fixes regression introduced in f24f3f9. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Youlei Yang <youlei.yang@intel.com>

kamil-kaczor

lgtm

…hed block sizes (#1486) ## Problem Backport of #1485 to releases/v0.21.0. For hybrid models like Qwen3.5 (GDN + attention), `_align_hybrid_block_size()` sets `block_size=640` (unified KV-cache page), while HPU kernels use `attn_block_size=128`. Decode bucket generation uses the formula: ``` max_decode_blocks = ceil(max_model_len / block_size) * max_num_seqs = ceil(262144 / 640) * 45 = 18450 ``` But the runtime decode path computes `num_blocks` using `attn_block_size=128`, producing values up to `92160`, causing hundreds of **"Configuration was not warmed-up"** warnings and HPU graph recompilation. ## Fix 1. Temporarily scope `bucketing_manager.block_size` to `attn_block_size` during decode bucket generation in `warmup_model()`, then restore. 2. Use `attn_block_size` in `_prepare_dummy_scenario()` for decode dummy data so warmup shapes match the generated buckets. ## Testing - Verified with Qwen3.5-35B-A3B on 4x Gaudi3 (TP=4, max_model_len=262144, max_num_seqs=45) - No more "Configuration was not warmed-up" warnings during serving Fixes regression introduced by f24f3f9. Signed-off-by: Youlei Yang <youlei.yang@intel.com> --------- Signed-off-by: Agata Dobrzyniewicz <agata.dobrzyniewicz@intel.com> Signed-off-by: Youlei Yang <youlei.yang@intel.com> Signed-off-by: Yang Lei <yang.lei@intel.com> Signed-off-by: Iryna Boiko <iryna.boiko@intel.com> Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Iryna Boiko <iryna.boiko@intel.com>

github-actions · 2026-05-26T17:07:33Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
0a54df28471be07b3d668ea21c5e411569d3baea

github-actions · 2026-05-27T00:51:46Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
0a54df28471be07b3d668ea21c5e411569d3baea

Copilot AI review requested due to automatic review settings May 24, 2026 04:14

yangulei requested review from PatrykWo, adobrzyn, afierka-intel, iboiko-habana, jbyczkow, kamil-kaczor, ksmusz, mgawarkiewicz-intel, michalkuligowski and xuechendi as code owners May 24, 2026 04:14

yangulei had a problem deploying to pre-merge-approval May 24, 2026 04:14 — with GitHub Actions Error

Copilot started reviewing on behalf of yangulei May 24, 2026 04:14 View session

Copilot AI reviewed May 24, 2026

View reviewed changes

Comment thread vllm_gaudi/v1/worker/hpu_model_runner.py Outdated

yangulei mentioned this pull request May 24, 2026

[v0.21.0] Fix decode bucket generation for hybrid models with mismatched block sizes #1486

Merged

yangulei marked this pull request as draft May 24, 2026 04:18

yangulei force-pushed the fix/decode-bucket-attn-block-size branch from f028957 to d7691b3 Compare May 24, 2026 04:30

yangulei had a problem deploying to pre-merge-approval May 24, 2026 04:30 — with GitHub Actions Error

yangulei force-pushed the fix/decode-bucket-attn-block-size branch from d7691b3 to c3fc144 Compare May 24, 2026 05:40

yangulei had a problem deploying to pre-merge-approval May 24, 2026 05:40 — with GitHub Actions Error

yangulei force-pushed the fix/decode-bucket-attn-block-size branch from c3fc144 to 85e7f46 Compare May 24, 2026 05:49

yangulei had a problem deploying to pre-merge-approval May 24, 2026 05:49 — with GitHub Actions Error

yangulei requested a review from Copilot May 24, 2026 05:50

Copilot started reviewing on behalf of yangulei May 24, 2026 05:50 View session

Copilot AI reviewed May 24, 2026

View reviewed changes

Comment thread tests/unit_tests/test_decode_bucket_hybrid.py

Comment thread tests/unit_tests/test_decode_bucket_hybrid.py Outdated

Comment thread tests/unit_tests/test_decode_bucket_hybrid.py

yangulei force-pushed the fix/decode-bucket-attn-block-size branch from 85e7f46 to 15351d6 Compare May 24, 2026 06:06

yangulei had a problem deploying to pre-merge-approval May 24, 2026 06:06 — with GitHub Actions Error

yangulei force-pushed the fix/decode-bucket-attn-block-size branch from 15351d6 to 2265f34 Compare May 24, 2026 06:36

yangulei had a problem deploying to pre-merge-approval May 24, 2026 06:36 — with GitHub Actions Error

yangulei marked this pull request as ready for review May 24, 2026 06:36

yangulei temporarily deployed to pre-merge-approval May 24, 2026 06:36 — with GitHub Actions Inactive

yangulei force-pushed the fix/decode-bucket-attn-block-size branch from 2265f34 to 9a67618 Compare May 24, 2026 06:47

yangulei temporarily deployed to pre-merge-approval May 24, 2026 06:47 — with GitHub Actions Inactive

github-actions Bot mentioned this pull request May 24, 2026

🚦 Team Review Dashboard #701

Open

yangulei force-pushed the fix/decode-bucket-attn-block-size branch from 9a67618 to 6000bce Compare May 24, 2026 07:41

yangulei temporarily deployed to pre-merge-approval May 24, 2026 07:41 — with GitHub Actions Inactive

kamil-kaczor had a problem deploying to pre-merge-approval May 25, 2026 07:57 — with GitHub Actions Error

kamil-kaczor requested changes May 25, 2026

View reviewed changes

Comment thread tests/unit_tests/test_decode_bucket_hybrid.py Outdated

Comment thread tests/unit_tests/test_decode_bucket_hybrid.py

yangulei force-pushed the fix/decode-bucket-attn-block-size branch from 06a0a41 to 1357c00 Compare May 25, 2026 13:55

yangulei temporarily deployed to pre-merge-approval May 25, 2026 13:55 — with GitHub Actions Inactive

kamil-kaczor approved these changes May 26, 2026

View reviewed changes

yangulei temporarily deployed to pre-merge-approval May 26, 2026 13:23 — with GitHub Actions Inactive

yangulei temporarily deployed to pre-merge-approval May 26, 2026 13:31 — with GitHub Actions Inactive

yangulei temporarily deployed to pre-merge-approval May 26, 2026 21:14 — with GitHub Actions Inactive

iboiko-habana merged commit bc4f535 into vllm-project:main May 27, 2026
6 of 9 checks passed

yangulei deleted the fix/decode-bucket-attn-block-size branch May 28, 2026 01:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix decode bucket generation for hybrid models with mismatched block sizes#1485

Fix decode bucket generation for hybrid models with mismatched block sizes#1485
iboiko-habana merged 1 commit into
vllm-project:mainfrom
yangulei:fix/decode-bucket-attn-block-size

yangulei commented May 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adobrzyn commented May 25, 2026

Uh oh!

kamil-kaczor left a comment

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

yangulei commented May 24, 2026

Problem

Root Cause

Fix

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adobrzyn commented May 25, 2026

Uh oh!

kamil-kaczor left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 26, 2026

✅ CI Passed

Uh oh!

github-actions Bot commented May 27, 2026

✅ CI Passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants