Skip to content

Fix decode bucket generation for hybrid models with mismatched block sizes#1485

Merged
iboiko-habana merged 1 commit into
vllm-project:mainfrom
yangulei:fix/decode-bucket-attn-block-size
May 27, 2026
Merged

Fix decode bucket generation for hybrid models with mismatched block sizes#1485
iboiko-habana merged 1 commit into
vllm-project:mainfrom
yangulei:fix/decode-bucket-attn-block-size

Conversation

@yangulei
Copy link
Copy Markdown
Collaborator

Problem

For hybrid models like Qwen3.5 (GDN + attention), _align_hybrid_block_size() sets block_size=640 (unified KV-cache page for mamba/attention alignment), while HPU kernels use attn_block_size=128.

The decode bucket generation (introduced by f24f3f9) uses the formula:

max_decode_blocks = ceil(max_model_len / block_size) * max_num_seqs
                  = ceil(262144 / 640) * 45 = 18450

But the runtime decode path (_create_decode_input_data) computes num_blocks using attn_block_size=128, producing values up to ceil(262144/128) * 45 = 92160.

This causes hundreds of "Configuration was not warmed-up" warnings and costly HPU graph recompilation on every decode step.

Root Cause

Two different block_size semantics coexist:

  • self.block_size = 640: KV-cache management page size (unified for hybrid mamba/attention)
  • self.attn_block_size = 128: HPU attention kernel page size (what hardware actually uses)

Decode bucket generation used block_size but should use attn_block_size to match the runtime.

Fix

Temporarily scope bucketing_manager.block_size to attn_block_size during decode bucket generation in warmup_model(), then restore the original value so prompt fallback paths remain unaffected.

Testing

  • Verified with Qwen3.5-35B-A3B on 4x Gaudi3 (TP=4, max_model_len=262144, max_num_seqs=45)
  • Decode buckets now correctly cover runtime num_blocks range
  • No more "Configuration was not warmed-up" warnings during serving

Signed-off-by: Youlei Yang youlei.yang@intel.com

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts Gaudi bucketing warmup to generate decode buckets using the HPU attention kernel block granularity (attn_block_size) for hybrid models where it differs from the KV-cache management block_size, preventing “not warmed-up” warnings and repeated HPU graph recompilation during decode.

Changes:

  • Temporarily overrides bucketing_manager.block_size to attn_block_size when generating decode buckets in warmup_model().
  • Restores the original bucketing_manager.block_size afterward to avoid impacting prompt fallback behavior.

Comment thread vllm_gaudi/v1/worker/hpu_model_runner.py Outdated
@yangulei yangulei marked this pull request as draft May 24, 2026 04:18
@yangulei yangulei force-pushed the fix/decode-bucket-attn-block-size branch from f028957 to d7691b3 Compare May 24, 2026 04:30
@yangulei yangulei had a problem deploying to pre-merge-approval May 24, 2026 04:30 — with GitHub Actions Error
@yangulei yangulei force-pushed the fix/decode-bucket-attn-block-size branch from d7691b3 to c3fc144 Compare May 24, 2026 05:40
@yangulei yangulei had a problem deploying to pre-merge-approval May 24, 2026 05:40 — with GitHub Actions Error
@yangulei yangulei force-pushed the fix/decode-bucket-attn-block-size branch from c3fc144 to 85e7f46 Compare May 24, 2026 05:49
@yangulei yangulei had a problem deploying to pre-merge-approval May 24, 2026 05:49 — with GitHub Actions Error
@yangulei yangulei requested a review from Copilot May 24, 2026 05:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comment thread tests/unit_tests/test_decode_bucket_hybrid.py
Comment thread tests/unit_tests/test_decode_bucket_hybrid.py Outdated
Comment thread tests/unit_tests/test_decode_bucket_hybrid.py
@yangulei yangulei force-pushed the fix/decode-bucket-attn-block-size branch from 85e7f46 to 15351d6 Compare May 24, 2026 06:06
@yangulei yangulei had a problem deploying to pre-merge-approval May 24, 2026 06:06 — with GitHub Actions Error
@yangulei yangulei force-pushed the fix/decode-bucket-attn-block-size branch from 15351d6 to 2265f34 Compare May 24, 2026 06:36
@yangulei yangulei had a problem deploying to pre-merge-approval May 24, 2026 06:36 — with GitHub Actions Error
@yangulei yangulei marked this pull request as ready for review May 24, 2026 06:36
@yangulei yangulei temporarily deployed to pre-merge-approval May 24, 2026 06:36 — with GitHub Actions Inactive
@yangulei yangulei force-pushed the fix/decode-bucket-attn-block-size branch from 2265f34 to 9a67618 Compare May 24, 2026 06:47
@yangulei yangulei temporarily deployed to pre-merge-approval May 24, 2026 06:47 — with GitHub Actions Inactive
@yangulei yangulei force-pushed the fix/decode-bucket-attn-block-size branch from 9a67618 to 6000bce Compare May 24, 2026 07:41
@yangulei yangulei temporarily deployed to pre-merge-approval May 24, 2026 07:41 — with GitHub Actions Inactive
Comment thread tests/unit_tests/test_decode_bucket_hybrid.py Outdated
Comment thread tests/unit_tests/test_decode_bucket_hybrid.py
@adobrzyn
Copy link
Copy Markdown
Collaborator

Finding 1 🟡 Medium · tests/unit_tests/test_decode_bucket_hybrid.py:L122-141

The test file defines a local _generate_seq_lengths() that reimplements the production method instead of importing/patching the real HPUModelRunner._generate_seq_lengths. As written, the tests will pass even if the production method in vllm_gaudi/v1/worker/hpu_model_runner.py regresses (e.g., someone re-introduces the unconditional cap). The same risk applies to _MockModelRunner — it only encodes the assumed invariants, not the real ones.

Suggestion: Either (a) instantiate the real HPUModelRunner with mocks and call runner._generate_seq_lengths(...), or (b) import the function (or extract it into a pure helper shared by production and tests). Otherwise this file behaves more like documentation than a regression test.


[- Reviewed by Awesome ChlOpus]

@yangulei yangulei force-pushed the fix/decode-bucket-attn-block-size branch from 06a0a41 to 1357c00 Compare May 25, 2026 13:55
@yangulei yangulei temporarily deployed to pre-merge-approval May 25, 2026 13:55 — with GitHub Actions Inactive
For hybrid models like Qwen3.5 where block_size (640) differs from
attn_block_size (128), two issues caused 'not warmed-up' warnings:

1. Decode bucket generation used block_size=640 instead of
   attn_block_size=128, producing too few/small buckets. Fix: scope
   bucketing_manager.block_size to attn_block_size during decode
   bucket generation (with try/finally for safe restoration).

2. Warmup execution capped num_blocks at kv_cache_config.num_blocks
   (physical pool), preventing large decode buckets from being warmed.
   At runtime, prefix-sharing can produce sum(block_table_entries) >
   physical blocks. Fix: only cap for contiguous PA where block_id
   must be valid; non-contiguous PA uses block_id=0 (always safe).

Fixes regression introduced in f24f3f9.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Copy link
Copy Markdown
Collaborator

@kamil-kaczor kamil-kaczor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

mgawarkiewicz-intel pushed a commit that referenced this pull request May 26, 2026
…hed block sizes (#1486)

## Problem

Backport of #1485 to releases/v0.21.0.

For hybrid models like Qwen3.5 (GDN + attention),
`_align_hybrid_block_size()` sets `block_size=640` (unified KV-cache
page), while HPU kernels use `attn_block_size=128`.

Decode bucket generation uses the formula:
```
max_decode_blocks = ceil(max_model_len / block_size) * max_num_seqs
                  = ceil(262144 / 640) * 45 = 18450
```

But the runtime decode path computes `num_blocks` using
`attn_block_size=128`, producing values up to `92160`, causing hundreds
of **"Configuration was not warmed-up"** warnings and HPU graph
recompilation.

## Fix

1. Temporarily scope `bucketing_manager.block_size` to `attn_block_size`
during decode bucket generation in `warmup_model()`, then restore.
2. Use `attn_block_size` in `_prepare_dummy_scenario()` for decode dummy
data so warmup shapes match the generated buckets.

## Testing

- Verified with Qwen3.5-35B-A3B on 4x Gaudi3 (TP=4,
max_model_len=262144, max_num_seqs=45)
- No more "Configuration was not warmed-up" warnings during serving

Fixes regression introduced by f24f3f9.

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

---------

Signed-off-by: Agata Dobrzyniewicz <agata.dobrzyniewicz@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Yang Lei <yang.lei@intel.com>
Signed-off-by: Iryna Boiko <iryna.boiko@intel.com>
Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Iryna Boiko <iryna.boiko@intel.com>
@yangulei yangulei temporarily deployed to pre-merge-approval May 26, 2026 13:23 — with GitHub Actions Inactive
@yangulei yangulei temporarily deployed to pre-merge-approval May 26, 2026 13:31 — with GitHub Actions Inactive
@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
0a54df28471be07b3d668ea21c5e411569d3baea

@yangulei yangulei temporarily deployed to pre-merge-approval May 26, 2026 21:14 — with GitHub Actions Inactive
@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
0a54df28471be07b3d668ea21c5e411569d3baea

@iboiko-habana iboiko-habana merged commit bc4f535 into vllm-project:main May 27, 2026
6 of 9 checks passed
@yangulei yangulei deleted the fix/decode-bucket-attn-block-size branch May 28, 2026 01:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants