Fix decode bucket generation for hybrid models with mismatched block sizes#1485
Conversation
There was a problem hiding this comment.
Pull request overview
Adjusts Gaudi bucketing warmup to generate decode buckets using the HPU attention kernel block granularity (attn_block_size) for hybrid models where it differs from the KV-cache management block_size, preventing “not warmed-up” warnings and repeated HPU graph recompilation during decode.
Changes:
- Temporarily overrides
bucketing_manager.block_sizetoattn_block_sizewhen generating decode buckets inwarmup_model(). - Restores the original
bucketing_manager.block_sizeafterward to avoid impacting prompt fallback behavior.
f028957 to
d7691b3
Compare
d7691b3 to
c3fc144
Compare
c3fc144 to
85e7f46
Compare
85e7f46 to
15351d6
Compare
15351d6 to
2265f34
Compare
2265f34 to
9a67618
Compare
9a67618 to
6000bce
Compare
|
Finding 1 🟡 Medium · The test file defines a local Suggestion: Either (a) instantiate the real [- Reviewed by Awesome ChlOpus] |
06a0a41 to
1357c00
Compare
For hybrid models like Qwen3.5 where block_size (640) differs from attn_block_size (128), two issues caused 'not warmed-up' warnings: 1. Decode bucket generation used block_size=640 instead of attn_block_size=128, producing too few/small buckets. Fix: scope bucketing_manager.block_size to attn_block_size during decode bucket generation (with try/finally for safe restoration). 2. Warmup execution capped num_blocks at kv_cache_config.num_blocks (physical pool), preventing large decode buckets from being warmed. At runtime, prefix-sharing can produce sum(block_table_entries) > physical blocks. Fix: only cap for contiguous PA where block_id must be valid; non-contiguous PA uses block_id=0 (always safe). Fixes regression introduced in f24f3f9. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Youlei Yang <youlei.yang@intel.com>
…hed block sizes (#1486) ## Problem Backport of #1485 to releases/v0.21.0. For hybrid models like Qwen3.5 (GDN + attention), `_align_hybrid_block_size()` sets `block_size=640` (unified KV-cache page), while HPU kernels use `attn_block_size=128`. Decode bucket generation uses the formula: ``` max_decode_blocks = ceil(max_model_len / block_size) * max_num_seqs = ceil(262144 / 640) * 45 = 18450 ``` But the runtime decode path computes `num_blocks` using `attn_block_size=128`, producing values up to `92160`, causing hundreds of **"Configuration was not warmed-up"** warnings and HPU graph recompilation. ## Fix 1. Temporarily scope `bucketing_manager.block_size` to `attn_block_size` during decode bucket generation in `warmup_model()`, then restore. 2. Use `attn_block_size` in `_prepare_dummy_scenario()` for decode dummy data so warmup shapes match the generated buckets. ## Testing - Verified with Qwen3.5-35B-A3B on 4x Gaudi3 (TP=4, max_model_len=262144, max_num_seqs=45) - No more "Configuration was not warmed-up" warnings during serving Fixes regression introduced by f24f3f9. Signed-off-by: Youlei Yang <youlei.yang@intel.com> --------- Signed-off-by: Agata Dobrzyniewicz <agata.dobrzyniewicz@intel.com> Signed-off-by: Youlei Yang <youlei.yang@intel.com> Signed-off-by: Yang Lei <yang.lei@intel.com> Signed-off-by: Iryna Boiko <iryna.boiko@intel.com> Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Iryna Boiko <iryna.boiko@intel.com>
✅ CI PassedAll checks passed successfully against the following vllm commit: |
✅ CI PassedAll checks passed successfully against the following vllm commit: |
Problem
For hybrid models like Qwen3.5 (GDN + attention),
_align_hybrid_block_size()setsblock_size=640(unified KV-cache page for mamba/attention alignment), while HPU kernels useattn_block_size=128.The decode bucket generation (introduced by f24f3f9) uses the formula:
But the runtime decode path (
_create_decode_input_data) computesnum_blocksusingattn_block_size=128, producing values up toceil(262144/128) * 45 = 92160.This causes hundreds of "Configuration was not warmed-up" warnings and costly HPU graph recompilation on every decode step.
Root Cause
Two different block_size semantics coexist:
self.block_size = 640: KV-cache management page size (unified for hybrid mamba/attention)self.attn_block_size = 128: HPU attention kernel page size (what hardware actually uses)Decode bucket generation used
block_sizebut should useattn_block_sizeto match the runtime.Fix
Temporarily scope
bucketing_manager.block_sizetoattn_block_sizeduring decode bucket generation inwarmup_model(), then restore the original value so prompt fallback paths remain unaffected.Testing
Signed-off-by: Youlei Yang youlei.yang@intel.com