[Bugfix] Translate hybrid block_size for Metal paged attention kernel by ricky-chaoju · Pull Request #235 · vllm-project/vllm-metal

ricky-chaoju · 2026-04-07T04:05:13Z

Summary

Fix RuntimeError when running Qwen3.5 hybrid model with paged attention: Unable to load function paged_attention_..._bs544_...
vLLM inflates block_size to 544 to align attention pages with mamba pages in hybrid models, but the Metal kernel only has instantiations for [8, 16, 32]
Add block-size translation in attention_sdpa.py: reshape cache (zero-copy) and expand block tables so the kernel sees a compatible block_size

For hybrid models (e.g. Qwen3.5), vLLM sets block_size=544 to align attention page size with mamba page size. The Metal paged attention kernel is template-instantiated for block sizes [8, 16, 32] only.

The fix picks the largest kernel-supported block size that divides evenly into the cache block size (544 % 32 = 0, so kernel uses 32 with ratio=17), then:

Reshapes cache: [num_blocks, 544, heads, hd] -> [num_blocks*17, 32, heads, hd] (zero-copy, same physical memory)
Expands block tables: each vLLM block b becomes 17 kernel blocks [b17, ..., b17+16]

Non-hybrid models (block_size=16) are unaffected (fast path skips translation).

Test

pytest tests/test_attention_dispatch.py -v -m "not slow" -- 4/4 passed
pytest tests/test_attention_dispatch.py::test_qwen35_paged_attention_hybrid -- passed (previously RuntimeError)
Unit tests for _pick_kernel_block_size and _build_block_tables translation logic

… attention Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>

LxYuan0420

do we have any unit test to cover this changes?

Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>

ricky-chaoju · 2026-04-07T08:18:07Z

do we have any unit test to cover this changes?

Added tests/test_block_size_translation.py with 10 test cases covering _pick_kernel_block_size and
_build_block_tables (translation, padding, exact match, indivisible error)

Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>

## Summary - Add Qwen3.5-0.8B smoke test alongside the existing Qwen3-0.6B test, covering the hybrid SDPA + GDN linear attention paged path end-to-end - Fix `json.load` → `json.loads(strict=False)` for both smoke tests — responses containing newlines (e.g. Qwen3.5 output) cause `Invalid control character` with strict parsing - Pin model revision to `2fc06364715b967f1860aea9cf38778875588b17` - Use longer health check timeout for Qwen3.5 (`--retry 30 --retry-delay 10`) - Use `--max-num-seqs 1` and `VLLM_METAL_MEMORY_FRACTION=0.8` for Qwen3.5 to fit within the CI runner's ~5GB Metal memory (hybrid models allocate GDN linear state per slot, default 256 slots would exceed budget) Depends on #235 (merged) for the block_size translation fix. --------- Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>

…vllm-project#235) ## Summary - Fix RuntimeError when running Qwen3.5 hybrid model with paged attention: `Unable to load function paged_attention_..._bs544_...` - vLLM inflates block_size to 544 to align attention pages with mamba pages in hybrid models, but the Metal kernel only has instantiations for [8, 16, 32] - Add block-size translation in attention_sdpa.py: reshape cache (zero-copy) and expand block tables so the kernel sees a compatible block_size For hybrid models (e.g. Qwen3.5), vLLM sets block_size=544 to align attention page size with mamba page size. The Metal paged attention kernel is template-instantiated for block sizes [8, 16, 32] only. The fix picks the largest kernel-supported block size that divides evenly into the cache block size (544 % 32 = 0, so kernel uses 32 with ratio=17), then: 1. Reshapes cache: [num_blocks, 544, heads, hd] -> [num_blocks*17, 32, heads, hd] (zero-copy, same physical memory) 2. Expands block tables: each vLLM block b becomes 17 kernel blocks [b*17, ..., b*17+16] Non-hybrid models (block_size=16) are unaffected (fast path skips translation). ## Test - [x] pytest tests/test_attention_dispatch.py -v -m "not slow" -- 4/4 passed - [x] pytest tests/test_attention_dispatch.py::test_qwen35_paged_attention_hybrid -- passed (previously RuntimeError) - [x] Unit tests for _pick_kernel_block_size and _build_block_tables translation logic --------- Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>

…oject#239) ## Summary - Add Qwen3.5-0.8B smoke test alongside the existing Qwen3-0.6B test, covering the hybrid SDPA + GDN linear attention paged path end-to-end - Fix `json.load` → `json.loads(strict=False)` for both smoke tests — responses containing newlines (e.g. Qwen3.5 output) cause `Invalid control character` with strict parsing - Pin model revision to `2fc06364715b967f1860aea9cf38778875588b17` - Use longer health check timeout for Qwen3.5 (`--retry 30 --retry-delay 10`) - Use `--max-num-seqs 1` and `VLLM_METAL_MEMORY_FRACTION=0.8` for Qwen3.5 to fit within the CI runner's ~5GB Metal memory (hybrid models allocate GDN linear state per slot, default 256 slots would exceed budget) Depends on vllm-project#235 (merged) for the block_size translation fix. --------- Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>

Explains the block-size translation mechanism (PR vllm-project#235) when users enable paged attention for hybrid models like Qwen3.5. The warning describes: - Why translation is needed (vLLM requires block_size=160, Metal kernel only supports {8, 16, 32}) - How it works (each vLLM block splits into multiple kernel blocks, cache is reshaped zero-copy) - That the default MLX path is recommended for hybrid models Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>

…ence estimate Updates test expectations to match the implementation changes: - test_hybrid_with_paged_attention_logs_warning: Verify warning is logged instead of ValueError (PR vllm-project#235 made hybrid + paged attention supported) - test_determine_available_memory_single_sequence_mode: Restore to test one-sequence estimate (PR vllm-project#229 design) instead of 80% memory fraction Also fixes test fixtures to include required vllm_config attribute. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>

Translate hybrid block_size to kernel-compatible size for Metal paged…

251ece6

… attention Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>

ricky-chaoju marked this pull request as ready for review April 7, 2026 04:11

ricky-chaoju force-pushed the fix/hybrid-paged-attention-block-size branch from 6417c9a to 251ece6 Compare April 7, 2026 07:18

LxYuan0420 requested changes Apr 7, 2026

View reviewed changes

Comment thread vllm_metal/metal_kernel_backend/attention_sdpa.py Outdated

Vectorize block-table expansion and add unit tests

f7d6f39

Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>

Address review: tuple constant, empty-block-tables guard, fix docstring

2641a53

Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>

ricky-chaoju requested a review from LxYuan0420 April 7, 2026 08:42

LxYuan0420 approved these changes Apr 7, 2026

View reviewed changes

LxYuan0420 mentioned this pull request Apr 7, 2026

Fix/qwen3.5 init failed #230

Merged

ericcurtin approved these changes Apr 7, 2026

View reviewed changes

ericcurtin merged commit 0f6f76b into vllm-project:main Apr 7, 2026
5 checks passed

ricky-chaoju deleted the fix/hybrid-paged-attention-block-size branch April 7, 2026 10:03

ricky-chaoju mentioned this pull request Apr 7, 2026

[CI] Add Qwen3.5-0.8B hybrid smoke test and fix json parsing #239

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Translate hybrid block_size for Metal paged attention kernel#235

[Bugfix] Translate hybrid block_size for Metal paged attention kernel#235
ericcurtin merged 3 commits intovllm-project:mainfrom
ricky-chaoju:fix/hybrid-paged-attention-block-size

ricky-chaoju commented Apr 7, 2026

Uh oh!

LxYuan0420 left a comment

Uh oh!

Uh oh!

ricky-chaoju commented Apr 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ricky-chaoju commented Apr 7, 2026

Summary

Test

Uh oh!

LxYuan0420 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ricky-chaoju commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ricky-chaoju commented Apr 7, 2026 •

edited

Loading