[Bugfix] Translate hybrid block_size for Metal paged attention kernel#235
Merged
ericcurtin merged 3 commits intovllm-project:mainfrom Apr 7, 2026
Merged
Conversation
… attention Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>
6417c9a to
251ece6
Compare
LxYuan0420
requested changes
Apr 7, 2026
Collaborator
LxYuan0420
left a comment
There was a problem hiding this comment.
do we have any unit test to cover this changes?
Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>
Contributor
Author
Added |
Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>
LxYuan0420
approved these changes
Apr 7, 2026
Merged
ericcurtin
approved these changes
Apr 7, 2026
WindChimeRan
pushed a commit
that referenced
this pull request
Apr 8, 2026
## Summary - Add Qwen3.5-0.8B smoke test alongside the existing Qwen3-0.6B test, covering the hybrid SDPA + GDN linear attention paged path end-to-end - Fix `json.load` → `json.loads(strict=False)` for both smoke tests — responses containing newlines (e.g. Qwen3.5 output) cause `Invalid control character` with strict parsing - Pin model revision to `2fc06364715b967f1860aea9cf38778875588b17` - Use longer health check timeout for Qwen3.5 (`--retry 30 --retry-delay 10`) - Use `--max-num-seqs 1` and `VLLM_METAL_MEMORY_FRACTION=0.8` for Qwen3.5 to fit within the CI runner's ~5GB Metal memory (hybrid models allocate GDN linear state per slot, default 256 slots would exceed budget) Depends on #235 (merged) for the block_size translation fix. --------- Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>
Alex-ai-future
pushed a commit
to Alex-ai-future/vllm-metal
that referenced
this pull request
Apr 8, 2026
…vllm-project#235) ## Summary - Fix RuntimeError when running Qwen3.5 hybrid model with paged attention: `Unable to load function paged_attention_..._bs544_...` - vLLM inflates block_size to 544 to align attention pages with mamba pages in hybrid models, but the Metal kernel only has instantiations for [8, 16, 32] - Add block-size translation in attention_sdpa.py: reshape cache (zero-copy) and expand block tables so the kernel sees a compatible block_size For hybrid models (e.g. Qwen3.5), vLLM sets block_size=544 to align attention page size with mamba page size. The Metal paged attention kernel is template-instantiated for block sizes [8, 16, 32] only. The fix picks the largest kernel-supported block size that divides evenly into the cache block size (544 % 32 = 0, so kernel uses 32 with ratio=17), then: 1. Reshapes cache: [num_blocks, 544, heads, hd] -> [num_blocks*17, 32, heads, hd] (zero-copy, same physical memory) 2. Expands block tables: each vLLM block b becomes 17 kernel blocks [b*17, ..., b*17+16] Non-hybrid models (block_size=16) are unaffected (fast path skips translation). ## Test - [x] pytest tests/test_attention_dispatch.py -v -m "not slow" -- 4/4 passed - [x] pytest tests/test_attention_dispatch.py::test_qwen35_paged_attention_hybrid -- passed (previously RuntimeError) - [x] Unit tests for _pick_kernel_block_size and _build_block_tables translation logic --------- Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>
Alex-ai-future
pushed a commit
to Alex-ai-future/vllm-metal
that referenced
this pull request
Apr 8, 2026
…oject#239) ## Summary - Add Qwen3.5-0.8B smoke test alongside the existing Qwen3-0.6B test, covering the hybrid SDPA + GDN linear attention paged path end-to-end - Fix `json.load` → `json.loads(strict=False)` for both smoke tests — responses containing newlines (e.g. Qwen3.5 output) cause `Invalid control character` with strict parsing - Pin model revision to `2fc06364715b967f1860aea9cf38778875588b17` - Use longer health check timeout for Qwen3.5 (`--retry 30 --retry-delay 10`) - Use `--max-num-seqs 1` and `VLLM_METAL_MEMORY_FRACTION=0.8` for Qwen3.5 to fit within the CI runner's ~5GB Metal memory (hybrid models allocate GDN linear state per slot, default 256 slots would exceed budget) Depends on vllm-project#235 (merged) for the block_size translation fix. --------- Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>
Alex-ai-future
added a commit
to Alex-ai-future/vllm-metal
that referenced
this pull request
Apr 8, 2026
Explains the block-size translation mechanism (PR vllm-project#235) when users enable paged attention for hybrid models like Qwen3.5. The warning describes: - Why translation is needed (vLLM requires block_size=160, Metal kernel only supports {8, 16, 32}) - How it works (each vLLM block splits into multiple kernel blocks, cache is reshaped zero-copy) - That the default MLX path is recommended for hybrid models Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>
Alex-ai-future
added a commit
to Alex-ai-future/vllm-metal
that referenced
this pull request
Apr 8, 2026
…ence estimate Updates test expectations to match the implementation changes: - test_hybrid_with_paged_attention_logs_warning: Verify warning is logged instead of ValueError (PR vllm-project#235 made hybrid + paged attention supported) - test_determine_available_memory_single_sequence_mode: Restore to test one-sequence estimate (PR vllm-project#229 design) instead of 80% memory fraction Also fixes test fixtures to include required vllm_config attribute. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Unable to load function paged_attention_..._bs544_...For hybrid models (e.g. Qwen3.5), vLLM sets block_size=544 to align attention page size with mamba page size. The Metal paged attention kernel is template-instantiated for block sizes [8, 16, 32] only.
The fix picks the largest kernel-supported block size that divides evenly into the cache block size (544 % 32 = 0, so kernel uses 32 with ratio=17), then:
Non-hybrid models (block_size=16) are unaffected (fast path skips translation).
Test