[Qwen3.5] C++ Metal kernel for hybrid SDPA + GDN paged inference(Stage C)#226
[Qwen3.5] C++ Metal kernel for hybrid SDPA + GDN paged inference(Stage C)#226ericcurtin merged 16 commits intovllm-project:mainfrom
Conversation
Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>
Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>
Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>
Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>
Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>
Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>
Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>
Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>
Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>
Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>
…models Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>
Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>
561b36b to
ebf2adc
Compare
Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>
Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>
|
@ricky-chaoju @WindChimeRan I'm ok with this, but are we sure we want to do C++ ? Maybe we'd have to remove the mlx dependancy, but it might be worth it: |
|
dead _orig_init variable in compat.py, no Dk>256 guard in GDN kernel, golden test generate() lacks subprocess isolation |
Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>
Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>
|
@ericcurtin Thanks for the pointer to inferrs. I want to make sure I understand your suggestion correctly. Are you proposing that we replace MLX with Candle as the tensor/model framework? That would mean rewriting the model files from mlx_lm to candle-transformers, and using Candle's Metal backend instead of MLX's. BTW, I found this: https://github.com/EricLBuehler/candle-vllm haven't fully digest it yet. Or are you suggesting something narrower - like using Rust only for the Metal kernel dispatch layer (the C++ nanobind bridge in this PR), while keeping MLX for model execution? Want to understand the scope before we discuss tradeoffs. |
Pretty much, it would be also ok to consider doing our own kernels based on candle. The creator of MLX left Apple, so that makes me unsure about it. I'm sure it will continue as a project but that leaves a dent. Also I think C/C++ for new code in 2026 is a bad idea, Rust is just simply better, the compiler errors the Rust compiler generates are so valuable. But curious what @WindChimeRan @LxYuan0420 @mgoin @robertgshaw2-redhat think...
|
|
Merging to keep things moving, but I'm still curious of people's thoughts on above |
|
My view is:
so if we do want to seriously explore that direction, it might be good to first agree on what the starting repo should be, so the effort stays focused? CC: @ericcurtin |
Actually I have the same feeling ... but we don't have much choice. My take is that the tensor framework and model ecosystem choice matters more than the C++ vs Rust question. A few thoughts:
The benefits of Candle over MLX are unclear to me right now. The applebench result is concerning to me now. vllm-metal is still too slow. I'm looking into the finegrained profilling now. |
Resolve conflict in paged_ops.cpp: keep both paged_attention_primitive (ours) and gdn_linear_attention (upstream vllm-project#226) bindings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolve conflict in paged_ops.cpp: keep both paged_attention_primitive (ours) and gdn_linear_attention (upstream vllm-project#226) bindings. Signed-off-by: ran <hzz5361@psu.edu>
Resolve conflict in paged_ops.cpp: keep both paged_attention_primitive (ours) and gdn_linear_attention (upstream vllm-project#226) bindings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: ran <hzz5361@psu.edu>
Resolve conflict in paged_ops.cpp: keep both paged_attention_primitive (ours) and gdn_linear_attention (upstream vllm-project#226) bindings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: ran <hzz5361@psu.edu>
|
@ricky-chaoju This PR modified Qwen3's hot path (e.g, I'm not sure if this is the right way to patch it. Maybe we'll have other model to hack and then attention_sdpa will be ballooned. At least we need some benchmark to show if Qwen3 has performance regression. |
The improvement comes from #226's unified varlen kernel, not from #232. Direct A/B of #232 alone is not possible because main without #232 crashes on Qwen3 (head_dim AttributeError). Micro-benchmark confirms the per-call overhead of the fix is 0.25 µs on a ~ms-scale forward pass, no regression. |
## Summary - Fix `AttributeError: 'Attention' object has no attribute 'head_dim'` on Qwen3 models when using paged attention - Qwen3.5 saves `self.head_dim` in its Attention class, but Qwen3 does not — derive from `k_proj.weight.shape[0] // n_kv_heads` as fallback ## Regression Introduced by #226 (`attention_sdpa.py:68`). All Qwen3 models on the paged KV path crash. ## Test VLLM_METAL_USE_PAGED_ATTENTION=1 VLLM_METAL_MEMORY_FRACTION=0.2 \ python -m pytest tests/test_paged_deterministic.py -v -s -m slow 6/6 passed (Qwen3-0.6B, paged path). Signed-off-by: RickyChen / 陳昭儒 <ricky.chen@infinirc.com> Signed-off-by: RickyChen <rickychen@RickyChens-Mac-Pro.local> Co-authored-by: RickyChen <rickychen@RickyChens-Mac-Pro.local>
…#232) ## Summary - Fix `AttributeError: 'Attention' object has no attribute 'head_dim'` on Qwen3 models when using paged attention - Qwen3.5 saves `self.head_dim` in its Attention class, but Qwen3 does not — derive from `k_proj.weight.shape[0] // n_kv_heads` as fallback ## Regression Introduced by vllm-project#226 (`attention_sdpa.py:68`). All Qwen3 models on the paged KV path crash. ## Test VLLM_METAL_USE_PAGED_ATTENTION=1 VLLM_METAL_MEMORY_FRACTION=0.2 \ python -m pytest tests/test_paged_deterministic.py -v -s -m slow 6/6 passed (Qwen3-0.6B, paged path). Signed-off-by: RickyChen / 陳昭儒 <ricky.chen@infinirc.com> Signed-off-by: RickyChen <rickychen@RickyChens-Mac-Pro.local> Co-authored-by: RickyChen <rickychen@RickyChens-Mac-Pro.local>
## Summary
This PR enables Qwen3.5 hybrid models (SDPA + GDN layers) to run on
Metal by implementing `update_block_size_for_backend()` to unify KV
cache page sizes, adding MLA (Multi-Token Latent Attention) support, and
improving memory reporting for the MLX path.
## Problem
When trying to run Qwen3.5 (a hybrid model with SDPA + GDN layers) on
Metal, the following issues were encountered:
### 1. Hybrid Model Page Size Alignment Failure
vLLM's KV cache validation fails because SDPA page size and Mamba page
size are not divisible:
```
NotImplementedError: The page size of the layer is not divisible by the maximum page size.
Cannot unify by adjusting block_size.
```
**Test failure on main branch:**
```
FAILED tests/test_attention_dispatch.py::test_qwen35_paged_attention_hybrid -
NotImplementedError: The page size of the layer is not divisible by the maximum page size.
```
### 2. Forced Paged Attention Causes OOM
Previously, paged attention was auto-enabled for hybrid models, which
allocates a large contiguous memory buffer that exceeds the capacity of
smaller Metal devices.
### 3. Inaccurate Memory Reporting
The MLX path reported memory based on `max_model_len`, which gave
misleadingly small values for the scheduler.
## Solution
### 1. Add `update_block_size_for_backend()` to MetalPlatform
Implements a 5-step process to unify page sizes for hybrid models:
1. **Compute attention page size per token** - Uses `MLAAttentionSpec`
for MLA models or `FullAttentionSpec` otherwise
2. **Get Mamba page size** - Queries model class for mamba state shape
and dtype
3. **Calculate block_size** - Ensures SDPA page_size >= Mamba
page_size using `kernel_block_alignment_size=32`
4. **Sync mamba_block_size** - If using align mode
5. **Pad mamba_page_size** - Matches SDPA page_size exactly
**Key insight:** This is a "logical" fix for vLLM's scheduler validation
only. The Metal plugin manages KV cache internally via MLX's
`make_prompt_cache()`, independent of vLLM's calculations.
### 2. Add MLA Support
Hybrid models with MLA (e.g., DeepSeek variants) now use
`MLAAttentionSpec` for correct page size calculation:
```python
if getattr(model_config, "use_mla", False):
attn_page_size_1_token = MLAAttentionSpec(...).page_size_bytes
else:
attn_page_size_1_token = FullAttentionSpec(...).page_size_bytes
```
### 3. Add Early Error for Hybrid + Paged Attention
Raises a clear `ValueError` when users attempt to use paged attention
with hybrid models:
```
ValueError: Hybrid models (e.g., Qwen3.5) are not supported with paged attention on Metal.
The Metal paged attention kernel only supports block_size in {8, 16, 32}, but hybrid models
require block_size=160. Please remove VLLM_METAL_USE_PAGED_ATTENTION=1.
```
**Root cause:** Metal paged attention kernels only support `block_size ∈
{8, 16, 32}`, but Qwen3.5 requires `block_size=160`.
### 4. Improve Memory Reporting
Changed MLX path to report 80% of remaining Metal memory instead of one
max-length sequence:
```python
_MLX_MEMORY_BUDGET_FRACTION = 0.8
metal_limit = mx.device_info()["max_recommended_working_set_size"]
model_memory = self._get_model_memory_usage()
available = int((metal_limit - model_memory) * _MLX_MEMORY_BUDGET_FRACTION)
```
**Before:** `reporting 4.29GB for scheduler admission control (one
max-length sequence, max_model_len=2048)`
**After:** `reporting 11.20 GB for scheduler (Metal limit: 16.00 GB,
Model: 2.00 GB, Remaining: 14.00 GB, KV budget: 11.20 GB)`
### 5. Remove Auto-Enable Paged Attention for Hybrid Models
MLX's `make_prompt_cache()` handles hybrid KV cache natively. Paged
attention is now opt-in rather than forced.
## Changes
### New Files
| File | Lines | Description |
| --- | --- | --- |
| `tests/test_platform_update_block_size.py` | 576 | Comprehensive unit
tests (16 test cases) |
### Modified Files
| File | Changes | Description |
| --- | --- | --- |
| `vllm_metal/platform.py` | +172 | Add
`update_block_size_for_backend()` with MLA support |
| `vllm_metal/v1/worker.py` | +41, -23 | Improve memory reporting,
remove auto-enable paged |
| `vllm_metal/v1/model_runner.py` | +1, -1 | Fix MLX API:
`mx.device_info()` |
| `tests/test_v1_worker.py` | +22 | Update memory reporting test |
## Test Results
### Unit Tests (16/16 Pass)
```bash
$ source .venv-vllm-metal/bin/activate && python -m pytest tests/test_platform_update_block_size.py -v
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_hybrid_model_success PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_hybrid_model_block_size_already_sufficient PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_non_hybrid_model_skipped PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_model_config_none PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_model_resolution_failure PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_get_mamba_state_shape_failure PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_get_mamba_state_dtype_failure PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_mamba_page_size_zero PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_invalid_architecture PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_block_size_increased_to_minimum PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_mamba_cache_mode_align PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_hybrid_with_paged_attention_raises_error PASSED
tests/test_platform_update_block_size.py::TestMLAModels::test_mla_hybrid_model_uses_mla_spec PASSED
tests/test_platform_update_block_size.py::TestMLAModels::test_mla_non_hybrid_skipped PASSED
tests/test_platform_update_block_size.py::TestMLAModels::test_mla_with_cache_dtype[bfloat16] PASSED
tests/test_platform_update_block_size.py::TestMLAModels::test_mla_with_cache_dtype[float16] PASSED
16 passed, 2 warnings in 2.69s
```
### Lint Checks
```bash
$ ruff check vllm_metal/ tests/
All checks passed!
$ ruff format --check vllm_metal/ tests/
5 files already formatted
```
## Usage
### Default Path (Recommended for Hybrid Models)
```bash
# No env var needed - uses MLX's native KV cache via make_prompt_cache()
vllm serve Qwen/Qwen3.5-0.8B
vllm serve Qwen/Qwen3.5-4B
vllm serve Qwen/Qwen3.5-14B
vllm serve Qwen/Qwen3.5-32B
```
### Paged Attention Path (Non-Hybrid Models Only)
```bash
# Only for non-hybrid models
VLLM_METAL_USE_PAGED_ATTENTION=1 vllm serve HuggingFaceTB/SmolLM2-135M-Instruct
```
### Unsupported Configuration (Will Raise Clear Error)
```bash
# This will now raise a helpful ValueError instead of cryptic kernel failure
VLLM_METAL_USE_PAGED_ATTENTION=1 vllm serve Qwen/Qwen3.5-0.8B
# ValueError: Hybrid models (e.g., Qwen3.5) are not supported with paged attention on Metal...
```
## Known Limitations
**Hybrid + Paged Attention is unsupported on Metal** due to kernel
limitations:
- Metal paged attention kernels only instantiate `block_size ∈ {8, 16,
32}`
- Hybrid models require `block_size=160` to satisfy vLLM's page size
divisibility validation
- Users should use the native MLX KV cache path (default) for hybrid
models
## Related Issues
- Fixes #184 (hybrid model initialization failure)
- Addresses reviewer feedback from #230 (@ricky-chaoju, @ericcurtin)
- Related to #232 (head_dim AttributeError on Qwen3)
- Related to #226 (regression fix)
## Commits
This PR includes the following commits:
```
964e9df [Metal] Fix inaccurate docstring and test comments
c87263b [Metal] Fix import order and add test for hybrid + paged error case
f7161cc [Metal] Address reviewer feedback on hybrid + paged attention
774c8bd use new device
fb5e2d5 [Metal] Fix ruff format issues
5c7379b [Metal] Fix lint issues in MLA support changes
80b8d19 [Metal] Add MLA support to update_block_size_for_backend
733612b [Metal] Improve error handling in update_block_size_for_backend
36d8cb7 [Metal] Add unit tests for update_block_size_for_backend and improve error handling
b2973f0 Fix test for determine_available_memory single_sequence mode
a292980 Restore model_runner.py to upstream version
3f22b84 [Metal] Fix Qwen3.5 hybrid model initialization
d49d0fb [Metal] Fix hybrid model KV cache page size alignment
```
---------
Signed-off-by: Alex <alex.tech.lab@outlook.com>
Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
This PR is: - To fix under-reported GDN recurrent memory sizing by accounting for float32 recurrent state (introduced in #226). - To add a unit test covering the corrected linear cache byte calculation. Notes: - Scope is limited to hybrid linear attention accounting in `linear_cache_bytes_per_slot()`. - SDPA KV sizing is unchanged. Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>

Summary
vllm servefor Qwen3.5 hybrid models (75% GDN + 25% SDPA)Benchmark (Qwen3.5-0.8B, 10 reqs, input=128, output=64, 3-run avg)
design
support_hybrid_kv_cache() = Trueenables vLLM's hybrid KV cache managerRef: #194 (Stage C)