Fix/qwen3.5 init failed by Alex-ai-future · Pull Request #230 · vllm-project/vllm-metal

Alex-ai-future · 2026-04-05T04:49:58Z

Summary

This PR enables Qwen3.5 hybrid models (SDPA + GDN layers) to run on Metal by implementing update_block_size_for_backend() to unify KV cache page sizes, adding MLA (Multi-Token Latent Attention) support, and improving memory reporting for the MLX path.

Problem

When trying to run Qwen3.5 (a hybrid model with SDPA + GDN layers) on Metal, the following issues were encountered:

1. Hybrid Model Page Size Alignment Failure

vLLM's KV cache validation fails because SDPA page size and Mamba page size are not divisible:

NotImplementedError: The page size of the layer is not divisible by the maximum page size.
Cannot unify by adjusting block_size.

Test failure on main branch:

FAILED tests/test_attention_dispatch.py::test_qwen35_paged_attention_hybrid - 
NotImplementedError: The page size of the layer is not divisible by the maximum page size.

2. Forced Paged Attention Causes OOM

Previously, paged attention was auto-enabled for hybrid models, which allocates a large contiguous memory buffer that exceeds the capacity of smaller Metal devices.

3. Inaccurate Memory Reporting

The MLX path reported memory based on max_model_len, which gave misleadingly small values for the scheduler.

Solution

1. Add `update_block_size_for_backend()` to MetalPlatform

Implements a 5-step process to unify page sizes for hybrid models:

Compute attention page size per token - Uses MLAAttentionSpec for MLA models or FullAttentionSpec otherwise
Get Mamba page size - Queries model class for mamba state shape and dtype
Calculate block_size - Ensures SDPA page_size >= Mamba page_size using kernel_block_alignment_size=32
Sync mamba_block_size - If using align mode
Pad mamba_page_size - Matches SDPA page_size exactly

Key insight: This is a "logical" fix for vLLM's scheduler validation only. The Metal plugin manages KV cache internally via MLX's make_prompt_cache(), independent of vLLM's calculations.

2. Add MLA Support

Hybrid models with MLA (e.g., DeepSeek variants) now use MLAAttentionSpec for correct page size calculation:

if getattr(model_config, "use_mla", False):
    attn_page_size_1_token = MLAAttentionSpec(...).page_size_bytes
else:
    attn_page_size_1_token = FullAttentionSpec(...).page_size_bytes

3. Add Early Error for Hybrid + Paged Attention

Raises a clear ValueError when users attempt to use paged attention with hybrid models:

ValueError: Hybrid models (e.g., Qwen3.5) are not supported with paged attention on Metal. 
The Metal paged attention kernel only supports block_size in {8, 16, 32}, but hybrid models 
require block_size=160. Please remove VLLM_METAL_USE_PAGED_ATTENTION=1.

Root cause: Metal paged attention kernels only support block_size ∈ {8, 16, 32}, but Qwen3.5 requires block_size=160.

4. Improve Memory Reporting

Changed MLX path to report 80% of remaining Metal memory instead of one max-length sequence:

_MLX_MEMORY_BUDGET_FRACTION = 0.8
metal_limit = mx.device_info()["max_recommended_working_set_size"]
model_memory = self._get_model_memory_usage()
available = int((metal_limit - model_memory) * _MLX_MEMORY_BUDGET_FRACTION)

Before: reporting 4.29GB for scheduler admission control (one max-length sequence, max_model_len=2048)

After: reporting 11.20 GB for scheduler (Metal limit: 16.00 GB, Model: 2.00 GB, Remaining: 14.00 GB, KV budget: 11.20 GB)

5. Remove Auto-Enable Paged Attention for Hybrid Models

MLX's make_prompt_cache() handles hybrid KV cache natively. Paged attention is now opt-in rather than forced.

Changes

New Files

File	Lines	Description
`tests/test_platform_update_block_size.py`	576	Comprehensive unit tests (16 test cases)

Modified Files

File	Changes	Description
`vllm_metal/platform.py`	+172	Add `update_block_size_for_backend()` with MLA support
`vllm_metal/v1/worker.py`	+41, -23	Improve memory reporting, remove auto-enable paged
`vllm_metal/v1/model_runner.py`	+1, -1	Fix MLX API: `mx.device_info()`
`tests/test_v1_worker.py`	+22	Update memory reporting test

Test Results

Unit Tests (16/16 Pass)

$ source .venv-vllm-metal/bin/activate && python -m pytest tests/test_platform_update_block_size.py -v

tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_hybrid_model_success PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_hybrid_model_block_size_already_sufficient PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_non_hybrid_model_skipped PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_model_config_none PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_model_resolution_failure PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_get_mamba_state_shape_failure PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_get_mamba_state_dtype_failure PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_mamba_page_size_zero PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_invalid_architecture PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_block_size_increased_to_minimum PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_mamba_cache_mode_align PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_hybrid_with_paged_attention_raises_error PASSED
tests/test_platform_update_block_size.py::TestMLAModels::test_mla_hybrid_model_uses_mla_spec PASSED
tests/test_platform_update_block_size.py::TestMLAModels::test_mla_non_hybrid_skipped PASSED
tests/test_platform_update_block_size.py::TestMLAModels::test_mla_with_cache_dtype[bfloat16] PASSED
tests/test_platform_update_block_size.py::TestMLAModels::test_mla_with_cache_dtype[float16] PASSED

16 passed, 2 warnings in 2.69s

Lint Checks

$ ruff check vllm_metal/ tests/
All checks passed!

$ ruff format --check vllm_metal/ tests/
5 files already formatted

Usage

Default Path (Recommended for Hybrid Models)

# No env var needed - uses MLX's native KV cache via make_prompt_cache()
vllm serve Qwen/Qwen3.5-0.8B
vllm serve Qwen/Qwen3.5-4B
vllm serve Qwen/Qwen3.5-14B
vllm serve Qwen/Qwen3.5-32B

Paged Attention Path (Non-Hybrid Models Only)

# Only for non-hybrid models
VLLM_METAL_USE_PAGED_ATTENTION=1 vllm serve HuggingFaceTB/SmolLM2-135M-Instruct

Unsupported Configuration (Will Raise Clear Error)

# This will now raise a helpful ValueError instead of cryptic kernel failure
VLLM_METAL_USE_PAGED_ATTENTION=1 vllm serve Qwen/Qwen3.5-0.8B
# ValueError: Hybrid models (e.g., Qwen3.5) are not supported with paged attention on Metal...

Known Limitations

Hybrid + Paged Attention is unsupported on Metal due to kernel limitations:

Metal paged attention kernels only instantiate block_size ∈ {8, 16, 32}
Hybrid models require block_size=160 to satisfy vLLM's page size divisibility validation
Users should use the native MLX KV cache path (default) for hybrid models

Related Issues

Fixes Engine crashes when prefix caching hits new complete prefill in Metal paged unified path #184 (hybrid model initialization failure)
Addresses reviewer feedback from Fix/qwen3.5 init failed #230 (@ricky-chaoju, @ericcurtin)
Related to [Bugfix] Derive head_dim for Qwen3 paged attention path #232 (head_dim AttributeError on Qwen3)
Related to [Qwen3.5] C++ Metal kernel for hybrid SDPA + GDN paged inference(Stage C) #226 (regression fix)

Commits

This PR includes the following commits:

964e9df [Metal] Fix inaccurate docstring and test comments
c87263b [Metal] Fix import order and add test for hybrid + paged error case
f7161cc [Metal] Address reviewer feedback on hybrid + paged attention
774c8bd  use new device
fb5e2d5 [Metal] Fix ruff format issues
5c7379b [Metal] Fix lint issues in MLA support changes
80b8d19 [Metal] Add MLA support to update_block_size_for_backend
733612b [Metal] Improve error handling in update_block_size_for_backend
36d8cb7 [Metal] Add unit tests for update_block_size_for_backend and improve error handling
b2973f0 Fix test for determine_available_memory single_sequence mode
a292980 Restore model_runner.py to upstream version
3f22b84 [Metal] Fix Qwen3.5 hybrid model initialization
d49d0fb [Metal] Fix hybrid model KV cache page size alignment

ericcurtin · 2026-04-05T12:57:08Z

update_block_size_for_backend has no callers, no tests for new platform method, mamba_page_size_padded may be unset if update_block_size_for_backend is never invoked, magic 0.8 factor not configurable

Alex-ai-future · 2026-04-05T14:29:07Z

update_block_size_for_backend has no callers, no tests for new platform method, mamba_page_size_padded may be unset if update_block_size_for_backend is never invoked, magic 0.8 factor not configurable

Thank you for your review! I would like to discuss the hardcoded coefficient of 0.8 within the determine_available_memory() function.
Issue: The GPU path measures memory usage via actual performance profiling, whereas the MLX path relies on a hardcoded estimate. This approach introduces inconsistency and prevents users from fine-tuning the settings to suit their specific application scenarios.
Question: Should this function report the actual memory capacity (similar to the GPU path), or should it establish a configurable memory budget? Is it necessary to make this function user-adjustable, or should it simply return an actual, measured value?

Alex-ai-future · 2026-04-05T14:35:28Z

"update_block_size_for_backend has no callers"

The method is called by vLLM core in:

File: vllm/v1/executor/uniproc_executor.py
Function: _UniProcExecutor.init() (line 53)
Context: During executor initialization, after load_model() completes

1 # vllm/v1/executor/uniproc_executor.py:53
2 self.driver_worker.load_model()
3 current_platform.update_block_size_for_backend(self.vllm_config) # ← Called here

This is the standard integration point for all platform plugins.

"no tests for new platform method"

Agreed. I've added 11 unit tests in tests/test_platform_update_block_size.py covering success, failure, and edge cases.

"mamba_page_size_padded may be unset"

Good catch. I've improved error handling to raise exceptions for hybrid models instead of silently returning.

ricky-chaoju

Thanks for working on this. A few issues found during review:

block_size will crash the Metal kernel. Verified with all Qwen3.5 variants (0.8B/4B/14B/32B) the formula at platform.py:413-417 computes attn_block_size=160, but Metal paged attention only has kernel instantiations for {8, 16, 32}. Users who set VLLM_METAL_USE_PAGED_ATTENTION=1 will hit RuntimeError: Unable to load function ...bs160.... The docstring claims this is "scheduler validation only" but cache_config.block_size is consumed by the kernel in the paged path.

Magic constant. The 0.8 factor in worker.py:420 should be a named constant. Agree with ericcurtin's earlier feedback.

Stale test comments. test_mla_hybrid_model_uses_mla_spec and test_mla_with_cache_dtype say "will FAIL until MLA support is added" but the implementation landed in the same commit — all 15 tests pass. Comments should be cleaned up.

Related: #232 fixes another #226 regression (head_dim AttributeError on Qwen3).

ricky-chaoju · 2026-04-06T04:06:26Z

+        # - GPU performance (aligned to Metal threadgroup preferences)
+        # - Memory efficiency (not excessively large)
+        # - Compatibility with vLLM's page size unification requirements
+        kernel_block_alignment_size = 32  # Metal GPU kernel alignment


Verified with actual model parameters — all Qwen3.5 variants (0.8B/4B/14B/32B) compute attn_block_size=160 here.

Model mamba_page_size attn_page_size_1_token block_size

Qwen3.5-0.8B 548,864 4,096 160

Qwen3.5-4B 573,440 4,096 160

Qwen3.5-14B 585,728 4,096 160

Qwen3.5-32B 585,728 4,096 160

Metal paged attention kernel only has template instantiations for block_size ∈ {8, 16, 32} (pagedattention.metal:1452-1457). When a user sets VLLM_METAL_USE_PAGED_ATTENTION=1, this crashes:

RuntimeError: Unable to load function paged_attention_..._bs160_...

The docstring says "scheduler validation only" — inaccurate because cache_config.block_size is used by the Metal kernel in the paged path.

Related: #232 fixes another #226 regression (head_dim AttributeError on Qwen3).

My understanding is as follows:

The latest version of vLLM now validates the required block size; if it remains unmodified, it will fail this check.

In paged mode, if the block size is modified, there is no corresponding kernel available to match it.

However, how were you able to successfully run the application while utilizing the paged kernel?

And, do we have a plan to support dynamic kernel or to skip validation? Otherwise it is not likely to work on the latest vllm. Tell me is I was misunderstood!

Is it possible to detect whether Paged Mode is enabled?

If enabled, skip modifying the block size.

If disabled, modify the block size.

As part of a broader plan, at the very least, an option could be provided to allow running hybrid models without Paged Mode on latest vllm.

Confirmed — none of the Metal-supported block sizes {8, 16, 32} pass vLLM's page size divisibility check for Qwen3.5 (mamba_page_size=548864). The minimum required block_size is 134, which has no kernel instantiation. So hybrid + paged is genuinely unsupported right now.

Your approach is correct: update_block_size_for_backend solves the non-paged validation, and removing
auto-enable paged for hybrid is the right guard. My earlier review about the block_size crash still holds, but only when a user manually forces VLLM_METAL_USE_PAGED_ATTENTION=1 on a hybrid model — which is an unsupported configuration.

Suggestion: add an early error in _setup_paged_attention() (or update_block_size_for_backend) when both hybrid and paged attention are active, so users get a clear message instead of a cryptic kernel load failure.

thank you for your advice

ricky-chaoju · 2026-04-06T04:06:26Z

Agree with ericcurtin's feedback. Should at least be extracted to a named constant (e.g. _MLX_MEMORY_BUDGET_FRACTION = 0.8).

ricky-chaoju · 2026-04-06T04:06:26Z

+        - block_size should be a multiple of 32 (Metal GPU alignment)
+
+        Note: This test will FAIL until MLA support is added to the implementation.
+        """


This docstring says "will FAIL until MLA support is added" but the implementation was added in the same commit — all 15 tests pass. Same issue at line 479 and 528. Comments should be cleaned up.

ericcurtin · 2026-04-06T15:08:40Z

Need to sign to pass DCO

Alex-ai-future · 2026-04-06T15:33:43Z

signed, thank you！

LxYuan0420 · 2026-04-07T09:22:16Z

+        metal_limit = int(device_info.get("max_recommended_working_set_size", 0))
+        model_memory = self._get_model_memory_usage()
+        remaining = metal_limit - model_memory
+        available = int(remaining * _MLX_MEMORY_BUDGET_FRACTION)


This replaces the one-sequence admission logic (30eb33b, refined in #229) with a heuristic '80% of remaining memory.' That changes scheduler semantics and makes the block-alignment fix irrelevant on the MLX path. Keep the one-sequence estimate unless this is an explicit policy change.

Regarding determine_available_memory: Does the result calculated using the one-sequence method actually represent the available memory? During actual execution, I observed that the resulting value was extremely low, which prevented the program from running.

LxYuan0420 · 2026-04-07T09:23:00Z

+            raise ValueError(
+                "Hybrid models (e.g., Qwen3.5) are not supported with paged "
+                "attention on Metal. The Metal paged attention kernel only "
+                "supports block_size in {8, 16, 32}, but hybrid models require "


This hard ValueError blocks hybrid + paged attention unconditionally, which conflicts with PR #235’s block-size translation. Gate it on actual kernel support or downgrade to a warning.

Even if execution doesn't halt here, the program would still fail during actual runtime in this version due to the lack of a matching kernel.
Are you sure you want to allow this branch to execute within this PR? If so, I can change it to a warning for now.

LxYuan0420 · 2026-04-07T09:24:07Z


+# Memory budget fraction for MLX non-paged path.
+# Reports 80% of remaining Metal memory for KV cache.
+_MLX_MEMORY_BUDGET_FRACTION = 0.8


_MLX_MEMORY_BUDGET_FRACTION = 0.8 ignores cache_config.gpu_memory_utilization, removing user control over memory limits on the MLX path. Use the config value instead of a hard constant.

LxYuan0420 · 2026-04-07T09:26:05Z

+        else:
+            kv_cache_dtype = STR_DTYPE_TO_TORCH_DTYPE[cache_config.cache_dtype]
+
+        # Use MLAAttentionSpec for MLA models, FullAttentionSpec otherwise


Duplicated MLAAttentionSpec/FullAttentionSpec instantiation. Use a SpecClass selection to remove repetition.

Add update_block_size_for_backend method to MetalPlatform to properly align block_size and mamba_page_size_padded for hybrid models like Qwen3.5. The fix ensures that: 1. block_size is calculated to make SDPA page_size >= Mamba page_size 2. mamba_page_size_padded is set to match SDPA page_size exactly 3. _setup_paged_attention is called after block_size alignment This resolves the NotImplementedError when serving Qwen3.5 models: "The page size of the layer is not divisible by the maximum page size" Signed-off-by: Alex <alex.tech.lab@outlook.com> Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>

This commit fixes the initialization of hybrid models (e.g., Qwen3.5) on Metal platform by: 1. Adding update_block_size_for_backend() to MetalPlatform - Calculates proper block_size for hybrid models - Sets mamba_page_size_padded to unify page sizes - Required for vLLM's KV cache validation 2. Fixing block_size usage in model_runner.py - Use metal_config.block_size (not cache_config.block_size) - MLX uses metal_config.block_size for make_prompt_cache() - cache_config.block_size is only for vLLM scheduler validation 3. Fixing memory reporting in worker.py - Use 80% of remaining memory (like the old version) - Removed auto-enabling paged attention for hybrid models - MLX's make_prompt_cache() handles hybrid KV cache natively Signed-off-by: Alex <alex.tech.lab@outlook.com> Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>

This reverts the unnecessary changes to get_kv_cache_spec() and get_cache_block_size_bytes() that switched from cache_config.block_size to metal_config.block_size. As analyzed, these changes were not functionally necessary because: 1. get_cache_block_size_bytes() is only used when VLLM_METAL_USE_PAGED_ATTENTION=1 2. The default path uses MLX's make_prompt_cache() which doesn't use these values 3. The critical fix is in determine_available_memory() and load_model() Keeping code consistent with upstream for easier maintenance. Signed-off-by: Alex <alex.tech.lab@outlook.com> Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>

Update test_determine_available_memory_single_sequence_mode to match the new implementation that returns 80% of remaining Metal memory instead of _one_sequence_kv_bytes(). The new implementation: - Returns 80% of (metal_limit - model_memory) for MLX path - Provides more accurate memory reporting for scheduler admission control - Matches the behavior of the old update_to_lateset_version branch Signed-off-by: Alex <alex.tech.lab@outlook.com> Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>

…error handling - Add comprehensive unit tests (11 test cases) covering: - Success cases: hybrid models, non-hybrid models, sufficient block_size - Failure cases: model resolution failure, invalid config, zero mamba_page_size - Edge cases: block_size alignment, mamba_cache_mode='align' - Improve error handling in update_block_size_for_backend(): - Re-raise exceptions for hybrid models instead of silently returning - Add validation for zero mamba_page_size - Add error logging for debugging - Address reviewer feedback: - Add tests for new platform method - Ensure mamba_page_size_padded is properly set or raises exception Signed-off-by: Your Name <your.email@example.com> Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>

- Re-raise exceptions for hybrid models instead of silently returning - Add validation for zero mamba_page_size - Add error logging for debugging Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>

- Add MLA (Multi-Token Latent Attention) model support by checking model_config.use_mla and using MLAAttentionSpec for page size calculation - Add cache_dtype handling to properly convert cache_config.cache_dtype to torch.dtype (e.g., 'bfloat16', 'float16') - Add 4 new unit tests for MLA models: * test_mla_hybrid_model_uses_mla_spec: Verify MLA models use MLAAttentionSpec * test_mla_non_hybrid_skipped: Verify non-hybrid MLA models skip processing * test_mla_with_cache_dtype: Verify different cache dtypes are handled correctly This enables vllm-metal to support MLA-based models like DeepSeek. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>

- Remove trailing whitespace in test docstrings - Remove duplicate FullAttentionSpec and MambaSpec imports (already imported at line 333) Signed-off-by: Alex <alex.tech.lab@outlook.com> Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>

Signed-off-by: Alex <alex.tech.lab@outlook.com>

- Add _MLX_MEMORY_BUDGET_FRACTION constant to replace magic 0.8 factor - Add early ValueError when hybrid model is used with paged attention (Metal kernels only support block_size in {8, 16, 32}, but hybrid models require block_size=160) - Remove stale test comments about MLA support (already implemented) Fixes reviewer comments from ricky-chaoju and ericcurtin. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>

- Move _MLX_MEMORY_BUDGET_FRACTION constant after all imports - Add test_hybrid_with_paged_attention_raises_error() to verify clear error message when users enable paged attention on hybrid models All 16 tests pass. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>

- Update update_block_size_for_backend docstring with complete steps and raises documentation - Remove outdated 'test will FAIL' comment (MLA support already added) - Fix misleading comment about padding in block_size test Signed-off-by: Alex <alex.tech.lab@outlook.com>

Tests were mocking mx.metal.device_info but the code uses mx.device_info() after the API update. Fix all 8 TestPrefixCacheFractionParsing tests. Fixes: 8 failed tests in tests/test_prefix_cache.py Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>

Explains the block-size translation mechanism (PR vllm-project#235) when users enable paged attention for hybrid models like Qwen3.5. The warning describes: - Why translation is needed (vLLM requires block_size=160, Metal kernel only supports {8, 16, 32}) - How it works (each vLLM block splits into multiple kernel blocks, cache is reshaped zero-copy) - That the default MLX path is recommended for hybrid models Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>

Fixes reviewer comment from LxYuan0420: The hardcoded 0.8 factor ignored user configuration and removed control over memory limits on the MLX path. Changes: - Remove _MLX_MEMORY_BUDGET_FRACTION constant - Use self.cache_config.gpu_memory_utilization (default 0.9 in vLLM) - Apply to total Metal memory limit (consistent with vLLM GPU path) - Update log message to show GPU memory utilization value This allows users to control memory usage via --gpu-memory-utilization flag, consistent with vLLM's standard behavior. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>

Fixes reviewer comment from LxYuan0420: Use SpecClass selection pattern instead of duplicated MLAAttentionSpec/FullAttentionSpec instantiation. Changes: - Combine MLAAttentionSpec and FullAttentionSpec imports - Use conditional SpecClass selection - Remove duplicate if/else block This makes the code cleaner and easier to maintain. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>

Reverts to the PR vllm-project#229 design: report one max-length sequence of KV cache for the MLX path, instead of a fraction of total Metal memory. Rationale (from LxYuan0420's review): - The previous change (gpu_memory_utilization * total_memory) altered scheduler semantics without explicit policy discussion. - PR vllm-project#229's one-sequence estimate ensures conservative admission control. - MLX's make_prompt_cache() dynamically allocates per request, so we only need to report enough for one sequence. This keeps the scheduler behavior consistent with upstream expectations and avoids over-committing memory. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>

…ence estimate Updates test expectations to match the implementation changes: - test_hybrid_with_paged_attention_logs_warning: Verify warning is logged instead of ValueError (PR vllm-project#235 made hybrid + paged attention supported) - test_determine_available_memory_single_sequence_mode: Restore to test one-sequence estimate (PR vllm-project#229 design) instead of 80% memory fraction Also fixes test fixtures to include required vllm_config attribute. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>

Alex-ai-future · 2026-04-08T05:17:35Z

@LxYuan0420 Thank you for the thoughtful review! You raised valid concerns, and I've addressed them:

1. Hard ValueError → Warning ✅ Since PR #235 is now merged, hybrid + paged attention works via block-size translation. Changed the hard error to an informative warning explaining the mechanism.

2. Memory reporting strategy ✅ You were right — my change from one-sequence estimate to gpu_memory_utilization altered scheduler semantics without proper discussion. I've reverted to PR #229's design (one max-length sequence) to maintain conservative admission control for MLX's dynamic allocation.

3. Code duplication ✅ Refactored to use spec_class selection pattern instead of duplicating MLAAttentionSpec/FullAttentionSpec instantiation.

4. Tests & Lint ✅ All tests pass (16/16 + 11/11), and ruff check/format are clean.

I appreciate you catching these issues — especially the memory reporting strategy change, which I should have discussed more carefully before implementing. 🙏

LxYuan0420

I noticed a separate issue in linear_cache_bytes_per_slot() where recurrent GDN memory is undercounted because GDNPagedStateCache forces mx.float32. That came from PR #226. It’s out of scope for this PR, so I’ll follow up with a small fix.

ricky-chaoju reviewed Apr 6, 2026

View reviewed changes

ricky-chaoju approved these changes Apr 6, 2026

View reviewed changes

Alex-ai-future force-pushed the fix/qwen3.5-init-failed branch from d1e7277 to 50cde22 Compare April 6, 2026 15:32

LxYuan0420 reviewed Apr 7, 2026

View reviewed changes

LxYuan0420 requested changes Apr 7, 2026

View reviewed changes

Alex-ai-future and others added 14 commits April 8, 2026 11:42

[Metal] Fix ruff format issues

1c22bf0

Signed-off-by: Alex <alex.tech.lab@outlook.com>

use new device

3049380

Signed-off-by: Alex <alex.tech.lab@outlook.com>

Alex-ai-future force-pushed the fix/qwen3.5-init-failed branch from 4128d7d to fe70fc9 Compare April 8, 2026 03:42

Alex-ai-future and others added 4 commits April 8, 2026 12:56

LxYuan0420 approved these changes Apr 8, 2026

View reviewed changes

LxYuan0420 merged commit 987feac into vllm-project:main Apr 8, 2026
5 checks passed

Model	mamba_page_size	attn_page_size_1_token	block_size
Qwen3.5-0.8B	548,864	4,096	160
Qwen3.5-4B	573,440	4,096	160
Qwen3.5-14B	585,728	4,096	160
Qwen3.5-32B	585,728	4,096	160

Conversation

Alex-ai-future commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

1. Hybrid Model Page Size Alignment Failure

2. Forced Paged Attention Causes OOM

3. Inaccurate Memory Reporting

Solution

1. Add update_block_size_for_backend() to MetalPlatform

2. Add MLA Support

3. Add Early Error for Hybrid + Paged Attention

4. Improve Memory Reporting

5. Remove Auto-Enable Paged Attention for Hybrid Models

Changes

New Files

Modified Files

Test Results

Unit Tests (16/16 Pass)

Lint Checks

Usage

Default Path (Recommended for Hybrid Models)

Paged Attention Path (Non-Hybrid Models Only)

Unsupported Configuration (Will Raise Clear Error)

Known Limitations

Related Issues

Commits

Uh oh!

ericcurtin commented Apr 5, 2026

Uh oh!

Alex-ai-future commented Apr 5, 2026

Uh oh!

Alex-ai-future commented Apr 5, 2026

Uh oh!

ricky-chaoju left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Alex-ai-future Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericcurtin commented Apr 6, 2026

Uh oh!

Alex-ai-future commented Apr 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Alex-ai-future commented Apr 8, 2026

Uh oh!

LxYuan0420 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Alex-ai-future commented Apr 5, 2026 •

edited

Loading

1. Add `update_block_size_for_backend()` to MetalPlatform

ricky-chaoju left a comment •

edited

Loading

Alex-ai-future Apr 6, 2026 •

edited

Loading