Skip to content

Fix/qwen3.5 init failed#230

Merged
LxYuan0420 merged 19 commits intovllm-project:mainfrom
Alex-ai-future:fix/qwen3.5-init-failed
Apr 8, 2026
Merged

Fix/qwen3.5 init failed#230
LxYuan0420 merged 19 commits intovllm-project:mainfrom
Alex-ai-future:fix/qwen3.5-init-failed

Conversation

@Alex-ai-future
Copy link
Copy Markdown
Contributor

@Alex-ai-future Alex-ai-future commented Apr 5, 2026

Summary

This PR enables Qwen3.5 hybrid models (SDPA + GDN layers) to run on Metal by implementing update_block_size_for_backend() to unify KV cache page sizes, adding MLA (Multi-Token Latent Attention) support, and improving memory reporting for the MLX path.

Problem

When trying to run Qwen3.5 (a hybrid model with SDPA + GDN layers) on Metal, the following issues were encountered:

1. Hybrid Model Page Size Alignment Failure

vLLM's KV cache validation fails because SDPA page size and Mamba page size are not divisible:

NotImplementedError: The page size of the layer is not divisible by the maximum page size.
Cannot unify by adjusting block_size.

Test failure on main branch:

FAILED tests/test_attention_dispatch.py::test_qwen35_paged_attention_hybrid - 
NotImplementedError: The page size of the layer is not divisible by the maximum page size.

2. Forced Paged Attention Causes OOM

Previously, paged attention was auto-enabled for hybrid models, which allocates a large contiguous memory buffer that exceeds the capacity of smaller Metal devices.

3. Inaccurate Memory Reporting

The MLX path reported memory based on max_model_len, which gave misleadingly small values for the scheduler.

Solution

1. Add update_block_size_for_backend() to MetalPlatform

Implements a 5-step process to unify page sizes for hybrid models:

  1. Compute attention page size per token - Uses MLAAttentionSpec for MLA models or FullAttentionSpec otherwise
  2. Get Mamba page size - Queries model class for mamba state shape and dtype
  3. Calculate block_size - Ensures SDPA page_size >= Mamba page_size using kernel_block_alignment_size=32
  4. Sync mamba_block_size - If using align mode
  5. Pad mamba_page_size - Matches SDPA page_size exactly

Key insight: This is a "logical" fix for vLLM's scheduler validation only. The Metal plugin manages KV cache internally via MLX's make_prompt_cache(), independent of vLLM's calculations.

2. Add MLA Support

Hybrid models with MLA (e.g., DeepSeek variants) now use MLAAttentionSpec for correct page size calculation:

if getattr(model_config, "use_mla", False):
    attn_page_size_1_token = MLAAttentionSpec(...).page_size_bytes
else:
    attn_page_size_1_token = FullAttentionSpec(...).page_size_bytes

3. Add Early Error for Hybrid + Paged Attention

Raises a clear ValueError when users attempt to use paged attention with hybrid models:

ValueError: Hybrid models (e.g., Qwen3.5) are not supported with paged attention on Metal. 
The Metal paged attention kernel only supports block_size in {8, 16, 32}, but hybrid models 
require block_size=160. Please remove VLLM_METAL_USE_PAGED_ATTENTION=1.

Root cause: Metal paged attention kernels only support block_size ∈ {8, 16, 32}, but Qwen3.5 requires block_size=160.

4. Improve Memory Reporting

Changed MLX path to report 80% of remaining Metal memory instead of one max-length sequence:

_MLX_MEMORY_BUDGET_FRACTION = 0.8
metal_limit = mx.device_info()["max_recommended_working_set_size"]
model_memory = self._get_model_memory_usage()
available = int((metal_limit - model_memory) * _MLX_MEMORY_BUDGET_FRACTION)

Before: reporting 4.29GB for scheduler admission control (one max-length sequence, max_model_len=2048)

After: reporting 11.20 GB for scheduler (Metal limit: 16.00 GB, Model: 2.00 GB, Remaining: 14.00 GB, KV budget: 11.20 GB)

5. Remove Auto-Enable Paged Attention for Hybrid Models

MLX's make_prompt_cache() handles hybrid KV cache natively. Paged attention is now opt-in rather than forced.

Changes

New Files

File Lines Description
tests/test_platform_update_block_size.py 576 Comprehensive unit tests (16 test cases)

Modified Files

File Changes Description
vllm_metal/platform.py +172 Add update_block_size_for_backend() with MLA support
vllm_metal/v1/worker.py +41, -23 Improve memory reporting, remove auto-enable paged
vllm_metal/v1/model_runner.py +1, -1 Fix MLX API: mx.device_info()
tests/test_v1_worker.py +22 Update memory reporting test

Test Results

Unit Tests (16/16 Pass)

$ source .venv-vllm-metal/bin/activate && python -m pytest tests/test_platform_update_block_size.py -v

tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_hybrid_model_success PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_hybrid_model_block_size_already_sufficient PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_non_hybrid_model_skipped PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_model_config_none PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_model_resolution_failure PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_get_mamba_state_shape_failure PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_get_mamba_state_dtype_failure PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_mamba_page_size_zero PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_invalid_architecture PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_block_size_increased_to_minimum PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_mamba_cache_mode_align PASSED
tests/test_platform_update_block_size.py::TestUpdateBlockSizeForBackend::test_hybrid_with_paged_attention_raises_error PASSED
tests/test_platform_update_block_size.py::TestMLAModels::test_mla_hybrid_model_uses_mla_spec PASSED
tests/test_platform_update_block_size.py::TestMLAModels::test_mla_non_hybrid_skipped PASSED
tests/test_platform_update_block_size.py::TestMLAModels::test_mla_with_cache_dtype[bfloat16] PASSED
tests/test_platform_update_block_size.py::TestMLAModels::test_mla_with_cache_dtype[float16] PASSED

16 passed, 2 warnings in 2.69s

Lint Checks

$ ruff check vllm_metal/ tests/
All checks passed!

$ ruff format --check vllm_metal/ tests/
5 files already formatted

Usage

Default Path (Recommended for Hybrid Models)

# No env var needed - uses MLX's native KV cache via make_prompt_cache()
vllm serve Qwen/Qwen3.5-0.8B
vllm serve Qwen/Qwen3.5-4B
vllm serve Qwen/Qwen3.5-14B
vllm serve Qwen/Qwen3.5-32B

Paged Attention Path (Non-Hybrid Models Only)

# Only for non-hybrid models
VLLM_METAL_USE_PAGED_ATTENTION=1 vllm serve HuggingFaceTB/SmolLM2-135M-Instruct

Unsupported Configuration (Will Raise Clear Error)

# This will now raise a helpful ValueError instead of cryptic kernel failure
VLLM_METAL_USE_PAGED_ATTENTION=1 vllm serve Qwen/Qwen3.5-0.8B
# ValueError: Hybrid models (e.g., Qwen3.5) are not supported with paged attention on Metal...

Known Limitations

Hybrid + Paged Attention is unsupported on Metal due to kernel limitations:

  • Metal paged attention kernels only instantiate block_size ∈ {8, 16, 32}
  • Hybrid models require block_size=160 to satisfy vLLM's page size divisibility validation
  • Users should use the native MLX KV cache path (default) for hybrid models

Related Issues

Commits

This PR includes the following commits:

964e9df [Metal] Fix inaccurate docstring and test comments
c87263b [Metal] Fix import order and add test for hybrid + paged error case
f7161cc [Metal] Address reviewer feedback on hybrid + paged attention
774c8bd  use new device
fb5e2d5 [Metal] Fix ruff format issues
5c7379b [Metal] Fix lint issues in MLA support changes
80b8d19 [Metal] Add MLA support to update_block_size_for_backend
733612b [Metal] Improve error handling in update_block_size_for_backend
36d8cb7 [Metal] Add unit tests for update_block_size_for_backend and improve error handling
b2973f0 Fix test for determine_available_memory single_sequence mode
a292980 Restore model_runner.py to upstream version
3f22b84 [Metal] Fix Qwen3.5 hybrid model initialization
d49d0fb [Metal] Fix hybrid model KV cache page size alignment

@ericcurtin
Copy link
Copy Markdown
Collaborator

update_block_size_for_backend has no callers, no tests for new platform method, mamba_page_size_padded may be unset if update_block_size_for_backend is never invoked, magic 0.8 factor not configurable

@Alex-ai-future
Copy link
Copy Markdown
Contributor Author

update_block_size_for_backend has no callers, no tests for new platform method, mamba_page_size_padded may be unset if update_block_size_for_backend is never invoked, magic 0.8 factor not configurable

Thank you for your review! I would like to discuss the hardcoded coefficient of 0.8 within the determine_available_memory() function.
Issue: The GPU path measures memory usage via actual performance profiling, whereas the MLX path relies on a hardcoded estimate. This approach introduces inconsistency and prevents users from fine-tuning the settings to suit their specific application scenarios.
Question: Should this function report the actual memory capacity (similar to the GPU path), or should it establish a configurable memory budget? Is it necessary to make this function user-adjustable, or should it simply return an actual, measured value?

@Alex-ai-future
Copy link
Copy Markdown
Contributor Author

  1. "update_block_size_for_backend has no callers"

The method is called by vLLM core in:

  • File: vllm/v1/executor/uniproc_executor.py
  • Function: _UniProcExecutor.init() (line 53)
  • Context: During executor initialization, after load_model() completes

1 # vllm/v1/executor/uniproc_executor.py:53
2 self.driver_worker.load_model()
3 current_platform.update_block_size_for_backend(self.vllm_config) # ← Called here

This is the standard integration point for all platform plugins.

  1. "no tests for new platform method"

Agreed. I've added 11 unit tests in tests/test_platform_update_block_size.py covering success, failure, and edge cases.

  1. "mamba_page_size_padded may be unset"

Good catch. I've improved error handling to raise exceptions for hybrid models instead of silently returning.

Copy link
Copy Markdown
Contributor

@ricky-chaoju ricky-chaoju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. A few issues found during review:

block_size will crash the Metal kernel. Verified with all Qwen3.5 variants (0.8B/4B/14B/32B) the formula at platform.py:413-417 computes attn_block_size=160, but Metal paged attention only has kernel instantiations for {8, 16, 32}. Users who set VLLM_METAL_USE_PAGED_ATTENTION=1 will hit RuntimeError: Unable to load function ...bs160.... The docstring claims this is "scheduler validation only" but cache_config.block_size is consumed by the kernel in the paged path.

Magic constant. The 0.8 factor in worker.py:420 should be a named constant. Agree with ericcurtin's earlier feedback.

Stale test comments. test_mla_hybrid_model_uses_mla_spec and test_mla_with_cache_dtype say "will FAIL until MLA support is added" but the implementation landed in the same commit — all 15 tests pass. Comments should be cleaned up.

Related: #232 fixes another #226 regression (head_dim AttributeError on Qwen3).

Comment thread vllm_metal/platform.py
# - GPU performance (aligned to Metal threadgroup preferences)
# - Memory efficiency (not excessively large)
# - Compatibility with vLLM's page size unification requirements
kernel_block_alignment_size = 32 # Metal GPU kernel alignment
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified with actual model parameters — all Qwen3.5 variants (0.8B/4B/14B/32B) compute attn_block_size=160 here.

Model mamba_page_size attn_page_size_1_token block_size
Qwen3.5-0.8B 548,864 4,096 160
Qwen3.5-4B 573,440 4,096 160
Qwen3.5-14B 585,728 4,096 160
Qwen3.5-32B 585,728 4,096 160

Metal paged attention kernel only has template instantiations for block_size ∈ {8, 16, 32} (pagedattention.metal:1452-1457). When a user sets VLLM_METAL_USE_PAGED_ATTENTION=1, this crashes:

RuntimeError: Unable to load function paged_attention_..._bs160_...

The docstring says "scheduler validation only" — inaccurate because cache_config.block_size is used by the Metal kernel in the paged path.

Related: #232 fixes another #226 regression (head_dim AttributeError on Qwen3).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is as follows:

  1. The latest version of vLLM now validates the required block size; if it remains unmodified, it will fail this check.
  2. In paged mode, if the block size is modified, there is no corresponding kernel available to match it.

However, how were you able to successfully run the application while utilizing the paged kernel?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, do we have a plan to support dynamic kernel or to skip validation? Otherwise it is not likely to work on the latest vllm. Tell me is I was misunderstood!

Copy link
Copy Markdown
Contributor Author

@Alex-ai-future Alex-ai-future Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to detect whether Paged Mode is enabled?

If enabled, skip modifying the block size.

If disabled, modify the block size.

As part of a broader plan, at the very least, an option could be provided to allow running hybrid models without Paged Mode on latest vllm.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed — none of the Metal-supported block sizes {8, 16, 32} pass vLLM's page size divisibility check for Qwen3.5 (mamba_page_size=548864). The minimum required block_size is 134, which has no kernel instantiation. So hybrid + paged is genuinely unsupported right now.

Your approach is correct: update_block_size_for_backend solves the non-paged validation, and removing
auto-enable paged for hybrid is the right guard. My earlier review about the block_size crash still holds, but only when a user manually forces VLLM_METAL_USE_PAGED_ATTENTION=1 on a hybrid model — which is an unsupported configuration.

Suggestion: add an early error in _setup_paged_attention() (or update_block_size_for_backend) when both hybrid and paged attention are active, so users get a clear message instead of a cryptic kernel load failure.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for your advice

Comment thread vllm_metal/v1/worker.py
@@ -417,13 +410,21 @@
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with ericcurtin's feedback. Should at least be extracted to a named constant (e.g. _MLX_MEMORY_BUDGET_FRACTION = 0.8).

- block_size should be a multiple of 32 (Metal GPU alignment)

Note: This test will FAIL until MLA support is added to the implementation.
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring says "will FAIL until MLA support is added" but the implementation was added in the same commit — all 15 tests pass. Same issue at line 479 and 528. Comments should be cleaned up.

@ericcurtin
Copy link
Copy Markdown
Collaborator

Need to sign to pass DCO

@Alex-ai-future Alex-ai-future force-pushed the fix/qwen3.5-init-failed branch from d1e7277 to 50cde22 Compare April 6, 2026 15:32
@Alex-ai-future
Copy link
Copy Markdown
Contributor Author

signed, thank you!

Comment thread vllm_metal/v1/worker.py Outdated
metal_limit = int(device_info.get("max_recommended_working_set_size", 0))
model_memory = self._get_model_memory_usage()
remaining = metal_limit - model_memory
available = int(remaining * _MLX_MEMORY_BUDGET_FRACTION)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This replaces the one-sequence admission logic (30eb33b, refined in #229) with a heuristic '80% of remaining memory.' That changes scheduler semantics and makes the block-alignment fix irrelevant on the MLX path. Keep the one-sequence estimate unless this is an explicit policy change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding determine_available_memory: Does the result calculated using the one-sequence method actually represent the available memory? During actual execution, I observed that the resulting value was extremely low, which prevented the program from running.

Comment thread vllm_metal/platform.py Outdated
raise ValueError(
"Hybrid models (e.g., Qwen3.5) are not supported with paged "
"attention on Metal. The Metal paged attention kernel only "
"supports block_size in {8, 16, 32}, but hybrid models require "
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hard ValueError blocks hybrid + paged attention unconditionally, which conflicts with PR #235’s block-size translation. Gate it on actual kernel support or downgrade to a warning.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if execution doesn't halt here, the program would still fail during actual runtime in this version due to the lack of a matching kernel.
Are you sure you want to allow this branch to execute within this PR? If so, I can change it to a warning for now.

Comment thread vllm_metal/v1/worker.py Outdated

# Memory budget fraction for MLX non-paged path.
# Reports 80% of remaining Metal memory for KV cache.
_MLX_MEMORY_BUDGET_FRACTION = 0.8
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_MLX_MEMORY_BUDGET_FRACTION = 0.8 ignores cache_config.gpu_memory_utilization, removing user control over memory limits on the MLX path. Use the config value instead of a hard constant.

Comment thread vllm_metal/platform.py
else:
kv_cache_dtype = STR_DTYPE_TO_TORCH_DTYPE[cache_config.cache_dtype]

# Use MLAAttentionSpec for MLA models, FullAttentionSpec otherwise
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated MLAAttentionSpec/FullAttentionSpec instantiation. Use a SpecClass selection to remove repetition.

Alex-ai-future and others added 14 commits April 8, 2026 11:42
Add update_block_size_for_backend method to MetalPlatform to properly
align block_size and mamba_page_size_padded for hybrid models like
Qwen3.5.

The fix ensures that:
1. block_size is calculated to make SDPA page_size >= Mamba page_size
2. mamba_page_size_padded is set to match SDPA page_size exactly
3. _setup_paged_attention is called after block_size alignment

This resolves the NotImplementedError when serving Qwen3.5 models:
  "The page size of the layer is not divisible by the maximum page size"

Signed-off-by: Alex <alex.tech.lab@outlook.com>

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Signed-off-by: Alex <alex.tech.lab@outlook.com>
This commit fixes the initialization of hybrid models (e.g., Qwen3.5) on
Metal platform by:

1. Adding update_block_size_for_backend() to MetalPlatform
   - Calculates proper block_size for hybrid models
   - Sets mamba_page_size_padded to unify page sizes
   - Required for vLLM's KV cache validation

2. Fixing block_size usage in model_runner.py
   - Use metal_config.block_size (not cache_config.block_size)
   - MLX uses metal_config.block_size for make_prompt_cache()
   - cache_config.block_size is only for vLLM scheduler validation

3. Fixing memory reporting in worker.py
   - Use 80% of remaining memory (like the old version)
   - Removed auto-enabling paged attention for hybrid models
   - MLX's make_prompt_cache() handles hybrid KV cache natively

Signed-off-by: Alex <alex.tech.lab@outlook.com>

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Signed-off-by: Alex <alex.tech.lab@outlook.com>
This reverts the unnecessary changes to get_kv_cache_spec() and
get_cache_block_size_bytes() that switched from cache_config.block_size
to metal_config.block_size.

As analyzed, these changes were not functionally necessary because:
1. get_cache_block_size_bytes() is only used when VLLM_METAL_USE_PAGED_ATTENTION=1
2. The default path uses MLX's make_prompt_cache() which doesn't use these values
3. The critical fix is in determine_available_memory() and load_model()

Keeping code consistent with upstream for easier maintenance.

Signed-off-by: Alex <alex.tech.lab@outlook.com>

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Signed-off-by: Alex <alex.tech.lab@outlook.com>
Update test_determine_available_memory_single_sequence_mode to match
the new implementation that returns 80% of remaining Metal memory
instead of _one_sequence_kv_bytes().

The new implementation:
- Returns 80% of (metal_limit - model_memory) for MLX path
- Provides more accurate memory reporting for scheduler admission control
- Matches the behavior of the old update_to_lateset_version branch

Signed-off-by: Alex <alex.tech.lab@outlook.com>

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Signed-off-by: Alex <alex.tech.lab@outlook.com>
…error handling

- Add comprehensive unit tests (11 test cases) covering:
  - Success cases: hybrid models, non-hybrid models, sufficient block_size
  - Failure cases: model resolution failure, invalid config, zero mamba_page_size
  - Edge cases: block_size alignment, mamba_cache_mode='align'

- Improve error handling in update_block_size_for_backend():
  - Re-raise exceptions for hybrid models instead of silently returning
  - Add validation for zero mamba_page_size
  - Add error logging for debugging

- Address reviewer feedback:
  - Add tests for new platform method
  - Ensure mamba_page_size_padded is properly set or raises exception

Signed-off-by: Your Name <your.email@example.com>

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Signed-off-by: Alex <alex.tech.lab@outlook.com>
- Re-raise exceptions for hybrid models instead of silently returning
- Add validation for zero mamba_page_size
- Add error logging for debugging

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Signed-off-by: Alex <alex.tech.lab@outlook.com>
- Add MLA (Multi-Token Latent Attention) model support by checking
  model_config.use_mla and using MLAAttentionSpec for page size calculation
- Add cache_dtype handling to properly convert cache_config.cache_dtype
  to torch.dtype (e.g., 'bfloat16', 'float16')
- Add 4 new unit tests for MLA models:
  * test_mla_hybrid_model_uses_mla_spec: Verify MLA models use MLAAttentionSpec
  * test_mla_non_hybrid_skipped: Verify non-hybrid MLA models skip processing
  * test_mla_with_cache_dtype: Verify different cache dtypes are handled correctly

This enables vllm-metal to support MLA-based models like DeepSeek.

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Signed-off-by: Alex <alex.tech.lab@outlook.com>
- Remove trailing whitespace in test docstrings
- Remove duplicate FullAttentionSpec and MambaSpec imports
  (already imported at line 333)

Signed-off-by: Alex <alex.tech.lab@outlook.com>

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Signed-off-by: Alex <alex.tech.lab@outlook.com>
Signed-off-by: Alex <alex.tech.lab@outlook.com>
Signed-off-by: Alex <alex.tech.lab@outlook.com>
- Add _MLX_MEMORY_BUDGET_FRACTION constant to replace magic 0.8 factor
- Add early ValueError when hybrid model is used with paged attention
  (Metal kernels only support block_size in {8, 16, 32}, but hybrid
  models require block_size=160)
- Remove stale test comments about MLA support (already implemented)

Fixes reviewer comments from ricky-chaoju and ericcurtin.

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Signed-off-by: Alex <alex.tech.lab@outlook.com>
- Move _MLX_MEMORY_BUDGET_FRACTION constant after all imports
- Add test_hybrid_with_paged_attention_raises_error() to verify
  clear error message when users enable paged attention on hybrid models

All 16 tests pass.

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Signed-off-by: Alex <alex.tech.lab@outlook.com>
- Update update_block_size_for_backend docstring with complete steps
  and raises documentation
- Remove outdated 'test will FAIL' comment (MLA support already added)
- Fix misleading comment about padding in block_size test

Signed-off-by: Alex <alex.tech.lab@outlook.com>
Tests were mocking mx.metal.device_info but the code uses mx.device_info()
after the API update. Fix all 8 TestPrefixCacheFractionParsing tests.

Fixes: 8 failed tests in tests/test_prefix_cache.py

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Signed-off-by: Alex <alex.tech.lab@outlook.com>
@Alex-ai-future Alex-ai-future force-pushed the fix/qwen3.5-init-failed branch from 4128d7d to fe70fc9 Compare April 8, 2026 03:42
Explains the block-size translation mechanism (PR vllm-project#235) when users
enable paged attention for hybrid models like Qwen3.5.

The warning describes:
- Why translation is needed (vLLM requires block_size=160, Metal kernel
  only supports {8, 16, 32})
- How it works (each vLLM block splits into multiple kernel blocks,
  cache is reshaped zero-copy)
- That the default MLX path is recommended for hybrid models

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Signed-off-by: Alex <alex.tech.lab@outlook.com>
Alex-ai-future and others added 4 commits April 8, 2026 12:56
Fixes reviewer comment from LxYuan0420: The hardcoded 0.8 factor ignored
user configuration and removed control over memory limits on the MLX path.

Changes:
- Remove _MLX_MEMORY_BUDGET_FRACTION constant
- Use self.cache_config.gpu_memory_utilization (default 0.9 in vLLM)
- Apply to total Metal memory limit (consistent with vLLM GPU path)
- Update log message to show GPU memory utilization value

This allows users to control memory usage via --gpu-memory-utilization
flag, consistent with vLLM's standard behavior.

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Signed-off-by: Alex <alex.tech.lab@outlook.com>
Fixes reviewer comment from LxYuan0420: Use SpecClass selection pattern
instead of duplicated MLAAttentionSpec/FullAttentionSpec instantiation.

Changes:
- Combine MLAAttentionSpec and FullAttentionSpec imports
- Use conditional SpecClass selection
- Remove duplicate if/else block

This makes the code cleaner and easier to maintain.

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Signed-off-by: Alex <alex.tech.lab@outlook.com>
Reverts to the PR vllm-project#229 design: report one max-length sequence of KV cache
for the MLX path, instead of a fraction of total Metal memory.

Rationale (from LxYuan0420's review):
- The previous change (gpu_memory_utilization * total_memory) altered
  scheduler semantics without explicit policy discussion.
- PR vllm-project#229's one-sequence estimate ensures conservative admission control.
- MLX's make_prompt_cache() dynamically allocates per request, so we only
  need to report enough for one sequence.

This keeps the scheduler behavior consistent with upstream expectations
and avoids over-committing memory.

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Signed-off-by: Alex <alex.tech.lab@outlook.com>
…ence estimate

Updates test expectations to match the implementation changes:
- test_hybrid_with_paged_attention_logs_warning: Verify warning is logged
  instead of ValueError (PR vllm-project#235 made hybrid + paged attention supported)
- test_determine_available_memory_single_sequence_mode: Restore to test
  one-sequence estimate (PR vllm-project#229 design) instead of 80% memory fraction

Also fixes test fixtures to include required vllm_config attribute.

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Signed-off-by: Alex <alex.tech.lab@outlook.com>
@Alex-ai-future
Copy link
Copy Markdown
Contributor Author

@LxYuan0420 Thank you for the thoughtful review! You raised valid concerns, and I've addressed them:

1. Hard ValueError → Warning ✅ Since PR #235 is now merged, hybrid + paged attention works via block-size translation. Changed the hard error to an informative warning explaining the mechanism.

2. Memory reporting strategy ✅ You were right — my change from one-sequence estimate to gpu_memory_utilization altered scheduler semantics without proper discussion. I've reverted to PR #229's design (one max-length sequence) to maintain conservative admission control for MLX's dynamic allocation.

3. Code duplication ✅ Refactored to use spec_class selection pattern instead of duplicating MLAAttentionSpec/FullAttentionSpec instantiation.

4. Tests & Lint ✅ All tests pass (16/16 + 11/11), and ruff check/format are clean.

I appreciate you catching these issues — especially the memory reporting strategy change, which I should have discussed more carefully before implementing. 🙏

Copy link
Copy Markdown
Collaborator

@LxYuan0420 LxYuan0420 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed a separate issue in linear_cache_bytes_per_slot() where recurrent GDN memory is undercounted because GDNPagedStateCache forces mx.float32. That came from PR #226. It’s out of scope for this PR, so I’ll follow up with a small fix.

@LxYuan0420 LxYuan0420 merged commit 987feac into vllm-project:main Apr 8, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Engine crashes when prefix caching hits new complete prefill in Metal paged unified path

4 participants