Conversation
|
update_block_size_for_backend has no callers, no tests for new platform method, mamba_page_size_padded may be unset if update_block_size_for_backend is never invoked, magic 0.8 factor not configurable |
Thank you for your review! I would like to discuss the hardcoded coefficient of 0.8 within the |
The method is called by vLLM core in:
1 # vllm/v1/executor/uniproc_executor.py:53 This is the standard integration point for all platform plugins.
Agreed. I've added 11 unit tests in tests/test_platform_update_block_size.py covering success, failure, and edge cases.
Good catch. I've improved error handling to raise exceptions for hybrid models instead of silently returning. |
There was a problem hiding this comment.
Thanks for working on this. A few issues found during review:
block_size will crash the Metal kernel. Verified with all Qwen3.5 variants (0.8B/4B/14B/32B) the formula at platform.py:413-417 computes attn_block_size=160, but Metal paged attention only has kernel instantiations for {8, 16, 32}. Users who set VLLM_METAL_USE_PAGED_ATTENTION=1 will hit RuntimeError: Unable to load function ...bs160.... The docstring claims this is "scheduler validation only" but cache_config.block_size is consumed by the kernel in the paged path.
Magic constant. The 0.8 factor in worker.py:420 should be a named constant. Agree with ericcurtin's earlier feedback.
Stale test comments. test_mla_hybrid_model_uses_mla_spec and test_mla_with_cache_dtype say "will FAIL until MLA support is added" but the implementation landed in the same commit — all 15 tests pass. Comments should be cleaned up.
Related: #232 fixes another #226 regression (head_dim AttributeError on Qwen3).
| # - GPU performance (aligned to Metal threadgroup preferences) | ||
| # - Memory efficiency (not excessively large) | ||
| # - Compatibility with vLLM's page size unification requirements | ||
| kernel_block_alignment_size = 32 # Metal GPU kernel alignment |
There was a problem hiding this comment.
Verified with actual model parameters — all Qwen3.5 variants (0.8B/4B/14B/32B) compute attn_block_size=160 here.
| Model | mamba_page_size | attn_page_size_1_token | block_size |
|---|---|---|---|
| Qwen3.5-0.8B | 548,864 | 4,096 | 160 |
| Qwen3.5-4B | 573,440 | 4,096 | 160 |
| Qwen3.5-14B | 585,728 | 4,096 | 160 |
| Qwen3.5-32B | 585,728 | 4,096 | 160 |
Metal paged attention kernel only has template instantiations for block_size ∈ {8, 16, 32} (pagedattention.metal:1452-1457). When a user sets VLLM_METAL_USE_PAGED_ATTENTION=1, this crashes:
RuntimeError: Unable to load function paged_attention_..._bs160_...
The docstring says "scheduler validation only" — inaccurate because cache_config.block_size is used by the Metal kernel in the paged path.
Related: #232 fixes another #226 regression (head_dim AttributeError on Qwen3).
There was a problem hiding this comment.
My understanding is as follows:
- The latest version of vLLM now validates the required block size; if it remains unmodified, it will fail this check.
- In paged mode, if the block size is modified, there is no corresponding kernel available to match it.
However, how were you able to successfully run the application while utilizing the paged kernel?
There was a problem hiding this comment.
And, do we have a plan to support dynamic kernel or to skip validation? Otherwise it is not likely to work on the latest vllm. Tell me is I was misunderstood!
There was a problem hiding this comment.
Is it possible to detect whether Paged Mode is enabled?
If enabled, skip modifying the block size.
If disabled, modify the block size.
As part of a broader plan, at the very least, an option could be provided to allow running hybrid models without Paged Mode on latest vllm.
There was a problem hiding this comment.
Confirmed — none of the Metal-supported block sizes {8, 16, 32} pass vLLM's page size divisibility check for Qwen3.5 (mamba_page_size=548864). The minimum required block_size is 134, which has no kernel instantiation. So hybrid + paged is genuinely unsupported right now.
Your approach is correct: update_block_size_for_backend solves the non-paged validation, and removing
auto-enable paged for hybrid is the right guard. My earlier review about the block_size crash still holds, but only when a user manually forces VLLM_METAL_USE_PAGED_ATTENTION=1 on a hybrid model — which is an unsupported configuration.
Suggestion: add an early error in _setup_paged_attention() (or update_block_size_for_backend) when both hybrid and paged attention are active, so users get a clear message instead of a cryptic kernel load failure.
There was a problem hiding this comment.
thank you for your advice
| @@ -417,13 +410,21 @@ | |||
There was a problem hiding this comment.
Agree with ericcurtin's feedback. Should at least be extracted to a named constant (e.g. _MLX_MEMORY_BUDGET_FRACTION = 0.8).
| - block_size should be a multiple of 32 (Metal GPU alignment) | ||
|
|
||
| Note: This test will FAIL until MLA support is added to the implementation. | ||
| """ |
There was a problem hiding this comment.
This docstring says "will FAIL until MLA support is added" but the implementation was added in the same commit — all 15 tests pass. Same issue at line 479 and 528. Comments should be cleaned up.
|
Need to sign to pass DCO |
d1e7277 to
50cde22
Compare
|
signed, thank you! |
| metal_limit = int(device_info.get("max_recommended_working_set_size", 0)) | ||
| model_memory = self._get_model_memory_usage() | ||
| remaining = metal_limit - model_memory | ||
| available = int(remaining * _MLX_MEMORY_BUDGET_FRACTION) |
There was a problem hiding this comment.
There was a problem hiding this comment.
Regarding determine_available_memory: Does the result calculated using the one-sequence method actually represent the available memory? During actual execution, I observed that the resulting value was extremely low, which prevented the program from running.
| raise ValueError( | ||
| "Hybrid models (e.g., Qwen3.5) are not supported with paged " | ||
| "attention on Metal. The Metal paged attention kernel only " | ||
| "supports block_size in {8, 16, 32}, but hybrid models require " |
There was a problem hiding this comment.
This hard ValueError blocks hybrid + paged attention unconditionally, which conflicts with PR #235’s block-size translation. Gate it on actual kernel support or downgrade to a warning.
There was a problem hiding this comment.
Even if execution doesn't halt here, the program would still fail during actual runtime in this version due to the lack of a matching kernel.
Are you sure you want to allow this branch to execute within this PR? If so, I can change it to a warning for now.
|
|
||
| # Memory budget fraction for MLX non-paged path. | ||
| # Reports 80% of remaining Metal memory for KV cache. | ||
| _MLX_MEMORY_BUDGET_FRACTION = 0.8 |
There was a problem hiding this comment.
_MLX_MEMORY_BUDGET_FRACTION = 0.8 ignores cache_config.gpu_memory_utilization, removing user control over memory limits on the MLX path. Use the config value instead of a hard constant.
| else: | ||
| kv_cache_dtype = STR_DTYPE_TO_TORCH_DTYPE[cache_config.cache_dtype] | ||
|
|
||
| # Use MLAAttentionSpec for MLA models, FullAttentionSpec otherwise |
There was a problem hiding this comment.
Duplicated MLAAttentionSpec/FullAttentionSpec instantiation. Use a SpecClass selection to remove repetition.
Add update_block_size_for_backend method to MetalPlatform to properly align block_size and mamba_page_size_padded for hybrid models like Qwen3.5. The fix ensures that: 1. block_size is calculated to make SDPA page_size >= Mamba page_size 2. mamba_page_size_padded is set to match SDPA page_size exactly 3. _setup_paged_attention is called after block_size alignment This resolves the NotImplementedError when serving Qwen3.5 models: "The page size of the layer is not divisible by the maximum page size" Signed-off-by: Alex <alex.tech.lab@outlook.com> Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>
This commit fixes the initialization of hybrid models (e.g., Qwen3.5) on Metal platform by: 1. Adding update_block_size_for_backend() to MetalPlatform - Calculates proper block_size for hybrid models - Sets mamba_page_size_padded to unify page sizes - Required for vLLM's KV cache validation 2. Fixing block_size usage in model_runner.py - Use metal_config.block_size (not cache_config.block_size) - MLX uses metal_config.block_size for make_prompt_cache() - cache_config.block_size is only for vLLM scheduler validation 3. Fixing memory reporting in worker.py - Use 80% of remaining memory (like the old version) - Removed auto-enabling paged attention for hybrid models - MLX's make_prompt_cache() handles hybrid KV cache natively Signed-off-by: Alex <alex.tech.lab@outlook.com> Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>
This reverts the unnecessary changes to get_kv_cache_spec() and get_cache_block_size_bytes() that switched from cache_config.block_size to metal_config.block_size. As analyzed, these changes were not functionally necessary because: 1. get_cache_block_size_bytes() is only used when VLLM_METAL_USE_PAGED_ATTENTION=1 2. The default path uses MLX's make_prompt_cache() which doesn't use these values 3. The critical fix is in determine_available_memory() and load_model() Keeping code consistent with upstream for easier maintenance. Signed-off-by: Alex <alex.tech.lab@outlook.com> Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>
Update test_determine_available_memory_single_sequence_mode to match the new implementation that returns 80% of remaining Metal memory instead of _one_sequence_kv_bytes(). The new implementation: - Returns 80% of (metal_limit - model_memory) for MLX path - Provides more accurate memory reporting for scheduler admission control - Matches the behavior of the old update_to_lateset_version branch Signed-off-by: Alex <alex.tech.lab@outlook.com> Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>
…error handling - Add comprehensive unit tests (11 test cases) covering: - Success cases: hybrid models, non-hybrid models, sufficient block_size - Failure cases: model resolution failure, invalid config, zero mamba_page_size - Edge cases: block_size alignment, mamba_cache_mode='align' - Improve error handling in update_block_size_for_backend(): - Re-raise exceptions for hybrid models instead of silently returning - Add validation for zero mamba_page_size - Add error logging for debugging - Address reviewer feedback: - Add tests for new platform method - Ensure mamba_page_size_padded is properly set or raises exception Signed-off-by: Your Name <your.email@example.com> Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>
- Re-raise exceptions for hybrid models instead of silently returning - Add validation for zero mamba_page_size - Add error logging for debugging Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>
- Add MLA (Multi-Token Latent Attention) model support by checking model_config.use_mla and using MLAAttentionSpec for page size calculation - Add cache_dtype handling to properly convert cache_config.cache_dtype to torch.dtype (e.g., 'bfloat16', 'float16') - Add 4 new unit tests for MLA models: * test_mla_hybrid_model_uses_mla_spec: Verify MLA models use MLAAttentionSpec * test_mla_non_hybrid_skipped: Verify non-hybrid MLA models skip processing * test_mla_with_cache_dtype: Verify different cache dtypes are handled correctly This enables vllm-metal to support MLA-based models like DeepSeek. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>
- Remove trailing whitespace in test docstrings - Remove duplicate FullAttentionSpec and MambaSpec imports (already imported at line 333) Signed-off-by: Alex <alex.tech.lab@outlook.com> Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>
Signed-off-by: Alex <alex.tech.lab@outlook.com>
Signed-off-by: Alex <alex.tech.lab@outlook.com>
- Add _MLX_MEMORY_BUDGET_FRACTION constant to replace magic 0.8 factor
- Add early ValueError when hybrid model is used with paged attention
(Metal kernels only support block_size in {8, 16, 32}, but hybrid
models require block_size=160)
- Remove stale test comments about MLA support (already implemented)
Fixes reviewer comments from ricky-chaoju and ericcurtin.
Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Signed-off-by: Alex <alex.tech.lab@outlook.com>
- Move _MLX_MEMORY_BUDGET_FRACTION constant after all imports - Add test_hybrid_with_paged_attention_raises_error() to verify clear error message when users enable paged attention on hybrid models All 16 tests pass. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>
- Update update_block_size_for_backend docstring with complete steps and raises documentation - Remove outdated 'test will FAIL' comment (MLA support already added) - Fix misleading comment about padding in block_size test Signed-off-by: Alex <alex.tech.lab@outlook.com>
Tests were mocking mx.metal.device_info but the code uses mx.device_info() after the API update. Fix all 8 TestPrefixCacheFractionParsing tests. Fixes: 8 failed tests in tests/test_prefix_cache.py Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>
4128d7d to
fe70fc9
Compare
Explains the block-size translation mechanism (PR vllm-project#235) when users enable paged attention for hybrid models like Qwen3.5. The warning describes: - Why translation is needed (vLLM requires block_size=160, Metal kernel only supports {8, 16, 32}) - How it works (each vLLM block splits into multiple kernel blocks, cache is reshaped zero-copy) - That the default MLX path is recommended for hybrid models Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>
Fixes reviewer comment from LxYuan0420: The hardcoded 0.8 factor ignored user configuration and removed control over memory limits on the MLX path. Changes: - Remove _MLX_MEMORY_BUDGET_FRACTION constant - Use self.cache_config.gpu_memory_utilization (default 0.9 in vLLM) - Apply to total Metal memory limit (consistent with vLLM GPU path) - Update log message to show GPU memory utilization value This allows users to control memory usage via --gpu-memory-utilization flag, consistent with vLLM's standard behavior. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>
Fixes reviewer comment from LxYuan0420: Use SpecClass selection pattern instead of duplicated MLAAttentionSpec/FullAttentionSpec instantiation. Changes: - Combine MLAAttentionSpec and FullAttentionSpec imports - Use conditional SpecClass selection - Remove duplicate if/else block This makes the code cleaner and easier to maintain. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>
Reverts to the PR vllm-project#229 design: report one max-length sequence of KV cache for the MLX path, instead of a fraction of total Metal memory. Rationale (from LxYuan0420's review): - The previous change (gpu_memory_utilization * total_memory) altered scheduler semantics without explicit policy discussion. - PR vllm-project#229's one-sequence estimate ensures conservative admission control. - MLX's make_prompt_cache() dynamically allocates per request, so we only need to report enough for one sequence. This keeps the scheduler behavior consistent with upstream expectations and avoids over-committing memory. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>
…ence estimate Updates test expectations to match the implementation changes: - test_hybrid_with_paged_attention_logs_warning: Verify warning is logged instead of ValueError (PR vllm-project#235 made hybrid + paged attention supported) - test_determine_available_memory_single_sequence_mode: Restore to test one-sequence estimate (PR vllm-project#229 design) instead of 80% memory fraction Also fixes test fixtures to include required vllm_config attribute. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>
|
@LxYuan0420 Thank you for the thoughtful review! You raised valid concerns, and I've addressed them: 1. Hard ValueError → Warning ✅ Since PR #235 is now merged, hybrid + paged attention works via block-size translation. Changed the hard error to an informative warning explaining the mechanism. 2. Memory reporting strategy ✅ You were right — my change from one-sequence estimate to 3. Code duplication ✅ Refactored to use 4. Tests & Lint ✅ All tests pass (16/16 + 11/11), and ruff check/format are clean. I appreciate you catching these issues — especially the memory reporting strategy change, which I should have discussed more carefully before implementing. 🙏 |
LxYuan0420
left a comment
There was a problem hiding this comment.
I noticed a separate issue in linear_cache_bytes_per_slot() where recurrent GDN memory is undercounted because GDNPagedStateCache forces mx.float32. That came from PR #226. It’s out of scope for this PR, so I’ll follow up with a small fix.
Summary
This PR enables Qwen3.5 hybrid models (SDPA + GDN layers) to run on Metal by implementing
update_block_size_for_backend()to unify KV cache page sizes, adding MLA (Multi-Token Latent Attention) support, and improving memory reporting for the MLX path.Problem
When trying to run Qwen3.5 (a hybrid model with SDPA + GDN layers) on Metal, the following issues were encountered:
1. Hybrid Model Page Size Alignment Failure
vLLM's KV cache validation fails because SDPA page size and Mamba page size are not divisible:
Test failure on main branch:
2. Forced Paged Attention Causes OOM
Previously, paged attention was auto-enabled for hybrid models, which allocates a large contiguous memory buffer that exceeds the capacity of smaller Metal devices.
3. Inaccurate Memory Reporting
The MLX path reported memory based on
max_model_len, which gave misleadingly small values for the scheduler.Solution
1. Add
update_block_size_for_backend()to MetalPlatformImplements a 5-step process to unify page sizes for hybrid models:
MLAAttentionSpecfor MLA models orFullAttentionSpecotherwisekernel_block_alignment_size=32Key insight: This is a "logical" fix for vLLM's scheduler validation only. The Metal plugin manages KV cache internally via MLX's
make_prompt_cache(), independent of vLLM's calculations.2. Add MLA Support
Hybrid models with MLA (e.g., DeepSeek variants) now use
MLAAttentionSpecfor correct page size calculation:3. Add Early Error for Hybrid + Paged Attention
Raises a clear
ValueErrorwhen users attempt to use paged attention with hybrid models:Root cause: Metal paged attention kernels only support
block_size ∈ {8, 16, 32}, but Qwen3.5 requiresblock_size=160.4. Improve Memory Reporting
Changed MLX path to report 80% of remaining Metal memory instead of one max-length sequence:
Before:
reporting 4.29GB for scheduler admission control (one max-length sequence, max_model_len=2048)After:
reporting 11.20 GB for scheduler (Metal limit: 16.00 GB, Model: 2.00 GB, Remaining: 14.00 GB, KV budget: 11.20 GB)5. Remove Auto-Enable Paged Attention for Hybrid Models
MLX's
make_prompt_cache()handles hybrid KV cache natively. Paged attention is now opt-in rather than forced.Changes
New Files
tests/test_platform_update_block_size.pyModified Files
vllm_metal/platform.pyupdate_block_size_for_backend()with MLA supportvllm_metal/v1/worker.pyvllm_metal/v1/model_runner.pymx.device_info()tests/test_v1_worker.pyTest Results
Unit Tests (16/16 Pass)
Lint Checks
$ ruff check vllm_metal/ tests/ All checks passed! $ ruff format --check vllm_metal/ tests/ 5 files already formattedUsage
Default Path (Recommended for Hybrid Models)
# No env var needed - uses MLX's native KV cache via make_prompt_cache() vllm serve Qwen/Qwen3.5-0.8B vllm serve Qwen/Qwen3.5-4B vllm serve Qwen/Qwen3.5-14B vllm serve Qwen/Qwen3.5-32BPaged Attention Path (Non-Hybrid Models Only)
# Only for non-hybrid models VLLM_METAL_USE_PAGED_ATTENTION=1 vllm serve HuggingFaceTB/SmolLM2-135M-InstructUnsupported Configuration (Will Raise Clear Error)
Known Limitations
Hybrid + Paged Attention is unsupported on Metal due to kernel limitations:
block_size ∈ {8, 16, 32}block_size=160to satisfy vLLM's page size divisibility validationRelated Issues
Commits
This PR includes the following commits: