[Bugfix] Fix KV cache undercount in MLX path for large block sizes#229
Merged
ericcurtin merged 1 commit intovllm-project:mainfrom Apr 5, 2026
Merged
Conversation
`_one_sequence_kv_bytes` used `max_model_len` directly as the token count, but the upstream `_check_enough_kv_cache_memory` uses block-aligned sizes via `cdiv(max_model_len, block_size) * page_size_bytes`. When `block_size` is large (e.g. 400 for Mamba-hybrid models where the attention block size is padded to match the mamba page size), the rounding overhead causes the needed memory to exceed the reported available memory, failing server startup with: ValueError: 0.34 GiB KV cache is needed, which is larger than the available KV cache memory (0.31 GiB) This affects models like Granite 4.0-H (GraniteMoeHybridForCausalLM) which mix Mamba and attention layers, triggering block_size=400 alignment. Fix: round `max_model_len` up to the nearest `block_size` boundary in `_one_sequence_kv_bytes` so both sides use the same token count. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Samuel Warren <samuel@sketchpro.ai>
c919906 to
d0086fd
Compare
ericcurtin
approved these changes
Apr 5, 2026
Merged
Alex-ai-future
pushed a commit
to Alex-ai-future/vllm-metal
that referenced
this pull request
Apr 8, 2026
…llm-project#229) ## Summary - Fix `_one_sequence_kv_bytes` in `MetalWorker` to use block-aligned token counts, matching the upstream scheduler's `cdiv(max_model_len, block_size) * page_size_bytes` accounting - Prevents server startup failure on Mamba-hybrid models (e.g. Granite 4.0-H) where `block_size` is padded to 400 to match the mamba page size ## Problem `_one_sequence_kv_bytes` computes KV cache bytes using `max_model_len` directly (e.g. 2048 tokens). But the upstream `_check_enough_kv_cache_memory` in vLLM core uses block-aligned sizes: `cdiv(2048, 400) = 6 blocks = 2400 tokens`. This causes "needed > available" even though the intent is to report exactly enough for one sequence: ``` ValueError: To serve at least one request with the models's max seq len (2048), (0.18 GiB KV cache is needed, which is larger than the available KV cache memory (0.16 GiB). ``` For the default `block_size=16`, `cdiv(2048, 16) * 16 = 2048` — no padding, so this never triggers. It only manifests with large block sizes like 400, which occurs on Mamba-hybrid models (GraniteMoeHybridForCausalLM) where the attention block size is padded to match the mamba page size. ## Fix Round `max_model_len` up to the nearest `block_size` boundary in `_one_sequence_kv_bytes`: ```python block_size = self.vllm_config.cache_config.block_size max_tokens = -(-self.model_config.max_model_len // block_size) * block_size ``` ## Reproduction ```bash vllm serve mlx-community/granite-4.0-h-tiny-3bit-MLX --max-model-len 2048 --enforce-eager # Fails with KV cache memory error # After fix: # Server starts successfully ``` ## Test plan - [x] Added `test_block_alignment_rounds_up_token_count` — verifies block-aligned calculation with `block_size=400` - [x] Updated existing `test_non_hybrid_counts_all_layers` and `test_hybrid_adds_linear_state` to include `vllm_config.cache_config.block_size` in mocks - [x] All 10 tests in `test_v1_worker.py` pass - [x] Verified `vllm serve mlx-community/granite-4.0-h-tiny-3bit-MLX --max-model-len 4096 --enforce-eager` starts and serves requests on M4 Pro 48GB Signed-off-by: Samuel Warren <samuel@sketchpro.ai> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Alex-ai-future
added a commit
to Alex-ai-future/vllm-metal
that referenced
this pull request
Apr 8, 2026
Reverts to the PR vllm-project#229 design: report one max-length sequence of KV cache for the MLX path, instead of a fraction of total Metal memory. Rationale (from LxYuan0420's review): - The previous change (gpu_memory_utilization * total_memory) altered scheduler semantics without explicit policy discussion. - PR vllm-project#229's one-sequence estimate ensures conservative admission control. - MLX's make_prompt_cache() dynamically allocates per request, so we only need to report enough for one sequence. This keeps the scheduler behavior consistent with upstream expectations and avoids over-committing memory. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>
Alex-ai-future
added a commit
to Alex-ai-future/vllm-metal
that referenced
this pull request
Apr 8, 2026
…ence estimate Updates test expectations to match the implementation changes: - test_hybrid_with_paged_attention_logs_warning: Verify warning is logged instead of ValueError (PR vllm-project#235 made hybrid + paged attention supported) - test_determine_available_memory_single_sequence_mode: Restore to test one-sequence estimate (PR vllm-project#229 design) instead of 80% memory fraction Also fixes test fixtures to include required vllm_config attribute. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Signed-off-by: Alex <alex.tech.lab@outlook.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_one_sequence_kv_bytesinMetalWorkerto use block-aligned token counts, matching the upstream scheduler'scdiv(max_model_len, block_size) * page_size_bytesaccountingblock_sizeis padded to 400 to match the mamba page sizeProblem
_one_sequence_kv_bytescomputes KV cache bytes usingmax_model_lendirectly (e.g. 2048 tokens). But the upstream_check_enough_kv_cache_memoryin vLLM core uses block-aligned sizes:cdiv(2048, 400) = 6 blocks = 2400 tokens. This causes "needed > available" even though the intent is to report exactly enough for one sequence:For the default
block_size=16,cdiv(2048, 16) * 16 = 2048— no padding, so this never triggers. It only manifests with large block sizes like 400, which occurs on Mamba-hybrid models (GraniteMoeHybridForCausalLM) where the attention block size is padded to match the mamba page size.Fix
Round
max_model_lenup to the nearestblock_sizeboundary in_one_sequence_kv_bytes:Reproduction
Test plan
test_block_alignment_rounds_up_token_count— verifies block-aligned calculation withblock_size=400test_non_hybrid_counts_all_layersandtest_hybrid_adds_linear_stateto includevllm_config.cache_config.block_sizein mockstest_v1_worker.pypassvllm serve mlx-community/granite-4.0-h-tiny-3bit-MLX --max-model-len 4096 --enforce-eagerstarts and serves requests on M4 Pro 48GB