Skip to content

Revert "[BugFix] Correct max memory usage for multiple KV-cache groups" (#36030)#37584

Draft
zhewenl wants to merge 1 commit intovllm-project:mainfrom
zhewenl:auto-revert/pr-36030
Draft

Revert "[BugFix] Correct max memory usage for multiple KV-cache groups" (#36030)#37584
zhewenl wants to merge 1 commit intovllm-project:mainfrom
zhewenl:auto-revert/pr-36030

Conversation

@zhewenl
Copy link
Copy Markdown
Collaborator

@zhewenl zhewenl commented Mar 19, 2026

Revert of PR #36030

This reverts #36030 (merge commit 45f526d).

Reason: CI failure in Distributed Torchrun + Examples (4 GPUs) — the KV cache memory calculation change caused insufficient KV cache memory (0.44 GiB available vs 0.5 GiB needed) for microsoft/Phi-mini-MoE-instruct on L4 GPUs, breaking the test_torchrun_example_moe.py test with TP_SIZE=2 DP_SIZE=2 ENABLE_EP=1.

Linked build: https://buildkite.com/vllm/ci/builds/56956
New failures linked: 1

Auto-generated by CI failure analyzer.

@mergify mergify Bot added v1 bug Something isn't working labels Mar 19, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request successfully reverts the changes introduced in PR #36030, addressing the CI failures related to KV cache memory calculation. While the revert restores a previously stable state, the logic for calculating blocks_needed in _max_memory_usage_bytes_from_groups in vllm/v1/core/kv_cache_utils.py appears to reintroduce a potential bug. It currently only considers the memory usage of the first KV cache group, which could lead to insufficient memory allocation if other groups have different or higher memory requirements. This is a critical issue that should be addressed to prevent future runtime errors, especially in hybrid models with diverse KV cache specifications.

for group in kv_cache_groups
)
any_spec = kv_cache_groups[0].kv_cache_spec
blocks_needed = cdiv(any_spec.max_memory_usage_bytes(vllm_config), page_size)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The reverted logic for calculating blocks_needed uses any_spec = kv_cache_groups[0].kv_cache_spec and then cdiv(any_spec.max_memory_usage_bytes(vllm_config), page_size). This assumes that the max_memory_usage_bytes is uniform across all kv_cache_spec objects within kv_cache_groups for the "General case" (i.e., when not UniformTypeKVCacheSpecs).

However, different KVCacheSpec types (e.g., FullAttentionSpec vs. SlidingWindowSpec) can have different max_memory_usage_bytes calculations, even if their page_size_bytes are unified. By only considering kv_cache_groups[0].kv_cache_spec, this calculation might underestimate the total blocks needed if subsequent groups have higher memory requirements. This could lead to insufficient memory allocation and runtime failures.

To correctly account for all groups, blocks_needed should be derived from the maximum memory usage among all individual kv_cache_spec objects in the groups.

Suggested change
blocks_needed = cdiv(any_spec.max_memory_usage_bytes(vllm_config), page_size)
blocks_needed = cdiv(max(group.kv_cache_spec.max_memory_usage_bytes(vllm_config) for group in kv_cache_groups), page_size)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant