Revert "[BugFix] Correct max memory usage for multiple KV-cache groups" (#36030)#37584
Revert "[BugFix] Correct max memory usage for multiple KV-cache groups" (#36030)#37584zhewenl wants to merge 1 commit intovllm-project:mainfrom
Conversation
vllm-project#36030)" This reverts commit 45f526d.
There was a problem hiding this comment.
Code Review
The pull request successfully reverts the changes introduced in PR #36030, addressing the CI failures related to KV cache memory calculation. While the revert restores a previously stable state, the logic for calculating blocks_needed in _max_memory_usage_bytes_from_groups in vllm/v1/core/kv_cache_utils.py appears to reintroduce a potential bug. It currently only considers the memory usage of the first KV cache group, which could lead to insufficient memory allocation if other groups have different or higher memory requirements. This is a critical issue that should be addressed to prevent future runtime errors, especially in hybrid models with diverse KV cache specifications.
| for group in kv_cache_groups | ||
| ) | ||
| any_spec = kv_cache_groups[0].kv_cache_spec | ||
| blocks_needed = cdiv(any_spec.max_memory_usage_bytes(vllm_config), page_size) |
There was a problem hiding this comment.
The reverted logic for calculating blocks_needed uses any_spec = kv_cache_groups[0].kv_cache_spec and then cdiv(any_spec.max_memory_usage_bytes(vllm_config), page_size). This assumes that the max_memory_usage_bytes is uniform across all kv_cache_spec objects within kv_cache_groups for the "General case" (i.e., when not UniformTypeKVCacheSpecs).
However, different KVCacheSpec types (e.g., FullAttentionSpec vs. SlidingWindowSpec) can have different max_memory_usage_bytes calculations, even if their page_size_bytes are unified. By only considering kv_cache_groups[0].kv_cache_spec, this calculation might underestimate the total blocks needed if subsequent groups have higher memory requirements. This could lead to insufficient memory allocation and runtime failures.
To correctly account for all groups, blocks_needed should be derived from the maximum memory usage among all individual kv_cache_spec objects in the groups.
| blocks_needed = cdiv(any_spec.max_memory_usage_bytes(vllm_config), page_size) | |
| blocks_needed = cdiv(max(group.kv_cache_spec.max_memory_usage_bytes(vllm_config) for group in kv_cache_groups), page_size) |
Revert of PR #36030
This reverts #36030 (merge commit 45f526d).
Reason: CI failure in
Distributed Torchrun + Examples (4 GPUs)— the KV cache memory calculation change caused insufficient KV cache memory (0.44 GiB available vs 0.5 GiB needed) formicrosoft/Phi-mini-MoE-instructon L4 GPUs, breaking thetest_torchrun_example_moe.pytest withTP_SIZE=2 DP_SIZE=2 ENABLE_EP=1.Linked build: https://buildkite.com/vllm/ci/builds/56956
New failures linked: 1
Auto-generated by CI failure analyzer.