v1/kv_cache_utils: Respect num_gpu_blocks_override in memory check#27238
v1/kv_cache_utils: Respect num_gpu_blocks_override in memory check#27238khaled-wsa wants to merge 2 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request correctly fixes a bug where num_gpu_blocks_override was ignored during memory checks, which could lead to engine initialization with insufficient KV blocks. The changes in check_enough_kv_cache_memory are logical and now correctly validate the override value and use it to cap the effective available memory. The new unit tests are comprehensive, covering both the basic case and a more complex heterogeneous model scenario, ensuring the fix is robust.
I have one high-severity comment regarding a latent bug due to a function call with side effects within the modified code block. While it doesn't cause an issue in the current execution path, it's a potential source of future bugs and should be addressed for better code maintainability and correctness.
| estimated_max_len = estimate_max_model_len( | ||
| vllm_config, kv_cache_spec, available_memory | ||
| vllm_config, kv_cache_spec, effective_available_memory | ||
| ) |
There was a problem hiding this comment.
The function estimate_max_model_len modifies the vllm_config.model_config.max_model_len attribute as a side effect of its binary search implementation. While this is not currently causing a bug because this code path always raises an exception, it is a latent bug that could cause issues in the future if this function is called in a context that doesn't terminate.
A function should not have hidden side effects on its arguments. It would be best to refactor estimate_max_model_len to not modify vllm_config, for example by restoring the original value before returning or by working on a copy.
Since the definition of estimate_max_model_len is not in this diff, I'm pointing this out here at the call site. A fix could look like this inside estimate_max_model_len:
def estimate_max_model_len(...):
original_max_len = vllm_config.model_config.max_model_len
try:
# ... existing logic ...
return result
finally:
vllm_config.model_config.max_model_len = original_max_len53e6b20 to
2aad22e
Compare
See PR body for details. Signed-off-by: khaled-wsa <102720886+khaled-wsa@users.noreply.github.com>
Signed-off-by: khaled-wsa <102720886+khaled-wsa@users.noreply.github.com>
|
Hi @khaled-wsa, please see discussion under PR #26939. In addition to the |
heheda12345
left a comment
There was a problem hiding this comment.
Thanks for the quick bugfix! Actually I'm thinking of merging check_enough_kv_cache_memory and _report_kv_cache_config. Basically, in check_enough_kv_cache_memory we only do
if available_memory <= 0:
raise ValueError(
"No available memory for the cache blocks. "
"Try increasing `gpu_memory_utilization` when "
"initializing the engine."
)
and in _report_kv_cache_config, if max_concurrency < 1, raise error for
raise ValueError(
f"To serve at least one request with the models's max seq len "
f"({max_model_len}), ({needed_memory / GiB_bytes:.2f} GiB KV "
f"cache is needed, which is larger than the available KV cache "
f"memory ({available_memory / GiB_bytes:.2f} GiB). "
f"{estimated_msg} "
f"Try increasing `gpu_memory_utilization` or decreasing "
f"`max_model_len` when initializing the engine."
)
|
Hi @khaled-wsa, checking in here; any chance we can move forward on this PR? |
Summary
check_enough_kv_cache_memoryignorednum_gpu_blocks_override, allowing engine initialization with an insufficient number of KV blocks formax_model_len.num_gpu_blocks_overrideis too small (e.g., 1), initialization raises a clear error even if rawavailable_memoryis large.Context
check_enough_kv_cache_memorydidn't considernum_gpu_blocks_override#27181.vllm serve facebook/opt-125m --num_gpu_blocks_override=1appears to start, but no request with length > block size can be scheduled. Expect early failure during initialization.Technical Details
check_enough_kv_cache_memory:ceil(spec.max_memory_usage_bytes / spec.page_size_bytes)for each layer and ensurenum_gpu_blocks_override >= max(required_blocks_per_layer). This closes the hole for heterogeneous specs (e.g., cross-attn vs self-attn).available_memorybysum(page_size_bytes) * num_gpu_blocks_overrideto formeffective_available_memory.needed_memorytoeffective_available_memory, and pass the effective capacity toestimate_max_model_lenfor accurate guidance.Files Changed
num_gpu_blocks_overrideand apply memory cap. Adjust error message.test_check_enough_kv_cache_memory_respects_num_gpu_blocks_override.test_override_must_cover_worst_layer_blocks_in_heterogeneous_modelto cover cross-attn vs self-attn scenario.How To Test
pytest -q tests/v1/core/test_kv_cache_utils.py::test_check_enough_kv_cache_memory_respects_num_gpu_blocks_override--num_gpu_blocks_override=1.Notes
page_size_bytes.