[BugFix] Correct max memory usage for multiple KV-cache groups#36030
[BugFix] Correct max memory usage for multiple KV-cache groups#36030heheda12345 merged 4 commits intovllm-project:mainfrom
Conversation
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
There was a problem hiding this comment.
Code Review
This pull request addresses a bug in the maximum memory usage calculation, which is particularly relevant for hybrid models. The change in vllm/v1/core/kv_cache_utils.py correctly sums the memory requirements across all KV cache groups, whereas the previous implementation only considered the first group. This ensures accurate memory estimation, for instance when auto-fitting max_model_len. The addition of a new test case for hybrid KV cache specs in tests/v1/core/test_kv_cache_utils.py is a good measure to prevent regressions.
|
Hi @peakcrosser7, thank you for your work on this. I've been testing this PR without issues for the past week or so. |
Hello, im author of #37124 , I also fixed _max_memory_usage_bytes_from_groups as part of a broader set of changes to hybrid Mamba/attention KV cache handling (reporting, allocation, and concurrency estimation). There is overlap on that one function. Let me know how we should proceed :) (my apologies as i did not see this PR as the issue that brought me here was to do with memory over estimation for the qwen 3.5 series.) |
|
Hi @repne, thanks for pointing that out! It seems there is indeed some overlap between this PR and #37124. Hi @swtb3, great work on #37124! It looks like a more comprehensive solution to the problem. I'm not entirely sure about the best way to move forward with my PR, since the auto-merge wasn’t triggered as expected earlier. @heheda12345, what are your thoughts on this? |
| blocks_needed = sum( | ||
| cdiv(group.kv_cache_spec.max_memory_usage_bytes(vllm_config), page_size) | ||
| for group in kv_cache_groups | ||
| ) |
There was a problem hiding this comment.
Ive done some digging on the cause of the OOM. I think that to get the proper allocation for Qwen3.5 will have me back to the drawing board. It may not be as simple as I first thought. I would say, if this PR is ready and tested then go for it. I will rebase my PR on top and continue figuring it out. If youve any thoughts on the OOM lets discuss over on #37124
There was a problem hiding this comment.
Will use this PR for the time being, ping me when you have something you want me to test. Thank you!
There was a problem hiding this comment.
@repne new changes pushed, could you test? cheers!
…project#36030) Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
…project#36030) Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
vllm-project#36030)" This reverts commit 45f526d.
…project#36030) Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
…project#36030) Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
…project#36030) Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Purpose
This PR fixes a calculation error in
_max_memory_usage_bytes_from_groups()that led to underestimating total memory usage in multi-group cases.Root Cause
The original implementation only calculated
blocks_neededbased on the first KV-Cache group (index0). In models with multiple KV-Cache groups (e.g., hybrid Mamba models), the block usage from other groups was neglected. This resulted in a lower-than-actual total memory estimate, also caused the engine to over-calculate themax_model_len.For example, in
test_auto_fit_max_model_len_with_hybrid(), the model consists of both Mamba and FullAttn groups. Since the original logic only considered the Mamba group (index 0), which occupies a fixed number of blocks when prefix-caching is disabled, it ignored the memory usage of the FullAttn group and led to an excessively largemax_model_len.Solution
Updated the logic to
sumthe blocks from all KV-Cache groups, ensuring the total memory usage accurately reflects the requirements of the entire model.Test Plan
Added a new unit test
test_auto_fit_max_model_len_with_hybrid()Test Result
Before fix:
After fix:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.