[Metrics] Add group-aware KV cache capacity to vllm:cache_config_info#42206
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces group-aware KV cache capacity metrics, specifically vllm:kv_cache_size_tokens and vllm:kv_cache_max_concurrency, to provide more accurate reporting for hybrid models where naive block-based calculations overestimate capacity. The changes include updates to the cache configuration, engine initialization logic, and Prometheus logging, along with new unit tests. Review feedback identifies a critical logic error in the Data Parallel (DP) implementation where these metrics are not summed across engines, leading to an under-reporting of total system capacity. Actionable suggestions were provided to aggregate these values in the client and update the test suite to reflect the expected summed totals.
| if ( | ||
| vllm_config.cache_config.kv_cache_size_tokens is None | ||
| and response.kv_cache_size_tokens is not None | ||
| ): | ||
| vllm_config.cache_config.kv_cache_size_tokens = ( | ||
| response.kv_cache_size_tokens | ||
| ) | ||
| if ( | ||
| vllm_config.cache_config.kv_cache_max_concurrency is None | ||
| and response.kv_cache_max_concurrency is not None | ||
| ): | ||
| vllm_config.cache_config.kv_cache_max_concurrency = ( | ||
| response.kv_cache_max_concurrency | ||
| ) |
There was a problem hiding this comment.
The group-aware capacity metrics (kv_cache_size_tokens and kv_cache_max_concurrency) should be summed across Data Parallel (DP) engines, similar to how num_gpu_blocks is handled at line 681. Currently, the code only takes the value from the first engine response. In a DP setup with multiple engines, this results in a reported capacity that is only a fraction of the actual system capacity, creating a significant inconsistency with the num_gpu_blocks reported in vllm:cache_config_info (which is summed). Summing these values ensures that the Prometheus gauges correctly represent the total capacity of the served model.
| if ( | |
| vllm_config.cache_config.kv_cache_size_tokens is None | |
| and response.kv_cache_size_tokens is not None | |
| ): | |
| vllm_config.cache_config.kv_cache_size_tokens = ( | |
| response.kv_cache_size_tokens | |
| ) | |
| if ( | |
| vllm_config.cache_config.kv_cache_max_concurrency is None | |
| and response.kv_cache_max_concurrency is not None | |
| ): | |
| vllm_config.cache_config.kv_cache_max_concurrency = ( | |
| response.kv_cache_max_concurrency | |
| ) | |
| # Group-aware capacity: sum across DP engines to match num_gpu_blocks. | |
| if response.kv_cache_size_tokens is not None: | |
| vllm_config.cache_config.kv_cache_size_tokens = ( | |
| (vllm_config.cache_config.kv_cache_size_tokens or 0) + | |
| response.kv_cache_size_tokens | |
| ) | |
| if response.kv_cache_max_concurrency is not None: | |
| vllm_config.cache_config.kv_cache_max_concurrency = ( | |
| (vllm_config.cache_config.kv_cache_max_concurrency or 0.0) + | |
| response.kv_cache_max_concurrency | |
| ) |
|
Ok, some high level thoughts:
|
|
Thanks for the detailed review, @markmc. I really appreciate the architectural perspective here. A lot of your comments genuinely gave me an “ah, that makes sense” moment. I'm going to work through a cleaner revision aligned with the longer-term direction you mentioned and come back with an updated diff. |
a69b48e to
9fbe2d5
Compare
|
Updated per review. I agree this should move out of RFC #38147 looks like a better long-term direction for this kind of runtime data. So I see this as an interim solution aligned with the current metrics/config path. Please let me know if I missed anything or if there are still issues with the implementation. |
935f2a8 to
f6985d1
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
f6985d1 to
116960a
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
116960a to
551439c
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
551439c to
316db03
Compare
markmc
left a comment
There was a problem hiding this comment.
lgtm, thanks!
PTAL @heheda12345
|
Hi @chfeng-cs, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
ab31dc4 to
9f91679
Compare
Compute KV cache token capacity from the resolved group-aware concurrency. Store the post-init capacity and max concurrency on CacheConfig for metrics. Propagate them through EngineCoreReadyResponse while keeping cache_config_info values per DP engine; num_gpu_blocks remains the DP-aggregated block count. Add tests for capacity calculation and cache_config_info metric labels. Signed-off-by: Ethan Feng <ethan.fengch@gmail.com>
9f91679 to
0363c9b
Compare
|
Hi @chfeng-cs, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
…vllm-project#42206) The startup log already reports the correct group-aware KV cache capacity for hybrid models, but Prometheus did not expose matching info in 'vllm:cache_config_info`. This PR adds kv_cache_size_tokens and kv_cache_max_concurrency. Signed-off-by: Ethan Feng <ethan.fengch@gmail.com>
Purpose
Addresses the Prometheus vs. startup-log discrepancy for KV cache capacity in #42024.
The startup log already reports the correct group-aware KV cache capacity for hybrid models, but Prometheus did not expose matching info in 'vllm:cache_config_info`. This PR adds two:
kv_cache_size_tokens: Per-DP-engine KV cache capacity in tokens (group-aware). Uses group-aware capacity sincenum_gpu_blocks * block_sizecan be wrong for hybrid models where requests occupy multiple KV cache groups.kv_cache_max_concurrency: Per-DP-engine maximum concurrency atmax_model_lentokens.Both values are computed from the same group-aware KV cache path used at startup, and are propagated to the frontend process in multiprocess deployments. These values are per-engine, not cluster totals, so they are not summed across DP ranks.