Skip to content

[Metrics] Add group-aware KV cache capacity to vllm:cache_config_info#42206

Merged
markmc merged 4 commits into
vllm-project:mainfrom
chfeng-cs:group-aware-kv-cache-capacity
Jun 12, 2026
Merged

[Metrics] Add group-aware KV cache capacity to vllm:cache_config_info#42206
markmc merged 4 commits into
vllm-project:mainfrom
chfeng-cs:group-aware-kv-cache-capacity

Conversation

@chfeng-cs

@chfeng-cs chfeng-cs commented May 10, 2026

Copy link
Copy Markdown
Contributor

Purpose

Addresses the Prometheus vs. startup-log discrepancy for KV cache capacity in #42024.

The startup log already reports the correct group-aware KV cache capacity for hybrid models, but Prometheus did not expose matching info in 'vllm:cache_config_info`. This PR adds two:

  • kv_cache_size_tokens : Per-DP-engine KV cache capacity in tokens (group-aware). Uses group-aware capacity since num_gpu_blocks * block_size can be wrong for hybrid models where requests occupy multiple KV cache groups.
  • kv_cache_max_concurrency : Per-DP-engine maximum concurrency at max_model_len tokens.

Both values are computed from the same group-aware KV cache path used at startup, and are propagated to the frontend process in multiprocess deployments. These values are per-engine, not cluster totals, so they are not summed across DP ranks.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added the v1 label May 10, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces group-aware KV cache capacity metrics, specifically vllm:kv_cache_size_tokens and vllm:kv_cache_max_concurrency, to provide more accurate reporting for hybrid models where naive block-based calculations overestimate capacity. The changes include updates to the cache configuration, engine initialization logic, and Prometheus logging, along with new unit tests. Review feedback identifies a critical logic error in the Data Parallel (DP) implementation where these metrics are not summed across engines, leading to an under-reporting of total system capacity. Actionable suggestions were provided to aggregate these values in the client and update the test suite to reflect the expected summed totals.

Comment thread vllm/v1/engine/core_client.py Outdated
Comment on lines +686 to +699
if (
vllm_config.cache_config.kv_cache_size_tokens is None
and response.kv_cache_size_tokens is not None
):
vllm_config.cache_config.kv_cache_size_tokens = (
response.kv_cache_size_tokens
)
if (
vllm_config.cache_config.kv_cache_max_concurrency is None
and response.kv_cache_max_concurrency is not None
):
vllm_config.cache_config.kv_cache_max_concurrency = (
response.kv_cache_max_concurrency
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The group-aware capacity metrics (kv_cache_size_tokens and kv_cache_max_concurrency) should be summed across Data Parallel (DP) engines, similar to how num_gpu_blocks is handled at line 681. Currently, the code only takes the value from the first engine response. In a DP setup with multiple engines, this results in a reported capacity that is only a fraction of the actual system capacity, creating a significant inconsistency with the num_gpu_blocks reported in vllm:cache_config_info (which is summed). Summing these values ensures that the Prometheus gauges correctly represent the total capacity of the served model.

Suggested change
if (
vllm_config.cache_config.kv_cache_size_tokens is None
and response.kv_cache_size_tokens is not None
):
vllm_config.cache_config.kv_cache_size_tokens = (
response.kv_cache_size_tokens
)
if (
vllm_config.cache_config.kv_cache_max_concurrency is None
and response.kv_cache_max_concurrency is not None
):
vllm_config.cache_config.kv_cache_max_concurrency = (
response.kv_cache_max_concurrency
)
# Group-aware capacity: sum across DP engines to match num_gpu_blocks.
if response.kv_cache_size_tokens is not None:
vllm_config.cache_config.kv_cache_size_tokens = (
(vllm_config.cache_config.kv_cache_size_tokens or 0) +
response.kv_cache_size_tokens
)
if response.kv_cache_max_concurrency is not None:
vllm_config.cache_config.kv_cache_max_concurrency = (
(vllm_config.cache_config.kv_cache_max_concurrency or 0.0) +
response.kv_cache_max_concurrency
)

Comment thread tests/v1/metrics/test_kv_cache_metrics.py Outdated
Comment thread tests/v1/metrics/test_kv_cache_metrics.py Outdated
@markmc

markmc commented May 13, 2026

Copy link
Copy Markdown
Member

Ok, some high level thoughts:

  • Problem statement - the naive num_gpu_blocks * block_size formula that made sense with uniform KV cache does not work for hybrid models. Users need an alternative to observe/introspect "KV cache capacity in tokens"
  • Question - do we want to expose max_model_len and max_concurrency (allowing users to do num_tokens = max_concurrency * max_model_len), or do we simply want to report kv_cache_size_tokens ?
  • Either way, we need to propagate the info reported in _report_kv_cache_config() from the engine core process to the frontend process - that means adding at least max_concurrency to EngineCoreReadyResponse, but maybe num_tokens too
  • I really don't like the way we store runtime state like num_gpu_tokens in VllmConfig, but I guess I'm fine with also adding max_concurrency (and maybe num_tokens) to it for now. See [KV Connector] Make KVCacheConfig an explicit constructor argument #27887 (comment)
  • I'm also not a huge fan of a Prometheus metric that is only set once on startup and never changes - it may be common enough practise, but it seems like a poor fit for a time-series DB to me
  • We already have the vllm:cache_config_info metric that does this, which I also don't like, but I'd prefer we add this new info there rather add a new metric
  • [RFC]: Endpoints to Retrieve User Configuration and Runtime Data #38147 is a more promising long-term direction - an explicit API for introspecting the server, which might have something like /server_info/kv_cache

@markmc markmc left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More concrete feedback inline

Comment thread vllm/v1/metrics/loggers.py Outdated
Comment thread vllm/v1/engine/core.py Outdated
Comment thread tests/v1/metrics/test_kv_cache_metrics.py Outdated
@chfeng-cs

Copy link
Copy Markdown
Contributor Author

Thanks for the detailed review, @markmc. I really appreciate the architectural perspective here. A lot of your comments genuinely gave me an “ah, that makes sense” moment. I'm going to work through a cleaner revision aligned with the longer-term direction you mentioned and come back with an updated diff.

@chfeng-cs

Copy link
Copy Markdown
Contributor Author

Updated per review.

I agree this should move out of VllmConfig if we later split runtime KV cache state from user config. For now, I kept it in CacheConfig, following the existing num_gpu_blocks post-init state pattern.

RFC #38147 looks like a better long-term direction for this kind of runtime data. So I see this as an interim solution aligned with the current metrics/config path.

Please let me know if I missed anything or if there are still issues with the implementation.

@chfeng-cs chfeng-cs force-pushed the group-aware-kv-cache-capacity branch from 935f2a8 to f6985d1 Compare May 22, 2026 11:55
@mergify

mergify Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @chfeng-cs.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 22, 2026
@chfeng-cs chfeng-cs force-pushed the group-aware-kv-cache-capacity branch from f6985d1 to 116960a Compare May 22, 2026 12:15
@mergify mergify Bot removed the needs-rebase label May 22, 2026
@mergify

mergify Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @chfeng-cs.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 29, 2026
@chfeng-cs chfeng-cs force-pushed the group-aware-kv-cache-capacity branch from 116960a to 551439c Compare June 1, 2026 16:24
@mergify mergify Bot removed the needs-rebase label Jun 1, 2026
@chfeng-cs

Copy link
Copy Markdown
Contributor Author

Hi @markmc, I’ve updated the implementation in the latest revision. If you have a moment, could you take another look.

I also noticed that the related PR #42967 is still unmerged. If that lands first, I’m happy to rebase and resolve any resulting conflicts.

@mergify

mergify Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @chfeng-cs.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 3, 2026
@chfeng-cs chfeng-cs force-pushed the group-aware-kv-cache-capacity branch from 551439c to 316db03 Compare June 6, 2026 12:36
@mergify mergify Bot removed the needs-rebase label Jun 6, 2026

@markmc markmc left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks!

PTAL @heheda12345

@markmc markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 11, 2026
@markmc markmc moved this from In Review to Ready in Prometheus Metrics Jun 11, 2026
@mergify

mergify Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Hi @chfeng-cs, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

@chfeng-cs chfeng-cs force-pushed the group-aware-kv-cache-capacity branch from ab31dc4 to 9f91679 Compare June 12, 2026 02:58
@chfeng-cs

chfeng-cs commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Drive-by fix: Updated fake_allocate_slots_fn in test_mamba_prefix_cache.py to include has_scheduled_reqs added in #44594 — required for CI to pass.

======
Reverted — this is already fixed in #45345. Will rebase once that merges.

Compute KV cache token capacity from the resolved group-aware concurrency.
Store the post-init capacity and max concurrency on CacheConfig for metrics.
Propagate them through EngineCoreReadyResponse while keeping cache_config_info
values per DP engine; num_gpu_blocks remains the DP-aggregated block count.
Add tests for capacity calculation and cache_config_info metric labels.

Signed-off-by: Ethan Feng <ethan.fengch@gmail.com>
@chfeng-cs chfeng-cs force-pushed the group-aware-kv-cache-capacity branch from 9f91679 to 0363c9b Compare June 12, 2026 03:30
@mergify

mergify Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Hi @chfeng-cs, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

chfeng-cs and others added 2 commits June 12, 2026 16:25
Signed-off-by: Ethan Feng <ethan.fengch@gmail.com>
@markmc markmc changed the title [Metrics] Add group-aware KV cache capacity Prometheus gauges [Metrics] Add group-aware KV cache capacity to vllm:cache_config_info Jun 12, 2026
@markmc markmc enabled auto-merge (squash) June 12, 2026 11:09
@markmc markmc merged commit b7f9b6a into vllm-project:main Jun 12, 2026
81 checks passed
@github-project-automation github-project-automation Bot moved this from Ready to Done in Prometheus Metrics Jun 12, 2026
Saddss pushed a commit to Saddss/vllm that referenced this pull request Jun 14, 2026
…vllm-project#42206)

The startup log already reports the correct group-aware KV cache capacity for
hybrid models, but Prometheus did not expose matching info in 'vllm:cache_config_info`.

This PR adds kv_cache_size_tokens and kv_cache_max_concurrency.

Signed-off-by: Ethan Feng <ethan.fengch@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Development

Successfully merging this pull request may close these issues.

2 participants