[Metrics] Add group-aware KV cache capacity to vllm:cache_config_info by chfeng-cs · Pull Request #42206 · vllm-project/vllm

chfeng-cs · 2026-05-10T05:28:41Z

Purpose

Addresses the Prometheus vs. startup-log discrepancy for KV cache capacity in #42024.

The startup log already reports the correct group-aware KV cache capacity for hybrid models, but Prometheus did not expose matching info in 'vllm:cache_config_info`. This PR adds two:

kv_cache_size_tokens : Per-DP-engine KV cache capacity in tokens (group-aware). Uses group-aware capacity since num_gpu_blocks * block_size can be wrong for hybrid models where requests occupy multiple KV cache groups.
kv_cache_max_concurrency : Per-DP-engine maximum concurrency at max_model_len tokens.

Both values are computed from the same group-aware KV cache path used at startup, and are propagated to the frontend process in multiprocess deployments. These values are per-engine, not cluster totals, so they are not summed across DP ranks.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request introduces group-aware KV cache capacity metrics, specifically vllm:kv_cache_size_tokens and vllm:kv_cache_max_concurrency, to provide more accurate reporting for hybrid models where naive block-based calculations overestimate capacity. The changes include updates to the cache configuration, engine initialization logic, and Prometheus logging, along with new unit tests. Review feedback identifies a critical logic error in the Data Parallel (DP) implementation where these metrics are not summed across engines, leading to an under-reporting of total system capacity. Actionable suggestions were provided to aggregate these values in the client and update the test suite to reflect the expected summed totals.

gemini-code-assist · 2026-05-10T05:37:50Z

+        if (
+            vllm_config.cache_config.kv_cache_size_tokens is None
+            and response.kv_cache_size_tokens is not None
+        ):
+            vllm_config.cache_config.kv_cache_size_tokens = (
+                response.kv_cache_size_tokens
+            )
+        if (
+            vllm_config.cache_config.kv_cache_max_concurrency is None
+            and response.kv_cache_max_concurrency is not None
+        ):
+            vllm_config.cache_config.kv_cache_max_concurrency = (
+                response.kv_cache_max_concurrency
+            )


The group-aware capacity metrics (kv_cache_size_tokens and kv_cache_max_concurrency) should be summed across Data Parallel (DP) engines, similar to how num_gpu_blocks is handled at line 681. Currently, the code only takes the value from the first engine response. In a DP setup with multiple engines, this results in a reported capacity that is only a fraction of the actual system capacity, creating a significant inconsistency with the num_gpu_blocks reported in vllm:cache_config_info (which is summed). Summing these values ensures that the Prometheus gauges correctly represent the total capacity of the served model.

Suggested change

if (

vllm_config.cache_config.kv_cache_size_tokens is None

and response.kv_cache_size_tokens is not None

):

vllm_config.cache_config.kv_cache_size_tokens = (

response.kv_cache_size_tokens

)

if (

vllm_config.cache_config.kv_cache_max_concurrency is None

and response.kv_cache_max_concurrency is not None

):

vllm_config.cache_config.kv_cache_max_concurrency = (

response.kv_cache_max_concurrency

)

# Group-aware capacity: sum across DP engines to match num_gpu_blocks.

if response.kv_cache_size_tokens is not None:

vllm_config.cache_config.kv_cache_size_tokens = (

(vllm_config.cache_config.kv_cache_size_tokens or 0) +

response.kv_cache_size_tokens

)

if response.kv_cache_max_concurrency is not None:

vllm_config.cache_config.kv_cache_max_concurrency = (

(vllm_config.cache_config.kv_cache_max_concurrency or 0.0) +

response.kv_cache_max_concurrency

)

markmc · 2026-05-13T15:13:16Z

Ok, some high level thoughts:

Problem statement - the naive num_gpu_blocks * block_size formula that made sense with uniform KV cache does not work for hybrid models. Users need an alternative to observe/introspect "KV cache capacity in tokens"
Question - do we want to expose max_model_len and max_concurrency (allowing users to do num_tokens = max_concurrency * max_model_len), or do we simply want to report kv_cache_size_tokens ?
Either way, we need to propagate the info reported in _report_kv_cache_config() from the engine core process to the frontend process - that means adding at least max_concurrency to EngineCoreReadyResponse, but maybe num_tokens too
I really don't like the way we store runtime state like num_gpu_tokens in VllmConfig, but I guess I'm fine with also adding max_concurrency (and maybe num_tokens) to it for now. See [KV Connector] Make KVCacheConfig an explicit constructor argument #27887 (comment)
I'm also not a huge fan of a Prometheus metric that is only set once on startup and never changes - it may be common enough practise, but it seems like a poor fit for a time-series DB to me
We already have the vllm:cache_config_info metric that does this, which I also don't like, but I'd prefer we add this new info there rather add a new metric
[RFC]: Endpoints to Retrieve User Configuration and Runtime Data #38147 is a more promising long-term direction - an explicit API for introspecting the server, which might have something like /server_info/kv_cache

markmc

More concrete feedback inline

chfeng-cs · 2026-05-13T16:20:20Z

Thanks for the detailed review, @markmc. I really appreciate the architectural perspective here. A lot of your comments genuinely gave me an “ah, that makes sense” moment. I'm going to work through a cleaner revision aligned with the longer-term direction you mentioned and come back with an updated diff.

chfeng-cs · 2026-05-19T19:02:42Z

Updated per review.

I agree this should move out of VllmConfig if we later split runtime KV cache state from user config. For now, I kept it in CacheConfig, following the existing num_gpu_blocks post-init state pattern.

RFC #38147 looks like a better long-term direction for this kind of runtime data. So I see this as an interim solution aligned with the current metrics/config path.

Please let me know if I missed anything or if there are still issues with the implementation.

mergify · 2026-05-22T11:56:41Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @chfeng-cs.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-05-29T01:08:41Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @chfeng-cs.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

chfeng-cs · 2026-06-01T16:56:15Z

Hi @markmc, I’ve updated the implementation in the latest revision. If you have a moment, could you take another look.

I also noticed that the related PR #42967 is still unmerged. If that lands first, I’m happy to rebase and resolve any resulting conflicts.

mergify · 2026-06-03T15:44:55Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @chfeng-cs.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

markmc

lgtm, thanks!

PTAL @heheda12345

mergify · 2026-06-11T18:15:56Z

Hi @chfeng-cs, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

chfeng-cs · 2026-06-12T03:00:37Z

Drive-by fix: Updated fake_allocate_slots_fn in test_mamba_prefix_cache.py to include has_scheduled_reqs added in #44594 — required for CI to pass.

======
Reverted — this is already fixed in #45345. Will rebase once that merges.

Compute KV cache token capacity from the resolved group-aware concurrency. Store the post-init capacity and max concurrency on CacheConfig for metrics. Propagate them through EngineCoreReadyResponse while keeping cache_config_info values per DP engine; num_gpu_blocks remains the DP-aggregated block count. Add tests for capacity calculation and cache_config_info metric labels. Signed-off-by: Ethan Feng <ethan.fengch@gmail.com>

mergify · 2026-06-12T05:07:24Z

Hi @chfeng-cs, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Signed-off-by: Ethan Feng <ethan.fengch@gmail.com>

…vllm-project#42206) The startup log already reports the correct group-aware KV cache capacity for hybrid models, but Prometheus did not expose matching info in 'vllm:cache_config_info`. This PR adds kv_cache_size_tokens and kv_cache_max_concurrency. Signed-off-by: Ethan Feng <ethan.fengch@gmail.com>

chfeng-cs requested review from DarkLight1337, NickLucche, aarnphm, heheda12345, markmc, njhill and robertgshaw2-redhat as code owners May 10, 2026 05:28

claude Bot reviewed May 10, 2026

View reviewed changes

mergify Bot added the v1 label May 10, 2026

gemini-code-assist Bot reviewed May 10, 2026

View reviewed changes

markmc added this to Prometheus Metrics May 11, 2026

github-project-automation Bot moved this to Backlog in Prometheus Metrics May 11, 2026

markmc mentioned this pull request May 13, 2026

[RFC]: Endpoints to Retrieve User Configuration and Runtime Data #38147

Open

1 task

markmc moved this from Backlog to In Review in Prometheus Metrics May 13, 2026

markmc requested changes May 13, 2026

View reviewed changes

Comment thread vllm/v1/metrics/loggers.py Outdated

Comment thread vllm/v1/engine/core.py Outdated

Comment thread tests/v1/metrics/test_kv_cache_metrics.py Outdated

markmc mentioned this pull request May 18, 2026

[Bugfix] Sync block_size from EngineCore to frontend for hybrid Mamba… #42967

Merged

4 tasks

chfeng-cs force-pushed the group-aware-kv-cache-capacity branch from a69b48e to 9fbe2d5 Compare May 19, 2026 18:47

chfeng-cs requested review from ApostaC, WoosukKwon, alexm-redhat, orozery and ywang96 as code owners May 19, 2026 18:47

chfeng-cs force-pushed the group-aware-kv-cache-capacity branch from 935f2a8 to f6985d1 Compare May 22, 2026 11:55

mergify Bot added the needs-rebase label May 22, 2026

chfeng-cs force-pushed the group-aware-kv-cache-capacity branch from f6985d1 to 116960a Compare May 22, 2026 12:15

mergify Bot removed the needs-rebase label May 22, 2026

mergify Bot added the needs-rebase label May 29, 2026

chfeng-cs force-pushed the group-aware-kv-cache-capacity branch from 116960a to 551439c Compare June 1, 2026 16:24

chfeng-cs requested a review from AndreasKaratzas as a code owner June 1, 2026 16:24

mergify Bot removed the needs-rebase label Jun 1, 2026

mergify Bot added the needs-rebase label Jun 3, 2026

chfeng-cs force-pushed the group-aware-kv-cache-capacity branch from 551439c to 316db03 Compare June 6, 2026 12:36

mergify Bot removed the needs-rebase label Jun 6, 2026

markmc approved these changes Jun 11, 2026

View reviewed changes

markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 11, 2026

markmc moved this from In Review to Ready in Prometheus Metrics Jun 11, 2026

chfeng-cs force-pushed the group-aware-kv-cache-capacity branch from ab31dc4 to 9f91679 Compare June 12, 2026 02:58

chfeng-cs force-pushed the group-aware-kv-cache-capacity branch from 9f91679 to 0363c9b Compare June 12, 2026 03:30

Merge branch 'main' into group-aware-kv-cache-capacity

2bdc6f4

chfeng-cs and others added 2 commits June 12, 2026 16:25

Merge branch 'main' into group-aware-kv-cache-capacity

e14f93d

ci: rerun

8ae4298

Signed-off-by: Ethan Feng <ethan.fengch@gmail.com>

markmc changed the title ~~[Metrics] Add group-aware KV cache capacity Prometheus gauges~~ [Metrics] Add group-aware KV cache capacity to vllm:cache_config_info Jun 12, 2026

markmc enabled auto-merge (squash) June 12, 2026 11:09

markmc merged commit b7f9b6a into vllm-project:main Jun 12, 2026
81 checks passed

github-project-automation Bot moved this from Ready to Done in Prometheus Metrics Jun 12, 2026

Uh oh!

Conversation

chfeng-cs commented May 10, 2026 • edited by markmc Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

markmc commented May 13, 2026

Uh oh!

markmc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chfeng-cs commented May 13, 2026

Uh oh!

chfeng-cs commented May 19, 2026

Uh oh!

mergify Bot commented May 22, 2026

Uh oh!

mergify Bot commented May 29, 2026

Uh oh!

chfeng-cs commented Jun 1, 2026

Uh oh!

mergify Bot commented Jun 3, 2026

Uh oh!

markmc left a comment

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Jun 11, 2026

Uh oh!

chfeng-cs commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify Bot commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chfeng-cs commented May 10, 2026 •

edited by markmc

Loading

chfeng-cs commented Jun 12, 2026 •

edited

Loading