[Feature] Add KV cache usage metrics to iteration logging by maxyanghu · Pull Request #34860 · vllm-project/vllm

maxyanghu · 2026-02-19T00:40:47Z

Purpose

Report KV cache usage metrics in iteration level logging .

This reports percentage, used/free blocks and used token number:

Iteration(268): ... , kv cache: 1.1% (792/68959 blocks, 12672 tokens)

We used this KV cache metrics to analyze the best MBNT for the Qwen3-VL MLPerf submission.

Test Plan

vllm serve --enable-logging-iteration-details  --model Qwen/Qwen3-VL-8B-Instruct
vllm serve --enable-logging-iteration-details  --async-scheduling --model Qwen/Qwen3-VL-8B-Instruct

then

vllm bench serve --base-url http://127.0.0.1:8000 --model Qwen/Qwen3-VL-8B-Instruct --dataset-name random --num-prompts 50

Test Result

(EngineCore_DP0 pid=2733609) INFO 02-18 16:21:20 [core.py:387] Iteration(188): 0 context requests, 0 context tokens, 50 generation requests, 50 gen
eration tokens, iteration elapsed time: 10.25 ms, kv cache: 5.1% (3500/68959 blocks, 56000 tokens)                                                 
(EngineCore_DP0 pid=2733609) INFO 02-18 16:21:20 [core.py:387] Iteration(189): 0 context requests, 0 context tokens, 50 generation requests, 50 gen
eration tokens, iteration elapsed time: 0.72 ms, kv cache: 5.1% (3500/68959 blocks, 56000 tokens)                                                  
(EngineCore_DP0 pid=2733609) INFO 02-18 16:21:20 [core.py:387] Iteration(190): 0 context requests, 0 context tokens, 50 generation requests, 50 gen
eration tokens, iteration elapsed time: 10.37 ms, kv cache: 5.1% (3500/68959 blocks, 56000 tokens)                                                 
(EngineCore_DP0 pid=2733609) INFO 02-18 16:21:20 [core.py:387] Iteration(191): 0 context requests,

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Max Hu <hyoung2991@gmail.com>

mergify · 2026-02-19T00:45:24Z

Hi @maxyanghu, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

gemini-code-assist · 2026-02-19T00:46:05Z

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

Signed-off-by: Max Hu <hyoung2991@gmail.com>

maxyanghu · 2026-02-19T00:58:47Z

/gemini review

gemini-code-assist

Code Review

This pull request adds KV cache usage metrics to the iteration-level logging, which is a great enhancement for observability. The implementation is mostly solid, introducing a new dataclass for the metrics, extending the scheduler interface, and updating the logging mechanism. I have one suggestion to improve the robustness and consistency of the metric calculation in the scheduler.

vllm/v1/core/sched/scheduler.py

wangshangsam · 2026-02-19T07:40:51Z

@robertgshaw2-redhat @nvpohanh @pavanimajety could you guys take a look at this PR please? Getting the KV cache usage along with the prefill/decode requests/tokens within the iteration logging was quite useful when we were doing profiling during the MLPerf Qwen3-VL efforts.

markmc · 2026-02-19T11:13:43Z

@robertgshaw2-redhat I only noticed #31193 now. Before we further build on this approach, why did you go with log_iteration_details() in the engine core when we've gone to such lengths to do other such logging in the frontend?

vadiklyutiy · 2026-02-20T02:19:52Z

I'd propose to make iteration level logging looks more similar to vllm 10 sec logging:

(ApiServer_6 pid=717044) INFO 02-20 06:15:45 [loggers.py:259] Engine 003: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 402.3 tokens/s, Running: 821 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.5%, Prefix cache hit rate: 0.0%
(ApiServer_6 pid=717044) INFO 02-20 06:15:45 [loggers.py:259] Engine 004: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 402.3 tokens/s, Running: 820 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.5%, Prefix cache hit rate: 0.0%
(ApiServer_6 pid=717044) INFO 02-20 06:15:45 [loggers.py:259] Engine 005: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 402.4 tokens/s, Running: 821 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.5%, Prefix cache hit rate: 0.0%

wangshangsam · 2026-02-23T19:57:17Z

@robertgshaw2-redhat I only noticed #31193 now. Before we further build on this approach, why did you go with log_iteration_details() in the engine core when we've gone to such lengths to do other such logging in the frontend?

I'm not particularly sure about what you meant by "when we've gone to such lengths to do other such logging in the frontend" but, if you are asking about the context of #31193 , this PR, and #35072 - inside NV we have iteration-level perf projection tools that are quite critical towards (especially kernel-level) perf optimization work, and the information being logged here are needed for perf projection to work correctly.

markmc · 2026-02-23T20:57:53Z

@robertgshaw2-redhat I only noticed #31193 now. Before we further build on this approach, why did you go with log_iteration_details() in the engine core when we've gone to such lengths to do other such logging in the frontend?

I'm not particularly sure about what you meant by "when we've gone to such lengths to do other such logging in the frontend"

I'm referring to https://docs.vllm.ai/en/latest/design/metrics/#metrics-collection

In v1, we wish to move computation and overhead out of the engine core process to minimize the time between each forward pass. [...] AsyncLLM.output_handler_loop() is the ideal place for the metrics bookkeeping if possible.
We will achieve this by collecting metrics in the frontend API server, and base these metrics on information we can glean from the EngineCoreOutputs returned by the engine core process to the frontend.

So e.g. this logging happens in the frontend process, not the engine core. And yes, it currently aggregate per-iteration stats, but if we wanted to log each iteration ... we could just not aggregate them.

Since @robertgshaw2-redhat came up with this design, I'm assuming there's a good reason for doing it differently in this case ... I'm just asking what that is

njhill · 2026-02-25T00:15:02Z

I also missed #31193 originally and also don't love it. Agree with @markmc that it's hacking in some logging in a way that is different to the rest of our existing logging/metrics, and it's also quite intrusive in terms of the changes made to core parts.

robertgshaw2-redhat · 2026-02-25T00:53:57Z

@robertgshaw2-redhat I only noticed #31193 now. Before we further build on this approach, why did you go with log_iteration_details() in the engine core when we've gone to such lengths to do other such logging in the frontend?

Sorry Mark - I reviewed this while you were out I should not have merged it. The reason I merged #31193 was because these iteration level logs are used for debugging performance and are annotating the nvtx markers, which is intended to be used by a profiler --- and thus need to be annotated in the GPU worker. I thought this was a valid use case for debugging [different from the operator use case for the typical logs we do in the frontend], but I see now that it could open up a can of worms and also we could have added the annotations without doing the logging in the enginecore

njhill · 2026-02-25T18:15:19Z

I think we can refactor for it to be cleaner.

mergify · 2026-02-26T16:58:27Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maxyanghu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

maxyanghu added 3 commits February 18, 2026 01:27

add implementation

516b984

Signed-off-by: Max Hu <hyoung2991@gmail.com>

change

a45eb80

Signed-off-by: Max Hu <hyoung2991@gmail.com>

fix

42992d5

Signed-off-by: Max Hu <hyoung2991@gmail.com>

maxyanghu requested review from ApostaC, WoosukKwon, alexm-redhat, heheda12345, njhill, orozery, robertgshaw2-redhat and ywang96 as code owners February 19, 2026 00:40

mergify bot added the v1 label Feb 19, 2026

fix pre-commit

5fb13fe

Signed-off-by: Max Hu <hyoung2991@gmail.com>

gemini-code-assist bot reviewed Feb 19, 2026

View reviewed changes

vllm/v1/core/sched/scheduler.py Show resolved Hide resolved

orozery requested a review from markmc February 19, 2026 06:17

wangshangsam assigned maxyanghu Feb 19, 2026

wangshangsam added the nvidia label Feb 19, 2026

github-project-automation bot added this to NVIDIA Feb 19, 2026

wangshangsam approved these changes Feb 19, 2026

View reviewed changes

Merge branch 'main' into max/add-kvcache-usage

66c6012

markmc added this to Metrics & Tracing Feb 19, 2026

github-project-automation bot moved this to Backlog in Metrics & Tracing Feb 19, 2026

Merge branch 'main' into max/add-kvcache-usage

be5e39a

Merge branch 'main' into max/add-kvcache-usage

51680e9

mergify bot added the needs-rebase label Feb 26, 2026

Uh oh!

Conversation

maxyanghu commented Feb 19, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Feb 19, 2026

Uh oh!

gemini-code-assist bot commented Feb 19, 2026

Uh oh!

maxyanghu commented Feb 19, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

wangshangsam commented Feb 19, 2026

Uh oh!

markmc commented Feb 19, 2026

Uh oh!

vadiklyutiy commented Feb 20, 2026

Uh oh!

wangshangsam commented Feb 23, 2026

Uh oh!

markmc commented Feb 23, 2026

Uh oh!

njhill commented Feb 25, 2026

Uh oh!

robertgshaw2-redhat commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

njhill commented Feb 25, 2026

Uh oh!

mergify bot commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

maxyanghu commented Feb 19, 2026 •

edited by github-actions bot

Loading

robertgshaw2-redhat commented Feb 25, 2026 •

edited

Loading