Skip to content

[Feature] Add KV cache usage metrics to iteration logging#34860

Open
maxyanghu wants to merge 7 commits intovllm-project:mainfrom
CentML:max/add-kvcache-usage
Open

[Feature] Add KV cache usage metrics to iteration logging#34860
maxyanghu wants to merge 7 commits intovllm-project:mainfrom
CentML:max/add-kvcache-usage

Conversation

@maxyanghu
Copy link
Copy Markdown
Contributor

@maxyanghu maxyanghu commented Feb 19, 2026

Purpose

Report KV cache usage metrics in iteration level logging .

This reports percentage, used/free blocks and used token number:

Iteration(268): ... , kv cache: 1.1% (792/68959 blocks, 12672 tokens)

We used this KV cache metrics to analyze the best MBNT for the Qwen3-VL MLPerf submission.

Test Plan

vllm serve --enable-logging-iteration-details  --model Qwen/Qwen3-VL-8B-Instruct
vllm serve --enable-logging-iteration-details  --async-scheduling --model Qwen/Qwen3-VL-8B-Instruct

then

vllm bench serve --base-url http://127.0.0.1:8000 --model Qwen/Qwen3-VL-8B-Instruct --dataset-name random --num-prompts 50

Test Result

(EngineCore_DP0 pid=2733609) INFO 02-18 16:21:20 [core.py:387] Iteration(188): 0 context requests, 0 context tokens, 50 generation requests, 50 gen
eration tokens, iteration elapsed time: 10.25 ms, kv cache: 5.1% (3500/68959 blocks, 56000 tokens)                                                 
(EngineCore_DP0 pid=2733609) INFO 02-18 16:21:20 [core.py:387] Iteration(189): 0 context requests, 0 context tokens, 50 generation requests, 50 gen
eration tokens, iteration elapsed time: 0.72 ms, kv cache: 5.1% (3500/68959 blocks, 56000 tokens)                                                  
(EngineCore_DP0 pid=2733609) INFO 02-18 16:21:20 [core.py:387] Iteration(190): 0 context requests, 0 context tokens, 50 generation requests, 50 gen
eration tokens, iteration elapsed time: 10.37 ms, kv cache: 5.1% (3500/68959 blocks, 56000 tokens)                                                 
(EngineCore_DP0 pid=2733609) INFO 02-18 16:21:20 [core.py:387] Iteration(191): 0 context requests,

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Max Hu <hyoung2991@gmail.com>
Signed-off-by: Max Hu <hyoung2991@gmail.com>
Signed-off-by: Max Hu <hyoung2991@gmail.com>
@mergify
Copy link
Copy Markdown

mergify bot commented Feb 19, 2026

Hi @maxyanghu, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

Signed-off-by: Max Hu <hyoung2991@gmail.com>
@maxyanghu
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds KV cache usage metrics to the iteration-level logging, which is a great enhancement for observability. The implementation is mostly solid, introducing a new dataclass for the metrics, extending the scheduler interface, and updating the logging mechanism. I have one suggestion to improve the robustness and consistency of the metric calculation in the scheduler.

@wangshangsam
Copy link
Copy Markdown
Collaborator

@robertgshaw2-redhat @nvpohanh @pavanimajety could you guys take a look at this PR please? Getting the KV cache usage along with the prefill/decode requests/tokens within the iteration logging was quite useful when we were doing profiling during the MLPerf Qwen3-VL efforts.

@markmc
Copy link
Copy Markdown
Member

markmc commented Feb 19, 2026

@robertgshaw2-redhat I only noticed #31193 now. Before we further build on this approach, why did you go with log_iteration_details() in the engine core when we've gone to such lengths to do other such logging in the frontend?

@vadiklyutiy
Copy link
Copy Markdown
Collaborator

I'd propose to make iteration level logging looks more similar to vllm 10 sec logging:

(ApiServer_6 pid=717044) INFO 02-20 06:15:45 [loggers.py:259] Engine 003: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 402.3 tokens/s, Running: 821 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.5%, Prefix cache hit rate: 0.0%
(ApiServer_6 pid=717044) INFO 02-20 06:15:45 [loggers.py:259] Engine 004: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 402.3 tokens/s, Running: 820 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.5%, Prefix cache hit rate: 0.0%
(ApiServer_6 pid=717044) INFO 02-20 06:15:45 [loggers.py:259] Engine 005: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 402.4 tokens/s, Running: 821 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.5%, Prefix cache hit rate: 0.0%

@wangshangsam
Copy link
Copy Markdown
Collaborator

@robertgshaw2-redhat I only noticed #31193 now. Before we further build on this approach, why did you go with log_iteration_details() in the engine core when we've gone to such lengths to do other such logging in the frontend?

I'm not particularly sure about what you meant by "when we've gone to such lengths to do other such logging in the frontend" but, if you are asking about the context of #31193 , this PR, and #35072 - inside NV we have iteration-level perf projection tools that are quite critical towards (especially kernel-level) perf optimization work, and the information being logged here are needed for perf projection to work correctly.

@markmc
Copy link
Copy Markdown
Member

markmc commented Feb 23, 2026

@robertgshaw2-redhat I only noticed #31193 now. Before we further build on this approach, why did you go with log_iteration_details() in the engine core when we've gone to such lengths to do other such logging in the frontend?

I'm not particularly sure about what you meant by "when we've gone to such lengths to do other such logging in the frontend"

I'm referring to https://docs.vllm.ai/en/latest/design/metrics/#metrics-collection

In v1, we wish to move computation and overhead out of the engine core process to minimize the time between each forward pass. [...] AsyncLLM.output_handler_loop() is the ideal place for the metrics bookkeeping if possible.
We will achieve this by collecting metrics in the frontend API server, and base these metrics on information we can glean from the EngineCoreOutputs returned by the engine core process to the frontend.

So e.g. this logging happens in the frontend process, not the engine core. And yes, it currently aggregate per-iteration stats, but if we wanted to log each iteration ... we could just not aggregate them.

Since @robertgshaw2-redhat came up with this design, I'm assuming there's a good reason for doing it differently in this case ... I'm just asking what that is

@njhill
Copy link
Copy Markdown
Member

njhill commented Feb 25, 2026

I also missed #31193 originally and also don't love it. Agree with @markmc that it's hacking in some logging in a way that is different to the rest of our existing logging/metrics, and it's also quite intrusive in terms of the changes made to core parts.

@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

robertgshaw2-redhat commented Feb 25, 2026

@robertgshaw2-redhat I only noticed #31193 now. Before we further build on this approach, why did you go with log_iteration_details() in the engine core when we've gone to such lengths to do other such logging in the frontend?

Sorry Mark - I reviewed this while you were out I should not have merged it. The reason I merged #31193 was because these iteration level logs are used for debugging performance and are annotating the nvtx markers, which is intended to be used by a profiler --- and thus need to be annotated in the GPU worker. I thought this was a valid use case for debugging [different from the operator use case for the typical logs we do in the frontend], but I see now that it could open up a can of worms and also we could have added the annotations without doing the logging in the enginecore

@njhill
Copy link
Copy Markdown
Member

njhill commented Feb 25, 2026

I think we can refactor for it to be cleaner.

@mergify
Copy link
Copy Markdown

mergify bot commented Feb 26, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maxyanghu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Backlog
Status: No status

Development

Successfully merging this pull request may close these issues.

6 participants