[Feature] Add KV cache usage metrics to iteration logging#34860
[Feature] Add KV cache usage metrics to iteration logging#34860maxyanghu wants to merge 7 commits intovllm-project:mainfrom
Conversation
|
Hi @maxyanghu, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
Signed-off-by: Max Hu <hyoung2991@gmail.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds KV cache usage metrics to the iteration-level logging, which is a great enhancement for observability. The implementation is mostly solid, introducing a new dataclass for the metrics, extending the scheduler interface, and updating the logging mechanism. I have one suggestion to improve the robustness and consistency of the metric calculation in the scheduler.
|
@robertgshaw2-redhat @nvpohanh @pavanimajety could you guys take a look at this PR please? Getting the KV cache usage along with the prefill/decode requests/tokens within the iteration logging was quite useful when we were doing profiling during the MLPerf Qwen3-VL efforts. |
|
@robertgshaw2-redhat I only noticed #31193 now. Before we further build on this approach, why did you go with |
|
I'd propose to make iteration level logging looks more similar to vllm 10 sec logging: |
I'm not particularly sure about what you meant by "when we've gone to such lengths to do other such logging in the frontend" but, if you are asking about the context of #31193 , this PR, and #35072 - inside NV we have iteration-level perf projection tools that are quite critical towards (especially kernel-level) perf optimization work, and the information being logged here are needed for perf projection to work correctly. |
I'm referring to https://docs.vllm.ai/en/latest/design/metrics/#metrics-collection
So e.g. this logging happens in the frontend process, not the engine core. And yes, it currently aggregate per-iteration stats, but if we wanted to log each iteration ... we could just not aggregate them. Since @robertgshaw2-redhat came up with this design, I'm assuming there's a good reason for doing it differently in this case ... I'm just asking what that is |
Sorry Mark - I reviewed this while you were out I should not have merged it. The reason I merged #31193 was because these iteration level logs are used for debugging performance and are annotating the nvtx markers, which is intended to be used by a profiler --- and thus need to be annotated in the GPU worker. I thought this was a valid use case for debugging [different from the operator use case for the typical logs we do in the frontend], but I see now that it could open up a can of worms and also we could have added the annotations without doing the logging in the enginecore |
|
I think we can refactor for it to be cleaner. |
|
This pull request has merge conflicts that must be resolved before it can be |
Purpose
Report KV cache usage metrics in iteration level logging .
This reports percentage, used/free blocks and used token number:
We used this KV cache metrics to analyze the best MBNT for the Qwen3-VL MLPerf submission.
Test Plan
then
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.