[Feature] Add iteration level logging and enhance nvtx marker#31193
[Feature] Add iteration level logging and enhance nvtx marker#31193robertgshaw2-redhat merged 13 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces iteration-level logging and enhances NVTX markers by providing more detailed information about context and generation requests and tokens. The changes are well-structured, adding a new environment variable to control the feature, a utility function to compute iteration details, and integrating it into the engine core and GPU worker. My review focuses on a performance optimization for the new utility function. Overall, this is a valuable addition for debugging and performance analysis.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
I actually feel that we should turn this off by default. This feature is really very much for GPU experts to do performance projection and profiling, which represents only a minor fraction of the overall user base of vLLM. For the general laymen, too much logging can be confusing. |
I am a little worried about the CPU overhead this causes and whether vLLM users would think the logging is too verbose. Should we turn it off by default and add a scheduler flag to enable it? |
|
this should not be on by default |
vllm/v1/worker/gpu_worker.py
Outdated
|
|
||
| return self.profiler.annotate_context_manager( | ||
| f"execute_new_{num_new}_cached_{num_cached}" | ||
| annotation = ( |
There was a problem hiding this comment.
using fstring in core loop logging can add a lot of overhead. This is something we ran into issues with in the past
There was a problem hiding this comment.
changed all string building to join() to minimize overheads
|
@wangshangsam @nvpohanh @robertgshaw2-redhat It's turned off by default now. |
vllm/envs.py
Outdated
| @@ -46,6 +46,7 @@ | |||
| VLLM_LOGGING_COLOR: str = "auto" | |||
| NO_COLOR: bool = False | |||
| VLLM_LOG_STATS_INTERVAL: float = 10.0 | |||
| VLLM_LOG_ITERATION_DETAILS: bool = False | |||
There was a problem hiding this comment.
@robertgshaw2-redhat do you think this should be controlled via env var or a server flag instead?
There was a problem hiding this comment.
@robertgshaw2-redhat Could you help take another look? Thanks!
There was a problem hiding this comment.
can we have it as a flag? we should put it in the profiler config that @benchislett made I think
There was a problem hiding this comment.
How about putting it inside ObservabilityConfig. This is a logging option. But ProfilerConfig is for profilers.
I changed the code this way, please take another look. Thanks! @robertgshaw2-redhat
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Max Hu <hyoung2991@gmail.com>
Signed-off-by: Max Hu <maxhu@nvidia.com>
vllm/v1/engine/core.py
Outdated
| ] | ||
| ) | ||
| ) | ||
| self.iteration_index += 1 |
There was a problem hiding this comment.
should we put some cap on this?
I know python can have arbitrary sized ints but I wonder if there is some footgun for billions of iterations
There was a problem hiding this comment.
So even at 1,000 iterations/second (which is very fast), reaching 1 billion would take ~11.5 days. And I think putting a cap or wrap it around would be confusing.
Plus this isn't turned on by default. It's a debugging functionality.
There was a problem hiding this comment.
yeah, sounds reasonable to me
There was a problem hiding this comment.
One small nit. Can you create the self.iteration_index inside this contextmanager?
I want to avoid people using it except for this debug functionality
There was a problem hiding this comment.
self.iteration_index = getattr(self, "iteration_index", 0)
iteration_details = compute_iteration_details(scheduler_output)Signed-off-by: Max Hu <maxhu@nvidia.com>
…roject#31193) Signed-off-by: Max Hu <maxhu@nvidia.com> Signed-off-by: Max Hu <hyoung2991@gmail.com> Co-authored-by: Max Hu <maxhu@nvidia.com>
…roject#31193) Signed-off-by: Max Hu <maxhu@nvidia.com> Signed-off-by: Max Hu <hyoung2991@gmail.com> Co-authored-by: Max Hu <maxhu@nvidia.com>
…roject#31193) Signed-off-by: Max Hu <maxhu@nvidia.com> Signed-off-by: Max Hu <hyoung2991@gmail.com> Co-authored-by: Max Hu <maxhu@nvidia.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
…roject#31193) Signed-off-by: Max Hu <maxhu@nvidia.com> Signed-off-by: Max Hu <hyoung2991@gmail.com> Co-authored-by: Max Hu <maxhu@nvidia.com>
Purpose
This PR adds iteration level logging for each scheduled iteration. It computes iteration details like number of context/generation requests and number of context/generation tokens. The definition of a context request is any requests that is still being processed in prefill phase and no output token has been generated for that request. It also logs elapsed time of iterations and keeps an index of the current iteration. As the index of iteration is recorded per
EngineCore, it is kept per data parallel instance.Here is a logging example:
And this is a screenshot of the enriched nvtx iteration marker:

This feature turned off by default. You'll see a significant increase in server side logging if it's turned on. A flag
--enable-logging-iteration-detailsis added to turn it on/off.Test Plan
and then:
Test Result
correct logging
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.