[Metrics] Deprecate TPOT in favor of ITL#24110
Merged
DarkLight1337 merged 2 commits intovllm-project:mainfrom Sep 2, 2025
Merged
[Metrics] Deprecate TPOT in favor of ITL#24110DarkLight1337 merged 2 commits intovllm-project:mainfrom
DarkLight1337 merged 2 commits intovllm-project:mainfrom
Conversation
The only case where we don't want to assert the existance of a metric is where it is deprecated and we're not showing hidden deprecated metrics. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
5 tasks
Contributor
There was a problem hiding this comment.
Code Review
This pull request correctly deprecates the vllm:time_per_output_token_seconds (TPOT) metric in favor of the more accurately named vllm:inter_token_latency_seconds (ITL). The changes are consistently applied across the codebase, including metrics definitions, logging, tests, and the Grafana dashboard example. The deprecation strategy of retaining the old metric for backward compatibility while introducing the new one is sound. I've found one minor issue with the documentation of the new metric, which appears to be a copy-paste error.
As per vllm-project#24015, what we currently call as TPOT should instead be called ITL since what we are actually measuring is the time between iterations, and a single iteration can produce multiple tokens. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
b176439 to
09dbc43
Compare
DarkLight1337
approved these changes
Sep 2, 2025
Member
DarkLight1337
left a comment
There was a problem hiding this comment.
LGTM, thanks for updating
845473182
pushed a commit
to 845473182/vllm
that referenced
this pull request
Sep 3, 2025
* 'main' of https://github.com/845473182/vllm: (457 commits) [BugFix] Fix routed_scaling_factor double mul for dots1 and glm4 MoE models (vllm-project#24132) [Misc] Add check for dual_chunk_attention (vllm-project#24070) [Doc]: fix typos in Python comments (vllm-project#24115) [Doc]: fix typos in Python comments (vllm-project#24093) [Compile] Fix Compile Warning for `w4a8_mm_entry.cu` (vllm-project#23660) fix some typos (vllm-project#24071) [V1] Wrapper which plumbs request-level logits processors into vLLM batch-level logits processing (vllm-project#23656) Upgrade xgrammar to 0.1.23 (vllm-project#22988) Update release pipeline post PyTorch 2.8.0 update (vllm-project#24073) [XPU] Fix the bug of LoRA logits on the XPU platform (vllm-project#24081) [CI/Build] Disable SiluMul NVFP4 quant fusion tests (vllm-project#24121) [Bug] R1 Accuracy: Fix `routed_scaling_factor` Double Mul Issue (vllm-project#24119) [AMD][Kernel][Bugfix] Cast offsets tensor bn to tl.int64 to avoid GPU segfault (vllm-project#23692) [CI] Enable all hf transformers baselines in test_hybrid (vllm-project#23936) [Log] Only Print Profiler Results on Rank 0 (vllm-project#23370) Fix weights loading for Apertus (vllm-project#24100) [Metrics] Deprecate TPOT in favor of ITL (vllm-project#24110) [Bugfix] Fix packed_factor missing attribute error (vllm-project#23902) Run ruff format on a few files. (vllm-project#24075) [Bugfix] Fix transform_config parsing in Compressed Tensors (vllm-project#23945) ...
eicherseiji
pushed a commit
to eicherseiji/vllm
that referenced
this pull request
Sep 9, 2025
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
FeiDaLI
pushed a commit
to FeiDaLI/vllm
that referenced
this pull request
Sep 25, 2025
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
1 task
markmc
added a commit
to markmc/vllm
that referenced
this pull request
Nov 24, 2025
The following are due for removal: - `vllm:gpu_cache_usage_perc` - `vllm:gpu_prefix_cache_queries` - `vllm:gpu_prefix_cache_hits` See vllm-project#18354 And the following is due to be hidden: - `vllm:time_per_output_token_seconds` See vllm-project#24110 The deprecation policy is documented [here](https://docs.vllm.ai/en/latest/usage/metrics/) > when metrics are deprecated in version X.Y, they are > hidden in version X.Y+1 but can be re-enabled using > the --show-hidden-metrics-for-version=X.Y escape hatch, > and are then removed in version X.Y+2. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
As per #24015, what we currently call as TPOT should instead be called ITL since what we are actually measuring is the time between iterations, and a single iteration can produce multiple tokens.
I'm flagging the TPOT metric as deprecated from 0.11 - even if this gets released on a 0.10.x release, I think we should only start the deprecation period from when it gets released in a new minor 0.N.0 release.