[Core][Metrics][BugFix] Replace num_cached_tokens/num_external_computed_tokens with PrefillStats#37460
Conversation
|
I've just realized an obvious gap - the prefill stats aren't updated on the failed-KV-transfer path |
There was a problem hiding this comment.
Code Review
This pull request introduces PrefillStats to more accurately track and report prefill-related metrics, which is a valuable improvement for clarity and correctness, especially in scenarios involving preemption. The refactoring to use a dedicated PrefillStats object instead of separate numerical values is a good design choice. The changes are consistently applied across the scheduler, engine outputs, request objects, and tests. I have found one critical issue in the metrics calculation logic that leads to double-counting of tokens, for which I've provided a detailed comment and a suggested fix.
Right now we basically duplicate these fields into prefill stats. And I guess once we do that, we can think whether we want to keep the old fields (probably not?). |
IMO, the PR basically does that (set the stats directly) ?
I'm going to take a look at this, and I expect I can remove/rework these fields without affecting |
Right, my bad. |
|
@orozery from #36859 (comment)
I've incorporated this approach from that earlier revision now, PTAL |
I think it's better to create a separate PR for these changes in the failure recovery flow. |
|
Moved to draft since #38096 needs to merge first now |
|
This pull request has merge conflicts that must be resolved before it can be |
…s with PrefillStats In OutputProcessor, we take the first EngineCoreOutput as a signal that prefill has completed, and record certain statistics about it. On the scheduler side, because of preemption, we might have prefills that are scheduled but never completed, or we might need to recompute an already completed prefill. To add clarity, we use PrefillStats to track the first scheduled prefill so that the stats can be returned to the frontend via EngineCoreOutput. num_cached_tokens was previously used for KV transfer failure recovery, but this is no longer true as of vllm-project#38096. We also no longer attempt to correct these prefill metrics if KV transfers failed, since this introduced unjustified brittleness to an already brittle code path. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
|
FTR, this is a significant change in the most recent update: We now track stats from the first time the prefill is scheduled (previously: first time prefill is completed) since the proven-error-prone effort to update the stats in the KV transfer error-handling path seems unjustified |
Or points out: > we call take_prefill_stats at update_from_output in the same > step the request was first scheduled to run. So I don't think > there's a way the request will be preempted before we set > Request.prefill_stats to None. Co-authored-by: Or Ozeri <or@ozery.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com>
|
Note: This PR changed the prefill stats reported in the case of multi kv connector (NIXL connector + CPUOffloading connector). It impacts a couple of test cases in Will post a PR that update the test cases accordingly. |
|
yep it broke CI for |
…ed_tokens with PrefillStats (vllm-project#37460) Related to `Counters can only be incremented by non-negative amounts` error with the `vllm:prompt_tokens_by_source_total` metric. Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Or Ozeri <or@ozery.com> Signed-off-by: zengxian <xiangdong.zeng@intel.com>
…fferent vllm version (#8426) ### What this PR does / why we need it? This fix vllm-project/vllm#37460 This PR introduces version-specific logic to handle `num_cached_tokens` and `num_external_computed_tokens` in the scheduler, ensuring compatibility with vLLM 0.19.0 and maintaining legacy support for older versions via `prefill_stats`. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.19.0 - vLLM main: vllm-project/vllm@6f786f2 Signed-off-by: wangli <wangli858794774@gmail.com>
…fferent vllm version (vllm-project#8426) ### What this PR does / why we need it? This fix vllm-project/vllm#37460 This PR introduces version-specific logic to handle `num_cached_tokens` and `num_external_computed_tokens` in the scheduler, ensuring compatibility with vLLM 0.19.0 and maintaining legacy support for older versions via `prefill_stats`. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.19.0 - vLLM main: vllm-project/vllm@6f786f2 Signed-off-by: wangli <wangli858794774@gmail.com>
…fferent vllm version (vllm-project#8426) ### What this PR does / why we need it? This fix vllm-project/vllm#37460 This PR introduces version-specific logic to handle `num_cached_tokens` and `num_external_computed_tokens` in the scheduler, ensuring compatibility with vLLM 0.19.0 and maintaining legacy support for older versions via `prefill_stats`. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.19.0 - vLLM main: vllm-project/vllm@6f786f2 Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: tfhddd <2272751277@qq.com>
…fferent vllm version (vllm-project#8426) ### What this PR does / why we need it? This fix vllm-project/vllm#37460 This PR introduces version-specific logic to handle `num_cached_tokens` and `num_external_computed_tokens` in the scheduler, ensuring compatibility with vLLM 0.19.0 and maintaining legacy support for older versions via `prefill_stats`. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.19.0 - vLLM main: vllm-project/vllm@6f786f2 Signed-off-by: wangli <wangli858794774@gmail.com>
…ed_tokens with PrefillStats (vllm-project#37460) Related to `Counters can only be incremented by non-negative amounts` error with the `vllm:prompt_tokens_by_source_total` metric. Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Or Ozeri <or@ozery.com>
…fferent vllm version (vllm-project#8426) ### What this PR does / why we need it? This fix vllm-project/vllm#37460 This PR introduces version-specific logic to handle `num_cached_tokens` and `num_external_computed_tokens` in the scheduler, ensuring compatibility with vLLM 0.19.0 and maintaining legacy support for older versions via `prefill_stats`. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.19.0 - vLLM main: vllm-project/vllm@6f786f2 Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: guxin108 <1252896542@qq.com>
…ed_tokens with PrefillStats (vllm-project#37460) Related to `Counters can only be incremented by non-negative amounts` error with the `vllm:prompt_tokens_by_source_total` metric. Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Or Ozeri <or@ozery.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
…fferent vllm version (vllm-project#8426) ### What this PR does / why we need it? This fix vllm-project/vllm#37460 This PR introduces version-specific logic to handle `num_cached_tokens` and `num_external_computed_tokens` in the scheduler, ensuring compatibility with vLLM 0.19.0 and maintaining legacy support for older versions via `prefill_stats`. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.19.0 - vLLM main: vllm-project/vllm@6f786f2 Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
Related to the discussion in #36859 and the
Counters can only be incremented by non-negative amountserror with thevllm:prompt_tokens_by_source_totalmetric.In
OutputProcessor, we take the firstEngineCoreOutputas a signal that prefill has completed, and record certain statistics about it.On the scheduler side, because of preemption, we might have prefills that are scheduled but never completed, or we might need to recompute an already completed prefill. To add clarity, we use
PrefillStatsto track the first scheduled prefill so that the stats can be returned to the frontend viaEngineCoreOutput.num_cached_tokenswas previously used for KV transfer failure recovery, but this is no longer true as of #38096. We also no longer attempt to correct these prefill metrics if KV transfers failed, since this introduced unjustified brittleness to an already brittle code path.