[BugFix] Scheduler: Only set num_external_computed_tokens once#36859
[BugFix] Scheduler: Only set num_external_computed_tokens once#36859orozery wants to merge 1 commit intovllm-project:mainfrom
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request addresses a bug where request-level cache hit statistics (num_external_computed_tokens) were being incorrectly reset upon request rescheduling (e.g., after preemption). The fix ensures that both num_cached_tokens and num_external_computed_tokens are set only once, during the initial scheduling of a request, by checking if request.num_preemptions == 0. The logic for updating these stats upon external cache load failures has also been corrected to respect this new rule. The accompanying changes in the test suite correctly verify this new behavior. The changes appear correct and effectively resolve the described issue.
f4b6538 to
a3ac960
Compare
|
Hi @orozery, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Request.num_cached_tokens and Request.num_external_computed_tokens are two fields used for reporting request level cache hit stats. While num_cached_tokens is only set for the first time a request gets schedule, num_external_computed_tokens gets re-set whenever a request tries to gets re-scheduled, in case the request is preempted or when initial allocation fails. This creates a possible inconsistency between the two fields, which can yield to wrongful deduction of the derived stat local_cache_hit, which can cause vLLM to crash in case the wrong value is negative. This commit fixes it by properly setting these two fields only after a request gets scheduled for the first time (by checking Request.num_preemptions == 0). This fields may be updated only in the case of an error reported by the connector loading external tokens, We modify a scheduler unit-test for preemptions with KV connector to verify this fields are only set once. Signed-off-by: Or Ozeri <oro@il.ibm.com>
a3ac960 to
0889687
Compare
|
Still looking, but some initial thoughts ... Preemption scenario (#36533)A request is scheduled for the first time -
If a later Failed transfer scenario (#36638)#36638 describes a scenario where ReproducerI have failed to find a reliable reproducer using KV offloading and llama-3.1-8b-instruct, and an automated sweep of variations like KV offloading size, long/short inputs, long/short outputs, random and sharegpt datasets, different levels of high concurrency, and more … it did reproduce once, but I couldn’t repeat it. Of course, we could artificially recreate the scenario in a carefully controlled unit test, but that wasn’t my goal. Metrics PurposeLet’s consider the metrics flow in isolation, since the error relates to metrics updates.
On the frontend side, in the Note that in the case of preemption, the frontend only considers prefill to have been completed once, and so these two values are irrelevant for metrics once that initial This is all quite challenging to validate 100% from reading the code. It could make things a lot more clear if we separated any integral scheduler accounting use of these two values from the metrics-related information associated with the “prefill completed event”. KV Transfer FailuresIt seems like KV transfer failure handling it the other purpose for tracking these two values on Minimal fixVery similar to Or's proposed fix, the simplest thing we can do is to ensure both values only get updated once (except for KV transfer failure handling), and at the same time e.g. I don't love But I also wonder whether we should just drop the "only set the first time" thing, since we do use the values in KV transfer failure handling, even after preemptions |
The reason I prefer this condition over initializing fields to
These fields are used to reporting stats to the user. If we allow these fields to be re-set I think we lose the desirable semantics of these stats fields. |
Agree, and I think this could be done in a way that more clearly reflects the intended semantics - see #37460
I agree it would be much better to not depend on these fields in the error handling code 👍 |
|
We've since iterated on #37460 to resolve this, so closing |
Request.num_cached_tokensandRequest.num_external_computed_tokensare two fields used for reporting request level cache hit stats. Whilenum_cached_tokensis only set for the first time a request gets schedule,num_external_computed_tokensgets re-set whenever a request tries to gets re-scheduled, in case the request is preempted or when initial allocation fails. This creates a possible inconsistency between the two fields, which can yield to wrongful deduction of the derived statlocal_cache_hit, which can cause vLLM to crash in case the wrong value is negative.This PR fixes it by properly setting these two fields only after a request gets scheduled for the first time (by checking
Request.num_preemptions == 0).This fields may be updated only in the case of an error reported by the connector loading external tokens, We modify a scheduler unit-test for preemptions with KV connector to verify this fields are only set once.