[BugFix] Eagerly abort cancelled final-step requests#29987
[BugFix] Eagerly abort cancelled final-step requests#29987markmc merged 6 commits intovllm-project:mainfrom
Conversation
|
Could you provide some more detailed explanation about what was happening before + why this fixes it? This is pretty complicated logic so I think we will value the posterity |
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
|
@robertgshaw2-redhat I've now added some explanations. |
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
|
|
||
| if TYPE_CHECKING: | ||
| from vllm.model_executor.model_loader.tensorizer import TensorizerConfig | ||
| from vllm.v1.worker.gpu_model_runner import GPUModelRunner |
There was a problem hiding this comment.
The changes in this file avoid cuda being initialized at import time (lazy import of GPUModelRunner)
|
Hello @njhill! I tried your PR and it seems to work as expected. Now that I think about it my PR was more a patch than a proper fix. Yours is the proper fix. I'll let you know if we encounter this issue again but hopefully not 😉 Any chance to see this PR merged quickly? |
markmc
left a comment
There was a problem hiding this comment.
Thanks, Nick. Looks good to me.
As discussed, I was trying to quantify the size of the window this closes, versus the window remaining
Here's claude's take on that, FWIW:
- Window CLOSED by this fix (Large) - Chunked prefill: Hundreds of milliseconds to several seconds
- Window REMAINING (Tiny) - Duration: Pure Python operations, no GPU wait, sub-millisecond (microseconds in practice)
) Currently, when requests are cancelled while executing their final step, "completion" is handled based on normal stop processing (e.g. length or stop token), so the abort has no effect. This is typically not a problem, but when a kv connector is involved it thinks the request completed successfully rather than being aborted. This is problematic for disaggregated prefill which will free kv cache blocks if the request was aborted but not if it completed successfully—since the cancelled request will never be sent to the decode side, kv cache blocks remain pinned until the fall-back timeout expires. The problem is exacerbated when many requests are cancelled and/or there are large prefills whose forward pass takes a long time (since the window is bigger). This PR fixes the problem by processing pending aborts immediately prior to processing model output each step; we process only aborts, not new requests, since it's preferable for latency to process model outputs before new incoming requests. Fixes vllm-project#26400. Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Currently, when requests are cancelled while executing their final step, "completion" of those requests is subsequently handled based on normal stop processing (e.g. length or stop token), and so the abort essentially has no effect.
This is typically not a problem since the final output would be ignored/discarded in this case anyhow. When a kv connector is involved however, it means that the connector will think the request completed successfully rather than being aborted.
This has turned out to be problematic for disaggregated prefill which will free the kv cache blocks if the request was aborted but not if it thinks the request has completed successfully. Since the top-level request was cancelled, it will never be sent to the decode side and so the kv cache blocks remain pinned unnecessarily until the fall-back timeout expires.
The problem is exacerbated when a large number of requests are cancelled and/or there are large prefills whose forward pass takes a long time (since the window for this to occur is bigger).
This PR fixes the problem by explicitly processing any pending aborts immediately prior to processing the model output each step. We process only the aborts and not new requests since it's still preferable for latency reasons to process the model outputs before new incoming requests.
Fixes #26400.