Skip to content

[BugFix] Eagerly abort cancelled final-step requests#29987

Merged
markmc merged 6 commits intovllm-project:mainfrom
njhill:eager-aborts
Dec 5, 2025
Merged

[BugFix] Eagerly abort cancelled final-step requests#29987
markmc merged 6 commits intovllm-project:mainfrom
njhill:eager-aborts

Conversation

@njhill
Copy link
Copy Markdown
Member

@njhill njhill commented Dec 3, 2025

Currently, when requests are cancelled while executing their final step, "completion" of those requests is subsequently handled based on normal stop processing (e.g. length or stop token), and so the abort essentially has no effect.

This is typically not a problem since the final output would be ignored/discarded in this case anyhow. When a kv connector is involved however, it means that the connector will think the request completed successfully rather than being aborted.

This has turned out to be problematic for disaggregated prefill which will free the kv cache blocks if the request was aborted but not if it thinks the request has completed successfully. Since the top-level request was cancelled, it will never be sent to the decode side and so the kv cache blocks remain pinned unnecessarily until the fall-back timeout expires.

The problem is exacerbated when a large number of requests are cancelled and/or there are large prefills whose forward pass takes a long time (since the window for this to occur is bigger).


This PR fixes the problem by explicitly processing any pending aborts immediately prior to processing the model output each step. We process only the aborts and not new requests since it's still preferable for latency reasons to process the model outputs before new incoming requests.

Fixes #26400.

@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

Could you provide some more detailed explanation about what was happening before + why this fixes it?

This is pretty complicated logic so I think we will value the posterity

Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
@njhill
Copy link
Copy Markdown
Member Author

njhill commented Dec 3, 2025

@robertgshaw2-redhat I've now added some explanations.

Signed-off-by: Nick Hill <nhill@redhat.com>
@njhill njhill marked this pull request as ready for review December 3, 2025 20:02
@njhill njhill changed the title [BugFix] Eagerly abort final-step requests [BugFix] Eagerly abort cancelled final-step requests Dec 3, 2025
@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 4, 2025

if TYPE_CHECKING:
from vllm.model_executor.model_loader.tensorizer import TensorizerConfig
from vllm.v1.worker.gpu_model_runner import GPUModelRunner
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in this file avoid cuda being initialized at import time (lazy import of GPUModelRunner)

@hasB4K
Copy link
Copy Markdown
Contributor

hasB4K commented Dec 5, 2025

Hello @njhill! I tried your PR and it seems to work as expected. Now that I think about it my PR was more a patch than a proper fix. Yours is the proper fix. I'll let you know if we encounter this issue again but hopefully not 😉

Any chance to see this PR merged quickly?

Copy link
Copy Markdown
Member

@markmc markmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Nick. Looks good to me.

As discussed, I was trying to quantify the size of the window this closes, versus the window remaining

Here's claude's take on that, FWIW:

  • Window CLOSED by this fix (Large) - Chunked prefill: Hundreds of milliseconds to several seconds
  • Window REMAINING (Tiny) - Duration: Pure Python operations, no GPU wait, sub-millisecond (microseconds in practice)

@markmc markmc merged commit dc264bc into vllm-project:main Dec 5, 2025
48 of 49 checks passed
@njhill njhill deleted the eager-aborts branch December 5, 2025 17:31
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
)

Currently, when requests are cancelled while executing their final
step, "completion" is handled based on normal stop processing (e.g.
length or stop token), so the abort has no effect. This is typically
not a problem, but when a kv connector is involved it thinks the
request completed successfully rather than being aborted.

This is problematic for disaggregated prefill which will free kv
cache blocks if the request was aborted but not if it completed
successfully—since the cancelled request will never be sent to
the decode side, kv cache blocks remain pinned until the fall-back
timeout expires. The problem is exacerbated when many requests
are cancelled and/or there are large prefills whose forward pass
takes a long time (since the window is bigger).

This PR fixes the problem by processing pending aborts
immediately prior to processing model output each step; we process
only aborts, not new requests, since it's preferable for latency to
process model outputs before new incoming requests.

Fixes vllm-project#26400.

Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Engine Core] Process pending requests in-between model execution and update_from_outputs

4 participants