[BugFix] Eagerly abort cancelled final-step requests by njhill · Pull Request #29987 · vllm-project/vllm

njhill · 2025-12-03T18:27:24Z

Currently, when requests are cancelled while executing their final step, "completion" of those requests is subsequently handled based on normal stop processing (e.g. length or stop token), and so the abort essentially has no effect.

This is typically not a problem since the final output would be ignored/discarded in this case anyhow. When a kv connector is involved however, it means that the connector will think the request completed successfully rather than being aborted.

This has turned out to be problematic for disaggregated prefill which will free the kv cache blocks if the request was aborted but not if it thinks the request has completed successfully. Since the top-level request was cancelled, it will never be sent to the decode side and so the kv cache blocks remain pinned unnecessarily until the fall-back timeout expires.

The problem is exacerbated when a large number of requests are cancelled and/or there are large prefills whose forward pass takes a long time (since the window for this to occur is bigger).

This PR fixes the problem by explicitly processing any pending aborts immediately prior to processing the model output each step. We process only the aborts and not new requests since it's still preferable for latency reasons to process the model outputs before new incoming requests.

Fixes #26400.

robertgshaw2-redhat · 2025-12-03T18:38:54Z

Could you provide some more detailed explanation about what was happening before + why this fixes it?

This is pretty complicated logic so I think we will value the posterity

Signed-off-by: Nick Hill <nhill@redhat.com>

njhill · 2025-12-03T19:28:03Z

@robertgshaw2-redhat I've now added some explanations.

Signed-off-by: Nick Hill <nhill@redhat.com>

njhill · 2025-12-04T17:33:31Z

vllm/v1/worker/gpu_worker.py


 if TYPE_CHECKING:
    from vllm.model_executor.model_loader.tensorizer import TensorizerConfig
+    from vllm.v1.worker.gpu_model_runner import GPUModelRunner


The changes in this file avoid cuda being initialized at import time (lazy import of GPUModelRunner)

hasB4K · 2025-12-05T10:54:39Z

Hello @njhill! I tried your PR and it seems to work as expected. Now that I think about it my PR was more a patch than a proper fix. Yours is the proper fix. I'll let you know if we encounter this issue again but hopefully not 😉

Any chance to see this PR merged quickly?

markmc

Thanks, Nick. Looks good to me.

As discussed, I was trying to quantify the size of the window this closes, versus the window remaining

Here's claude's take on that, FWIW:

Window CLOSED by this fix (Large) - Chunked prefill: Hundreds of milliseconds to several seconds
Window REMAINING (Tiny) - Duration: Pure Python operations, no GPU wait, sub-millisecond (microseconds in practice)

vllm/v1/engine/core.py

tests/v1/engine/test_abort_final_step.py

) Currently, when requests are cancelled while executing their final step, "completion" is handled based on normal stop processing (e.g. length or stop token), so the abort has no effect. This is typically not a problem, but when a kv connector is involved it thinks the request completed successfully rather than being aborted. This is problematic for disaggregated prefill which will free kv cache blocks if the request was aborted but not if it completed successfully—since the cancelled request will never be sent to the decode side, kv cache blocks remain pinned until the fall-back timeout expires. The problem is exacerbated when many requests are cancelled and/or there are large prefills whose forward pass takes a long time (since the window is bigger). This PR fixes the problem by processing pending aborts immediately prior to processing model output each step; we process only aborts, not new requests, since it's preferable for latency to process model outputs before new incoming requests. Fixes vllm-project#26400. Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

njhill mentioned this pull request Dec 3, 2025

[Bugfix] Free requests to avoid a KV Cache exhaustion during VLLM_NIXL_ABORT_REQUEST_TIMEOUT #29906

Closed

mergify bot added the v1 label Dec 3, 2025

[BugFix] Eagerly abort final-step requests

8c3bc3b

Signed-off-by: Nick Hill <nhill@redhat.com>

njhill force-pushed the eager-aborts branch from bf56652 to 8c3bc3b Compare December 3, 2025 18:46

add more comments

15c50f7

Signed-off-by: Nick Hill <nhill@redhat.com>

small optimization

f620ad1

Signed-off-by: Nick Hill <nhill@redhat.com>

njhill force-pushed the eager-aborts branch from c91a9d1 to f620ad1 Compare December 3, 2025 20:02

njhill marked this pull request as ready for review December 3, 2025 20:02

njhill changed the title ~~[BugFix] Eagerly abort final-step requests~~ [BugFix] Eagerly abort cancelled final-step requests Dec 3, 2025

njhill added 3 commits December 3, 2025 16:58

update GPU Worker to avoid cuda init at import time

9aa0444

Signed-off-by: Nick Hill <nhill@redhat.com>

add test (written with help from claude)

e1d251d

Signed-off-by: Nick Hill <nhill@redhat.com>

Merge remote-tracking branch 'origin/main' into eager-aborts

339d94b

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 4, 2025

njhill commented Dec 4, 2025

View reviewed changes

markmc approved these changes Dec 5, 2025

View reviewed changes

vllm/v1/engine/core.py Show resolved Hide resolved

tests/v1/engine/test_abort_final_step.py Show resolved Hide resolved

markmc merged commit dc264bc into vllm-project:main Dec 5, 2025
48 of 49 checks passed

njhill deleted the eager-aborts branch December 5, 2025 17:31

orozery mentioned this pull request Feb 3, 2026

[Bugfix][Async][Connector] avoid vllm-side double free during async scheduling + request abort + async KV cache transfer #33377

Merged

5 tasks

orozery mentioned this pull request Mar 23, 2026

Why is an assertion used here? #37837

Closed

markmc mentioned this pull request Mar 26, 2026

Flaky test: test_abort_during_final_step[False] fails intermittently #38221

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BugFix] Eagerly abort cancelled final-step requests#29987

[BugFix] Eagerly abort cancelled final-step requests#29987
markmc merged 6 commits intovllm-project:mainfrom
njhill:eager-aborts

njhill commented Dec 3, 2025 •

edited by github-actions bot

Loading

Uh oh!

robertgshaw2-redhat commented Dec 3, 2025

Uh oh!

njhill commented Dec 3, 2025

Uh oh!

njhill Dec 4, 2025

Uh oh!

hasB4K commented Dec 5, 2025 •

edited

Loading

Uh oh!

markmc left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

njhill commented Dec 3, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robertgshaw2-redhat commented Dec 3, 2025

Uh oh!

njhill commented Dec 3, 2025

Uh oh!

njhill Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

hasB4K commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markmc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

njhill commented Dec 3, 2025 •

edited by github-actions bot

Loading

hasB4K commented Dec 5, 2025 •

edited

Loading