[P/D] bugfix for p node force free requset by liziyu179 · Pull Request #5431 · vllm-project/vllm-ascend

liziyu179 · 2025-12-27T08:09:24Z

What this PR does / why we need it?

Fix the bug where the P-node's schedule dead after it force-frees a request due to timeout and then receives the completed kv cache pulled by the D-node again. By add list to recode all requests.

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: release/v0.13.0
vLLM main: vllm-project/vllm@81786c8

gemini-code-assist

Code Review

This pull request aims to fix an issue with force-freeing requests. However, the changes introduce a critical race condition in KVCacheTaskTracker that can lead to a thread crash due to a KeyError and a memory leak, which would cause requests to be incorrectly force-freed later. The previous logic for handling the race condition between add_delayed_request and update_done_task_count has been removed, leading to this bug. I've provided a comment with a detailed explanation of the issue.

gemini-code-assist · 2025-12-27T08:11:32Z

+            if self.is_kv_producer:
+                self.finished_requests.add(request_id)
                self._remove_delayed_requests(request_id)
            else:
-                self.record_finished_requests.add(request_id)
+                self.finished_requests.add(request_id)


The refactoring of KVCacheTaskTracker across __init__, update_done_task_count, and add_delayed_request has introduced a critical race condition with two major issues:

Potential Crash: In update_done_task_count, _remove_delayed_requests is now called unconditionally for a producer. This method uses dict.pop(), which will raise a KeyError if update_done_task_count runs before add_delayed_request for the same request_id. This will crash the KVCacheSendingThread.

Memory Leak: The removal of record_finished_requests breaks the mechanism that handled this race condition. If update_done_task_count runs first (and is modified to not crash), and then add_delayed_request runs, the request is added to delayed_free_requests and will never be removed. This leads to a memory leak and the request being incorrectly force-freed upon timeout.

The previous implementation using record_finished_requests appeared to correctly handle this race condition. This logic should be restored to ensure correctness and prevent these critical issues.

github-actions · 2025-12-27T08:27:34Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

jianzs

Could you provide more details about the issue you're trying to resolve?

LCAIZJ · 2025-12-29T02:28:57Z


    def update_done_task_count(self, request_id: str):
        with self.done_task_lock:
-            self.finished_requests.add(request_id)


Will placing this line of code back in its original position cause any issues?

Will placing this line of code back in its original position cause any issues?

For a P node, if the request has been forced_free, then it will not be in delayed_free_requests. This indicates it was previously marked as finished and does not need to be marked again.

liziyu179 · 2025-12-29T06:55:09Z

Could you provide more details about the issue you're trying to resolve?

When a request is forced_free by a P node, if it happens to be completed by being pulled by a D node, this request will enter the get_finished interface of MooncakeConnectorWorker twice, and will also enter the _update_from_kv_xfer_finished function of the Scheduler twice. The second time will cause an assertion failure because req_id is not in self.requests.

LCAIZJ · 2025-12-30T07:23:20Z

Could you provide more details about the issue you're trying to resolve?

When a request is forced_free by a P node, if it happens to be completed by being pulled by a D node, this request will enter the get_finished interface of MooncakeConnectorWorker twice, and will also enter the _update_from_kv_xfer_finished function of the Scheduler twice. The second time will cause an assertion failure because req_id is not in self.requests.

If the timeout for forced release is longer than the timeout for aborting a request, this issue should not occur. That said, resolving it properly in the code logic would also be a good approach.

baxingpiaochong · 2025-12-30T06:36:55Z

    def add_delayed_request(self, request_id: str, delay_start_time: float):
        """Add a delayed free request."""
        with self.done_task_lock:
-            if request_id not in self.record_finished_requests:


add_delayed_request occurs in the next forward. Therefore, it's possible that DONE_RECVING_MSG is received before add_delayed_request. In this case, this modification would force normally released requests to be forcibly released after timeout.

cc @liziyu179

The add_delayed_request operation occurs within the execute_model function following the current scheduling round. Requests are forwarded to the D node only after the P node has completed its execution of execute_model. Consequently, any request received by the D node is necessarily present in the P node’s delayed_requests collection. Therefore, the situation you described is deemed not to exist.

@liziyu179 We've previously considered the scenario described in this PR #2899, so please check whether the current changes are compatible with it.

LCAIZJ · 2026-01-12T14:00:58Z

+                    self.record_finished_requests.add(request_id)
            else:
-                self.record_finished_requests.add(request_id)
+                self.forced_free_requests.discard(request_id)


Here, you only add forced free requests but never remove them. Line 129 only removes requests that became abnormal after force free, which will lead to memory leaks over time.

…equest due to timeout and then receives the completed kv cache pulled by the D-node again. Signed-off-by: liziyu <liziyu16@huawei.com>

baxingpiaochong · 2026-01-13T03:26:34Z

+            if request_id in self.reqs_to_process:
+                self.finished_requests.add(request_id)
+                self.reqs_to_process.discard(request_id)
+                self.delayed_free_requests.pop(request_id, None)


You can add an else branch to report the log error. An exception occurs when the req_id received by update_done_task_count is not in the process, indicating a precision issue.

You can add an else branch to report the log error. An exception occurs when the req_id received by update_done_task_count is not in the process, indicating a precision issue.

Yes, we need to remind users about this here.

baxingpiaochong · 2026-01-13T03:33:20Z

                                       self._prefill_pp_size - 1))

+        if self.kv_send_thread is not None:
+            for req_id, meta in metadata.requests.items():


The prefill node might lack meta information? If you need the req_id, you need to bring it down from the scheduler side. I didn't see this operation in this PR.

The prefill node might lack meta information? If you need the req_id, you need to bring it down from the scheduler side. I didn't see this operation in this PR.

You're right, we added a req_in_batch variable to pass the request from the scheduler to the worker.

Signed-off-by: liziyu <liziyu16@huawei.com>

wangxiyuan · 2026-01-13T09:18:36Z

@LCAIZJ Please merge if this change is fine

wangxiyuan · 2026-01-13T09:20:07Z

@jianzs cc

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>

LCAIZJ · 2026-01-13T15:10:02Z

@LCAIZJ Please merge if this change is fine
LGTM! Merging now.

wangxiyuan · 2026-01-14T00:51:25Z

force mege. The CI failure doesn't related to this PR.

### What this PR does / why we need it? Fix the bug where the P-node's schedule dead after it force-frees a request due to timeout and then receives the completed kv cache pulled by the D-node again. By add list to recode all requests. - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@81786c8 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>

… (#5871) ### What this PR does / why we need it? Fix the bug where the P-node's schedule dead after it force-frees a request due to timeout and then receives the completed kv cache pulled by the D-node again. By add list to recode all requests. - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@81786c8 --------- ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>

…to eplb_refactor * 'main' of https://github.com/vllm-project/vllm-ascend: [CI] Fix lint CI (vllm-project#5880) [Feature] implement eagle spec decoding for model runner v2 (vllm-project#5840) [Quantization] Support compressed tensors moe w8a8 int8 dynamic weight (vllm-project#5718) [EPLB][Bugfix] Get expert map from layers (vllm-project#5817) [Bugfix] Fixed an accuracy problem of sp with eagle3 (vllm-project#5816) [P/D] bugfix for p node force free requset (vllm-project#5431) [Lint]Style: Convert `example` to `ruff format` (vllm-project#5863) [Main2Main] Upgrade vllm commit to 0109 (vllm-project#5752) [Bugfix][P/D] fix layerwise connector for decoder tp size > num kv heads (vllm-project#5846) [Test][e2e][LoRA] Add more e2e tests to cover scenarios of LoRA (vllm-project#4075) [CustomOp][Perf] Merge Q/K split to simplify AscendApplyRotaryEmb for better performance (vllm-project#5799) [Lint]Style: Convert `root`, `benchmarks`, `tools` and `docs` to `ruff format` (vllm-project#5843) enable ep32 for dispatch_ffn_combine (vllm-project#5787)

### What this PR does / why we need it? Fix the bug where the P-node's schedule dead after it force-frees a request due to timeout and then receives the completed kv cache pulled by the D-node again. By add list to recode all requests. - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@81786c8 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>

### What this PR does / why we need it? Fix the bug where the P-node's schedule dead after it force-frees a request due to timeout and then receives the completed kv cache pulled by the D-node again. By add list to recode all requests. - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@81786c8 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

### What this PR does / why we need it? Fix the bug where the P-node's schedule dead after it force-frees a request due to timeout and then receives the completed kv cache pulled by the D-node again. By add list to recode all requests. - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@81786c8 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>

### What this PR does / why we need it? Fix the bug where the P-node's schedule dead after it force-frees a request due to timeout and then receives the completed kv cache pulled by the D-node again. By add list to recode all requests. - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@81786c8 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

### What this PR does / why we need it? Fix the bug where the P-node's schedule dead after it force-frees a request due to timeout and then receives the completed kv cache pulled by the D-node again. By add list to recode all requests. - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@81786c8 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>

gemini-code-assist bot reviewed Dec 27, 2025

View reviewed changes

liziyu179 force-pushed the fix_delayfree_request branch from 72c52d6 to dff079d Compare December 27, 2025 08:19

jianzs requested changes Dec 28, 2025

View reviewed changes

LCAIZJ requested a review from jianzs December 29, 2025 01:37

LCAIZJ reviewed Dec 29, 2025

View reviewed changes

LCAIZJ changed the title ~~[P/D] bugfix for frce free requset~~ [P/D] bugfix for force free requset Dec 30, 2025

baxingpiaochong reviewed Dec 30, 2025

View reviewed changes

liziyu179 force-pushed the fix_delayfree_request branch from dff079d to 1c78ad1 Compare January 12, 2026 08:47

baxingpiaochong approved these changes Jan 12, 2026

View reviewed changes

liziyu179 changed the title ~~[P/D] bugfix for force free requset~~ [P/D] bugfix for p node force free requset Jan 12, 2026

LCAIZJ reviewed Jan 12, 2026

View reviewed changes

Fix the bug where the P-node's schedule dead after it force-frees a r…

21bf2e6

…equest due to timeout and then receives the completed kv cache pulled by the D-node again. Signed-off-by: liziyu <liziyu16@huawei.com>

liziyu179 force-pushed the fix_delayfree_request branch from 1c78ad1 to 21bf2e6 Compare January 13, 2026 03:11

baxingpiaochong reviewed Jan 13, 2026

View reviewed changes

liziyu179 added 3 commits January 13, 2026 11:51

add req in batch to process req id in p node

ba0b207

Signed-off-by: liziyu <liziyu16@huawei.com>

add req to process in D node

3645f94

Signed-off-by: liziyu <liziyu16@huawei.com>

fix lint

3c6db17

Signed-off-by: liziyu <liziyu16@huawei.com>

wangxiyuan approved these changes Jan 13, 2026

View reviewed changes

wangxiyuan added ready read for review ready-for-test start test by label for PR labels Jan 13, 2026

add_discard_all_request

11c16ec

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>

LCAIZJ approved these changes Jan 13, 2026

View reviewed changes

LCAIZJ self-requested a review January 13, 2026 15:10

LCAIZJ approved these changes Jan 13, 2026

View reviewed changes

wangxiyuan merged commit e1bed43 into vllm-project:main Jan 14, 2026
15 of 16 checks passed

liziyu179 deleted the fix_delayfree_request branch January 14, 2026 08:43

leo-pony mentioned this pull request Jan 15, 2026

[Bug, CI]: DeepSeek-R1-W8A8-longseq A3 multi-node run failed #5784

Closed

Conversation

liziyu179 commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 27, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 27, 2025

Uh oh!

jianzs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liziyu179 commented Dec 29, 2025

Uh oh!

LCAIZJ commented Dec 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangxiyuan commented Jan 13, 2026

Uh oh!

wangxiyuan commented Jan 13, 2026

Uh oh!

LCAIZJ commented Jan 13, 2026

Uh oh!

wangxiyuan commented Jan 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

liziyu179 commented Dec 27, 2025 •

edited

Loading