Skip to content

[0.13.0][cherry-pick][P/D] bugfix for p node force free requset (#5431)#5871

Merged
LCAIZJ merged 1 commit intovllm-project:releases/v0.13.0from
liziyu179:fix_force_free_v0_13_0
Jan 14, 2026
Merged

[0.13.0][cherry-pick][P/D] bugfix for p node force free requset (#5431)#5871
LCAIZJ merged 1 commit intovllm-project:releases/v0.13.0from
liziyu179:fix_force_free_v0_13_0

Conversation

@liziyu179
Copy link
Copy Markdown
Collaborator

What this PR does / why we need it?

Fix the bug where the P-node's schedule dead after it force-frees a request due to timeout and then receives the completed kv cache pulled by the D-node again. By add list to recode all requests.


Does this PR introduce any user-facing change?

How was this patch tested?

### What this PR does / why we need it?
Fix the bug where the P-node's schedule dead after it force-frees a
request due to timeout and then receives the completed kv cache pulled
by the D-node again. By add list to recode all requests.


- vLLM version: release/v0.13.0
- vLLM main:
vllm-project/vllm@81786c8

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
@liziyu179 liziyu179 changed the title [P/D] bugfix for p node force free requset (#5431) [0.13.0][cherry-pick][P/D] bugfix for p node force free requset (#5431) Jan 14, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug that could cause a scheduler hang on a P-node when a request is force-freed due to a timeout. The fix introduces a mechanism to track all requests currently being processed. By checking against this set of active requests, the system can now correctly handle completion signals for requests that have already been freed, preventing the hang. The overall logic is sound and the changes appear to correctly resolve the issue. I have added one high-severity comment to improve an error log message, making it more effective for debugging by including the relevant request ID and improving clarity.

Comment on lines +130 to +132
logger.error(
"MooncakeConnector finish req not in reqs to process.If it is a P node, this request may have been force freed."
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This error message is not very helpful for debugging as it's missing the request_id. Also, it has a minor typo and could be formatted for better readability. Consider using % formatting to include the request_id and rephrasing the message to be clearer.

Suggested change
logger.error(
"MooncakeConnector finish req not in reqs to process.If it is a P node, this request may have been force freed."
)
logger.error(
"MooncakeConnector: Finished request '%s' not in requests to process. "
"If this is a P-node, it may have been force-freed due to a timeout.",
request_id,
)

@LCAIZJ LCAIZJ added ready read for review ready-for-test start test by label for PR labels Jan 14, 2026
@LCAIZJ LCAIZJ merged commit 1feac65 into vllm-project:releases/v0.13.0 Jan 14, 2026
16 checks passed
@liziyu179 liziyu179 deleted the fix_force_free_v0_13_0 branch January 14, 2026 08:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants