[0.13.0][cherry-pick][P/D] bugfix for p node force free requset (#5431)#5871
Conversation
### What this PR does / why we need it? Fix the bug where the P-node's schedule dead after it force-frees a request due to timeout and then receives the completed kv cache pulled by the D-node again. By add list to recode all requests. - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@81786c8 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
There was a problem hiding this comment.
Code Review
This pull request addresses a bug that could cause a scheduler hang on a P-node when a request is force-freed due to a timeout. The fix introduces a mechanism to track all requests currently being processed. By checking against this set of active requests, the system can now correctly handle completion signals for requests that have already been freed, preventing the hang. The overall logic is sound and the changes appear to correctly resolve the issue. I have added one high-severity comment to improve an error log message, making it more effective for debugging by including the relevant request ID and improving clarity.
| logger.error( | ||
| "MooncakeConnector finish req not in reqs to process.If it is a P node, this request may have been force freed." | ||
| ) |
There was a problem hiding this comment.
This error message is not very helpful for debugging as it's missing the request_id. Also, it has a minor typo and could be formatted for better readability. Consider using % formatting to include the request_id and rephrasing the message to be clearer.
| logger.error( | |
| "MooncakeConnector finish req not in reqs to process.If it is a P node, this request may have been force freed." | |
| ) | |
| logger.error( | |
| "MooncakeConnector: Finished request '%s' not in requests to process. " | |
| "If this is a P-node, it may have been force-freed due to a timeout.", | |
| request_id, | |
| ) |
What this PR does / why we need it?
Fix the bug where the P-node's schedule dead after it force-frees a request due to timeout and then receives the completed kv cache pulled by the D-node again. By add list to recode all requests.
Does this PR introduce any user-facing change?
How was this patch tested?