[Nixl] make deletion atomic in nixl request timeout handling#24268
[Nixl] make deletion atomic in nixl request timeout handling#24268wseaton wants to merge 3 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Will Eaton <weaton@redhat.com>
There was a problem hiding this comment.
Code Review
This pull request provides a hotfix for a race condition in NIXL timeout handling by making a dictionary deletion atomic. The change is correct, but as noted in the detailed comment, it appears to be incomplete, as a similar race condition persists in another part of the code that modifies the same shared state.
Signed-off-by: Will Eaton <weaton@redhat.com>
4922107 to
857953c
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request aims to fix a race condition leading to a double-free by making the request deletion from _reqs_to_send atomic. The introduction of try_remove_request is a good approach for this. However, the changes in _get_new_notifs introduce a memory leak by processing notifications for untracked requests. I have added a review comment with a suggested fix to prevent this leak while maintaining the atomicity of the deletion.
| self.consumer_notification_counts_by_req[req_id] += 1 | ||
| # Wait all consumers (D) to be done reading before freeing. | ||
| if self.consumer_notification_counts_by_req[req_id] == int( | ||
| tp_ratio): | ||
| notified_req_ids.add(req_id) | ||
| del self.consumer_notification_counts_by_req[req_id] | ||
| del self._reqs_to_send[req_id] | ||
| if self.try_remove_request(req_id, "consumer_complete"): | ||
| notified_req_ids.add(req_id) | ||
| else: | ||
| logger.debug( | ||
| "Request %s completed by all consumers but was" | ||
| "already removed (likely timed out)", req_id) |
There was a problem hiding this comment.
This change introduces a potential memory leak. By removing the check for req_id in self._reqs_to_send, a notification for a request that has already been completed or timed out will now increment self.consumer_notification_counts_by_req.
If tp_ratio > 1 and this is a late notification for an already-handled request, the condition self.consumer_notification_counts_by_req[req_id] == int(tp_ratio) may not be met. In this case, the entry for req_id will remain in self.consumer_notification_counts_by_req, causing a memory leak.
A check for req_id in self._reqs_to_send should be restored at the beginning of the loop to prevent this. This will also make the else branch of try_remove_request an exceptional case that should be logged as an error.
| self.consumer_notification_counts_by_req[req_id] += 1 | |
| # Wait all consumers (D) to be done reading before freeing. | |
| if self.consumer_notification_counts_by_req[req_id] == int( | |
| tp_ratio): | |
| notified_req_ids.add(req_id) | |
| del self.consumer_notification_counts_by_req[req_id] | |
| del self._reqs_to_send[req_id] | |
| if self.try_remove_request(req_id, "consumer_complete"): | |
| notified_req_ids.add(req_id) | |
| else: | |
| logger.debug( | |
| "Request %s completed by all consumers but was" | |
| "already removed (likely timed out)", req_id) | |
| if req_id not in self._reqs_to_send: | |
| logger.debug("Ignoring notification for untracked request %s", | |
| req_id) | |
| continue | |
| self.consumer_notification_counts_by_req[req_id] += 1 | |
| if self.consumer_notification_counts_by_req[req_id] == int( | |
| tp_ratio): | |
| del self.consumer_notification_counts_by_req[req_id] | |
| if self.try_remove_request(req_id, "consumer_complete"): | |
| notified_req_ids.add(req_id) | |
| else: | |
| logger.error("BUG: Failed to remove request %s", req_id) |
Signed-off-by: Will Eaton <weaton@redhat.com>
|
Note the reporter was running v0.10.0 Here's what I see - each prefill server failed with the same pattern - both prefill TP workers received 2 notifications for the same request, apparently after the request has already been deleted from reqs_to_send. The number of notifications checks out - decode is TP=4, so the TP ratio is 2. It certainly seems like the most straightforward explanation is that the request had already been timed out before we received these notifications. And #21753 was merged in v0.10.1, with what looks like a guard against this situation. With v0.10.1, I expect we'd see |
|
@markmc good catch, I'll ask the reproducer to do a bisect and test on v0.10.1 and see if we see the expiry warnings pop up, we have an image we can use for this. |
|
I think we can close this in favor of #25067 |
|
This pull request has merge conflicts that must be resolved before it can be |
Hotfix for a race condition (double-free) reported in llm-d project here: llm-d/llm-d#187
Test plan: Generating a standalone image build and testing under load, this issue only comes up in high load P/D scenarios.