[Bugfix][NIXL] Fix Async Scheduler timeout issue#25808
[Bugfix][NIXL] Fix Async Scheduler timeout issue#25808tlrmchlsmth merged 1 commit intovllm-project:mainfrom
Conversation
Signed-off-by: NickLucche <nlucches@redhat.com>
There was a problem hiding this comment.
Code Review
This pull request addresses a race condition in the asynchronous scheduler that could lead to requests being incorrectly marked as expired. The fix introduces a new state to track requests within a processing batch, preventing premature expiration and erroneous error logging. The logic appears sound and effectively resolves the described issue. I have one suggestion to enhance the robustness of the implementation.
| "Releasing expired KV blocks for request %s which were " | ||
| "retrieved by %d decode worker(s) within %d seconds.", req_id, | ||
| count, envs.VLLM_NIXL_ABORT_REQUEST_TIMEOUT) | ||
| self._reqs_to_process.remove(req_id) |
There was a problem hiding this comment.
Using discard() instead of remove() would be more robust. In a complex distributed system with asynchronous operations, it's safer to use discard() to avoid potential KeyError exceptions if the req_id is unexpectedly not in the set due to race conditions. This would make the worker more resilient.
A similar change would be beneficial on line 1126.
| self._reqs_to_process.remove(req_id) | |
| self._reqs_to_process.discard(req_id) |
tlrmchlsmth
left a comment
There was a problem hiding this comment.
Tested in a WideEP setup -- fixes crashes I was seeing
Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: simon-mo <simon.mo@hey.com>
|
Thanks for this @NickLucche. I'm wondering whether there's possibility for a "leak" of entries in In this case the expiration won't be set and there will be no nixl notifications, so I can't see where the corresponding req_id would ever be removed from that set. |
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
To keep P kv blocks from being stranded in the case a request is aborted during decode, we have introduced a TTL/timeout #20139.
With async scheduling + high concurrency requests, this log line (and consequent logic) can be observed
The current workflow on main is something like:
Now if D is allowed in between iter_i and iter_i+1 to decode the requests and READ blocks from P (this happens on a separate async channel, not tied to scheduler step), the log.error can be observed as
self._reqs_to_sendhas not yet been updated!Basically async scheduling pushes the misalignment between the moment in which requests expiration is set (P side) and the moment in which blocks are read from D, by allowing the scheduler one step ahead and the runner to pick the next batch right up.
FIX #25777