[P/D] Refactor mooncake connector sender thread using async coroutines#31573
[P/D] Refactor mooncake connector sender thread using async coroutines#31573NickLucche merged 3 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
There was a problem hiding this comment.
Code Review
This pull request refactors the Mooncake connector's sender thread to use asyncio coroutines. This is a significant improvement that simplifies the code by removing manual lock management and leveraging modern asynchronous patterns. The state related to sending operations is now managed within a single event loop, which is a robust way to prevent race conditions. The changes are well-structured and follow best practices for integrating asyncio with threads and blocking operations. However, I found a critical issue where a silent failure can occur if a requested transfer ID is not found, potentially leading to data inconsistencies. I've provided a suggestion to fix this by raising an exception instead of returning silently.
| send_meta = self.reqs_need_send.get(req_id) | ||
| if send_meta is None: | ||
| logger.warning("Request %s not found in reqs_need_send", req_id) | ||
| return |
There was a problem hiding this comment.
Returning silently when a request ID is not found can lead to silent failures. The client that initiated the transfer will receive a TRANS_DONE message, making it believe the transfer was successful, while it was actually aborted. This can lead to data inconsistencies or hangs. It's better to raise an exception to signal an error condition, which will then be caught by the _sender_worker and result in a TRANS_ERROR message being sent back to the client.
| return | |
| raise ValueError(f"Request {req_id} not found in reqs_need_send") |
NickLucche
left a comment
There was a problem hiding this comment.
Thanks a lot for separating this change @dtcccc , this looks like good!
Only left a few minor comments for style+clarity
vllm/distributed/kv_transfer/kv_connector/v1/mooncake_connector.py
Outdated
Show resolved
Hide resolved
vllm/distributed/kv_transfer/kv_connector/v1/mooncake_connector.py
Outdated
Show resolved
Hide resolved
vllm/distributed/kv_transfer/kv_connector/v1/mooncake_connector.py
Outdated
Show resolved
Hide resolved
vllm/distributed/kv_transfer/kv_connector/v1/mooncake_connector.py
Outdated
Show resolved
Hide resolved
| ) | ||
| asyncio.run_coroutine_threadsafe( | ||
| self.record_send_reqs(metadata), self.sender_loop | ||
| ) |
There was a problem hiding this comment.
Race condition between async scheduling and sender workers
Medium Severity
The refactoring replaces synchronous lock-protected updates with asyncio.run_coroutine_threadsafe, which schedules record_send_reqs but returns immediately before execution. Since _sender_worker tasks are already running on sender_loop and processing incoming ZMQ requests, if a decoder request arrives before record_send_reqs has executed, send_kv_to_decode will fail to find the request in reqs_need_send, log a warning, and return early without transferring data. The old code used threading.Lock to ensure the dict was updated before being read. The new code provides no such guarantee.
Additional Locations (1)
vllm-project#31573) Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com> Signed-off-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>
vllm-project#31573) Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
vllm-project#31573) Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
vllm-project#31573) Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
vllm-project#31573) Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
Purpose
This is a separate PR for #31034 to help review.
This PR refactored sender thread using async coroutines. All related data are in the same thread so that we can drop their locks. This makes the sender thread simple and easy to maintain.
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Note
Cursor Bugbot is generating a summary for commit a64dc77. Configure here.