Skip to content

[P/D] Refactor mooncake connector sender thread using async coroutines#31573

Merged
NickLucche merged 3 commits intovllm-project:mainfrom
openanolis:dtcccc/p_con
Jan 12, 2026
Merged

[P/D] Refactor mooncake connector sender thread using async coroutines#31573
NickLucche merged 3 commits intovllm-project:mainfrom
openanolis:dtcccc/p_con

Conversation

@dtcccc
Copy link
Contributor

@dtcccc dtcccc commented Dec 31, 2025

Purpose

This is a separate PR for #31034 to help review.
This PR refactored sender thread using async coroutines. All related data are in the same thread so that we can drop their locks. This makes the sender thread simple and easy to maintain.

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Note

Cursor Bugbot is generating a summary for commit a64dc77. Configure here.

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Mooncake connector's sender thread to use asyncio coroutines. This is a significant improvement that simplifies the code by removing manual lock management and leveraging modern asynchronous patterns. The state related to sending operations is now managed within a single event loop, which is a robust way to prevent race conditions. The changes are well-structured and follow best practices for integrating asyncio with threads and blocking operations. However, I found a critical issue where a silent failure can occur if a requested transfer ID is not found, potentially leading to data inconsistencies. I've provided a suggestion to fix this by raising an exception instead of returning silently.

send_meta = self.reqs_need_send.get(req_id)
if send_meta is None:
logger.warning("Request %s not found in reqs_need_send", req_id)
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Returning silently when a request ID is not found can lead to silent failures. The client that initiated the transfer will receive a TRANS_DONE message, making it believe the transfer was successful, while it was actually aborted. This can lead to data inconsistencies or hangs. It's better to raise an exception to signal an error condition, which will then be caught by the _sender_worker and result in a TRANS_ERROR message being sent back to the client.

Suggested change
return
raise ValueError(f"Request {req_id} not found in reqs_need_send")

Copy link
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for separating this change @dtcccc , this looks like good!
Only left a few minor comments for style+clarity

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
Copy link
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@NickLucche NickLucche enabled auto-merge (squash) January 12, 2026 10:37
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 12, 2026
@NickLucche NickLucche disabled auto-merge January 12, 2026 10:40
@NickLucche NickLucche enabled auto-merge (squash) January 12, 2026 10:47
)
asyncio.run_coroutine_threadsafe(
self.record_send_reqs(metadata), self.sender_loop
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition between async scheduling and sender workers

Medium Severity

The refactoring replaces synchronous lock-protected updates with asyncio.run_coroutine_threadsafe, which schedules record_send_reqs but returns immediately before execution. Since _sender_worker tasks are already running on sender_loop and processing incoming ZMQ requests, if a decoder request arrives before record_send_reqs has executed, send_kv_to_decode will fail to find the request in reqs_need_send, log a warning, and return early without transferring data. The old code used threading.Lock to ensure the dict was updated before being read. The new code provides no such guarantee.

Additional Locations (1)

Fix in Cursor Fix in Web

@NickLucche NickLucche merged commit 0565f1f into vllm-project:main Jan 12, 2026
55 checks passed
TomerBN-Nvidia pushed a commit to TomerBN-Nvidia/vllm that referenced this pull request Jan 13, 2026
vllm-project#31573)

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
Signed-off-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>
@dtcccc dtcccc deleted the dtcccc/p_con branch January 15, 2026 05:39
sammysun0711 pushed a commit to sammysun0711/vllm that referenced this pull request Jan 16, 2026
vllm-project#31573)

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026
vllm-project#31573)

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
vllm-project#31573)

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
vllm-project#31573)

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kv-connector ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants