[P/D] Refactor mooncake connector sender thread using async coroutines by dtcccc · Pull Request #31573 · vllm-project/vllm

dtcccc · 2025-12-31T09:54:54Z

Purpose

This is a separate PR for #31034 to help review.
This PR refactored sender thread using async coroutines. All related data are in the same thread so that we can drop their locks. This makes the sender thread simple and easy to maintain.

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Note

^{Cursor Bugbot is generating a summary for commit a64dc77. Configure here.}

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>

gemini-code-assist

Code Review

This pull request refactors the Mooncake connector's sender thread to use asyncio coroutines. This is a significant improvement that simplifies the code by removing manual lock management and leveraging modern asynchronous patterns. The state related to sending operations is now managed within a single event loop, which is a robust way to prevent race conditions. The changes are well-structured and follow best practices for integrating asyncio with threads and blocking operations. However, I found a critical issue where a silent failure can occur if a requested transfer ID is not found, potentially leading to data inconsistencies. I've provided a suggestion to fix this by raising an exception instead of returning silently.

gemini-code-assist · 2025-12-31T09:57:46Z

vllm/distributed/kv_transfer/kv_connector/v1/mooncake_connector.py

+            send_meta = self.reqs_need_send.get(req_id)
+            if send_meta is None:
+                logger.warning("Request %s not found in reqs_need_send", req_id)
+                return


Returning silently when a request ID is not found can lead to silent failures. The client that initiated the transfer will receive a TRANS_DONE message, making it believe the transfer was successful, while it was actually aborted. This can lead to data inconsistencies or hangs. It's better to raise an exception to signal an error condition, which will then be caught by the _sender_worker and result in a TRANS_ERROR message being sent back to the client.

Suggested change

return

raise ValueError(f"Request {req_id} not found in reqs_need_send")

NickLucche

Thanks a lot for separating this change @dtcccc , this looks like good!
Only left a few minor comments for style+clarity

vllm/distributed/kv_transfer/kv_connector/v1/mooncake_connector.py

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>

NickLucche

LGTM

cursor · 2026-01-12T10:57:54Z

vllm/distributed/kv_transfer/kv_connector/v1/mooncake_connector.py

-                        )
+            asyncio.run_coroutine_threadsafe(
+                self.record_send_reqs(metadata), self.sender_loop
+            )


Race condition between async scheduling and sender workers

Medium Severity

The refactoring replaces synchronous lock-protected updates with asyncio.run_coroutine_threadsafe, which schedules record_send_reqs but returns immediately before execution. Since _sender_worker tasks are already running on sender_loop and processing incoming ZMQ requests, if a decoder request arrives before record_send_reqs has executed, send_kv_to_decode will fail to find the request in reqs_need_send, log a warning, and return early without transferring data. The old code used threading.Lock to ensure the dict was updated before being read. The new code provides no such guarantee.

Additional Locations (1)

vllm/distributed/kv_transfer/kv_connector/v1/mooncake_connector.py#L568-L573

vllm-project#31573) Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com> Signed-off-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>

vllm-project#31573) Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>

vllm-project#31573) Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

vllm-project#31573) Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>

[P/D] Refactor mooncake connector sender thread using async coroutines

ecec81f

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>

dtcccc requested review from ApostaC and NickLucche as code owners December 31, 2025 09:54

mergify bot added the kv-connector label Dec 31, 2025

gemini-code-assist bot reviewed Dec 31, 2025

View reviewed changes

NickLucche reviewed Jan 7, 2026

View reviewed changes

fixup

662def2

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>

NickLucche approved these changes Jan 12, 2026

View reviewed changes

NickLucche enabled auto-merge (squash) January 12, 2026 10:37

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 12, 2026

NickLucche disabled auto-merge January 12, 2026 10:40

Merge branch 'main' into dtcccc/p_con

a64dc77

NickLucche enabled auto-merge (squash) January 12, 2026 10:47

cursor bot reviewed Jan 12, 2026

View reviewed changes

NickLucche merged commit 0565f1f into vllm-project:main Jan 12, 2026
55 checks passed

dtcccc deleted the dtcccc/p_con branch January 15, 2026 05:39

NickLucche mentioned this pull request Feb 3, 2026

[Roadmap]: PD Disaggregation with NixlConnector Roadmap #33702

Open

44 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[P/D] Refactor mooncake connector sender thread using async coroutines#31573

[P/D] Refactor mooncake connector sender thread using async coroutines#31573
NickLucche merged 3 commits intovllm-project:mainfrom
openanolis:dtcccc/p_con

dtcccc commented Dec 31, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 31, 2025

Uh oh!

NickLucche left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NickLucche left a comment

Uh oh!

cursor bot Jan 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	return
	raise ValueError(f"Request {req_id} not found in reqs_need_send")

Uh oh!

Conversation

dtcccc commented Dec 31, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 12, 2026

Choose a reason for hiding this comment

Race condition between async scheduling and sender workers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dtcccc commented Dec 31, 2025 •

edited by github-actions bot

Loading