-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
[Feature][Disaggregated] Support XpYd disaggregated prefill with MooncakeStore #12957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature][Disaggregated] Support XpYd disaggregated prefill with MooncakeStore #12957
Conversation
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
6cdcc91 to
a45736e
Compare
Signed-off-by: Shangming Cai <[email protected]>
|
@KuntaiDu Hello, this is the initial PR for the XpYd design as we discussed before the Chinese New Year. I think it is ready for review :) |
Signed-off-by: Shangming Cai <[email protected]>
70db613 to
1f194db
Compare
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
@Second222None Please raise a new issue in the Mooncake repo, we will find someone to help you. |
|
@DarkLight1337 Any idea why v1-test keep failing? |
|
It's a broken test on main. I'll just force-merge this |
| load_kvcache_key = f"{load_key_prefix}_{self.local_tp_rank}" | ||
| remote_kv = self.kv_store.get(load_kvcache_key) | ||
| hidden_key = f"{load_key_prefix}_hidden_{self.local_tp_rank}" | ||
| hidden = self.kv_store.get(hidden_key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, can I ask a question? Do we need kv_store.remove(hidden_key) to delete the memory in mooncake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. We enable auto gc since prefix-based kvcache resuing has not been implemented yet. But it is a temporary state, we are working on the v1 design, which will include kvcache resuing and evicting.
…cakeStore (vllm-project#12957) Signed-off-by: Shangming Cai <[email protected]> Signed-off-by: xinyuxiao <[email protected]>
…cakeStore (vllm-project#12957) Signed-off-by: Shangming Cai <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>
|
got trouble in XpYd disaggregated prefill with MooncakeStore: any clue? |
@dalong2hongmei You can provide a detailed log and open an issue in the Mooncake repo, we will help you to seek the root cause. |
the logs are store on the corporate intrannet, hard to export them out to this... i foud an issue posting the error logs exactly the same as mine i also want to figure out if any prerequisite in docker container tcp scenario(or k8s), for example, if it need hostnetwork between p/d instances? |
|
@dalong2hongmei i follow the doc: https://github.com/kvcache-ai/Mooncake/blob/main/doc/en/vllm-integration-v1.md; but got error: whats wrong ? i test the connect of moon master,is ok the context is :
|
|
@morty-zxb, can you provide the full log of the prefill node and decode node, and open an issue in the mooncake repo? |
it figured out it's my fault, i mistook the ip parameters. mooncake is a great work! i got in another trouble, when i using vllm's serving_benchmark script to benchmark vllmserver with mooncake, i often found a large prompt throughput in my decoding instance's log, it also showed some pendding requests, i doubted it was doing prefilling in the decoding instance, is it normal? @ShangmingCai ps: i found vllm's --enable-prefix-caching in conflict with mooncake, when i set this launching parameter, the instance would crash somehow when doing benchmark; |
…cakeStore (vllm-project#12957) Signed-off-by: Shangming Cai <[email protected]>
…cakeStore (vllm-project#12957) Signed-off-by: Shangming Cai <[email protected]>
…cakeStore (vllm-project#12957) Signed-off-by: Shangming Cai <[email protected]> Signed-off-by: Mu Huai <[email protected]>
|
@ShangmingCai Hello, I found that #15343 had stalled. Do you have any more plan on mooncake store v1 intergation, for example, mooncake store v1 implementation, layerwise transfer or global kvcache reuse? I would highly appreciate if any feedbacks from you. |
@Zhangmj0621 Can you check this: https://github.com/kvcache-ai/Mooncake/blob/main/doc/en/lmcacheV1-deployment.md We have already integrated some impl through LMCache. |
Thanks for your reply! It was my mistake to think that lmcache and mooncake are two different systems. |
I have just one remaining question regarding the kvcache transfer mechanism: When utilizing both lmcache and mooncake store, is the transfer handled by NIXL or the Mooncake Transfer Engine? |
@Zhangmj0621 Actually, both LMCache and NIXL have integrated mooncake as a backend. You can try LMCache + MooncakeStore (which utilizes Mooncake Transfer Engine to transfer KVCache) or you can try LMCache + NIXL (choose Mooncake Transfer Engine as the backend). |
Thanks for your detailed and kind response! |

This is the initial PR to implement XpYd as described in the design doc, which has gone through several rounds of discussions in the #feat-prefill-disaggregation slack channel and reached a preliminary consensus.
This PR mainly includes the changes related toKVTransferParams. In the implementation of XpYd, we will use this parameter to coordinate metadata among all vllm instances (prefill nodes and decode nodes), the KVStore Server, and the global proxy. With this param, the global proxy can attach the disaggregated prefill metadata info in theCompletionRequestorChatCompletionRequest, and all kv-transfer-related modules can access it through themodel_inputin functionmodel_runner.execute_modeland use these params to interact with the KVStore Server at the right place under current 1P1D implementation.Based on this initial PR, the Mooncake team will implement theKVStoreConnectorin the next PR, which utilizesKVTransferParamsand supports several third-party key-value stores to act as theKVLookupBuffersuch as MooncakeStore, Valkey, LMCache, etc. WithKVStoreConnector, we can decouple the connections between the Prefill nodes and the Decode nodes, making the entire XpYd implementation simpler and more stable.The metadata info required inKVTransferParamswill be provided by a global proxy (or we can call it the XpYd orchestration mechanism). Therefore, a global proxy interface will be added in the next PR too. Mooncake team will implement a simple version based on round-robin. In the production stage, service providers and users can also implement corresponding global proxy strategies according to their needs.Since some parties seem to be concerned about the OpenAI API changes, we refactor this PR to support disaggregated prefill with MooncakeStore, which is a new Distributed Object Store for XpYd PD disaggregation. We also provide a simple disaggregation proxy demo to support XpYd disaggregated prefill as an example.
Integration guide: https://github.com/kvcache-ai/Mooncake/blob/main/doc/en/vllm-integration-v1.md
Due to the communication strategy among the p node, d node, global proxy, kvstore is still under discussion, we will support layer-by-layer kvcache transfer to further optimize TTFT once we figured out how to coordinate with the proxy in the next PR.
CC list: @KuntaiDu, @youkaichao, @james0zan, @alogfans, @stmatengss, @zeroorhero, @pizhenwei