Skip to content

Conversation

@ShangmingCai
Copy link
Contributor

@ShangmingCai ShangmingCai commented Feb 8, 2025

This is the initial PR to implement XpYd as described in the design doc, which has gone through several rounds of discussions in the #feat-prefill-disaggregation slack channel and reached a preliminary consensus.

This PR mainly includes the changes related to KVTransferParams. In the implementation of XpYd, we will use this parameter to coordinate metadata among all vllm instances (prefill nodes and decode nodes), the KVStore Server, and the global proxy. With this param, the global proxy can attach the disaggregated prefill metadata info in the CompletionRequest or ChatCompletionRequest, and all kv-transfer-related modules can access it through the model_input in function model_runner.execute_model and use these params to interact with the KVStore Server at the right place under current 1P1D implementation.

Based on this initial PR, the Mooncake team will implement the KVStoreConnector in the next PR, which utilizes KVTransferParams and supports several third-party key-value stores to act as the KVLookupBuffer such as MooncakeStore, Valkey, LMCache, etc. With KVStoreConnector, we can decouple the connections between the Prefill nodes and the Decode nodes, making the entire XpYd implementation simpler and more stable.

The metadata info required in KVTransferParams will be provided by a global proxy (or we can call it the XpYd orchestration mechanism). Therefore, a global proxy interface will be added in the next PR too. Mooncake team will implement a simple version based on round-robin. In the production stage, service providers and users can also implement corresponding global proxy strategies according to their needs.

Since some parties seem to be concerned about the OpenAI API changes, we refactor this PR to support disaggregated prefill with MooncakeStore, which is a new Distributed Object Store for XpYd PD disaggregation. We also provide a simple disaggregation proxy demo to support XpYd disaggregated prefill as an example.

Integration guide: https://github.com/kvcache-ai/Mooncake/blob/main/doc/en/vllm-integration-v1.md

Due to the communication strategy among the p node, d node, global proxy, kvstore is still under discussion, we will support layer-by-layer kvcache transfer to further optimize TTFT once we figured out how to coordinate with the proxy in the next PR.

CC list: @KuntaiDu, @youkaichao, @james0zan, @alogfans, @stmatengss, @zeroorhero, @pizhenwei

@github-actions
Copy link

github-actions bot commented Feb 8, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the frontend label Feb 8, 2025
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
@ShangmingCai ShangmingCai force-pushed the add_kv_transfer_params branch from 6cdcc91 to a45736e Compare February 11, 2025 04:03
Signed-off-by: Shangming Cai <[email protected]>
@ShangmingCai ShangmingCai changed the title [WIP] Add KVTransferParams for disaggregated prefill feature [Feature][Frontend] Add KVTransferParams for disaggregated prefill feature Feb 11, 2025
@ShangmingCai
Copy link
Contributor Author

@KuntaiDu Hello, this is the initial PR for the XpYd design as we discussed before the Chinese New Year. I think it is ready for review :)

@ShangmingCai ShangmingCai force-pushed the add_kv_transfer_params branch from 70db613 to 1f194db Compare February 12, 2025 11:15
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
@KuntaiDu KuntaiDu self-assigned this Feb 12, 2025
@Second222None
Copy link

Second222None commented Mar 28, 2025

Nice work. But we have the following problems with reproducing. Can anyone give some advice?
image

@ShangmingCai
Copy link
Contributor Author

Nice work. But we have the following problems with reproducing. Can anyone give some advice?

@Second222None Please raise a new issue in the Mooncake repo, we will find someone to help you.

@ShangmingCai
Copy link
Contributor Author

@DarkLight1337 Any idea why v1-test keep failing?

@DarkLight1337
Copy link
Member

It's a broken test on main. I'll just force-merge this

@vllm-bot vllm-bot merged commit 6fa7cd3 into vllm-project:main Mar 29, 2025
37 of 39 checks passed
load_kvcache_key = f"{load_key_prefix}_{self.local_tp_rank}"
remote_kv = self.kv_store.get(load_kvcache_key)
hidden_key = f"{load_key_prefix}_hidden_{self.local_tp_rank}"
hidden = self.kv_store.get(hidden_key)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, can I ask a question? Do we need kv_store.remove(hidden_key) to delete the memory in mooncake?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. We enable auto gc since prefix-based kvcache resuing has not been implemented yet. But it is a temporary state, we are working on the v1 design, which will include kvcache resuing and evicting.

Alex4210987 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Apr 5, 2025
lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025
@dalong2hongmei
Copy link

got trouble in XpYd disaggregated prefill with MooncakeStore:
1、it works perfect following this tutorial (tcp protocol,vllm0.8.3, 1p1d):
https://github.com/kvcache-ai/Mooncake/blob/main/doc/en/vllm-integration-v1.md
2、refer to above, i deployed 2 pods using k8s, 1pod for 1p or 1d role,tcp protocol too,it did work this time.
p pod could allways send kvcache successfully,
but d pod could not receive kvcache. an error in d pod's log i always met:
client.cpp:428] Transfer failed for task0
.....
client.cpp:177]transfer_read_failed_key=2580549111205550000_0

any clue?

@ShangmingCai
Copy link
Contributor Author

any clue?

@dalong2hongmei You can provide a detailed log and open an issue in the Mooncake repo, we will help you to seek the root cause.

@dalong2hongmei
Copy link

any clue?

@dalong2hongmei You can provide a detailed log and open an issue in the Mooncake repo, we will help you to seek the root cause.

the logs are store on the corporate intrannet, hard to export them out to this...

i foud an issue posting the error logs exactly the same as mine
kvcache-ai/Mooncake#180

i also want to figure out if any prerequisite in docker container tcp scenario(or k8s), for example, if it need hostnetwork between p/d instances?

@XiaobinZhao
Copy link

XiaobinZhao commented Apr 18, 2025

@dalong2hongmei i follow the doc: https://github.com/kvcache-ai/Mooncake/blob/main/doc/en/vllm-integration-v1.md; but got error:

2025-04-17 19:46:42.566927 ERROR    [975] [coro_rpc_client.hpp:978] read rpc body failed, error msg:End of file. close the socket.
E0417 19:46:42.567027   481 master_client.cpp:192] Failed to mount segment due to rpc error
E0417 19:46:42.567034   481 store_py.cpp:191] Failed to mount segment: UNKNOWN_ERROR
E0417 19:46:42.567068   484 master_client.cpp:192] Failed to mount segment due to rpc error
E0417 19:46:42.567075   484 store_py.cpp:191] Failed to mount segment: UNKNOWN_ERROR
2025-04-17 19:46:42.566977 ERROR    [953] [coro_rpc_client.hpp:978] read rpc body failed, error msg:End of file. close the socket.
I0417 19:46:42.567740   483 client.cpp:48] transport_type=tcp
I0417 19:46:42.568464   482 client.cpp:48] transport_type=tcp
I0417 19:46:42.568614   483 tcp_transport.cpp:224] TcpTransport: listen on port 16053
I0417 19:46:42.568714   483 allocator.cpp:99] initializing_simple_allocator size=1073741824
I0417 19:46:42.568770   483 allocator.cpp:138] simple_allocator_initialized pool_id=0
I0417 19:46:42.569110   482 tcp_transport.cpp:224] TcpTransport: listen on port 16517
I0417 19:46:42.569218   482 allocator.cpp:99] initializing_simple_allocator size=1073741824
I0417 19:46:42.569280   482 allocator.cpp:138] simple_allocator_initialized pool_id=0
E0417 19:46:42.569809   483 master_client.cpp:192] Failed to mount segment due to rpc error

whats wrong ?

i test the connect of moon master,is ok

root@zp-nc05:/data/deepseek-ai# telnet -e quit 172.16.0.11 50001
Telnet escape character is 'q'.
Trying 172.16.0.11...
Connected to 172.16.0.11.
Escape character is 'q'.
@@ xterm-256color
Connection closed by foreign host.

the context is :

  1. pull vllm latest docker images public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:9dbf7a2dc1448d6657adfb2daba36be270dcebcd
  2. install mooncake in the docker pip install mooncake-transfer-engine
  3. mooncake.json
{
    "local_hostname": "172.16.0.11",
    "metadata_server": "etcd://172.16.0.11:12379",
    "protocol": "tcp",
    "device_name": "",
    "master_server_address": "172.16.0.11:50001"
}
  1. run etcd
  2. run mooncake master with docker image: docker pull alogfans/mooncake
  3. run vllm with mooncake.json

@ShangmingCai
Copy link
Contributor Author

@morty-zxb, can you provide the full log of the prefill node and decode node, and open an issue in the mooncake repo?

@dalong2hongmei
Copy link

any clue?

@dalong2hongmei You can provide a detailed log and open an issue in the Mooncake repo, we will help you to seek the root cause.

the logs are store on the corporate intrannet, hard to export them out to this...

i foud an issue posting the error logs exactly the same as mine kvcache-ai/Mooncake#180

i also want to figure out if any prerequisite in docker container tcp scenario(or k8s), for example, if it need hostnetwork between p/d instances?

it figured out it's my fault, i mistook the ip parameters. mooncake is a great work!

i got in another trouble, when i using vllm's serving_benchmark script to benchmark vllmserver with mooncake, i often found a large prompt throughput in my decoding instance's log, it also showed some pendding requests, i doubted it was doing prefilling in the decoding instance, is it normal? @ShangmingCai

ps: i found vllm's --enable-prefix-caching in conflict with mooncake, when i set this launching parameter, the instance would crash somehow when doing benchmark;

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025
shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
@Zhangmj0621
Copy link

Zhangmj0621 commented Jul 14, 2025

@ShangmingCai Hello, I found that #15343 had stalled. Do you have any more plan on mooncake store v1 intergation, for example, mooncake store v1 implementation, layerwise transfer or global kvcache reuse? I would highly appreciate if any feedbacks from you.

@ShangmingCai
Copy link
Contributor Author

@ShangmingCai Hello, I found that #15343 had stalled. Do you have any more plan on mooncake store v1 intergation, for example, mooncake store v1 implementation, layerwise transfer or global kvcache reuse? I would highly appreciate if any feedbacks from you.

@Zhangmj0621 Can you check this: https://github.com/kvcache-ai/Mooncake/blob/main/doc/en/lmcacheV1-deployment.md We have already integrated some impl through LMCache.

@Zhangmj0621
Copy link

@ShangmingCai Hello, I found that #15343 had stalled. Do you have any more plan on mooncake store v1 intergation, for example, mooncake store v1 implementation, layerwise transfer or global kvcache reuse? I would highly appreciate if any feedbacks from you.

@Zhangmj0621 Can you check this: https://github.com/kvcache-ai/Mooncake/blob/main/doc/en/lmcacheV1-deployment.md We have already integrated some impl through LMCache.

Thanks for your reply! It was my mistake to think that lmcache and mooncake are two different systems.

@Zhangmj0621
Copy link

pp:48] transport_type=tcp
I0417 19:46:42.568464 482 client.c

I have just one remaining question regarding the kvcache transfer mechanism: When utilizing both lmcache and mooncake store, is the transfer handled by NIXL or the Mooncake Transfer Engine?
I would highly appreciate any feedbacks from you.

@ShangmingCai
Copy link
Contributor Author

pp:48] transport_type=tcp
I0417 19:46:42.568464 482 client.c

I have just one remaining question regarding the kvcache transfer mechanism: When utilizing both lmcache and mooncake store, is the transfer handled by NIXL or the Mooncake Transfer Engine? I would highly appreciate any feedbacks from you.

@Zhangmj0621 Actually, both LMCache and NIXL have integrated mooncake as a backend. You can try LMCache + MooncakeStore (which utilizes Mooncake Transfer Engine to transfer KVCache) or you can try LMCache + NIXL (choose Mooncake Transfer Engine as the backend).

@Zhangmj0621
Copy link

pp:48] transport_type=tcp
I0417 19:46:42.568464 482 client.c

I have just one remaining question regarding the kvcache transfer mechanism: When utilizing both lmcache and mooncake store, is the transfer handled by NIXL or the Mooncake Transfer Engine? I would highly appreciate any feedbacks from you.

@Zhangmj0621 Actually, both LMCache and NIXL have integrated mooncake as a backend. You can try LMCache + MooncakeStore (which utilizes Mooncake Transfer Engine to transfer KVCache) or you can try LMCache + NIXL (choose Mooncake Transfer Engine as the backend).

Thanks for your detailed and kind response!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation frontend ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.