[Feature] add model aware kv ops helper #16020

billishyahao · 2025-04-03T14:52:57Z

This patch provides

A unified model awared KV helper to handle non-MLA/MLA kv cache reorder/invocation.
MLA support for mooncake store connector [Feature][Disaggregated] Support XpYd disaggregated prefill with MooncakeStore #12957 .
Some minor code clean-ups for both simple connector and mooncake store connector.

Tested on both AMD and NVIDIA DCGPUs to verify its correctness on both simple connector 1P1D and mooncake store connector XPYD case.

XPYD:

# 1. Start the etcd server
etcd --listen-client-urls http://{IP}:2379 --advertise-client-urls http://{IP}:2379

# 2. Start the mooncake_master server
LD_LIBRARY_PATH=/usr/local/lib/:$LD_LIBRARY_PATH \
mooncake_master --port 50001

# 3. Run multiple vllm instances
# kv_producer role
CUDA_VISIBLE_DEVICES=1 \
LD_LIBRARY_PATH=/usr/local/lib/:$LD_LIBRARY_PATH \
MOONCAKE_CONFIG_PATH=./mooncake.json \
vllm serve deepseek-ai/DeepSeek-V2-Lite \
		--port 8100 \
		--trust-remote-code \
		--max-model-len 10000 \
		--gpu-memory-utilization 0.8 \
		--kv-transfer-config '{"kv_connector":"MooncakeStoreConnector", "kv_role":"kv_producer", "kv_rank":0, "kv_parallel_size":2}'


# kv_consumer role
CUDA_VISIBLE_DEVICES=2 \
LD_LIBRARY_PATH=/usr/local/lib/:$LD_LIBRARY_PATH \
MOONCAKE_CONFIG_PATH=./mooncake.json \
vllm serve deepseek-ai/DeepSeek-V2-Lite \
		--port 8200 \
		--trust-remote-code \
		--max-model-len 10000 \
		--gpu-memory-utilization 0.8 \
		--kv-transfer-config '{"kv_connector":"MooncakeStoreConnector", "kv_rank":1, "kv_role":"kv_consumer", "kv_rank":1, "kv_parallel_size":2}'

CUDA_VISIBLE_DEVICES=3 \
LD_LIBRARY_PATH=/usr/local/lib/:$LD_LIBRARY_PATH \
MOONCAKE_CONFIG_PATH=./mooncake.json \
vllm serve deepseek-ai/DeepSeek-V2-Lite \
		--port 8201 \
		--trust-remote-code \
		--max-model-len 10000 \
		--gpu-memory-utilization 0.8 \
		--kv-transfer-config '{"kv_connector":"MooncakeStoreConnector", "kv_rank":1, "kv_role":"kv_consumer", "kv_rank":1, "kv_parallel_size":2}'

# 4. Run round robin proxy server
python3 examples/online_serving/disagg_examples/disagg_proxy_demo.py --model deepseek-ai/DeepSeek-V2-Lite --prefill localhost:8100  --decode localhost:8200 localhost:8201  --port 8000

#5. Send request 
curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
  "model": "deepseek-ai/DeepSeek-V2-Lite",
  "prompt": "San Francisco is a",
  "max_tokens": 100}'

{"id":"cmpl-163f1756892949d1b914d0756cd98eab","object":"text_completion","created":1743687181,"model":"deepseek-ai/DeepSeek-V2-Lite","choices":[{"index":0,"text":" city that is known for its diversity, and the same can be said for the city’s food scene. From the iconic clam chowder to the deliciously sweet and sour Chinese food, San Francisco has a wide variety of cuisines to choose from.\nOne of the most popular dishes in San Francisco is the clam chowder. This creamy soup is made with clams, potatoes, and bacon, and is typically served with a side of sourdough bread. The clam ch","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":105,"completion_tokens":100,"prompt_tokens_details":null}}

1P1D

# 1. Run prefill instance
CUDA_VISIBLE_DEVICES=1 vllm serve deepseek-ai/DeepSeek-V2-Lite \
    --port 8100 \
    --max-model-len 10000 \
    --gpu-memory-utilization 0.8 \
    --trust-remote-code \
    --kv-transfer-config \
    '{"kv_connector":"PyNcclConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2}'

# 2. Run decode instance
CUDA_VISIBLE_DEVICES=2 vllm serve deepseek-ai/DeepSeek-V2-Lite \
    --port 8200 \
    --max-model-len 10000 \
    --gpu-memory-utilization 0.8 \
    --trust-remote-code \
    --kv-transfer-config \
    '{"kv_connector":"PyNcclConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2}'
	
# 3. Run simple proxy server	
python3 benchmarks/disagg_benchmarks/disagg_prefill_proxy_server.py

# 4. Send request 
curl -X POST -s http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V2-Lite",
"prompt": "San Francisco is a",
"max_tokens": 100
}'
{"id":"cmpl-959481be2f694efca9e574693c965753","object":"text_completion","created":1743688337,"model":"deepseek-ai/DeepSeek-V2-Lite","choices":[{"index":0,"text":" city of many neighborhoods, each with its own unique character and charm. From the bustling streets of Chinatown to the quiet streets of the Haight-Ashbury district, there is something for everyone in this vibrant city. One of the most popular neighborhoods in San Francisco is the Castro district, which is known for its vibrant LGBTQ+ community and its many bars and restaurants. Another popular neighborhood is the Mission district, which is known for its diverse population and its many restaurants and shops. No matter what","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":105,"completion_tokens":100,"prompt_tokens_details":null}}

Signed-off-by: billishyahao <[email protected]>

github-actions · 2025-04-03T14:53:06Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

KuntaiDu

I like this PR! Some comments on naming stuff, but functionality LGTM!

vllm/distributed/kv_transfer/kv_connector/mooncake_store_connector.py

vllm/distributed/kv_transfer/kv_connector/model_aware_kv_ops.py

ShangmingCai

I like this PR to modularize this part of the code to reduce duplication and adapt to all connectors, but model_aware_kv_ops.py this filename seems a bit confusing, maybe it should be placed in a utils.py file. Otherwise, LGTM.

vllm/distributed/kv_transfer/kv_connector/mooncake_store_connector.py

Signed-off-by: billishyahao <[email protected]>

ShangmingCai

LGTM. But maybe a shorter name like "utils.py" will be better? So that we can put more util functions or helpers all in this file as well instead of creating so many files in the future. I suggest this because I see some "utils.py" in many sub-directories of vllm.

billishyahao · 2025-04-09T11:44:33Z

LGTM. But maybe a shorter name like "utils.py" will be better? So that we can put more util functions or helpers all in this file as well instead of creating so many files in the future. I suggest this because I see some "utils.py" in many sub-directories of vllm.

Yes, it makes sense. I rename it in latest commit 07c73ea . Thanks!

Signed-off-by: billishyahao <[email protected]>

ShangmingCai

@billishyahao LGTM now. You can ping @KuntaiDu to review it again.

KuntaiDu

LGTM!

DarkLight1337 · 2025-04-15T14:37:45Z

Can you merge from main to fix the CI failures?

ShangmingCai · 2025-04-16T05:52:47Z

@DarkLight1337 This probably could use a force-merge since it only changes the files under vllm/distributed/kv_transfer/kv_connector dir, and the disaggregated serving feature doesn't have CI yet.

Signed-off-by: billishyahao <[email protected]>

Signed-off-by: billishyahao <[email protected]> Signed-off-by: Yang Wang <[email protected]>

Signed-off-by: billishyahao <[email protected]>

Signed-off-by: billishyahao <[email protected]> Signed-off-by: Mu Huai <[email protected]>

[Feature] add model aware kv ops helper

47b51d7

Signed-off-by: billishyahao <[email protected]>

simon-mo requested a review from KuntaiDu April 3, 2025 16:30

KuntaiDu reviewed Apr 3, 2025

View reviewed changes

ShangmingCai reviewed Apr 7, 2025

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/mooncake_store_connector.py Outdated Show resolved Hide resolved

fix comments

f83125e

Signed-off-by: billishyahao <[email protected]>

billishyahao requested review from KuntaiDu and ShangmingCai April 9, 2025 10:24

ShangmingCai reviewed Apr 9, 2025

View reviewed changes

billishyahao requested a review from ShangmingCai April 9, 2025 11:45

fix comments

07c73ea

Signed-off-by: billishyahao <[email protected]>

ShangmingCai reviewed Apr 10, 2025

View reviewed changes

KuntaiDu approved these changes Apr 14, 2025

View reviewed changes

KuntaiDu enabled auto-merge (squash) April 14, 2025 17:48

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 14, 2025

vllm-bot merged commit 3ac98ed into vllm-project:main Apr 16, 2025
65 of 69 checks passed

ShangmingCai mentioned this pull request Apr 17, 2025

two H20 nodes to run DeepSeek671b 1P1D occur RuntimeError: shape '[-1, 16, 56]' is invalid for input of size 36864 kvcache-ai/Mooncake#261

Closed

lionelvillard pushed a commit to lionelvillard/vllm that referenced this pull request Apr 17, 2025

[Feature] add model aware kv ops helper (vllm-project#16020)

1ec6215

Signed-off-by: billishyahao <[email protected]>

zhenwei-intel pushed a commit to HabanaAI/vllm-fork that referenced this pull request Apr 20, 2025

[Feature] add model aware kv ops helper (vllm-project#16020)

74b071e

Signed-off-by: billishyahao <[email protected]>

yangw-dev pushed a commit to yangw-dev/vllm that referenced this pull request Apr 21, 2025

[Feature] add model aware kv ops helper (vllm-project#16020)

c4f1be1

Signed-off-by: billishyahao <[email protected]> Signed-off-by: Yang Wang <[email protected]>

jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025

[Feature] add model aware kv ops helper (vllm-project#16020)

44917a5

Signed-off-by: billishyahao <[email protected]>

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

[Feature] add model aware kv ops helper (vllm-project#16020)

029f08f

Signed-off-by: billishyahao <[email protected]>

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

[Feature] add model aware kv ops helper (vllm-project#16020)

ee581bc

Signed-off-by: billishyahao <[email protected]> Signed-off-by: Mu Huai <[email protected]>

ckhordiasma mentioned this pull request May 14, 2025

nm vllm ent 0.8.5 sync red-hat-data-services/vllm#139

Merged

Uh oh!

[Feature] add model aware kv ops helper #16020

[Feature] add model aware kv ops helper #16020

Uh oh!

Conversation

billishyahao commented Apr 3, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 3, 2025

Uh oh!

KuntaiDu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

billishyahao commented Apr 9, 2025

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

KuntaiDu left a comment

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 commented Apr 15, 2025

Uh oh!

ShangmingCai commented Apr 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

billishyahao commented Apr 3, 2025 •

edited by github-actions bot

Loading