Skip to content

Conversation

@billishyahao
Copy link
Contributor

@billishyahao billishyahao commented Apr 3, 2025

This patch provides

  1. A unified model awared KV helper to handle non-MLA/MLA kv cache reorder/invocation.
  2. MLA support for mooncake store connector [Feature][Disaggregated] Support XpYd disaggregated prefill with MooncakeStore #12957 .
  3. Some minor code clean-ups for both simple connector and mooncake store connector.

Tested on both AMD and NVIDIA DCGPUs to verify its correctness on both simple connector 1P1D and mooncake store connector XPYD case.

XPYD:

# 1. Start the etcd server
etcd --listen-client-urls http://{IP}:2379 --advertise-client-urls http://{IP}:2379

# 2. Start the mooncake_master server
LD_LIBRARY_PATH=/usr/local/lib/:$LD_LIBRARY_PATH \
mooncake_master --port 50001

# 3. Run multiple vllm instances
# kv_producer role
CUDA_VISIBLE_DEVICES=1 \
LD_LIBRARY_PATH=/usr/local/lib/:$LD_LIBRARY_PATH \
MOONCAKE_CONFIG_PATH=./mooncake.json \
vllm serve deepseek-ai/DeepSeek-V2-Lite \
		--port 8100 \
		--trust-remote-code \
		--max-model-len 10000 \
		--gpu-memory-utilization 0.8 \
		--kv-transfer-config '{"kv_connector":"MooncakeStoreConnector", "kv_role":"kv_producer", "kv_rank":0, "kv_parallel_size":2}'


# kv_consumer role
CUDA_VISIBLE_DEVICES=2 \
LD_LIBRARY_PATH=/usr/local/lib/:$LD_LIBRARY_PATH \
MOONCAKE_CONFIG_PATH=./mooncake.json \
vllm serve deepseek-ai/DeepSeek-V2-Lite \
		--port 8200 \
		--trust-remote-code \
		--max-model-len 10000 \
		--gpu-memory-utilization 0.8 \
		--kv-transfer-config '{"kv_connector":"MooncakeStoreConnector", "kv_rank":1, "kv_role":"kv_consumer", "kv_rank":1, "kv_parallel_size":2}'

CUDA_VISIBLE_DEVICES=3 \
LD_LIBRARY_PATH=/usr/local/lib/:$LD_LIBRARY_PATH \
MOONCAKE_CONFIG_PATH=./mooncake.json \
vllm serve deepseek-ai/DeepSeek-V2-Lite \
		--port 8201 \
		--trust-remote-code \
		--max-model-len 10000 \
		--gpu-memory-utilization 0.8 \
		--kv-transfer-config '{"kv_connector":"MooncakeStoreConnector", "kv_rank":1, "kv_role":"kv_consumer", "kv_rank":1, "kv_parallel_size":2}'

# 4. Run round robin proxy server
python3 examples/online_serving/disagg_examples/disagg_proxy_demo.py --model deepseek-ai/DeepSeek-V2-Lite --prefill localhost:8100  --decode localhost:8200 localhost:8201  --port 8000

#5. Send request 
curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
  "model": "deepseek-ai/DeepSeek-V2-Lite",
  "prompt": "San Francisco is a",
  "max_tokens": 100}'

{"id":"cmpl-163f1756892949d1b914d0756cd98eab","object":"text_completion","created":1743687181,"model":"deepseek-ai/DeepSeek-V2-Lite","choices":[{"index":0,"text":" city that is known for its diversity, and the same can be said for the city’s food scene. From the iconic clam chowder to the deliciously sweet and sour Chinese food, San Francisco has a wide variety of cuisines to choose from.\nOne of the most popular dishes in San Francisco is the clam chowder. This creamy soup is made with clams, potatoes, and bacon, and is typically served with a side of sourdough bread. The clam ch","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":105,"completion_tokens":100,"prompt_tokens_details":null}}

1P1D

# 1. Run prefill instance
CUDA_VISIBLE_DEVICES=1 vllm serve deepseek-ai/DeepSeek-V2-Lite \
    --port 8100 \
    --max-model-len 10000 \
    --gpu-memory-utilization 0.8 \
    --trust-remote-code \
    --kv-transfer-config \
    '{"kv_connector":"PyNcclConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2}'

# 2. Run decode instance
CUDA_VISIBLE_DEVICES=2 vllm serve deepseek-ai/DeepSeek-V2-Lite \
    --port 8200 \
    --max-model-len 10000 \
    --gpu-memory-utilization 0.8 \
    --trust-remote-code \
    --kv-transfer-config \
    '{"kv_connector":"PyNcclConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2}'
	
# 3. Run simple proxy server	
python3 benchmarks/disagg_benchmarks/disagg_prefill_proxy_server.py

# 4. Send request 
curl -X POST -s http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V2-Lite",
"prompt": "San Francisco is a",
"max_tokens": 100
}'
{"id":"cmpl-959481be2f694efca9e574693c965753","object":"text_completion","created":1743688337,"model":"deepseek-ai/DeepSeek-V2-Lite","choices":[{"index":0,"text":" city of many neighborhoods, each with its own unique character and charm. From the bustling streets of Chinatown to the quiet streets of the Haight-Ashbury district, there is something for everyone in this vibrant city. One of the most popular neighborhoods in San Francisco is the Castro district, which is known for its vibrant LGBTQ+ community and its many bars and restaurants. Another popular neighborhood is the Mission district, which is known for its diverse population and its many restaurants and shops. No matter what","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":105,"completion_tokens":100,"prompt_tokens_details":null}}

@github-actions
Copy link

github-actions bot commented Apr 3, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@simon-mo simon-mo requested a review from KuntaiDu April 3, 2025 16:30
Copy link
Collaborator

@KuntaiDu KuntaiDu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this PR! Some comments on naming stuff, but functionality LGTM!

Copy link
Contributor

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this PR to modularize this part of the code to reduce duplication and adapt to all connectors, but model_aware_kv_ops.py this filename seems a bit confusing, maybe it should be placed in a utils.py file. Otherwise, LGTM.

Signed-off-by: billishyahao <[email protected]>
Copy link
Contributor

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. But maybe a shorter name like "utils.py" will be better? So that we can put more util functions or helpers all in this file as well instead of creating so many files in the future. I suggest this because I see some "utils.py" in many sub-directories of vllm.

@billishyahao
Copy link
Contributor Author

LGTM. But maybe a shorter name like "utils.py" will be better? So that we can put more util functions or helpers all in this file as well instead of creating so many files in the future. I suggest this because I see some "utils.py" in many sub-directories of vllm.

Yes, it makes sense. I rename it in latest commit 07c73ea . Thanks!

Signed-off-by: billishyahao <[email protected]>
Copy link
Contributor

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@billishyahao LGTM now. You can ping @KuntaiDu to review it again.

Copy link
Collaborator

@KuntaiDu KuntaiDu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@KuntaiDu KuntaiDu enabled auto-merge (squash) April 14, 2025 17:48
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 14, 2025
@DarkLight1337
Copy link
Member

Can you merge from main to fix the CI failures?

@ShangmingCai
Copy link
Contributor

@DarkLight1337 This probably could use a force-merge since it only changes the files under vllm/distributed/kv_transfer/kv_connector dir, and the disaggregated serving feature doesn't have CI yet.

@vllm-bot vllm-bot merged commit 3ac98ed into vllm-project:main Apr 16, 2025
65 of 69 checks passed
lionelvillard pushed a commit to lionelvillard/vllm that referenced this pull request Apr 17, 2025
zhenwei-intel pushed a commit to HabanaAI/vllm-fork that referenced this pull request Apr 20, 2025
yangw-dev pushed a commit to yangw-dev/vllm that referenced this pull request Apr 21, 2025
jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025
lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants