[V0] Support multiple kv connectors #18395

maobaolong · 2025-05-20T08:18:44Z

Motivation

Inspired by #17564 which make v1 support multiple connectors co-exist.

This PR aimed to implement MultiConnectorV0 for v0.

How to test

Start a 1P 1D cluster, and use lmcacheConnector as an offload purpose connector in the P instance, use FSConnector in the both P and D instance as a PD transfer connector.(Will submit another PR for introduce FSConnector, but you can use any other connector instead as P/D transfer connector)

run an vllm instance for prefill role

CUDA_VISIBLE_DEVICES=0 \
VLLM_LOGGING_LEVEL=DEBUG \
LMCACHE_USE_EXPERIMENTAL=True LMCACHE_TRACK_USAGE=false LMCACHE_LOG_LEVEL=DEBUG \
LMCACHE_CONFIG_FILE=/disc/data1/baoloongmao/cpu/lmcache-cpu.yaml \
VLLM_MLA_DISABLE=1 VLLM_USE_V1=0 \
vllm serve /disc/data1/deepseek/DeepSeek-V2-Lite-Chat/ \
           --trust-remote-code \
           --served-model-name vllm_cpu_offload \
           --max-model-len 32768 \
           --max-seq-len-to-capture 10000 \
           --max-num-seqs 64 \
           --gpu-memory-utilization 0.9 \
           --host 0.0.0.0 \
           -tp 1 \
           --no-enable-prefix-caching \
           --max-num-batched-tokens 64000 \
           --kv-transfer-config '{"kv_connector":"MultiConnectorV0","kv_role":"kv_both","kv_connector_extra_config":{"connectors":[{"kv_connector":"FSConnector","kv_role":"kv_producer","kv_connector_extra_config":{"fs_storage_path":"fs_local_storage","transfer":true}},{"kv_connector":"LMCacheConnector","kv_role":"kv_both"}]}}'

run an vllm instance for decode role

CUDA_VISIBLE_DEVICES=1 \
VLLM_LOGGING_LEVEL=DEBUG \
LMCACHE_USE_EXPERIMENTAL=True LMCACHE_TRACK_USAGE=false LMCACHE_LOG_LEVEL=DEBUG \
LMCACHE_CONFIG_FILE=/disc/data1/baoloongmao/cpu/lmcache-cpu.yaml \
VLLM_MLA_DISABLE=1 VLLM_USE_V1=0 \
vllm serve /disc/data1/deepseek/DeepSeek-V2-Lite-Chat/ \
           --trust-remote-code \
           --served-model-name vllm_cpu_offload \
           --max-model-len 32768 \
           --max-seq-len-to-capture 10000 \
           --max-num-seqs 64 \
           --gpu-memory-utilization 0.9 \
           --host 0.0.0.0 \
           -tp 1 \
           --no-enable-prefix-caching \
           --max-num-batched-tokens 64000 \
           --kv-transfer-config '{"kv_connector":"FSConnector","kv_role":"kv_consumer","kv_connector_extra_config":{"fs_storage_path":"fs_local_storage"}}' \
           --port 8001

Send request by curl and check the output log and file in the fs_local_storage folder

curl http://localhost:8000/v1/chat/completions     -H "Content-Type: application/json"     -d '{
    "model": "vllm_cpu_offload",
    "messages": [{"role": "user", "content": "Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?"}],
    "max_tokens": 1,
    "temperature": 0,
    "top_p": 0.95
    }'
{"id":"chatcmpl-94b0c6c0941343af9499f895789febe1","object":"chat.completion","created":1747727609,"model":"vllm_cpu_offload","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":" Hello","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":409,"total_tokens":410,"completion_tokens":1,"prompt_tokens_details":null},"prompt_logprobs":null,"kv_transfer_params":null}

curl http://localhost:8001/v1/chat/completions     -H "Content-Type: application/json"     -d '{
    "model": "vllm_cpu_offload",
    "messages": [{"role": "user", "content": "Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?"}],
    "max_tokens": 10,
    "temperature": 0,
    "top_p": 0.95
    }'

log in P

INFO 05-20 00:52:26 [logger.py:39] Received request chatcmpl-4c385ad864cb426dab9c3e31c82d073c: prompt: '<｜begin▁of▁sentence｜>User: Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?\n\nAssistant:', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 05-20 00:52:26 [engine.py:313] Added request chatcmpl-4c385ad864cb426dab9c3e31c82d073c.
[2025-05-20 00:52:26,864] LMCache DEBUG: Retrieved 0 out of 409 out of total 409 tokens (cache_engine.py:330:lmcache.experimental.cache_engine)
[2025-05-20 00:52:26,864] LMCache DEBUG: Injected token number: 0 (vllm_adapter.py:747:lmcache.integration.vllm.vllm_adapter)
[2025-05-20 00:52:26,864] LMCache DEBUG: Returning the original input! (vllm_adapter.py:789:lmcache.integration.vllm.vllm_adapter)
DEBUG 05-20 00:52:27 [fs_connector.py:94] [rank0]: KV send DONE.
INFO 05-20 00:52:27 [multi_connector.py:67] sent to connector FSConnector
[2025-05-20 00:52:27,062] LMCache DEBUG: Stored 409 out of total 409 tokens (cache_engine.py:257:lmcache.experimental.cache_engine)
[2025-05-20 00:52:27,062] LMCache DEBUG: Store skips 0 tokens and then stores 409 tokens (vllm_adapter.py:561:lmcache.integration.vllm.vllm_adapter)
INFO 05-20 00:52:27 [multi_connector.py:67] sent to connector LMCacheConnector

INFO 05-20 00:52:57 [logger.py:39] Received request chatcmpl-ae3020fdc9974caaaea4167112dcba5b: prompt: '<｜begin▁of▁sentence｜>User: Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?\n\nAssistant:', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 05-20 00:52:57 [engine.py:313] Added request chatcmpl-ae3020fdc9974caaaea4167112dcba5b.
[2025-05-20 00:52:57,507] LMCache DEBUG: Retrieved 409 out of 409 out of total 409 tokens (cache_engine.py:330:lmcache.experimental.cache_engine)
[2025-05-20 00:52:57,507] LMCache DEBUG: Injected token number: 408 (vllm_adapter.py:747:lmcache.integration.vllm.vllm_adapter)
[2025-05-20 00:52:57,517] LMCache DEBUG: Rebuilt the input! (vllm_adapter.py:786:lmcache.integration.vllm.vllm_adapter)
DEBUG 05-20 00:52:57 [fs_connector.py:94] [rank0]: KV send DONE.
INFO 05-20 00:52:57 [multi_connector.py:67] sent to connector FSConnector
[2025-05-20 00:52:57,558] LMCache DEBUG: Store skips 409 tokens and then stores 0 tokens (vllm_adapter.py:561:lmcache.integration.vllm.vllm_adapter)
INFO 05-20 00:52:57 [multi_connector.py:67] sent to connector LMCacheConnector

INFO 05-20 00:53:29 [logger.py:39] Received request chatcmpl-94b0c6c0941343af9499f895789febe1: prompt: '<｜begin▁of▁sentence｜>User: Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?\n\nAssistant:', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 05-20 00:53:29 [engine.py:313] Added request chatcmpl-94b0c6c0941343af9499f895789febe1.
[2025-05-20 00:53:29,635] LMCache DEBUG: Retrieved 409 out of 409 out of total 409 tokens (cache_engine.py:330:lmcache.experimental.cache_engine)
[2025-05-20 00:53:29,635] LMCache DEBUG: Injected token number: 408 (vllm_adapter.py:747:lmcache.integration.vllm.vllm_adapter)
[2025-05-20 00:53:29,638] LMCache DEBUG: Rebuilt the input! (vllm_adapter.py:786:lmcache.integration.vllm.vllm_adapter)
DEBUG 05-20 00:53:29 [fs_connector.py:94] [rank0]: KV send DONE.
INFO 05-20 00:53:29 [multi_connector.py:67] sent to connector FSConnector
[2025-05-20 00:53:29,671] LMCache DEBUG: Store skips 409 tokens and then stores 0 tokens (vllm_adapter.py:561:lmcache.integration.vllm.vllm_adapter)
INFO 05-20 00:53:29 [multi_connector.py:67] sent to connector LMCacheConnector

log in D

INFO 05-20 00:52:46 [logger.py:39] Received request chatcmpl-471ab3691c8741768686f04e4a3123f1: prompt: '<｜begin▁of▁sentence｜>User: Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?\n\nAssistant:', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=10, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 05-20 00:52:46 [engine.py:313] Added request chatcmpl-471ab3691c8741768686f04e4a3123f1.
DEBUG 05-20 00:52:46 [fs_connector.py:170] [rank0]: Successfully received all KVs and hidden states, skip model forwarding.

INFO 05-20 00:53:09 [logger.py:39] Received request chatcmpl-d0deb766738a425aa0129134ef52955e: prompt: '<｜begin▁of▁sentence｜>User: Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?\n\nAssistant:', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=10, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 05-20 00:53:09 [engine.py:313] Added request chatcmpl-d0deb766738a425aa0129134ef52955e.
DEBUG 05-20 00:53:09 [fs_connector.py:170] [rank0]: Successfully received all KVs and hidden states, skip model forwarding.

github-actions · 2025-05-20T08:18:53Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: baoloongmao <[email protected]>

maobaolong · 2025-05-20T12:13:51Z

@mgoin @njhill Would you like to take a look at this PR? Thanks!

maobaolong · 2025-05-30T06:15:01Z

Would be great if you could review this PR! @mgoin @njhill

maobaolong force-pushed the multiply_connectors_v0.gh branch from 58819a2 to 770acf7 Compare May 20, 2025 11:55

[V0] Support multiple kv connectors

f427ff5

Signed-off-by: baoloongmao <[email protected]>

maobaolong force-pushed the multiply_connectors_v0.gh branch from 770acf7 to f427ff5 Compare May 20, 2025 11:58

maobaolong closed this May 21, 2025

maobaolong reopened this May 21, 2025

maobaolong closed this Jun 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[V0] Support multiple kv connectors #18395

[V0] Support multiple kv connectors #18395

Uh oh!

maobaolong commented May 20, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented May 20, 2025

Uh oh!

maobaolong commented May 20, 2025

Uh oh!

maobaolong commented May 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

[V0] Support multiple kv connectors #18395

[V0] Support multiple kv connectors #18395

Uh oh!

Conversation

maobaolong commented May 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

How to test

Uh oh!

github-actions bot commented May 20, 2025

Uh oh!

maobaolong commented May 20, 2025

Uh oh!

maobaolong commented May 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

maobaolong commented May 20, 2025 •

edited by github-actions bot

Loading