Skip to content

Conversation

@maobaolong
Copy link
Contributor

@maobaolong maobaolong commented May 20, 2025

Motivation

Inspired by #17564 which make v1 support multiple connectors co-exist.

This PR aimed to implement MultiConnectorV0 for v0.

How to test

Start a 1P 1D cluster, and use lmcacheConnector as an offload purpose connector in the P instance, use FSConnector in the both P and D instance as a PD transfer connector.(Will submit another PR for introduce FSConnector, but you can use any other connector instead as P/D transfer connector)

  • run an vllm instance for prefill role
CUDA_VISIBLE_DEVICES=0 \
VLLM_LOGGING_LEVEL=DEBUG \
LMCACHE_USE_EXPERIMENTAL=True LMCACHE_TRACK_USAGE=false LMCACHE_LOG_LEVEL=DEBUG \
LMCACHE_CONFIG_FILE=/disc/data1/baoloongmao/cpu/lmcache-cpu.yaml \
VLLM_MLA_DISABLE=1 VLLM_USE_V1=0 \
vllm serve /disc/data1/deepseek/DeepSeek-V2-Lite-Chat/ \
           --trust-remote-code \
           --served-model-name vllm_cpu_offload \
           --max-model-len 32768 \
           --max-seq-len-to-capture 10000 \
           --max-num-seqs 64 \
           --gpu-memory-utilization 0.9 \
           --host 0.0.0.0 \
           -tp 1 \
           --no-enable-prefix-caching \
           --max-num-batched-tokens 64000 \
           --kv-transfer-config '{"kv_connector":"MultiConnectorV0","kv_role":"kv_both","kv_connector_extra_config":{"connectors":[{"kv_connector":"FSConnector","kv_role":"kv_producer","kv_connector_extra_config":{"fs_storage_path":"fs_local_storage","transfer":true}},{"kv_connector":"LMCacheConnector","kv_role":"kv_both"}]}}'
  • run an vllm instance for decode role
CUDA_VISIBLE_DEVICES=1 \
VLLM_LOGGING_LEVEL=DEBUG \
LMCACHE_USE_EXPERIMENTAL=True LMCACHE_TRACK_USAGE=false LMCACHE_LOG_LEVEL=DEBUG \
LMCACHE_CONFIG_FILE=/disc/data1/baoloongmao/cpu/lmcache-cpu.yaml \
VLLM_MLA_DISABLE=1 VLLM_USE_V1=0 \
vllm serve /disc/data1/deepseek/DeepSeek-V2-Lite-Chat/ \
           --trust-remote-code \
           --served-model-name vllm_cpu_offload \
           --max-model-len 32768 \
           --max-seq-len-to-capture 10000 \
           --max-num-seqs 64 \
           --gpu-memory-utilization 0.9 \
           --host 0.0.0.0 \
           -tp 1 \
           --no-enable-prefix-caching \
           --max-num-batched-tokens 64000 \
           --kv-transfer-config '{"kv_connector":"FSConnector","kv_role":"kv_consumer","kv_connector_extra_config":{"fs_storage_path":"fs_local_storage"}}' \
           --port 8001
  • Send request by curl and check the output log and file in the fs_local_storage folder
curl http://localhost:8000/v1/chat/completions     -H "Content-Type: application/json"     -d '{
    "model": "vllm_cpu_offload",
    "messages": [{"role": "user", "content": "Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?"}],
    "max_tokens": 1,
    "temperature": 0,
    "top_p": 0.95
    }'
{"id":"chatcmpl-94b0c6c0941343af9499f895789febe1","object":"chat.completion","created":1747727609,"model":"vllm_cpu_offload","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":" Hello","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":409,"total_tokens":410,"completion_tokens":1,"prompt_tokens_details":null},"prompt_logprobs":null,"kv_transfer_params":null}

curl http://localhost:8001/v1/chat/completions     -H "Content-Type: application/json"     -d '{
    "model": "vllm_cpu_offload",
    "messages": [{"role": "user", "content": "Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?"}],
    "max_tokens": 10,
    "temperature": 0,
    "top_p": 0.95
    }'
  • log in P
INFO 05-20 00:52:26 [logger.py:39] Received request chatcmpl-4c385ad864cb426dab9c3e31c82d073c: prompt: '<|begin▁of▁sentence|>User: Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?\n\nAssistant:', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 05-20 00:52:26 [engine.py:313] Added request chatcmpl-4c385ad864cb426dab9c3e31c82d073c.
[2025-05-20 00:52:26,864] LMCache DEBUG: Retrieved 0 out of 409 out of total 409 tokens (cache_engine.py:330:lmcache.experimental.cache_engine)
[2025-05-20 00:52:26,864] LMCache DEBUG: Injected token number: 0 (vllm_adapter.py:747:lmcache.integration.vllm.vllm_adapter)
[2025-05-20 00:52:26,864] LMCache DEBUG: Returning the original input! (vllm_adapter.py:789:lmcache.integration.vllm.vllm_adapter)
DEBUG 05-20 00:52:27 [fs_connector.py:94] [rank0]: KV send DONE.
INFO 05-20 00:52:27 [multi_connector.py:67] sent to connector FSConnector
[2025-05-20 00:52:27,062] LMCache DEBUG: Stored 409 out of total 409 tokens (cache_engine.py:257:lmcache.experimental.cache_engine)
[2025-05-20 00:52:27,062] LMCache DEBUG: Store skips 0 tokens and then stores 409 tokens (vllm_adapter.py:561:lmcache.integration.vllm.vllm_adapter)
INFO 05-20 00:52:27 [multi_connector.py:67] sent to connector LMCacheConnector

INFO 05-20 00:52:57 [logger.py:39] Received request chatcmpl-ae3020fdc9974caaaea4167112dcba5b: prompt: '<|begin▁of▁sentence|>User: Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?\n\nAssistant:', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 05-20 00:52:57 [engine.py:313] Added request chatcmpl-ae3020fdc9974caaaea4167112dcba5b.
[2025-05-20 00:52:57,507] LMCache DEBUG: Retrieved 409 out of 409 out of total 409 tokens (cache_engine.py:330:lmcache.experimental.cache_engine)
[2025-05-20 00:52:57,507] LMCache DEBUG: Injected token number: 408 (vllm_adapter.py:747:lmcache.integration.vllm.vllm_adapter)
[2025-05-20 00:52:57,517] LMCache DEBUG: Rebuilt the input! (vllm_adapter.py:786:lmcache.integration.vllm.vllm_adapter)
DEBUG 05-20 00:52:57 [fs_connector.py:94] [rank0]: KV send DONE.
INFO 05-20 00:52:57 [multi_connector.py:67] sent to connector FSConnector
[2025-05-20 00:52:57,558] LMCache DEBUG: Store skips 409 tokens and then stores 0 tokens (vllm_adapter.py:561:lmcache.integration.vllm.vllm_adapter)
INFO 05-20 00:52:57 [multi_connector.py:67] sent to connector LMCacheConnector

INFO 05-20 00:53:29 [logger.py:39] Received request chatcmpl-94b0c6c0941343af9499f895789febe1: prompt: '<|begin▁of▁sentence|>User: Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?\n\nAssistant:', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 05-20 00:53:29 [engine.py:313] Added request chatcmpl-94b0c6c0941343af9499f895789febe1.
[2025-05-20 00:53:29,635] LMCache DEBUG: Retrieved 409 out of 409 out of total 409 tokens (cache_engine.py:330:lmcache.experimental.cache_engine)
[2025-05-20 00:53:29,635] LMCache DEBUG: Injected token number: 408 (vllm_adapter.py:747:lmcache.integration.vllm.vllm_adapter)
[2025-05-20 00:53:29,638] LMCache DEBUG: Rebuilt the input! (vllm_adapter.py:786:lmcache.integration.vllm.vllm_adapter)
DEBUG 05-20 00:53:29 [fs_connector.py:94] [rank0]: KV send DONE.
INFO 05-20 00:53:29 [multi_connector.py:67] sent to connector FSConnector
[2025-05-20 00:53:29,671] LMCache DEBUG: Store skips 409 tokens and then stores 0 tokens (vllm_adapter.py:561:lmcache.integration.vllm.vllm_adapter)
INFO 05-20 00:53:29 [multi_connector.py:67] sent to connector LMCacheConnector
  • log in D
INFO 05-20 00:52:46 [logger.py:39] Received request chatcmpl-471ab3691c8741768686f04e4a3123f1: prompt: '<|begin▁of▁sentence|>User: Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?\n\nAssistant:', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=10, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 05-20 00:52:46 [engine.py:313] Added request chatcmpl-471ab3691c8741768686f04e4a3123f1.
DEBUG 05-20 00:52:46 [fs_connector.py:170] [rank0]: Successfully received all KVs and hidden states, skip model forwarding.

INFO 05-20 00:53:09 [logger.py:39] Received request chatcmpl-d0deb766738a425aa0129134ef52955e: prompt: '<|begin▁of▁sentence|>User: Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?Hello, how are you?\n\nAssistant:', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=10, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 05-20 00:53:09 [engine.py:313] Added request chatcmpl-d0deb766738a425aa0129134ef52955e.
DEBUG 05-20 00:53:09 [fs_connector.py:170] [rank0]: Successfully received all KVs and hidden states, skip model forwarding.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@maobaolong maobaolong force-pushed the multiply_connectors_v0.gh branch from 58819a2 to 770acf7 Compare May 20, 2025 11:55
@maobaolong maobaolong force-pushed the multiply_connectors_v0.gh branch from 770acf7 to f427ff5 Compare May 20, 2025 11:58
@maobaolong
Copy link
Contributor Author

@mgoin @njhill Would you like to take a look at this PR? Thanks!

@maobaolong maobaolong closed this May 21, 2025
@maobaolong maobaolong reopened this May 21, 2025
@maobaolong
Copy link
Contributor Author

Would be great if you could review this PR! @mgoin @njhill

@maobaolong maobaolong closed this Jun 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant