[P/D] Support CPU Transfer in NixlConnector by juncgu · Pull Request #18293 · vllm-project/vllm

juncgu · 2025-05-17T01:26:15Z

This PR adds TPU support in NixlConnector (#17751) for P/D disaggregated serving.
The high-level idea is to use a buffer in host memory as the kv transfer buffer. The kv transfer buffer is registered under nixl agent (as the type of "DRAM"). The computed KV cache (full blocks) at the prefill instance will be saved to the transfer buffer. One the decode side, the remote KV data will be read into the transfer buffer and then load into the device memory.

Currently, we supports the same P/D disaggregated serving scenarios as #17751:

support xPyD
support homogeneous TP > 1
support P->D request flow

We will follow up the updates in NixlConnector and support the incoming features mentioned in #17751.

How to config NixlConnector for TPU?

We need to set kv_buffer_device to cpu in kv_transfer_config.
For example:

# launch a prefill instance
VLLM_LOGGING_LEVEL=DEBUG \
VLLM_USE_V1=1 \
VLLM_NIXL_SIDE_CHANNEL_HOST=${PREFILL_HOST} \
VLLM_NIXL_SIDE_CHANNEL_PORT=${NIXL_SIDE_PORT} \
PJRT_DEVICE=TPU \
VLLM_WORKER_MULTIPROC_METHOD=spawn \
VLLM_ENABLE_V1_MULTIPROCESSING=0 vllm serve ${MODEL_NAME} \
    --host ${PREFILL_HOST} \
    --port 8100 \
    --tensor-parallel-size 8 \
    --enforce-eager \
    --gpu-memory-utilization 0.5 \
    --disable-log-requests \
    --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_buffer_device":"cpu"}'

A simple 1p1d disaggregation example can be found at tests/v1/kv_connector/nixl_integration/run_tpu_disagg_accuracy_test.sh.

Notes:

The current transfer buffer in host memory has the same num of blocks as the kv cache on device. This simplifies the impl. Given that the host memory capacity is often much large than the total HBM capacity within a node (e.g., TPU v6e spec).
By design, this impl. should be able to support any xPUs which are incompatible with the nixl library and need to use host memory as transfer buffer. Currently, only TPU is tested.
The extra time overhead from nixl-agent handshake and xla compilation may hit the time limit of execute_model in the multiproc_executor. Therefore, we'd suggest to relax the timeout value by setting VLLM_MULTIPROC_EXECUTE_MODEL_TIMEOUT_S env var.

github-actions · 2025-05-17T01:26:22Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-05-17T01:26:50Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @juncgu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

juncgu · 2025-05-17T04:04:58Z

@richardliaw, @robertgshaw2-redhat

mergify · 2025-05-17T07:24:58Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @juncgu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

robertgshaw2-redhat · 2025-05-17T22:12:12Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+        # cpu kv buffer for xfer
+        # used when xPU memory can not be registered under nixl
+        self.host_xfer_buffers: dict[str, torch.Tensor] = {}
+        self.use_host_buffer = True if _NIXL_SUPPORTED_XPU_TYPE.support(self.kv_buffer_device) else False


Is there any reason why we cannot allows the GPU to use the host buffer?

Also, it is confusing that the tpu uses kv_buffer_device: "tpu" when it is using the host buffer.

I think we should allow the user to specify kv_buffer_device, then have:

_NIXL_SUPPORTED_XPU_TYPE = { "cuda": ["cuda", "cpu"], "tpu": ["cpu"] } if self.kv_buffer_device not in _NIXL_SUPPORTED_XPU_TYPE[current_platform.platform()]: raise if self.kv_buffer_device == "cuda": self.nixl_memory_type = "VRAM" else: assert self.kv_buffer_device == "cpu": self.nixl_memory_type = "DRAM"

cuda: cpu is not supported yet.

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

yaochengji

Thanks for supporting such an important feature, left a few comments.

Also, could you add some tests? IMO, the minimum requirements are a unit test for nixl connector on tpu and an e2e accuracy test.

yaochengji · 2025-05-19T05:11:50Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+        if params.get("do_remote_decode"):
+            # NOTE: only need to save / send full computed blocks
+            block_ids = blocks.get_block_ids()[0]
+            all_full = request.num_tokens % self.block_size == 0


Are there multiple requests in the block_ids? If so, request.num_tokens % self.block_size == 0 cannot lead to all_full.

Oh, actually even for one request, e.g. num_tokens = 2 * block_size, but it can use 3 blocks when it's not aligned.

The block_ids is for a single request.

In my understanding, when e.g. prompt_len == 2 * block_size, 2 blocks will be allocated to the request at prefill stage, and the third block will be allocated when its first decode step gets scheduled.

Can you give more information about the case of when it's not aligned?

Let's say the page_size is 4 and the prompt_len is also 4. there're two pages:
0, 1, 2, 3 | 4, 5, 6, 7
is it possible for the prompt to use the indices of (2, 3, 4, 5)?

yaochengji · 2025-05-19T05:15:53Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

        for req_id, (req, block_ids) in self._reqs_need_recv.items():
            assert req.kv_transfer_params is not None
+            _kv_transfer_params = copy.deepcopy(req.kv_transfer_params)
+            _kv_transfer_params["do_remote_prefill"] = True


May I ask why do we need to make it a deepcopy and set do_remote_prefill to True, but previously we don't need to?

We rely on the attributes of do_remote_prefill/decode in ReqMeta to determine the direction of data-copy (i.e., D2H, or H2D).
The two lines here are just for aligning with add_new_req. Otherwise, we need to re-assign the attribute after calling add_new_req.

# e.g., meta.add_new_req( request_id=req_id, local_block_ids=block_ids, kv_transfer_params=req.kv_transfer_params, ) meta.requests[req_id].do_remote_prefill = True

Let's avoid the deepcopys, it adds some unnecessary overhead. No harm in adding new args (with default vals) to add_new_req

mergify · 2025-05-19T16:31:31Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @juncgu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-05-21T05:43:22Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @juncgu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vanbasten23 · 2025-05-22T02:53:57Z

vllm/v1/worker/tpu_model_runner.py

+    tpu_cache: torch.Tensor,
+    tpu_block_indices: torch.Tensor,
+) -> None:
+    torch.ops.xla.dynamo_set_buffer_donor_(tpu_cache, True)


I wonder why we do torch.ops.xla.dynamo_set_buffer_donor_(tpu_cache, True) here

vanbasten23 · 2025-05-22T04:49:51Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+            meta = self._recving_metadata[req_id]
+            # local decode only
+            if not meta.do_remote_prefill:
+                return


should here be continue or return?

Since only one request is handled by this function, "continue" or "return" will be the same.

juncgu · 2025-07-11T05:50:26Z

@juncgu ok that other PR has been merged now. The changes eliminate a couple of the "mixin" methods but some logic has to be added into the tpu_worker.py execute_model method. See #19555 and #20756.

Hi @njhill, please review the latest version.

njhill · 2025-07-11T11:36:07Z

Thanks @juncgu, looks good to me. Could you merge in latest main again, hopefully that will fix the CI failures which look unrelated.

Signed-off-by: Juncheng Gu <juncgu@gmail.com>

juncgu · 2025-07-11T23:38:55Z

Thanks @juncgu, looks good to me. Could you merge in latest main again, hopefully that will fix the CI failures which look unrelated.

Thanks, @njhill. It's clear now after merging the latest main.

mergify · 2025-07-14T09:40:53Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @juncgu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Juncheng Gu <juncgu@gmail.com>

mergify · 2025-07-16T02:25:09Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @juncgu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Juncheng Gu <juncgu@gmail.com>

mergify · 2025-07-21T16:15:18Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @juncgu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Juncheng Gu <juncgu@gmail.com>

njhill · 2025-07-24T16:58:24Z

Thanks again for all your hard work and patience with this @juncgu!

…m-project#268) Signed-off-by: Juncheng Gu <juncgu@gmail.com> Signed-off-by: Richard Liu <ricliu@google.com> Co-authored-by: Juncheng Gu <6314092+juncgu@users.noreply.github.com> Co-authored-by: Richard Liu <39319471+richardsliu@users.noreply.github.com> Co-authored-by: Richard Liu <ricliu@google.com>

juncgu requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners May 17, 2025 01:26

mergify bot added v1 tpu Related to Google TPUs labels May 17, 2025

mergify bot added needs-rebase and removed needs-rebase labels May 17, 2025

robertgshaw2-redhat changed the title ~~[P/D] Support TPU in NixlConnector~~ [P/D] Support CPU Transfer in NixlConnector May 17, 2025

robertgshaw2-redhat self-assigned this May 17, 2025

robertgshaw2-redhat reviewed May 17, 2025

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py Outdated Show resolved Hide resolved

robertgshaw2-redhat reviewed May 17, 2025

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py Outdated Show resolved Hide resolved

robertgshaw2-redhat reviewed May 17, 2025

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py Outdated Show resolved Hide resolved

yaochengji reviewed May 19, 2025

View reviewed changes

mergify bot added needs-rebase and removed needs-rebase labels May 19, 2025

mergify bot added the needs-rebase label May 21, 2025

vanbasten23 reviewed May 22, 2025

View reviewed changes

Merge branch 'main' into tpu_nixl_merge

1d9bc01

Signed-off-by: Juncheng Gu <juncgu@gmail.com>

mergify bot added the needs-rebase label Jul 14, 2025

Merge branch 'main' into tpu_nixl_merge

506f3dd

Signed-off-by: Juncheng Gu <juncgu@gmail.com>

mergify bot removed the needs-rebase label Jul 14, 2025

mergify bot added the needs-rebase label Jul 16, 2025

Merge branch 'main' into tpu_nixl_merge

bd19d33

Signed-off-by: Juncheng Gu <juncgu@gmail.com>

mergify bot removed the needs-rebase label Jul 16, 2025

mergify bot added the needs-rebase label Jul 21, 2025

juncgu added 2 commits July 21, 2025 17:23

Merge branch 'main' into tpu_nixl_merge

7f9ecca

Signed-off-by: Juncheng Gu <juncgu@gmail.com>

revert vllm-project#19555

e720e14

Signed-off-by: Juncheng Gu <juncgu@gmail.com>

mergify bot removed the needs-rebase label Jul 21, 2025

juncgu added 4 commits July 21, 2025 20:28

get every tpu_worker's output

b689740

Signed-off-by: Juncheng Gu <juncgu@gmail.com>

Merge branch 'main' into tpu_nixl_merge

94831f1

Signed-off-by: Juncheng Gu <juncgu@gmail.com>

Merge branch 'main' into tpu_nixl_merge

821d2b6

Signed-off-by: Juncheng Gu <juncgu@gmail.com>

Merge branch 'main' into tpu_nixl_merge

21e2642

Signed-off-by: Juncheng Gu <juncgu@gmail.com>

njhill merged commit 6066284 into vllm-project:main Jul 24, 2025
72 checks passed

zhenwei-intel mentioned this pull request Aug 7, 2025

[XPU][P/D] Add XPU support in NixlConnector #22436

Merged

This was referenced Sep 11, 2025

[Cuda2CPU][P/D] Add cuda2cpu support in NixlConnector #24619

Closed

[Cuda2CPU][P/D] Add cuda2cpu support in NixlConnector #24690

Merged

sdavidbd mentioned this pull request Jan 13, 2026

fix multiconnector for multi connector use push kv connector #31144

Open

5 tasks

NickLucche mentioned this pull request Feb 3, 2026

[Roadmap]: PD Disaggregation with NixlConnector Roadmap #33702

Open

44 tasks

Uh oh!

Conversation

juncgu commented May 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 17, 2025

Uh oh!

mergify bot commented May 17, 2025

Uh oh!

juncgu commented May 17, 2025

Uh oh!

mergify bot commented May 17, 2025

Uh oh!

Uh oh!

robertgshaw2-redhat May 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yaochengji left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented May 19, 2025

Uh oh!

mergify bot commented May 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juncgu commented Jul 11, 2025

Uh oh!

njhill commented Jul 11, 2025

Uh oh!

juncgu commented Jul 11, 2025

Uh oh!

mergify bot commented Jul 14, 2025

Uh oh!

mergify bot commented Jul 16, 2025

Uh oh!

mergify bot commented Jul 21, 2025

Uh oh!

njhill commented Jul 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

juncgu commented May 17, 2025 •

edited by github-actions bot

Loading

robertgshaw2-redhat May 17, 2025 •

edited

Loading