[P/D] Support CPU Transfer in NixlConnector#18293
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has merge conflicts that must be resolved before it can be |
| # cpu kv buffer for xfer | ||
| # used when xPU memory can not be registered under nixl | ||
| self.host_xfer_buffers: dict[str, torch.Tensor] = {} | ||
| self.use_host_buffer = True if _NIXL_SUPPORTED_XPU_TYPE.support(self.kv_buffer_device) else False |
There was a problem hiding this comment.
Is there any reason why we cannot allows the GPU to use the host buffer?
Also, it is confusing that the tpu uses kv_buffer_device: "tpu" when it is using the host buffer.
I think we should allow the user to specify kv_buffer_device, then have:
_NIXL_SUPPORTED_XPU_TYPE = {
"cuda": ["cuda", "cpu"],
"tpu": ["cpu"]
}
if self.kv_buffer_device not in _NIXL_SUPPORTED_XPU_TYPE[current_platform.platform()]:
raise
if self.kv_buffer_device == "cuda":
self.nixl_memory_type = "VRAM"
else:
assert self.kv_buffer_device == "cpu":
self.nixl_memory_type = "DRAM"There was a problem hiding this comment.
cuda: cpu is not supported yet.
yaochengji
left a comment
There was a problem hiding this comment.
Thanks for supporting such an important feature, left a few comments.
Also, could you add some tests? IMO, the minimum requirements are a unit test for nixl connector on tpu and an e2e accuracy test.
| if params.get("do_remote_decode"): | ||
| # NOTE: only need to save / send full computed blocks | ||
| block_ids = blocks.get_block_ids()[0] | ||
| all_full = request.num_tokens % self.block_size == 0 |
There was a problem hiding this comment.
Are there multiple requests in the block_ids? If so, request.num_tokens % self.block_size == 0 cannot lead to all_full.
There was a problem hiding this comment.
Oh, actually even for one request, e.g. num_tokens = 2 * block_size, but it can use 3 blocks when it's not aligned.
There was a problem hiding this comment.
The block_ids is for a single request.
In my understanding, when e.g. prompt_len == 2 * block_size, 2 blocks will be allocated to the request at prefill stage, and the third block will be allocated when its first decode step gets scheduled.
Can you give more information about the case of when it's not aligned?
There was a problem hiding this comment.
Let's say the page_size is 4 and the prompt_len is also 4. there're two pages:
0, 1, 2, 3 | 4, 5, 6, 7
is it possible for the prompt to use the indices of (2, 3, 4, 5)?
| for req_id, (req, block_ids) in self._reqs_need_recv.items(): | ||
| assert req.kv_transfer_params is not None | ||
| _kv_transfer_params = copy.deepcopy(req.kv_transfer_params) | ||
| _kv_transfer_params["do_remote_prefill"] = True |
There was a problem hiding this comment.
May I ask why do we need to make it a deepcopy and set do_remote_prefill to True, but previously we don't need to?
There was a problem hiding this comment.
We rely on the attributes of do_remote_prefill/decode in ReqMeta to determine the direction of data-copy (i.e., D2H, or H2D).
The two lines here are just for aligning with add_new_req. Otherwise, we need to re-assign the attribute after calling add_new_req.
# e.g.,
meta.add_new_req(
request_id=req_id,
local_block_ids=block_ids,
kv_transfer_params=req.kv_transfer_params,
)
meta.requests[req_id].do_remote_prefill = True
There was a problem hiding this comment.
Let's avoid the deepcopys, it adds some unnecessary overhead. No harm in adding new args (with default vals) to add_new_req
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has merge conflicts that must be resolved before it can be |
| tpu_cache: torch.Tensor, | ||
| tpu_block_indices: torch.Tensor, | ||
| ) -> None: | ||
| torch.ops.xla.dynamo_set_buffer_donor_(tpu_cache, True) |
There was a problem hiding this comment.
I wonder why we do torch.ops.xla.dynamo_set_buffer_donor_(tpu_cache, True) here
| meta = self._recving_metadata[req_id] | ||
| # local decode only | ||
| if not meta.do_remote_prefill: | ||
| return |
There was a problem hiding this comment.
should here be continue or return?
There was a problem hiding this comment.
Since only one request is handled by this function, "continue" or "return" will be the same.
|
Thanks @juncgu, looks good to me. Could you merge in latest main again, hopefully that will fix the CI failures which look unrelated. |
Signed-off-by: Juncheng Gu <juncgu@gmail.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Juncheng Gu <juncgu@gmail.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Juncheng Gu <juncgu@gmail.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Juncheng Gu <juncgu@gmail.com>
Signed-off-by: Juncheng Gu <juncgu@gmail.com>
Signed-off-by: Juncheng Gu <juncgu@gmail.com>
Signed-off-by: Juncheng Gu <juncgu@gmail.com>
Signed-off-by: Juncheng Gu <juncgu@gmail.com>
Signed-off-by: Juncheng Gu <juncgu@gmail.com>
|
Thanks again for all your hard work and patience with this @juncgu! |
…m-project#268) Signed-off-by: Juncheng Gu <juncgu@gmail.com> Signed-off-by: Richard Liu <ricliu@google.com> Co-authored-by: Juncheng Gu <6314092+juncgu@users.noreply.github.com> Co-authored-by: Richard Liu <39319471+richardsliu@users.noreply.github.com> Co-authored-by: Richard Liu <ricliu@google.com>
This PR adds TPU support in
NixlConnector(#17751) for P/D disaggregated serving.The high-level idea is to use a buffer in host memory as the kv transfer buffer. The kv transfer buffer is registered under nixl agent (as the type of "DRAM"). The computed KV cache (full blocks) at the prefill instance will be saved to the transfer buffer. One the decode side, the remote KV data will be read into the transfer buffer and then load into the device memory.
Currently, we supports the same P/D disaggregated serving scenarios as #17751:
We will follow up the updates in NixlConnector and support the incoming features mentioned in #17751.
How to config NixlConnector for TPU?
A simple 1p1d disaggregation example can be found at
tests/v1/kv_connector/nixl_integration/run_tpu_disagg_accuracy_test.sh.Notes:
The extra time overhead from nixl-agent handshake and xla compilation may hit the time limit of execute_model in the multiproc_executor. Therefore, we'd suggest to relax the timeout value by settingVLLM_MULTIPROC_EXECUTE_MODEL_TIMEOUT_Senv var.