[Cuda2CPU][P/D] Add cuda2cpu support in NixlConnector#24690
[Cuda2CPU][P/D] Add cuda2cpu support in NixlConnector#24690NickLucche merged 5 commits intovllm-project:mainfrom
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request adds support for CUDA to CPU KV cache offloading via the NixlConnector, which is a valuable feature for handling large models. The changes include updates to the NixlConnector to recognize 'cpu' as a valid buffer device for CUDA, and new integration tests to validate this functionality. My review identified a critical bug in the block copying logic within the CUDA platform code, which would cause incorrect behavior or errors. Additionally, I've pointed out some high-severity issues in the new test scripts, such as the use of eval which poses a security risk, and incorrect variable usage in log messages.
| if [ -n "$model_args" ]; then | ||
| FULL_CMD="$BASE_CMD $model_args" | ||
| else | ||
| FULL_CMD="$BASE_CMD" | ||
| fi | ||
|
|
||
| eval "$FULL_CMD &" |
There was a problem hiding this comment.
The use of eval to execute the command is dangerous as it can lead to arbitrary code execution if variables in the command string are not properly sanitized. This occurs for both prefill (line 101) and decode (line 133) instance startup. While the risk is lower in a test script with controlled inputs, it's a bad practice. Please consider refactoring the command execution to use arrays, which is safer. This would involve changing get_model_args to output an array of arguments and then executing the command array directly.
| if [ -n "$model_args" ]; then | ||
| FULL_CMD="$BASE_CMD $model_args" | ||
| else | ||
| FULL_CMD="$BASE_CMD" | ||
| fi | ||
|
|
||
| eval "$FULL_CMD &" |
There was a problem hiding this comment.
The use of eval to execute the command is dangerous as it can lead to arbitrary code execution if variables in the command string are not properly sanitized. This occurs for both prefill (line 69) and decode (line 87) instance startup. It's a best practice to avoid eval. Please consider refactoring the command execution to use arrays, which is safer. This would involve changing get_model_args to output an array of arguments and then executing the command array directly.
tests/v1/kv_connector/nixl_integration/run_cuda2cpu_edge_case_test.sh
Outdated
Show resolved
Hide resolved
f88c963 to
ad91641
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
NickLucche
left a comment
There was a problem hiding this comment.
Will review asap, sorry!
tests/v1/kv_connector/nixl_integration/run_cuda2cpu_accuracy_test.sh
Outdated
Show resolved
Hide resolved
|
Thanks @chenxi-yang! It looks good at first glance but would you mind reverting all of the formatting changes? It makes it difficult to see what the real changes are are is unrelated to the purpose of the PR. Feel free to open a separate PR with formatting improvements! |
90e2f0e to
bfa16a4
Compare
Cleaned up the formatting edits from auto-save! |
e2645ae to
01c955f
Compare
NickLucche
left a comment
There was a problem hiding this comment.
Looks clean, thanks @chenxi-yang !
I just left a few minor comments.
Also, would you mind updating the docstring in KVTransferConfig.kv_buffer_device, we indeed support cpu now.
tests/v1/kv_connector/nixl_integration/run_cuda2cpu_accuracy_test.sh
Outdated
Show resolved
Hide resolved
Sorry! I was swamped in other tasks. I am not opening a new one and fixed the requested changes. Let me know how you think. Thank you! |
| copy_kv_blocks) | ||
| kv_transfer_group = get_kv_transfer_group() | ||
| kv_transfer_group.register_kv_caches(kv_caches) | ||
| kv_transfer_group.set_host_xfer_buffer_ops(copy_kv_blocks) |
There was a problem hiding this comment.
How about moving "if self.kv_buffer_device != "cpu"" condition here ? I think it will be straight forward to read the codes.
Suggestion:
if self.vllm_config.kv_transfer_config.kv_buffer_device == "cpu":
kv_transfer_group.set_host_xfer_buffer_ops(copy_kv_blocks)
There was a problem hiding this comment.
resolved in @NickLucche comments below, please ignore
| if [[ "$KV_BUFFER_DEVICE" == "cuda" ]]; then | ||
| KV_CONFIG='{"kv_connector":"NixlConnector","kv_role":"kv_both"}' | ||
| else | ||
| KV_CONFIG="{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\",\"kv_buffer_device\":\"$KV_BUFFER_DEVICE\"}" |
There was a problem hiding this comment.
I added backends support in this PR: #25121
May you also provide options in the run_accuracy_test?
Suggested codes:
VLLM_NIXL_BACKEND=${VLLM_NIXL_BACKEND:-"[\"UCX\"]"}
--kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\",\"kv_buffer_device\":\"${KV_BUFFER_DEVICE}\", \"kv_connector_extra_config\":{\"backends\":${VLLM_NIXL_BACKEND}}}'"
There was a problem hiding this comment.
I think we should keep the scope of the PR focused here, we can do that in a separate PR
| if [[ "$KV_BUFFER_DEVICE" == "cuda" ]]; then | ||
| KV_CONFIG='{"kv_connector":"NixlConnector","kv_role":"kv_both"}' | ||
| else | ||
| KV_CONFIG="{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\",\"kv_buffer_device\":\"$KV_BUFFER_DEVICE\"}" |
There was a problem hiding this comment.
same ask for adding
"kv_connector_extra_config\":{\"backends\":${VLLM_NIXL_BACKEND}}
There was a problem hiding this comment.
will add through a different PR, please ignore my comments above
| @@ -672,6 +675,9 @@ def initialize_host_xfer_buffer( | |||
|
|
|||
| def set_host_xfer_buffer_ops(self, copy_operation: CopyBlocksOp): | |||
| """Assign copy (d2h, h2d) operations when host buffer is used.""" | |||
| # Set a no-op if the host buffer is not cpu. | |||
| if self.kv_buffer_device != "cpu": | |||
| return | |||
There was a problem hiding this comment.
Ok to put guard here, however, it is not very straight-forward to me for any non-cpu buffer, it will always call set_host_xfer_buffer_ops in GPU_MODEL_RUNNER, is that OK to move the condition-check to gpu_model_runner?
There was a problem hiding this comment.
You can refer to @njhill earlier comment, but more broadly this pair of functions make sense when the selected buffer device is cpu, not when we're running on particular platform.
And the gpu model runner doesn't need to be aware of the selected buffer device, possibly, since it's a kv connector spec.
There was a problem hiding this comment.
resolved in @NickLucche comments below, please ignore
There was a problem hiding this comment.
LGTM, thanks for contributing @chenxi-yang , let's get the tests fixed (sync to main) and fix the DCO check.
| if [[ "$KV_BUFFER_DEVICE" == "cuda" ]]; then | ||
| KV_CONFIG='{"kv_connector":"NixlConnector","kv_role":"kv_both"}' | ||
| else | ||
| KV_CONFIG="{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\",\"kv_buffer_device\":\"$KV_BUFFER_DEVICE\"}" |
| @@ -672,6 +675,9 @@ def initialize_host_xfer_buffer( | |||
|
|
|||
| def set_host_xfer_buffer_ops(self, copy_operation: CopyBlocksOp): | |||
| """Assign copy (d2h, h2d) operations when host buffer is used.""" | |||
| # Set a no-op if the host buffer is not cpu. | |||
| if self.kv_buffer_device != "cpu": | |||
| return | |||
There was a problem hiding this comment.
You can refer to @njhill earlier comment, but more broadly this pair of functions make sense when the selected buffer device is cpu, not when we're running on particular platform.
And the gpu model runner doesn't need to be aware of the selected buffer device, possibly, since it's a kv connector spec.
3daceb3 to
732d7a9
Compare
Signed-off-by: Chenxi Yang <cxyang@fb.com>
Signed-off-by: Chenxi Yang <cxyang@fb.com>
Signed-off-by: Chenxi Yang <cxyang@fb.com>
732d7a9 to
fbba1a6
Compare
Signed-off-by: Chenxi Yang <cxyang@fb.com> Co-authored-by: Chenxi Yang <cxyang@fb.com>
Signed-off-by: Chenxi Yang <cxyang@fb.com> Co-authored-by: Chenxi Yang <cxyang@fb.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Chenxi Yang <cxyang@fb.com> Co-authored-by: Chenxi Yang <cxyang@fb.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Chenxi Yang <cxyang@fb.com> Co-authored-by: Chenxi Yang <cxyang@fb.com>
Signed-off-by: Chenxi Yang <cxyang@fb.com> Co-authored-by: Chenxi Yang <cxyang@fb.com>
Signed-off-by: Chenxi Yang <cxyang@fb.com> Co-authored-by: Chenxi Yang <cxyang@fb.com>
Signed-off-by: Chenxi Yang <cxyang@fb.com> Co-authored-by: Chenxi Yang <cxyang@fb.com>
Accidentally closed #24619
Support PD disaggregation for CUDA:CPU based on #22436
Utilizing CPU memory as a buffer and performing point-to-point transmission via NIXL.
Use case: We want to support multi-turn for multi-hosts. When the decode' kv cache is larger than GPU's memory or we need to pass other prefill's kv to the current prefill through decode, we may want to use CPU as cache buffer.
Purpose
Currently, cuda:cpu is not supported yet (#18293 (comment)). This PR adds the support of host buffer on cuda devices.
Test Plan
bash tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh --kv_buffer_device cpufor accuracy test.bash tests/v1/kv_connector/nixl_integration/run_edge_case_test.sh --kv_buffer_device cpufor edge case test.Test Result
All tests completed!