[NIXL][BUG FIX] Fix a bug for PD with host_buffer after merging 29665#30420
[NIXL][BUG FIX] Fix a bug for PD with host_buffer after merging 29665#30420njhill merged 6 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Code Review
This pull request addresses a KeyError that occurs when using NIXL with a host buffer for prefill/decode operations. The issue stems from accessing remote_request_id in kv_transfer_params which may not always be present on the prefill side. The fix correctly uses dict.get() for safe access, preventing a crash. This change is sound and effectively resolves the bug. The removal of a blank line is a minor stylistic improvement. The changes are approved.
markmc
left a comment
There was a problem hiding this comment.
Thank you
This is only an issue with --kv_buffer_device cpu which I guess we don't test in CI
I don't think remote_request_id should be particularly special here - a prefill request should be able to just contain kv_transfer_params = {'do_remote_decode': True} IMO
| remote_block_ids=kv_transfer_params["remote_block_ids"], | ||
| remote_engine_id=kv_transfer_params["remote_engine_id"], | ||
| remote_request_id=kv_transfer_params["remote_request_id"], | ||
| remote_request_id=kv_transfer_params.get("remote_request_id", ""), |
There was a problem hiding this comment.
I don't think we should require remote_engine_id, remote_block_ids, remote_request_id, remote_host, or remote_port for a prefill request - it just happens that existing proxy servers set those fields, but I don't think there's any reason to require them
Let's use .get(..., None) for all of them - AFAICT all the other fields are None in a prefill request currently
There was a problem hiding this comment.
Ok, will do.
remote_request_id=kv_transfer_params.get("remote_request_id", None) => Failed pre-commit mypy, so I used remote_request_id=kv_transfer_params.get("remote_request_id", "") instead
There was a problem hiding this comment.
Done, @markmc , please help to review again.
There was a problem hiding this comment.
Sorry to be a pain, but I'm pretty sure the proxies are sending None for these values currently - if that's true (please double-check I'm not mistaken!) then we should update the ReqMeta type hints to allow these to be None
There was a problem hiding this comment.
Yes, proxies are sending None! I tested with current default("" or 0), it also works, but I can update the datatype in ReqMeta to make it as None as well
There was a problem hiding this comment.
@markmc
By adding Optional of None for ReqMeta, the mypy complains more now
Wondering if I can ease the pain by simply providing a default value("" or 0 or [])?
vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1129: error: Argument 2 to "submit" of "Executor" has incompatible type "str | None"; expected "str" [arg-type]
vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1130: error: Argument 3 to "submit" of "Executor" has incompatible type "int | None"; expected "int" [arg-type]
vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1636: error: Argument 2 to "map" has incompatible type "list[int] | None"; expected "Iterable[int]" [arg-type]
vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1771: error: Need type annotation for "block_ids_to_permute" (hint: "block_ids_to_permute: list[<type>] = ...") [var-annotated]
vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1780: error: Argument 1 to "__iadd__" of "list" has incompatible type "list[int] | None"; expected "Iterable[Any]" [arg-type]
vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1920: error: Argument 1 to "_logical_to_kernel_block_ids" of "NixlConnectorWorker" has incompatible type "list[int] | None"; expected "list[int]" [arg-type]
vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1937: error: Argument 2 to "_background_nixl_handshake" of "NixlConnectorWorker" has incompatible type "str | None"; expected "str" [arg-type]
vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1975: error: Argument "dst_engine_id" to "_read_blocks" of "NixlConnectorWorker" has incompatible type "str | None"; expected "str" [arg-type]
vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1976: error: Argument "remote_request_id" to "_read_blocks" of "NixlConnectorWorker" has incompatible type "str | None"; expected "str" [arg-type]
vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1977: error: Argument "local_block_ids" to "_read_blocks" of "NixlConnectorWorker" has incompatible type "list[int] | None"; expected "list[int]" [arg-type]
vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1978: error: Argument "remote_block_ids" to "_read_blocks" of "NixlConnectorWorker" has incompatible type "list[int] | None"; expected "list[int]" [arg-type]
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
| remote_engine_id=kv_transfer_params.get("remote_engine_id"), # type: ignore | ||
| remote_request_id=kv_transfer_params.get("remote_request_id"), # type: ignore | ||
| remote_host=kv_transfer_params.get("remote_host"), # type: ignore | ||
| remote_port=kv_transfer_params.get("remote_port"), # type: ignore |
There was a problem hiding this comment.
@markmc , do you think this works?
I tried updating ReqMeta with optional, then the mypy check exploded. => Since we only want to align with toy_proxy, so I am hoping to use # type ignore , in that case, we can still use None and make mypy happy
On the decode side, in reqs_to_save, we do not expect or need any of these remote_ fields - they are for the prefill side only. Make this more clear by putting these fields in their own dataclass which is only present on requests in reqs_to_recv. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
|
The mypy errors are a sign that we can do better here - the current code basically disables type checking Because the value returned by the dict lookup is I've sent you xuechendi#15 which will enable proper type checking. Please do check it over and make sure it makes sense to you 👍 |
[NIXL] Refactor prefill-only ReqMeta into RemoteMeta
Oh, that's quite big change...., I was not expecting to refactoring. Looks good to me, thanks |
markmc
left a comment
There was a problem hiding this comment.
@NickLucche PTAL, since I've done a bunch of refactoring in the latest version
There was a problem hiding this comment.
Thanks @xuechendi @markmc! The RemoteMeta refactoring seems much cleaner!
| return compat_hash | ||
|
|
||
|
|
||
| @dataclass |
There was a problem hiding this comment.
nit: perf benefit
| @dataclass | |
| @dataclass(slots=True) |
can add it to other dataclasses here too!
Edit: can do this in a follow-on
…vllm-project#30420) Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com>
…vllm-project#30420) Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Joachim Studnia <joachim@mistral.ai>
…vllm-project#30420) Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com>
…vllm-project#30420) Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
…vllm-project#30420) Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
…vllm-project#30420) Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Purpose
After #29665 merged, the PR added
remote_request_id=kv_transfer_params["remote_request_id"],toadd_new_reqHowever, in prefill side, there is no
remote_request_idin kv_transfer_params=> This issue only occurs when running PD with host_buffer
Test Plan
without the PR, test will fail immediately
With this PR, there will still be an accuracy issue, which fix is proposed at #30419
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.