Skip to content

[NIXL][BUG FIX] Fix a bug for PD with host_buffer after merging 29665#30420

Merged
njhill merged 6 commits intovllm-project:mainfrom
xuechendi:dev/fix_pd_host_buffer
Dec 14, 2025
Merged

[NIXL][BUG FIX] Fix a bug for PD with host_buffer after merging 29665#30420
njhill merged 6 commits intovllm-project:mainfrom
xuechendi:dev/fix_pd_host_buffer

Conversation

@xuechendi
Copy link
Copy Markdown
Contributor

@xuechendi xuechendi commented Dec 10, 2025

Purpose

After #29665 merged, the PR added remote_request_id=kv_transfer_params["remote_request_id"], to add_new_req

However, in prefill side, there is no remote_request_id in kv_transfer_params

=> This issue only occurs when running PD with host_buffer

Test Plan

PREFILL_BLOCK_SIZE=16 DECODE_BLOCK_SIZE=16 bash tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh --kv_buffer_device cpu

without the PR, test will fail immediately
With this PR, there will still be an accuracy issue, which fix is proposed at #30419

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@mergify mergify bot added the kv-connector label Dec 10, 2025
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a KeyError that occurs when using NIXL with a host buffer for prefill/decode operations. The issue stems from accessing remote_request_id in kv_transfer_params which may not always be present on the prefill side. The fix correctly uses dict.get() for safe access, preventing a crash. This change is sound and effectively resolves the bug. The removal of a blank line is a minor stylistic improvement. The changes are approved.

Copy link
Copy Markdown
Member

@markmc markmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

This is only an issue with --kv_buffer_device cpu which I guess we don't test in CI

I don't think remote_request_id should be particularly special here - a prefill request should be able to just contain kv_transfer_params = {'do_remote_decode': True} IMO

remote_block_ids=kv_transfer_params["remote_block_ids"],
remote_engine_id=kv_transfer_params["remote_engine_id"],
remote_request_id=kv_transfer_params["remote_request_id"],
remote_request_id=kv_transfer_params.get("remote_request_id", ""),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should require remote_engine_id, remote_block_ids, remote_request_id, remote_host, or remote_port for a prefill request - it just happens that existing proxy servers set those fields, but I don't think there's any reason to require them

Let's use .get(..., None) for all of them - AFAICT all the other fields are None in a prefill request currently

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will do.
remote_request_id=kv_transfer_params.get("remote_request_id", None) => Failed pre-commit mypy, so I used remote_request_id=kv_transfer_params.get("remote_request_id", "") instead

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, @markmc , please help to review again.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to be a pain, but I'm pretty sure the proxies are sending None for these values currently - if that's true (please double-check I'm not mistaken!) then we should update the ReqMeta type hints to allow these to be None

Copy link
Copy Markdown
Contributor Author

@xuechendi xuechendi Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, proxies are sending None! I tested with current default("" or 0), it also works, but I can update the datatype in ReqMeta to make it as None as well

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markmc
By adding Optional of None for ReqMeta, the mypy complains more now
Wondering if I can ease the pain by simply providing a default value("" or 0 or [])?

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1129: error: Argument 2 to "submit" of "Executor" has incompatible type "str | None"; expected "str"  [arg-type]                                     
vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1130: error: Argument 3 to "submit" of "Executor" has incompatible type "int | None"; expected "int"  [arg-type]                                     
vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1636: error: Argument 2 to "map" has incompatible type "list[int] | None"; expected "Iterable[int]"  [arg-type]                                      
vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1771: error: Need type annotation for "block_ids_to_permute" (hint: "block_ids_to_permute: list[<type>] = ...")  [var-annotated]                     
vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1780: error: Argument 1 to "__iadd__" of "list" has incompatible type "list[int] | None"; expected "Iterable[Any]"  [arg-type]                       
vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1920: error: Argument 1 to "_logical_to_kernel_block_ids" of "NixlConnectorWorker" has incompatible type "list[int] | None"; expected "list[int]"  [arg-type]
vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1937: error: Argument 2 to "_background_nixl_handshake" of "NixlConnectorWorker" has incompatible type "str | None"; expected "str"  [arg-type]      
vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1975: error: Argument "dst_engine_id" to "_read_blocks" of "NixlConnectorWorker" has incompatible type "str | None"; expected "str"  [arg-type]      
vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1976: error: Argument "remote_request_id" to "_read_blocks" of "NixlConnectorWorker" has incompatible type "str | None"; expected "str"  [arg-type]  
vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1977: error: Argument "local_block_ids" to "_read_blocks" of "NixlConnectorWorker" has incompatible type "list[int] | None"; expected "list[int]"  [arg-type]
vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1978: error: Argument "remote_block_ids" to "_read_blocks" of "NixlConnectorWorker" has incompatible type "list[int] | None"; expected "list[int]"  [arg-type]

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
@xuechendi xuechendi requested a review from markmc December 11, 2025 15:41
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
remote_engine_id=kv_transfer_params.get("remote_engine_id"), # type: ignore
remote_request_id=kv_transfer_params.get("remote_request_id"), # type: ignore
remote_host=kv_transfer_params.get("remote_host"), # type: ignore
remote_port=kv_transfer_params.get("remote_port"), # type: ignore
Copy link
Copy Markdown
Contributor Author

@xuechendi xuechendi Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markmc , do you think this works?
I tried updating ReqMeta with optional, then the mypy check exploded. => Since we only want to align with toy_proxy, so I am hoping to use # type ignore , in that case, we can still use None and make mypy happy

On the decode side, in reqs_to_save, we do not expect or need
any of these remote_ fields - they are for the prefill side
only. Make this more clear by putting these fields in their
own dataclass which is only present on requests in reqs_to_recv.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
@markmc
Copy link
Copy Markdown
Member

markmc commented Dec 12, 2025

The mypy errors are a sign that we can do better here - the current code basically disables type checking

   def _add_new_req(
         ...
         kv_transfer_params: dict[str, Any],
         _req = ReqMeta(
             ...
             remote_engine_id=kv_transfer_params["remote_engine_id"]

Because the value returned by the dict lookup is Any, mypy does no type checking - even though it could be None (and we know it definitely is None on the decode side runtime), mypy doesn't complain that we are assigning it to a str field. And what's worse, it's a really subtle way of disabling type checking ... it's not at all obvious to the reader, so someone reading this code might be very surprised to find these fields are None at runtime

I've sent you xuechendi#15 which will enable proper type checking. Please do check it over and make sure it makes sense to you 👍

[NIXL] Refactor prefill-only ReqMeta into RemoteMeta
@mergify mergify bot added the v1 label Dec 12, 2025
@xuechendi
Copy link
Copy Markdown
Contributor Author

xuechendi commented Dec 12, 2025

The mypy errors are a sign that we can do better here - the current code basically disables type checking

   def _add_new_req(
         ...
         kv_transfer_params: dict[str, Any],
         _req = ReqMeta(
             ...
             remote_engine_id=kv_transfer_params["remote_engine_id"]

Because the value returned by the dict lookup is Any, mypy does no type checking - even though it could be None (and we know it definitely is None on the decode side runtime), mypy doesn't complain that we are assigning it to a str field. And what's worse, it's a really subtle way of disabling type checking ... it's not at all obvious to the reader, so someone reading this code might be very surprised to find these fields are None at runtime

I've sent you xuechendi#15 which will enable proper type checking. Please do check it over and make sure it makes sense to you 👍


Oh, that's quite big change...., I was not expecting to refactoring. Looks good to me, thanks
I have merged the change now @markmc

Copy link
Copy Markdown
Member

@markmc markmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NickLucche PTAL, since I've done a bunch of refactoring in the latest version

Copy link
Copy Markdown
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xuechendi @markmc! The RemoteMeta refactoring seems much cleaner!

return compat_hash


@dataclass
Copy link
Copy Markdown
Member

@njhill njhill Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perf benefit

Suggested change
@dataclass
@dataclass(slots=True)

can add it to other dataclasses here too!

Edit: can do this in a follow-on

@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 13, 2025
@njhill njhill enabled auto-merge (squash) December 13, 2025 17:50
@njhill njhill merged commit ae2e503 into vllm-project:main Dec 14, 2025
52 checks passed
Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Dec 15, 2025
…vllm-project#30420)

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
joa-stdn pushed a commit to joa-stdn/vllm that referenced this pull request Dec 15, 2025
…vllm-project#30420)

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Joachim Studnia <joachim@mistral.ai>
teddygood pushed a commit to teddygood/vllm that referenced this pull request Dec 16, 2025
…vllm-project#30420)

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Majid-Taheri pushed a commit to Majid-Taheri/vllm that referenced this pull request Dec 23, 2025
…vllm-project#30420)

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
…vllm-project#30420)

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
…vllm-project#30420)

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kv-connector ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants