[NIXL][BUG FIX] Fix both failing issue and accuracy issue with nixl + host_buffer on CUDA by xuechendi · Pull Request #30419 · vllm-project/vllm

xuechendi · 2025-12-10T18:54:11Z

Purpose

This PR is fixed upon #30420
Should get that one merged firstly

Two issue detected and resolved in this PR

Fix a bug after [NIXL] Add remote_request_id to kv_transfer_params #29665 for running PD with cpu host buffer
Fix accuracy issue for running PD with cpu host buffer, described in [Bug]: NIXL PD disaggregate with host_buffer has accuracy issue - Prefill scheduled num_block mismatch at update_state_after_alloc and request_finished #30358

Test Plan

PREFILL_BLOCK_SIZE=16 DECODE_BLOCK_SIZE=16 bash tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh --kv_buffer_device cpu

Qwen3-0.6B
Before: accuracy is ~0.3
Now: Accuracy is = 0.4109

Root Cause and proposed change

=> However, if one prefill request is chunked into small request, the self.connector.update_state_after_alloc only registered partial of block_ids(first request of the prefill) into nixl_metadata

--

Solution:

Add another call to self.connector.update_state_after_alloc at running queue process, so if the following chunked gets scheduled, it will continue to update block_ids to nixl metadata.

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

chatgpt-codex-connector · 2025-12-10T18:54:20Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

gemini-code-assist

Code Review

This pull request aims to fix a failing issue and an accuracy issue related to NIXL with a CPU host buffer. The change in nixl_connector.py to use .get() for remote_request_id is a good defensive measure against potential KeyError exceptions. However, I've identified a critical issue in the new logic added to scheduler.py. The new call to update_state_after_alloc only passes new_blocks, which for chunked prefills, results in an incomplete list of blocks being registered for transfer. This would lead to data corruption and accuracy problems, which is likely the very issue this PR is trying to solve. I have provided a suggestion to fix this by passing all of the request's blocks.

vllm/v1/core/sched/scheduler.py

xuechendi · 2025-12-10T19:02:25Z

@KuntaiDu , May you take a review, not sure if LMCache might also uses needs update_state_after_alloc for getting latest block_ids

NickLucche · 2025-12-11T08:24:34Z

cc @markmc @njhill for the scheduler changes.

orozery · 2025-12-12T04:11:15Z

@markmc @njhill I don't think this change is necessary as block_ids of subsequent allocations are available to connectors via SchedulerOutput.scheduled_cached_reqs.new_block_ids which is passed in build_connector_meta.

@xuechendi you can look at the offloading connector for a reference:

vllm/vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py

Line 521 in 783644e

def yield_req_data(

Now that I think of it, I think there's also another bug in the nixl connector triggered when a chunked-prefill request gets preempted.
Currently, looks like the nixl connector does not detect that.
For the offloading connector we also have the same bug, which should be solved by #29870.
Basically, we detect preemptions via SchedulerOutput.preempted_req_ids in build_connector_meta.

xuechendi · 2025-12-12T18:55:36Z

@orozery
Oh, I see, so in offloadingConnectorMetadata, it leverage scheduler_output to obtain the new_block_ids.
I can do that in nixl_connector as well.

xuechendi · 2025-12-12T20:43:35Z

@NickLucche @markmc @orozery , Now I switched the fixing by following similiar approach done in OffloadingConnectorScheduler, I have verified accuracy locally with host buffer, it looks good. Please help to review.
This PR is still depending on #30420 being merged firstly.

mergify · 2025-12-15T04:25:51Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @xuechendi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

orozery · 2025-12-15T08:04:45Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+            if is_partial:
+                continue


I was wondering if it's not better to save what we have, instead of waiting for the entire request to be available.
Then I see that self.copy_blocks is blocking so I guess it does not matter.
@NickLucche your thoughts?

I think there's interesting developments for copying chunks, but for now I would treat this PR as a bug fix and prioritize getting the feature in a working state.
We can leave this optimization for future work.

@orozery done. Change back to copy immediately

orozery · 2025-12-15T08:15:58Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+        # list of GPU block IDs per request
+        self._request_block_ids: dict[ReqId, list[int]] = {}


Instead of tracking the block IDs of all requests, let's just track block IDs of requests that needs saving.
You can discard this new variable, and use self._reqs_need_save to track the block IDs.

agree we can re-use the existing container

NickLucche

Thanks for the work @xuechendi and the great review @orozery !
Left a few comments but things look good overall.

NickLucche · 2025-12-15T08:25:57Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+        # list of GPU block IDs per request
+        self._request_block_ids: dict[ReqId, list[int]] = {}


agree we can re-use the existing container

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

NickLucche · 2025-12-15T08:52:25Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+            if is_partial:
+                continue


I think there's interesting developments for copying chunks, but for now I would treat this PR as a bug fix and prioritize getting the feature in a working state.
We can leave this optimization for future work.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi · 2025-12-16T19:11:22Z

Thanks! @NickLucche, I added the comment

if not is_partial:
      # For non-partial prefills, once new req_meta is scheduled, it
      # can be removed from _reqs_need_save.
      # For partial prefill case, we will retain the request in
      # _reqs_need_save until all blocks are scheduled with req_meta.
      # Therefore, only pop if `not is_partial`.
      self._reqs_need_save.pop(req_id)

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

orozery · 2025-12-17T17:42:30Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+        # Clear _reqs_need_save if a request is aborted as partial prefill.
+        self._reqs_need_save.pop(request.request_id, None)
+


I think you need to move this upward a few lines, at least before if request.status != RequestStatus.FINISHED_LENGTH_CAPPED:

Oh, I see, thanks, updated

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

orozery · 2025-12-17T20:16:03Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+            # Clear _reqs_need_save if a request is aborted as partial prefill.
+            self._reqs_need_save.pop(request.request_id, None)


I think this should work, but it seems more fragile to me.
I would go further and put this pop right at the beginning.
Then, you could also remove the entire is_partial check from build_connector_meta.

Since _reqs_need_save originally is used to buffer which requests should create req_meta for saving to host. So the life cyle is from "scheduled" to "all request metadata are created".

If we go with your proposal, the life cycle becomes from "scheduled" to "request ends"
This changes the original design.

@NickLucche, do you think we should do that?

I assume the fix here is to just handle a corner case when request was aborted ?

I think @orozery is proposing to only clear the id on request finished, so either terminal block was processed or abort/error.
I think it makes sense but I don't necessarily think the current implementation is fragile.

Hence I don't have a strong opinion here, this could also be done in a separate PR, as long as we maximize clarity for these cases.

@NickLucche @orozery , let's do that in separate PR, since other queues _reqs_to_recv _reqs_need_send and etc are being cleared in build_connector_meta, I would prefer not adding refactor scope into this PR (originally for for accuracy fix). And once this one merged, I can open a new PR and we can have better design there.

mergify · 2025-12-18T12:05:19Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @xuechendi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

mergify · 2025-12-18T15:19:01Z

Hi @xuechendi, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Unblocking

… host_buffer on CUDA (vllm-project#30419) Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>

… host_buffer on CUDA (vllm-project#30419) Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>

… host_buffer on CUDA (vllm-project#30419) Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

… host_buffer on CUDA (vllm-project#30419) Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>

xuechendi requested review from ApostaC, NickLucche, WoosukKwon, alexm-redhat, heheda12345, njhill, robertgshaw2-redhat and ywang96 as code owners December 10, 2025 18:54

mergify bot added nvidia v1 kv-connector labels Dec 10, 2025

github-project-automation bot added this to NVIDIA Dec 10, 2025

gemini-code-assist bot reviewed Dec 10, 2025

View reviewed changes

vllm/v1/core/sched/scheduler.py Outdated Show resolved Hide resolved

This was referenced Dec 10, 2025

[NIXL][BUG FIX] Fix a bug for PD with host_buffer after merging 29665 #30420

Merged

[NIXL] Heterogeneous KV Layout and block_size - prefill NHD and nP > nD support #30448

Open

This comment was marked as off-topic.

Sign in to view

xuechendi force-pushed the debug/block_id_mistatch_schedule_to_finish branch from 75538df to fea492b Compare December 11, 2025 15:44

xuechendi requested a review from markmc December 11, 2025 16:09

xuechendi force-pushed the debug/block_id_mistatch_schedule_to_finish branch from fea492b to fc1ff41 Compare December 12, 2025 20:40

xuechendi force-pushed the debug/block_id_mistatch_schedule_to_finish branch from fc1ff41 to 5943fdf Compare December 12, 2025 21:06

mergify bot added the needs-rebase label Dec 15, 2025

orozery reviewed Dec 15, 2025

View reviewed changes

NickLucche reviewed Dec 15, 2025

View reviewed changes

Add comments

f45d50a

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

NickLucche added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 17, 2025

Merge branch 'main' into debug/block_id_mistatch_schedule_to_finish

b78d55b

NickLucche enabled auto-merge (squash) December 17, 2025 09:29

NickLucche disabled auto-merge December 17, 2025 09:39

clean up _reqs_need_save when request need to be free

a5f18f3

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

orozery reviewed Dec 17, 2025

View reviewed changes

fix

dcb0cfa

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

orozery reviewed Dec 17, 2025

View reviewed changes

mergify bot added needs-rebase and removed needs-rebase labels Dec 18, 2025

xuechendi added 2 commits December 18, 2025 07:15

Merge branch 'main' into debug/block_id_mistatch_schedule_to_finish

f3216ae

Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Signed-off-by: Chendi Xue <chendi.xue@intel.com>

fix format

a7d268d

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi force-pushed the debug/block_id_mistatch_schedule_to_finish branch from d0d2752 to a7d268d Compare December 18, 2025 15:20

NickLucche enabled auto-merge (squash) December 18, 2025 17:09

DarkLight1337 approved these changes Dec 18, 2025

View reviewed changes

NickLucche merged commit 6ca74bc into vllm-project:main Dec 18, 2025
52 checks passed

github-project-automation bot moved this to Done in NVIDIA Dec 18, 2025

xuechendi mentioned this pull request Dec 19, 2025

[NIXL] delay req_id clean for _req_to_save to finished_request #31048

Open

5 tasks

		# list of GPU block IDs per request
		self._request_block_ids: dict[ReqId, list[int]] = {}

		# Clear _reqs_need_save if a request is aborted as partial prefill.
		self._reqs_need_save.pop(request.request_id, None)

Uh oh!

Conversation

xuechendi commented Dec 10, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Root Cause and proposed change

Test Result

Uh oh!

chatgpt-codex-connector bot commented Dec 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

xuechendi commented Dec 10, 2025

Uh oh!

NickLucche commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as off-topic.

Uh oh!

orozery commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xuechendi commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xuechendi commented Dec 12, 2025

Uh oh!

mergify bot commented Dec 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuechendi Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuechendi commented Dec 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Dec 18, 2025

Uh oh!

mergify bot commented Dec 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

xuechendi commented Dec 10, 2025 •

edited by github-actions bot

Loading

NickLucche commented Dec 11, 2025 •

edited

Loading

orozery commented Dec 12, 2025 •

edited

Loading

xuechendi commented Dec 12, 2025 •

edited

Loading

xuechendi Dec 15, 2025 •

edited

Loading