[NIXL][BUG FIX] Fix both failing issue and accuracy issue with nixl + host_buffer on CUDA#30419
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Code Review
This pull request aims to fix a failing issue and an accuracy issue related to NIXL with a CPU host buffer. The change in nixl_connector.py to use .get() for remote_request_id is a good defensive measure against potential KeyError exceptions. However, I've identified a critical issue in the new logic added to scheduler.py. The new call to update_state_after_alloc only passes new_blocks, which for chunked prefills, results in an incomplete list of blocks being registered for transfer. This would lead to data corruption and accuracy problems, which is likely the very issue this PR is trying to solve. I have provided a suggestion to fix this by passing all of the request's blocks.
|
@KuntaiDu , May you take a review, not sure if LMCache might also uses needs |
75538df to
fea492b
Compare
|
@markmc @njhill I don't think this change is necessary as @xuechendi you can look at the offloading connector for a reference: Now that I think of it, I think there's also another bug in the nixl connector triggered when a chunked-prefill request gets preempted. |
|
@orozery |
fea492b to
fc1ff41
Compare
|
@NickLucche @markmc @orozery , Now I switched the fixing by following similiar approach done in |
fc1ff41 to
5943fdf
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
| if is_partial: | ||
| continue |
There was a problem hiding this comment.
I was wondering if it's not better to save what we have, instead of waiting for the entire request to be available.
Then I see that self.copy_blocks is blocking so I guess it does not matter.
@NickLucche your thoughts?
There was a problem hiding this comment.
I think there's interesting developments for copying chunks, but for now I would treat this PR as a bug fix and prioritize getting the feature in a working state.
We can leave this optimization for future work.
There was a problem hiding this comment.
@orozery done. Change back to copy immediately
| # list of GPU block IDs per request | ||
| self._request_block_ids: dict[ReqId, list[int]] = {} |
There was a problem hiding this comment.
Instead of tracking the block IDs of all requests, let's just track block IDs of requests that needs saving.
You can discard this new variable, and use self._reqs_need_save to track the block IDs.
There was a problem hiding this comment.
agree we can re-use the existing container
NickLucche
left a comment
There was a problem hiding this comment.
Thanks for the work @xuechendi and the great review @orozery !
Left a few comments but things look good overall.
| # list of GPU block IDs per request | ||
| self._request_block_ids: dict[ReqId, list[int]] = {} |
There was a problem hiding this comment.
agree we can re-use the existing container
| if is_partial: | ||
| continue |
There was a problem hiding this comment.
I think there's interesting developments for copying chunks, but for now I would treat this PR as a bug fix and prioritize getting the feature in a working state.
We can leave this optimization for future work.
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
|
Thanks! @NickLucche, I added the comment |
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
| # Clear _reqs_need_save if a request is aborted as partial prefill. | ||
| self._reqs_need_save.pop(request.request_id, None) | ||
|
|
There was a problem hiding this comment.
I think you need to move this upward a few lines, at least before if request.status != RequestStatus.FINISHED_LENGTH_CAPPED:
There was a problem hiding this comment.
Oh, I see, thanks, updated
| # Clear _reqs_need_save if a request is aborted as partial prefill. | ||
| self._reqs_need_save.pop(request.request_id, None) |
There was a problem hiding this comment.
I think this should work, but it seems more fragile to me.
I would go further and put this pop right at the beginning.
Then, you could also remove the entire is_partial check from build_connector_meta.
There was a problem hiding this comment.
Since _reqs_need_save originally is used to buffer which requests should create req_meta for saving to host. So the life cyle is from "scheduled" to "all request metadata are created".
If we go with your proposal, the life cycle becomes from "scheduled" to "request ends"
This changes the original design.
@NickLucche, do you think we should do that?
I assume the fix here is to just handle a corner case when request was aborted ?
There was a problem hiding this comment.
I think @orozery is proposing to only clear the id on request finished, so either terminal block was processed or abort/error.
I think it makes sense but I don't necessarily think the current implementation is fragile.
Hence I don't have a strong opinion here, this could also be done in a separate PR, as long as we maximize clarity for these cases.
There was a problem hiding this comment.
@NickLucche @orozery , let's do that in separate PR, since other queues _reqs_to_recv _reqs_need_send and etc are being cleared in build_connector_meta, I would prefer not adding refactor scope into this PR (originally for for accuracy fix). And once this one merged, I can open a new PR and we can have better design there.
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
|
Hi @xuechendi, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
d0d2752 to
a7d268d
Compare
… host_buffer on CUDA (vllm-project#30419) Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
… host_buffer on CUDA (vllm-project#30419) Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
… host_buffer on CUDA (vllm-project#30419) Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
… host_buffer on CUDA (vllm-project#30419) Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Purpose
This PR is fixed upon #30420
Should get that one merged firstly
Two issue detected and resolved in this PR
Test Plan
Qwen3-0.6B
Before: accuracy is ~0.3
Now: Accuracy is = 0.4109
Root Cause and proposed change
Previous:
when scheduler call
schedule, only brand new request will getsself.connector.update_state_after_alloc=> However, if one prefill request is chunked into small request, the
self.connector.update_state_after_alloconly registered partial of block_ids(first request of the prefill) into nixl_metadata--
Solution:
Add another call to
self.connector.update_state_after_allocat running queue process, so if the following chunked gets scheduled, it will continue to update block_ids to nixl metadata.Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.