[BugFix] Wait for compute before offloading KV to CPU#31341
[BugFix] Wait for compute before offloading KV to CPU#31341njhill merged 2 commits intovllm-project:mainfrom
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Code Review
This pull request correctly addresses a race condition between model computation and KV cache offloading by introducing a stream synchronization for GPU-to-CPU transfers. It also refactors the offloading logic to delay store operations to the next engine step, which should help reduce contention on the DMA engine. The removal of unused stream priorities is a good cleanup. The changes are well-implemented and the accompanying test modifications accurately reflect the new asynchronous behavior.
|
cc @NickLucche |
njhill
left a comment
There was a problem hiding this comment.
Thanks @orozery, mostly looks good to me.
Additionally, we move the offloading to start at the beginning of the next engine step. This is to avoid contention with sample_tokens which also involves gpu->cpu copies.
Actually with async scheduling (which is now default), the sampled token gpu->cpu copy would still be happening at this point.
It may be better to leave the transfer where it is, or possibly better have the offload stream also wait on the sampled_tokens stream (GPUModelRunner.async_output_copy_stream).
vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py
Outdated
Show resolved
Hide resolved
vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py
Outdated
Show resolved
Hide resolved
4572662 to
252e720
Compare
I've been actually testing with both async_scheduling on and off. cudaMemcpyAsync calls, unlike CUDA kernels, are handled FIFO (in their order of submission). My initial try was to start offloading right at the end of I think the best solution for now is what's currently implemented in this PR, i.e. moving the offloading before the start of the next |
Do you mean regardless of cuda streams? In which case the situation we will have here is that both the offloading stream and sampled token stream will be waiting for the execution stream to reach the same point. At which point they will commence their async copy operations. It's not clear to me though which one will "win" and start first - not obvious that it would be the one which started waiting first...
This is only applicable to PP which is not compatible with async scheduling yet anyhow. |
Exactly. Both wait on the same stream, yet offloading wins as it was queued first. |
But it's not obvious that it's a queue, both of the streams are "woken up" at the same time, at which point they both proceed to call cudaMemcpyAsync. IIUC this is the point at which they are effectively queued, so it's still not clear to me that we'll be sure that the sampled token stream will always win. I'm now not suggesting moving the hook, just that we should ideally change the |
Nice idea (waiting on the sample tokens stream). |
|
Having hook after sample tokens would not be any different to doing it at the start of the next step as you have here, right? Either way you would still need to wait on the output copy stream. |
I take that back. |
We could add the copy stream as a field in |
I'm not so familiar with Also, the stream is constant throughout the process lifetime, so it feels more right to pass it once at connector bootstrap, like in |
Good point, maybe we can extend that method. I guess we could consider that as a follow-on. |
| @@ -92,7 +92,7 @@ def save_kv_layer( | |||
| def wait_for_save(self): | |||
| assert self.connector_worker is not None | |||
| assert isinstance(self._connector_metadata, OffloadingConnectorMetadata) | |||
| self.connector_worker.start_store_kv(self._connector_metadata) | |||
| self.connector_worker.prepare_store_kv(self._connector_metadata) | |||
There was a problem hiding this comment.
Add a short comment here to explain why we are deferring?
There was a problem hiding this comment.
This class is just a redirection class.
I've added a comment instead in OffloadingConnectorWorker.prepare_store_kv where I think it makes more sense.
Let me know what you think.
There was a problem hiding this comment.
I guess I was thinking it would make more sense here, since the name of the method being called now is prepare_store_kv rather than "transfer" or "start", and the comment is about the hook itself .. so it's clearer that this is in the load rather than save hook.
But it's not that big deal :)
252e720 to
369cb4b
Compare
|
Hi @orozery, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
This commit fixes the OffloadingConnector to wait on the default (compute) stream, before offloading KV data from the GPU to the CPU. Additionally, we move the offloading to start at the beginning of the next engine step. This is to avoid contention with sample_tokens which also involves gpu->cpu copies. Lastly, we remove the use of stream priorities as they don't effect DMA-based copies. Signed-off-by: Or Ozeri <oro@il.ibm.com>
369cb4b to
f6640eb
Compare
…1341) Signed-off-by: Or Ozeri <oro@il.ibm.com>
…1341) Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
…1341) Signed-off-by: Or Ozeri <oro@il.ibm.com>
This PR fixes the OffloadingConnector to wait on the default (compute) stream, before offloading KV data from the GPU to the CPU.
Additionally, we move the offloading to start at the beginning of the next engine step. This is to avoid contention with sample_tokens which also involves gpu->cpu copies.
Lastly, we remove the use of stream priorities as they don't effect DMA-based copies.
Note
Implements safer, staged KV offloading and cleans up stream handling.
start_kv_transfersandprepare_store_kv; queue store jobs in_unsubmitted_store_jobsand submit them at the next step before starting loads.start_load_kv→start_kv_transfersandstart_store_kv→prepare_store_kv.wait_streamon the current (compute) stream for GPU→CPU copies; use defaulttorch.cuda.Stream().unsubmitted_stores_count) and adjust assertions accordingly.Written by Cursor Bugbot for commit 252e7205ce32fb041eda7fe61d2e49e73b237cc9. This will update automatically on new commits. Configure here.
Note
Implements safer, staged KV offloading and stream handling.
start_kv_transfersandprepare_store_kv; queue store jobs in_unsubmitted_store_jobsand submit them at the next step before starting loadsstart_load_kv→start_kv_transfersandstart_store_kv→prepare_store_kvwait_streamon the current (compute) stream for GPU→CPU; use defaulttorch.cuda.Stream()unsubmitted_stores_count), update expected stored indices and assertionsWritten by Cursor Bugbot for commit 369cb4b006569afd1c586dc944299f25018abb66. This will update automatically on new commits. Configure here.
Note
Implements safer, staged KV offloading and removes ineffective stream priorities.
start_kv_transfersandprepare_store_kv; queue store jobs in_unsubmitted_store_jobsand submit them at the start of the next step before starting loadsstart_load_kv→start_kv_transfersandstart_store_kv→prepare_store_kvtorch.cuda.Stream()and, for GPU->CPU,wait_streamon the current compute stream to avoid contentionunsubmitted_stores_count) and adjust expected stored/loaded block assertions accordinglyWritten by Cursor Bugbot for commit f6640eb. This will update automatically on new commits. Configure here.
Note
Implements safer, staged KV offloading and stream handling.
start_kv_transfersandprepare_store_kv; queue store jobs in_unsubmitted_store_jobsand submit them at the start of the next step before starting loadsstart_load_kv→start_kv_transfersandstart_store_kv→prepare_store_kvtorch.cuda.Stream()and, for GPU→CPU,wait_streamon the current compute stream to avoid contentionunsubmitted_stores_count) and adjust expected stored/loaded block assertions accordinglyWritten by Cursor Bugbot for commit 4f5923a. This will update automatically on new commits. Configure here.