CPU KV Offloading: Use more CUDA streams#29013
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request refactors the CPU KV offloading to use a unique CUDA stream per transfer, which is a good approach to prevent stream queue saturation and potential blocking during concurrent loads. The implementation correctly uses pooling for streams and events to improve efficiency. However, I've identified a critical resource leak issue where streams and events are not returned to the pool if an error occurs during the transfer. I've provided a code suggestion to fix this. Additionally, it appears the test file tests/v1/kv_offload/test_cpu_gpu.py has not been updated to reflect the changes in this PR (e.g., renaming transfer_events to transfers), which will cause tests to fail. Please ensure the tests are updated accordingly.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
82346cf to
ec780a6
Compare
a135689 to
25d1378
Compare
| stream = ( | ||
| self._stream_pool.pop() | ||
| if self._stream_pool | ||
| else torch.cuda.Stream(priority=self.priority) | ||
| ) |
There was a problem hiding this comment.
Should we keep a bound on the max number of streams (start to share after this)? Or is the number of concurrent transfers already bounded somehow?
There was a problem hiding this comment.
I don't think there's a reason to limit.
On the GPU side, it seems there's no limit on the number of streams.
Also, note that the streams do not run in parallel, as each transfer stream waits for the previous transfer to finish.
The only purpose for creating multiple streams is to overcome the 1024-ops-per-stream limit.
I tested this PR with 100000 concurrent small requests loading from CPU and it handles it well.
It does not really create 100000 as the GPU KV cache gets full so only around 1500 requests get allocated in any point in time, so basically my test used 1500 streams with no issues.
I am planning on limiting the number of requests being async loaded, to count towards max_num_seqs (which equals 1024 for H100).
See this draft PR #29877.
So once we get that in, the number of streams will be bounded by, roughly, max_num_seqs.
|
Also pinging @ApostaC again in case he wants to check |
Prior to this commit, the CPU offloading connector used 2 global CUDA streams, one for loads and one for stores. When there are a lot of concurrent loads, the stream queue (which has a capacity of about 1000 on A100) can get full. When this happens, submitting more operations to the stream becomes blocking, which is harmful to model execution. To overcome this, this commit changes the CPU KV offloading to use a unique CUDA stream per transfer. Additionally, we add code that detects the kernel block size and fails if it is different than the user set block size. Support for general kernel block size will be added in a follow-up PR. Signed-off-by: Or Ozeri <oro@il.ibm.com>
25d1378 to
a2048af
Compare
Signed-off-by: Or Ozeri <oro@il.ibm.com>
Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: Joachim Studnia <joachim@mistral.ai>
Signed-off-by: Or Ozeri <oro@il.ibm.com>
| if self._stream_pool | ||
| else torch.cuda.Stream(priority=self.priority) | ||
| ) | ||
| event = self._event_pool.pop() if self._event_pool else torch.Event() |
There was a problem hiding this comment.
Hi, there seems to be a problem.
In the swapping-out situation, this workflow does not guarantee that the swap operation occurs after the forward computation, potentially resulting in the incorrect KV cache being swapped out.
So, maybe we can insert such lines of code?
current_forward_stream = torch.cuda.current_stream()
if (src_spec.medium() == "gpu") and (dst_spec.medium() == "cpu"):
stream.wait_stream(current_forward_stream)
There was a problem hiding this comment.
Even before this PR, the offloading connector used a separate cuda stream for executing the transfers.
This stream was not synced with the model execution stream.
This is also the case for other connectors.
Looking at the connector API itself, specifically save_kv_layer and wait_for_save.
My assumption was that when these functions are called we can be sure the KV cache is safe to read (without syncing on the default stream).
@NickLucche @njhill can you confirm?
There was a problem hiding this comment.
Digging more into the code, looks like you're right @lixiaobai09.
This is indeed a bug, though not introduced by this PR, but existed from the beginning of the OffloadingConnector.
I just opened #31341 to fix it.
@lixiaobai09 thank you so much for point this out!
Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Signed-off-by: Or Ozeri <oro@il.ibm.com>
Prior to this PR, the CPU offloading connector used 2 global CUDA streams, one for loads and one for stores.
When there are a lot of concurrent loads, the stream queue (which has a capacity of about 1000 on A100) can get full.
When this happens, submitting more operations to the stream becomes blocking, which is harmful to model execution.
To overcome this, this PR changes the CPU KV offloading to use a unique CUDA stream per transfer.
I have witnessed this is a true issue by timing the time it takes to call
cudaMemcpyAsync, which increases from 2us to 180us when the stream queue gets full.