cuda: reset cuda context after reading memory size#23935
Conversation
How often will the router process query the available memory? If it's only once at the beginning, I'd suggest to do a pattern like
Intuitively, I'd have thought a backend has to be initialized before we can ask it about its available memory. Consequentially, we would move the release of the cuda context into |
|
The behaviour of initialisation just to read memory state seems to be unique to CUDA, so I would prefer to handle it inside of the CUDA backend, not outside. |
|
@ORippler as of right now fetching memory is part of the ggml backend device API (== CUDA device), not the ggml backend API (== CUDA stream). So the lifetime of the CUDA device context cannot be simply tied to the lifetime of a ggml backend unless we move that function. And I would not be in favor of this since the memory in my opinion belongs to the device. |
Still unintuitive to me: what good is a device if I don't have the constructs/context in place to dispatch work to it. But maybe I'm too biased by CUDA on this one 🤷♂️ |
a182b35 to
61e659a
Compare
|
I forgot that hip and musa were also initially included here, I don't think that is required, so I'll remove it. Edit: On second thought, that would require wrapping all counter calls in preprocessor checks. Not sure whether that would be better here. |
|
I excluded hip and musa, sorry about the noise. @JohannesGaessler Let me know if this looks better. |
| return nullptr; | ||
| } | ||
|
|
||
| ggml_cuda_set_device(0); |
There was a problem hiding this comment.
| ggml_cuda_set_device(0); | |
| ggml_cuda_set_device(0); // cudaMallocHost can create the implicit CUDA device context, make sure that this is consistently done on device 0. |
| if (ctx->active_count.load(std::memory_order_relaxed) == 0) { | ||
| cudaDeviceReset(); | ||
| } |
There was a problem hiding this comment.
I don't think an atomic integer is strictly speaking enough here. One thread could theoretically fetch the integer with value 0, then another thread could create a ggml CUDA backend, then the first thread could call cudaDeviceReset. Please use a mutex and simply apply it to the functions that create or destroy ggml buffers or ggml backends or that fetch memory; they are not performance-critical.
Also please apply CUDA_CHECK to the return value of cudaDeviceReset.
Overview
Alternative to #23604, to allow reading CUDA memory in the router process in #21231 without allocating permanent memory through an initialized CUDA context. Instead of using NVML, this checks before running
cudaMemGetInfowhether the context is already initialized. If not, it releases the context after the call.I tried ref-counting as well as suggested in #23604 (comment), but that is harder to get right and introduces more edge cases.
Requirements