Skip to content

cuda: reset cuda context after reading memory size#23935

Open
0cc4m wants to merge 3 commits into
masterfrom
0cc4m/cuda-get-memory-device-reset
Open

cuda: reset cuda context after reading memory size#23935
0cc4m wants to merge 3 commits into
masterfrom
0cc4m/cuda-get-memory-device-reset

Conversation

@0cc4m
Copy link
Copy Markdown
Contributor

@0cc4m 0cc4m commented May 31, 2026

Overview

Alternative to #23604, to allow reading CUDA memory in the router process in #21231 without allocating permanent memory through an initialized CUDA context. Instead of using NVML, this checks before running cudaMemGetInfo whether the context is already initialized. If not, it releases the context after the call.

I tried ref-counting as well as suggested in #23604 (comment), but that is harder to get right and introduces more edge cases.

Requirements

@0cc4m 0cc4m requested a review from a team as a code owner May 31, 2026 06:37
@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 31, 2026
@ORippler
Copy link
Copy Markdown
Collaborator

ORippler commented Jun 2, 2026

Alternative to #23604, to allow reading CUDA memory in the router process in #21231 without allocating permanent memory through an initialized CUDA context. Instead of using NVML, this checks before running cudaMemGetInfo whether the context is already initialized. If not, it releases the context after the call.

How often will the router process query the available memory? If it's only once at the beginning, I'd suggest to do a pattern like

ggml_backend_init -> ggml_backend_device_i.get_memory -> ggml_backend_i.free.

Intuitively, I'd have thought a backend has to be initialized before we can ask it about its available memory. Consequentially, we would move the release of the cuda context into ggml_backend_cuda_free

@0cc4m
Copy link
Copy Markdown
Contributor Author

0cc4m commented Jun 2, 2026

The behaviour of initialisation just to read memory state seems to be unique to CUDA, so I would prefer to handle it inside of the CUDA backend, not outside.

Comment thread ggml/src/ggml-cuda/ggml-cuda.cu Outdated
@JohannesGaessler
Copy link
Copy Markdown
Contributor

@ORippler as of right now fetching memory is part of the ggml backend device API (== CUDA device), not the ggml backend API (== CUDA stream). So the lifetime of the CUDA device context cannot be simply tied to the lifetime of a ggml backend unless we move that function. And I would not be in favor of this since the memory in my opinion belongs to the device.

@ORippler
Copy link
Copy Markdown
Collaborator

ORippler commented Jun 3, 2026

And I would not be in favor of this since the memory in my opinion belongs to the device.

Still unintuitive to me: what good is a device if I don't have the constructs/context in place to dispatch work to it. But maybe I'm too biased by CUDA on this one 🤷‍♂️

@0cc4m 0cc4m force-pushed the 0cc4m/cuda-get-memory-device-reset branch from a182b35 to 61e659a Compare June 4, 2026 11:44
@0cc4m 0cc4m requested a review from IMbackK as a code owner June 4, 2026 11:44
@0cc4m
Copy link
Copy Markdown
Contributor Author

0cc4m commented Jun 4, 2026

I forgot that hip and musa were also initially included here, I don't think that is required, so I'll remove it.

Edit: On second thought, that would require wrapping all counter calls in preprocessor checks. Not sure whether that would be better here.

@0cc4m 0cc4m removed the request for review from IMbackK June 4, 2026 11:54
@0cc4m
Copy link
Copy Markdown
Contributor Author

0cc4m commented Jun 4, 2026

I excluded hip and musa, sorry about the noise. @JohannesGaessler Let me know if this looks better.

return nullptr;
}

ggml_cuda_set_device(0);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ggml_cuda_set_device(0);
ggml_cuda_set_device(0); // cudaMallocHost can create the implicit CUDA device context, make sure that this is consistently done on device 0.

Comment on lines +5036 to +5038
if (ctx->active_count.load(std::memory_order_relaxed) == 0) {
cudaDeviceReset();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think an atomic integer is strictly speaking enough here. One thread could theoretically fetch the integer with value 0, then another thread could create a ggml CUDA backend, then the first thread could call cudaDeviceReset. Please use a mutex and simply apply it to the functions that create or destroy ggml buffers or ggml backends or that fetch memory; they are not performance-critical.

Also please apply CUDA_CHECK to the return value of cudaDeviceReset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants