cuda: reset cuda context after reading memory size by 0cc4m · Pull Request #23935 · ggml-org/llama.cpp

0cc4m · 2026-05-31T06:37:18Z

Overview

Alternative to #23604, to allow reading CUDA memory in the router process in #21231 without allocating permanent memory through an initialized CUDA context. Instead of using NVML, this checks before running cudaMemGetInfo whether the context is already initialized. If not, it releases the context after the call.

I tried ref-counting as well as suggested in #23604 (comment), but that is harder to get right and introduces more edge cases.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES

ORippler · 2026-06-02T09:30:09Z

Alternative to #23604, to allow reading CUDA memory in the router process in #21231 without allocating permanent memory through an initialized CUDA context. Instead of using NVML, this checks before running cudaMemGetInfo whether the context is already initialized. If not, it releases the context after the call.

How often will the router process query the available memory? If it's only once at the beginning, I'd suggest to do a pattern like

ggml_backend_init -> ggml_backend_device_i.get_memory -> ggml_backend_i.free.

Intuitively, I'd have thought a backend has to be initialized before we can ask it about its available memory. Consequentially, we would move the release of the cuda context into ggml_backend_cuda_free

0cc4m · 2026-06-02T10:04:27Z

The behaviour of initialisation just to read memory state seems to be unique to CUDA, so I would prefer to handle it inside of the CUDA backend, not outside.

JohannesGaessler · 2026-06-02T18:24:49Z

@ORippler as of right now fetching memory is part of the ggml backend device API (== CUDA device), not the ggml backend API (== CUDA stream). So the lifetime of the CUDA device context cannot be simply tied to the lifetime of a ggml backend unless we move that function. And I would not be in favor of this since the memory in my opinion belongs to the device.

ORippler · 2026-06-03T19:37:12Z

And I would not be in favor of this since the memory in my opinion belongs to the device.

Still unintuitive to me: what good is a device if I don't have the constructs/context in place to dispatch work to it. But maybe I'm too biased by CUDA on this one 🤷‍♂️

0cc4m · 2026-06-04T11:46:27Z

I forgot that hip and musa were also initially included here, I don't think that is required, so I'll remove it.

Edit: On second thought, that would require wrapping all counter calls in preprocessor checks. Not sure whether that would be better here.

0cc4m · 2026-06-04T11:54:55Z

I excluded hip and musa, sorry about the noise. @JohannesGaessler Let me know if this looks better.

JohannesGaessler · 2026-06-04T18:33:40Z

        return nullptr;
    }

+    ggml_cuda_set_device(0);


Suggested change

ggml_cuda_set_device(0);

ggml_cuda_set_device(0); // cudaMallocHost can create the implicit CUDA device context, make sure that this is consistently done on device 0.

JohannesGaessler · 2026-06-04T18:43:38Z

+    if (ctx->active_count.load(std::memory_order_relaxed) == 0) {
+        cudaDeviceReset();
+    }


I don't think an atomic integer is strictly speaking enough here. One thread could theoretically fetch the integer with value 0, then another thread could create a ggml CUDA backend, then the first thread could call cudaDeviceReset. Please use a mutex and simply apply it to the functions that create or destroy ggml buffers or ggml backends or that fetch memory; they are not performance-critical.

Also please apply CUDA_CHECK to the return value of cudaDeviceReset.

0cc4m requested a review from a team as a code owner May 31, 2026 06:37

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 31, 2026

JohannesGaessler requested changes Jun 2, 2026

View reviewed changes

Comment thread ggml/src/ggml-cuda/ggml-cuda.cu Outdated

0cc4m added 2 commits June 4, 2026 13:44

cuda: reset device in get_memory function if no backend is active

c1c3b9b

also count device and host buffers

61e659a

0cc4m force-pushed the 0cc4m/cuda-get-memory-device-reset branch from a182b35 to 61e659a Compare June 4, 2026 11:44

0cc4m requested a review from IMbackK as a code owner June 4, 2026 11:44

exclude hip and musa from counting and device reset

94b6291

0cc4m removed the request for review from IMbackK June 4, 2026 11:54

JohannesGaessler reviewed Jun 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda: reset cuda context after reading memory size#23935

cuda: reset cuda context after reading memory size#23935
0cc4m wants to merge 3 commits into
masterfrom
0cc4m/cuda-get-memory-device-reset

0cc4m commented May 31, 2026

Uh oh!

ORippler commented Jun 2, 2026

Uh oh!

0cc4m commented Jun 2, 2026

Uh oh!

Uh oh!

JohannesGaessler commented Jun 2, 2026

Uh oh!

ORippler commented Jun 3, 2026

Uh oh!

0cc4m commented Jun 4, 2026 •

edited

Loading

Uh oh!

0cc4m commented Jun 4, 2026

Uh oh!

JohannesGaessler Jun 4, 2026

Uh oh!

JohannesGaessler Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	ggml_cuda_set_device(0);
	ggml_cuda_set_device(0); // cudaMallocHost can create the implicit CUDA device context, make sure that this is consistently done on device 0.

Conversation

0cc4m commented May 31, 2026

Overview

Requirements

Uh oh!

ORippler commented Jun 2, 2026

Uh oh!

0cc4m commented Jun 2, 2026

Uh oh!

Uh oh!

JohannesGaessler commented Jun 2, 2026

Uh oh!

ORippler commented Jun 3, 2026

Uh oh!

0cc4m commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented Jun 4, 2026

Uh oh!

JohannesGaessler Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

0cc4m commented Jun 4, 2026 •

edited

Loading