cuda : check src shapes for CUDA graphs by ggerganov · Pull Request #18561 · ggml-org/llama.cpp

ggerganov · 2026-01-03T10:12:25Z

Add checks for the src shapes to determine if a graph node has matching properties (in the context of CUDA graphs)

Not sure if this is the ideal solution. The failure case that is observed is a graph with a single ggml_top_k() node. For constant k, the output tensor of this node has the same shape, but the src tensor shape is different. With the logic on master we incorrectly determine that the node properties match and the CUDA graph is being reused.

am17an · 2026-01-03T10:28:09Z

Could we have the src shape remain the same or is that not possible? As I understand this would disable CUDA graphs entirely when using backend sampling. CUDA graphs at the moment are only used batch_size = 1 use case

ORippler · 2026-01-03T10:44:47Z

Could we have the src shape remain the same or is that not possible? As I understand this would disable CUDA graphs entirely when using backend sampling.

Since LLMs have constant vocab-sizes, this check should not trigger as we have constant input/output shapes for top-k as is already.

For batched inference in the server setting, afaik we execute a single cgraph with N-batched logit-gen followed by N-times batch-size 1 backend sampling. In that case, cuda graphs are currently disabled due to the batched logit-gen. Moreover, even in this case Input/output shapes for top-k are constant across cgraph invocations

ggerganov · 2026-01-03T11:01:36Z

Yes, this patch is not relevant for normal llama.cpp inference. It fixes an issue with the test-backend-ops where we run many different graphs that have a single top-k node with different src shape. This is just one potential failure of the current CUDA graph logic and although it does not directly affect llama.cpp (apart from the tests) it still needs to be fixed.

am17an · 2026-01-03T11:24:02Z

If it's not useful in llama.cpp perhaps test-backend-ops can be modified? This adds 1280 bytes per graph node, and this check will run on every TG cycle

ORippler · 2026-01-03T11:37:33Z

If it's not useful in llama.cpp perhaps test-backend-ops can be modified? This adds 1280 bytes per graph node, and this check will run on every TG cycle

ggml is intended to serve other consumers apart from llama.cpp also, so I would say we should fix it.

To reduce the time spent checking, we can forward the capability of re-using a ggml_cgraph from the ggml_backend_sched to the cuda backend, checking the graph only when necessary. I think this was scoped as part of the GraphPlan API where the backend scheduling was intended to be refactored/reworked, which is currently on hold due to diego's hiatus.

am17an · 2026-01-03T12:00:08Z

I agree it should be fixed but not sure if this is the correct solution. I'm not sure how to estimate the impact of the memory increase, AFAIK there are some linear attn models which run tens of thousands of nodes, for them it would be reading 10s of MBs of data for every token. Perhaps it won't be a big deal, I can measure and report back

ggerganov · 2026-01-03T13:05:53Z

Yes, this patch is an overkill. We really need to only check the shapes of the leaf nodes (i.e. source tensors that are not the result of any node in the graph). The shapes of the rest of the sources are already being checked with the existing logic as noted in #17004 (comment). A better fix can be implemented that would not add so much overhead.

am17an · 2026-01-03T13:33:04Z

@ggerganov I can make that change, unless you are already going to make it

ggerganov · 2026-01-03T13:42:02Z

@am17an Thanks, please go ahead

cuda : check src shapes for CUDA graphs

2b86966

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jan 3, 2026

ggerganov mentioned this pull request Jan 3, 2026

sampling : add support for backend sampling #17004

Merged

31 tasks

am17an mentioned this pull request Jan 4, 2026

ggml-cuda: check for srcs outside the cgraph #18583

Merged

ggerganov closed this Jan 9, 2026

ggerganov deleted the gg/cuda-fix-graph-match branch January 9, 2026 10:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda : check src shapes for CUDA graphs#18561

cuda : check src shapes for CUDA graphs#18561
ggerganov wants to merge 1 commit intomasterfrom
gg/cuda-fix-graph-match

ggerganov commented Jan 3, 2026

Uh oh!

am17an commented Jan 3, 2026 •

edited

Loading

Uh oh!

ORippler commented Jan 3, 2026 •

edited

Loading

Uh oh!

ggerganov commented Jan 3, 2026

Uh oh!

am17an commented Jan 3, 2026

Uh oh!

ORippler commented Jan 3, 2026

Uh oh!

am17an commented Jan 3, 2026

Uh oh!

ggerganov commented Jan 3, 2026

Uh oh!

am17an commented Jan 3, 2026

Uh oh!

ggerganov commented Jan 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ggerganov commented Jan 3, 2026

Uh oh!

am17an commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ORippler commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jan 3, 2026

Uh oh!

am17an commented Jan 3, 2026

Uh oh!

ORippler commented Jan 3, 2026

Uh oh!

am17an commented Jan 3, 2026

Uh oh!

ggerganov commented Jan 3, 2026

Uh oh!

am17an commented Jan 3, 2026

Uh oh!

ggerganov commented Jan 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

am17an commented Jan 3, 2026 •

edited

Loading

ORippler commented Jan 3, 2026 •

edited

Loading