Skip to content

cuda : check src shapes for CUDA graphs#18561

Closed
ggerganov wants to merge 1 commit intomasterfrom
gg/cuda-fix-graph-match
Closed

cuda : check src shapes for CUDA graphs#18561
ggerganov wants to merge 1 commit intomasterfrom
gg/cuda-fix-graph-match

Conversation

@ggerganov
Copy link
Member

rel #17004 (comment)

Add checks for the src shapes to determine if a graph node has matching properties (in the context of CUDA graphs)

Not sure if this is the ideal solution. The failure case that is observed is a graph with a single ggml_top_k() node. For constant k, the output tensor of this node has the same shape, but the src tensor shape is different. With the logic on master we incorrectly determine that the node properties match and the CUDA graph is being reused.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jan 3, 2026
@am17an
Copy link
Contributor

am17an commented Jan 3, 2026

Could we have the src shape remain the same or is that not possible? As I understand this would disable CUDA graphs entirely when using backend sampling. CUDA graphs at the moment are only used batch_size = 1 use case

@ORippler
Copy link
Collaborator

ORippler commented Jan 3, 2026

Could we have the src shape remain the same or is that not possible? As I understand this would disable CUDA graphs entirely when using backend sampling.

Since LLMs have constant vocab-sizes, this check should not trigger as we have constant input/output shapes for top-k as is already.

For batched inference in the server setting, afaik we execute a single cgraph with N-batched logit-gen followed by N-times batch-size 1 backend sampling. In that case, cuda graphs are currently disabled due to the batched logit-gen. Moreover, even in this case Input/output shapes for top-k are constant across cgraph invocations

@ggerganov
Copy link
Member Author

Yes, this patch is not relevant for normal llama.cpp inference. It fixes an issue with the test-backend-ops where we run many different graphs that have a single top-k node with different src shape. This is just one potential failure of the current CUDA graph logic and although it does not directly affect llama.cpp (apart from the tests) it still needs to be fixed.

@am17an
Copy link
Contributor

am17an commented Jan 3, 2026

If it's not useful in llama.cpp perhaps test-backend-ops can be modified? This adds 1280 bytes per graph node, and this check will run on every TG cycle

@ORippler
Copy link
Collaborator

ORippler commented Jan 3, 2026

If it's not useful in llama.cpp perhaps test-backend-ops can be modified? This adds 1280 bytes per graph node, and this check will run on every TG cycle

ggml is intended to serve other consumers apart from llama.cpp also, so I would say we should fix it.

To reduce the time spent checking, we can forward the capability of re-using a ggml_cgraph from the ggml_backend_sched to the cuda backend, checking the graph only when necessary. I think this was scoped as part of the GraphPlan API where the backend scheduling was intended to be refactored/reworked, which is currently on hold due to diego's hiatus.

@am17an
Copy link
Contributor

am17an commented Jan 3, 2026

I agree it should be fixed but not sure if this is the correct solution. I'm not sure how to estimate the impact of the memory increase, AFAIK there are some linear attn models which run tens of thousands of nodes, for them it would be reading 10s of MBs of data for every token. Perhaps it won't be a big deal, I can measure and report back

@ggerganov
Copy link
Member Author

Yes, this patch is an overkill. We really need to only check the shapes of the leaf nodes (i.e. source tensors that are not the result of any node in the graph). The shapes of the rest of the sources are already being checked with the existing logic as noted in #17004 (comment). A better fix can be implemented that would not add so much overhead.

@am17an
Copy link
Contributor

am17an commented Jan 3, 2026

@ggerganov I can make that change, unless you are already going to make it

@ggerganov
Copy link
Member Author

@am17an Thanks, please go ahead

@ggerganov ggerganov closed this Jan 9, 2026
@ggerganov ggerganov deleted the gg/cuda-fix-graph-match branch January 9, 2026 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants