Conversation
Currently, CUDA graphs are eagerly enabled on the first call to ggml_backend_cuda_graph_compute. If the graph properties keep changing (4+ consecutive updates), the graph is permanently disabled. This is suboptimal because: - The first call always incurs CUDA graph capture overhead even if the graph is unstable - Once permanently disabled, CUDA graphs never re-enable even after the graph stabilizes (e.g., switching from prompt processing to decode) The new approach delays CUDA graph activation until warmup completes: the same cgraph must be called at least twice with matching properties before CUDA graph capture begins. This avoids wasted capture overhead on volatile graphs and allows graphs to become eligible once they stabilize. This also fixes issues such as ggml-org#19708
|
How often does a CUDA graph need to be evaluated to offset the overhead? If that number is low it may make more sense to just enable them more generally since we can now cache more than a single one. |
|
Actually, I think the problem was that due to the increasing context size CUDA graphs were not reusable for the prefill phase so it didn't make sense to use them. |
Exactly. I think PR #19645 enabled it for pre-fill phase, but it was causing CUDA graph to get disabled altogether after 4 consecutive updates. |
Ah yes, thanks for flagging that - I missed it. |
|
Regarding the CUDA Graph logic, we currently have the following state in the code:
The counter + disablement logic fires on both 2a) and 2b), whereas 2a) has performance parity to using cudaLaunch API over cudaGraph API, see the following numbers (tg 200 on a B6000):
For 2a/2b, I made From this data, I would recommend the following:
|
|
#19757 This is a WIP based on my insights above, which would enable cudaGraphs for PP on dense models that do not use |
One of the reason to take this approach in the PR was to fix issues like #19708. Basically, there could be cases where one |
|
@gaugarg-nv, @ORippler, I am just curious whether it is feasible to "freeze" the cuda graph. In some applications, e.g. diffusion, the same cgraph runs repeatedly without topography changes. In this case, even |
Unfortunately ggml does not allow for this yet. We wanted to tackle this as part of the graph-plan API, which was postponed indefinitely due to the hiatus of slaren. llama-context, the LLM orchestration loop of llama.cpp, already avoids rebuilding the graph if topology is consistent since #14482, and would simply have to forward this information to the backends somehow. Quoting from #14482:
|
@ORippler, thanks for the #14482 link. Didn't know such capability already exists. |
Following up on this with a TLDR: Given the limited perf-potential (talking about up to ~2% on Windows based on For the interested, continue reading along:Let's start with the collected perf numbers:Surprisingly, we don't see perf gains on Windows, despite there being a 2% gap between Windows and Linux in cudaLaunch-API-mode. Why? Let's take an nsight systems report to figure out
Ahh, we hit 2b) way more often than expected (my expectation was to hit only 2a). in |
| graph_key = ggml_cuda_graph_get_key(cgraph); | ||
|
|
||
| use_cuda_graph = ggml_cuda_graph_set_enabled(cuda_ctx, graph_key); | ||
| ggml_cuda_graph_set_enabled(cuda_ctx, graph_key); |
There was a problem hiding this comment.
strictly speaking this function solely checks whether the GPU supports cudaGraphs, but we can change the name in a separate PR
JohannesGaessler
left a comment
There was a problem hiding this comment.
Your code comments contain EM dashes. Unless there is a good reason not to, please stick to ASCII characters.
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Fixed now. |
am17an
left a comment
There was a problem hiding this comment.
nice and elegant solution!
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* Improve CUDA graph capture Currently, CUDA graphs are eagerly enabled on the first call to ggml_backend_cuda_graph_compute. If the graph properties keep changing (4+ consecutive updates), the graph is permanently disabled. This is suboptimal because: - The first call always incurs CUDA graph capture overhead even if the graph is unstable - Once permanently disabled, CUDA graphs never re-enable even after the graph stabilizes (e.g., switching from prompt processing to decode) The new approach delays CUDA graph activation until warmup completes: the same cgraph must be called at least twice with matching properties before CUDA graph capture begins. This avoids wasted capture overhead on volatile graphs and allows graphs to become eligible once they stabilize. This also fixes issues such as ggml-org#19708 * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Remove EM dashes * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Aman Gupta <amangupta052@gmail.com> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* Improve CUDA graph capture Currently, CUDA graphs are eagerly enabled on the first call to ggml_backend_cuda_graph_compute. If the graph properties keep changing (4+ consecutive updates), the graph is permanently disabled. This is suboptimal because: - The first call always incurs CUDA graph capture overhead even if the graph is unstable - Once permanently disabled, CUDA graphs never re-enable even after the graph stabilizes (e.g., switching from prompt processing to decode) The new approach delays CUDA graph activation until warmup completes: the same cgraph must be called at least twice with matching properties before CUDA graph capture begins. This avoids wasted capture overhead on volatile graphs and allows graphs to become eligible once they stabilize. This also fixes issues such as ggml-org#19708 * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Remove EM dashes * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Aman Gupta <amangupta052@gmail.com> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* Improve CUDA graph capture Currently, CUDA graphs are eagerly enabled on the first call to ggml_backend_cuda_graph_compute. If the graph properties keep changing (4+ consecutive updates), the graph is permanently disabled. This is suboptimal because: - The first call always incurs CUDA graph capture overhead even if the graph is unstable - Once permanently disabled, CUDA graphs never re-enable even after the graph stabilizes (e.g., switching from prompt processing to decode) The new approach delays CUDA graph activation until warmup completes: the same cgraph must be called at least twice with matching properties before CUDA graph capture begins. This avoids wasted capture overhead on volatile graphs and allows graphs to become eligible once they stabilize. This also fixes issues such as ggml-org#19708 * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Remove EM dashes * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Aman Gupta <amangupta052@gmail.com> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Aman Gupta <amangupta052@gmail.com>
|
@gaugarg-nv I'm seeing a performance regression for pp512 from this PR:
|
Thanks for reporting this. What's the command-line option used? Do you have a warm-up iteration? What is the number of runtime iterations? I guess that when you are testing pp512 with a micro-batch size of 512, the CUDA graph won't change across iterations, and the CUDA graph captured during warm-up will get reused. Can you try removing the warm-up loop or reducing runtime iteration and see if it has any impact? |
|
I tested the performance like this: |
|
I think I know what's going on. If I use 10 rather than 1 repetition for the benchmark there is basically no performance difference:
On the warmup run for pp512 CUDA graphs are not used because it's the first run. On the first benchmark run the same ggml graph is run for a second time so a CUDA graph is captured which introduces overhead. With 10 benchmark runs the overhead amortizes so there is no difference. So I would see this not as an issue with the code but rather with how we are benchmarking it. |
|
Right, I think one way to fix this is to increase the number of warmup iterations to 2 instead of the current 1. With that change, both implementations should show the same perf. |

Currently, CUDA graphs are eagerly enabled on the first call to ggml_backend_cuda_graph_compute. If the graph properties keep changing (4+ consecutive updates), the graph is permanently disabled. This is suboptimal because:
The new approach delays CUDA graph activation until warmup completes: the same cgraph must be called at least twice with matching properties before CUDA graph capture begins. This avoids wasted capture overhead on volatile graphs and allows graphs to become eligible once they stabilize. This also fixes issues such as #19708
Perf improvement for Llama-8b-Q4_K_M on RTX 6000 Ada 300W:
Nsight profile Master:

Nsight profile PR:
