-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overlap CUDA graph building and processing to minimize GPU idle time and improve tokens per seconds performance. #11867
base: master
Are you sure you want to change the base?
Overlap CUDA graph building and processing to minimize GPU idle time and improve tokens per seconds performance. #11867
Conversation
for-loop to cycle through them, optimizes function calls to pass specific graphs instead of the whole context
hard-coding 2 cuda graphs and setting custom offsets.
the vulkan backend. The first two graphs are small to minimize idle time, and then graphs have uniform size.
This also provides a small but mensurable speed up on ROCm / CDNA: Master:
Pr:
Making HipGraphs performance positive for the first time. Unfortunately it also leads to random crashes in hipGraphDestroy, which at first examination seam to be ROCRs fault not this prs. |
My understanding of the CUDA graph implementation is that a new graph is only created when there are incompatible changes to the graph, otherwise only a small number graph nodes are updated to reflect the new positions of the KV cache. Due to the KV padding, this should only happen every 32 and 256 tokens, with FA disabled and enabled respectively. Thus, in the worst case, this optimization could save 224us every 32 tokens, an average 7us per token. Is this what you are observing? |
That's not quite right, the |
Thanks for the clarification. I understand that even in the case where only a few nodes need to be updated, the call to I still think that this change adds a significant amount of complexity, to code that is already too fragile and complex to reasonably maintain. I mentioned this in the initial PR where you added support for CUDA graphs, and I still think this is the way to go, that this could be implemented via the "graph plan" API in ggml-backend. With the graph plan API, llama.cpp would take on the responsibility of creating and updating these plans, simplifying the logic in the CUDA backend significantly. It would also be possible to add higher level logic in llama.cpp to handle the plans, for example, it could prepare the plan for the next graph while a graph is being evaluated, effectively achieving the same that is done in this PR. Other backends could take advantage of the same optimizations by implementing the graph plan interface. If that's something that you would be interested in implementing, I could help with that. As it is, I do not think that the performance difference is large enough to justify the added complexity, and I am not willing to take the responsibility of maintaining this code. Other maintainers may still review and merge this PR if they disagree. |
I created some logs and checked how often However even with For example in the master, The approach presented here is an elegant way to do these checks in parallel with the GPU, instead of as a blocking operation up front, as it's currently done. A full rewrite of this area might fix the aforementioned issues, but this proposed change is in my opinion ultimately much easier, less risky, and introduces very little additional complexity. Regarding code complexity:I encourage you and the other interested maintainers to read the change set commit by commit. It might look complex but it actually is reasonable. The rest of the commits are much smaller. Regarding performance:The numbers presented here are the best case. I will try to get a low-tier system, retest and report back. Graph APIOut of interest and for future reference, do you have any plans or resources you can share for the graph plan API? |
@slaren Honestly, I think Flash Attention should be an optional feature in ggml since it doesn't introduce significant performance improvements, and the binary size has increased considerably—not to mention the compilation time, which, even though I only compile it for my GPU architecture, still takes 20 minutes on an i5-12400. It is not related to this PR, but it would be good to take it into account. |
I have not been Collaborator very long here so my option should be taken with a grain of salt. My knowledge of ggml is also still very restricted to my perspective of looking at single kernels in an effort toward optimizing them, without spending much time looking at the whole picture. llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu Line 2554 in 63ac128
I also challenge the usefulness of the graph support in general, even on cuda it seams to increase performance only very slightly and only in situations where the t/s performance is very high already due to the execution of small models on high end hardware and even then the amount of exceptions in the code where it is disabled for being not useful is long. On the other side i agree with @aendk that this pr hardly moves the needle at all in terms of complexity of the implementation of this feature while giving a larger performance boost than the pr that introduced graph support in the first place. In my option not merging this pr makes no sense, if the intention is to keep the graph support as is. Either it should be merged or the current usage of hip/cuGraph should be eliminated entirely, potentially being replaced by code that explicitly constructs the graph. |
I very much disagree that CUDA graphs are not useful. It's true that it mostly benefits small models and fast GPUs, but we use small models all the time, even if only as a draft model for speculative decoding. Here is an overview of the performance I get on my system: No CUDA graphs:
CUDA graphs:
CUDA graphs + this PR:
Small print: this is under Windows (WSL), with hardware GPU scheduling enabled, which has a notoriously high kernel launch overhead. The difference is not as significant on Linux or with GPU scheduling disabled. I do not disagree that the amount of complexity that this change adds is not very significant compared to the overall complexity of the feature, but I am just not willing to continue going on this road where we continue developing and adding complexity to a feature that should have been immediately refactored the moment it was added. At the time I concluded that the performance difference was too big to ignore the PR, but this is not the case here. About the graph plan API: this is meant to be a very simplified abstraction of features similar to CUDA graphs. Currently it looks like this: llama.cpp/ggml/src/ggml-backend-impl.h Lines 100 to 107 in 63ac128
As noted, this is not currently used, and can be changed if necessary. I would be willing to implement the minimum necessary changes to llama.cpp and ggml-backend to support this feature, but I would need your help implementing the interface on the CUDA backend. |
Hi slaren, can you share your hardware setup and how you ran these tests? Additionally, how many runs have you done per configuration? This is the first time I've seen a perf decrease (gemma 2B) and I would love to get some more info. Just FYI, I plan on running a low-end CPU until next week to see the worst case for CPU overhead, and thus the best improvement for this PR. |
My hardware is an Intel 13900k and 3090 Ti, running in WSL under Windows 11. The command used to run this test was The difference with gemma-2b is not caused by this PR, @JohannesGaessler has made some optimizations recently that are not present in this PR. After merging
|
I have debugged the hip graph crash this pr triggers to #11949 and can confirm that it is not the fault of this pr. |
I gues that in this case the fact that this pr helps alot for hip systems on small models is a thing to consider then: Master:
Pr:
Dont worry about the hip runtime bug, i have a simple oneliner to avoid it i will add soon. |
Hi all,
this PR takes the ideas applied to the vulkan backend (#9118 and #10499) and implements them for CUDA. This results in improved tokens per second performance.
Performance
I tested on two systems using an example query in

llama-cli
and thephi3-mini-4k-instruct
model.Prompt eval tokens per second improved between 2.5 and 7%.
Context print tokens per second improved between 2.8 and 3.57%
Note that this is a PR to reduce CPU overhead, and that these numbers were generated using top-end CPUs.
On less powerful consumer CPUs, the performance increase will be more significant.
Explanation
Currently, before every forward pass, a CUDA graph is built on CPU and then executed on GPU. This results in a delay, the GPU needs to wait around for the CPU to finish CUDA graph building.
Our proposed change splits the CPU workload into smaller pieces, with the effect that after the first graph has been built, the CPU and GPU can work in parallel on different CUDA graphs.
The before/after is shown in the below images from
nsight systems
. Top is the master, bottom is this changeset.The time between the start of the forward pass (red/green timeline of the CUDA API) and GPU graph execution (orange) is measured. We highlighted the time taken (256us vs 56us) with a red circle. This seems small, but as this is done before each forward pass / token generation step, this adds up quickly.
Note that both screenshots are made at different time-scales, the width of the items itself is misleading. Only the measured time is relevant, and the pattern of the red/green operations of the CUDA-API.
Performance impact of switching between graphs during forward passes
My code mirrors the changes in vulkan. In our testing, each forward pass is done with dozens of graphs. One could argue that the last few context switches likely are not required and hinder performance.
We investigated this. Switching between these is a non-issue for now, at about 2us per switch. However we could discuss strategies to steadily increase the graph size to reduce the number of context switches.

@mtavenrath @agray3