cuda : enable CUDA graphs for MMID 1 <= BS <= 4#19645
Conversation
am17an
left a comment
There was a problem hiding this comment.
Speedup on 4090:
| Model | Microbatch size | Test | t/s 8bc255a | t/s gg/cuda-graphs-enable-bs-gt1 | Speedup |
|---|---|---|---|---|---|
| gpt-oss 20B MXFP4 MoE | 1 | pp512 | 283.83 | 284.09 | 1.00 |
| gpt-oss 20B MXFP4 MoE | 2 | pp512 | 401.81 | 443.68 | 1.10 |
| gpt-oss 20B MXFP4 MoE | 3 | pp512 | 517.59 | 561.15 | 1.08 |
| gpt-oss 20B MXFP4 MoE | 4 | pp512 | 595.86 | 635.65 | 1.07 |
ggml/src/ggml-cuda/ggml-cuda.cu
Outdated
| // https://github.com/huggingface/transformers/blob/bda75b4011239d065de84aa3e744b67ebfa7b245/src/transformers/models/gemma3n/modeling_gemma3n.py#L1773, | ||
| // Generally, changes in batch size or context size can cause changes to the grid size of some kernels. | ||
| // [TAG_MUL_MAT_ID_CUDA_GRAPHS] | ||
| if (node->op == GGML_OP_MUL_MAT_ID && (!ggml_is_quantized(node->src[0]->type) || node->ne[2] > 4)) { |
There was a problem hiding this comment.
Perhaps we rename this 4 to MMVQ_MMID_MAX_BATCH_SIZE in case we end up optimizing for BS > 4
There was a problem hiding this comment.
@JohannesGaessler Can you confirm that adding the MMVQ_MMID_MAX_BATCH_SIZE constant is OK for now?
There was a problem hiding this comment.
Adding a named constant here would be fine with me.
15a6842 to
7d0be2c
Compare
| // under these conditions, the mul_mat_id operation will need to synchronize the stream, so we cannot use CUDA graphs | ||
| // TODO: figure out a way to enable for larger batch sizes, without hurting performance | ||
| // ref: https://github.com/ggml-org/llama.cpp/pull/18958 |
There was a problem hiding this comment.
I see two ways for this:
- Refactor the logic expressed on host side into a CUDA kernel
- Use
cudaLaunchHostFuncand manage lifetime of CPU objects insideggml_cuda_graph
There was a problem hiding this comment.
The fundamental problem here is that we are using cuBLAS for GEMM and are orchestrating the matrix multiplications per expert via CPU logic. It would in principle be possible to pad the matrix multiplications and use batched/strided GEMM but that will waste a lot of compute. I'm not sure whether it's possible to orchestrate cuBLAS GEMM from device code, launching CUDA kernels from within cudaLaunchHostFunc also seems dubious. For the custom ggml kernels I made it so that during the quantization kernel the data is re-arranged to be contiguous per expert and the kernel is skipping any output tiles that would be 100% padding. In principle this could be generalized to floating-point data, see #18864 .
That's correct. |
ggml/src/ggml-cuda/ggml-cuda.cu
Outdated
| // https://github.com/huggingface/transformers/blob/bda75b4011239d065de84aa3e744b67ebfa7b245/src/transformers/models/gemma3n/modeling_gemma3n.py#L1773, | ||
| // Generally, changes in batch size or context size can cause changes to the grid size of some kernels. | ||
| // [TAG_MUL_MAT_ID_CUDA_GRAPHS] | ||
| if (node->op == GGML_OP_MUL_MAT_ID && (!ggml_is_quantized(node->src[0]->type) || node->ne[2] > 4)) { |
There was a problem hiding this comment.
Btw, note here that we currently require quantized src0. This means that with BF16 MoE models for example, the CUDA graphs will not be enabled.
There was a problem hiding this comment.
#18958 enables this for mmvf also, so bf16 should also work
There was a problem hiding this comment.
Ah sorry that is that just for AMD
Co-authored-by: Oliver Simons <osimons@nvidia.com>
|
Merging on green |
…)" This reverts commit ad8207a.
|
Hello @ggerganov, this PR specifically seems to cause a new memory leak faced by @henk717 when used with stable-diffusion.cpp. I know it sounds strange but bear with me.
I can create a full issue too if you want but I figured since this is a very small commit maybe you might already know the fix. Also tagging stable-diffusion.cpp stakeholders although I don't think it's caused there since their code is unchanged - @leejet @stduhpf |
Seems related to #19708. Does stable-diffusion.cpp re-create a new ggml-cgraph object with different node-key for every step of the pipeline? Also, does #19754 fix your issue? |
|
Small clarification, the amount of memory increase will depend on the image model. It can be as much as 4GB per image on Qwen Image Edit. Its very noticable when it happens. I monitor this in the regular ram, not the vram although that seemed to go up as well. While I run dual 3090's to the application only one of the 3090's is visible. |
…-org#19645)"" This reverts commit 131e3cb.
|
Sent you a test build with #19754 merged and this PR reinstated, let me know if it fixes the memory leak |
|
This did not resolve the issue. |
… 4 (ggml-org#19645)""" This reverts commit edc04f3.
|
maybe gg knows whats wrong. |
|
Could you provide the most basic steps to repro? |
Umm sorry, I'd like to if possible, unfortunately... I don't know how to reproduce it easily here in this repo... as most of the necessary code is not inside the llama.cpp repo. Henk717 was using it from within KoboldCpp. The issue only happens when using image generation code taken from https://github.com/leejet/stable-diffusion.cpp, however sd.cpp itself has yet to sync to the new ggml version (which need this commit), so it won't be reproducible there yet, additionally they also do not use CUDA graphs enabled there which is needed to trigger this issue, so it will also need to be enabled. I could direct you to my own fork at https://github.com/LostRuins/koboldcpp however I must admit my code quality is not very good, my build process is jank, and might be troublesome to work with, plus I cannot trigger this bug myself... @henk717 do you have any better idea how we can help gg troubleshoot this here? The circumstances of the leak caused by this PR are as summarized in #19645 (comment) |
|
It happens during regular KoboldCpp usage on every image I generate. System memory usage is extremely noticeable as a measurement as it will rapidly increase on an effected build. For me its consistent on Windows with driver version 581.29 on my local desktop PC with a 3090 GPU. On a 3090 linux cloud rental I was unable to measure this issue (I assume it wasn't present but I am less familiar with the linux tools). Lostruins was uneffected on his 4090 laptop. The tricky part is that it manifests inside of stablediffusion.cpp because KoboldCpp uses the newer ggml from Llamacpp. At the moment the upstream sdcpp project is still on a 3 week old ggml version and will not be impacted yet. One way to test it on KoboldCpp is by using https://huggingface.co/MaxedOut/ComfyUI-Starter-Packs/resolve/main/SDXL/checkpoints/sd_xl_base_1.0.safetensors?download=true (Any model should produce it, this is a relatively fast one that I know for certain is easy to observe) as the image model. This can be done in the Image Model tab by selecting it as an Image Gen Model. No further files are required for this model. The commandline equivalent would be --imgmodel Once the UI launches accept that you wish to be taken to sdui or use the /sdui link for easy image generation. Type any prompt (In my tests I use the word Kobold) and you will now see the system process's memory increase by several hundred megabytes every image you generate. If you prefer working with the upstream sdcpp project you may be able to link it to a newer ggml version to introduce the same issue. KoboldCpp could be compiled with cmake for the windows cuda build (Linux is make or koboldcpp.sh, but on Linux I haven't been able to reproduce the issue so far possibly due to my lack of familiarity with the linux memory management tools). @LostRuins would it be difficult for you to provide a stablediffusioncpp build (not koboldcpp) with the new ggml applied? |
|
Maybe add this check back again and see if it resolves the issue diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
index 7e6d33035..f841d8ac9 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
@@ -2893,6 +2893,10 @@ static bool ggml_cuda_graph_check_compability(ggml_cgraph * cgraph) {
#endif
}
+ if (node->op == GGML_OP_ADD && node->src[1] && node->src[1]->ne[1] > 1) {
+ use_cuda_graph = false;
+ }
+
if (!use_cuda_graph) {
break;
} |
|
Bullseye, @LostRuins made a test for me with the fix applied and on this test build I do not have the issue. |
|
So I think we could be adding a large of number of CUDA graphs which are never reclaimed. Namely this code is being called several times. It's probably because each time graph_compute is being called with a new pointer during nrows > 1. llama.cpp/ggml/src/ggml-cuda/common.cuh Lines 1334 to 1341 in 9051663 You can confirm this by using |
|
Its happening on my Windows system, so if that instruction is for me i'd need to know a windows equivalent. |
|
IDK on windows, from a quick search it looks like drmemory is useful. |
* cuda : enable CUDA graphs for MMID BS <= 4 * cont : add stream capture check Co-authored-by: Oliver Simons <osimons@nvidia.com> * cont : add MMVQ_MMID_MAX_BATCH_SIZE --------- Co-authored-by: Oliver Simons <osimons@nvidia.com>
|
This is not going to be an easy task for me, the tool you suggested didn't work. I found another one which reduces the generation speeds immensely so it takes a long time, the end result was without any names so I don't think any of us will be able to decypher it. Between @LostRuins integrating fixes and me testing this would probably take significantly more time compared to queuing up some idea's. Normally when I am testing he queue's up multiple test builds so I can test them all in one go this will likely be faster than the hours it will take for a single test, assuming we get more useful output to begin with as I don't know if the tool I found can handle debug symbols (The exe I have isn't a debug symbol build but one from our ci). My suggestion is to continue on the more speculative path, ideally in one go. Then he can queue it up for me in his timezone and I can test it in mine if that is needed. |
|
Okay, you can try this patch to confirm diff --git a/ggml/src/ggml-cuda/common.cuh b/ggml/src/ggml-cuda/common.cuh
index 36d8a3aaa..d394fb475 100644
--- a/ggml/src/ggml-cuda/common.cuh
+++ b/ggml/src/ggml-cuda/common.cuh
@@ -1335,6 +1335,9 @@ struct ggml_backend_cuda_context {
auto it = cuda_graphs.find(first_node_ptr);
if (it == cuda_graphs.end()) {
cuda_graphs[first_node_ptr] = std::make_unique<ggml_cuda_graph>();
+ if (cuda_graphs.size() % 100 == 0) {
+ GGML_LOG_INFO("cuda graphs current size: %zu\n", cuda_graphs.size());
+ }
return cuda_graphs[first_node_ptr].get();
}
return it->second.get();This should keep on increasing forever till OOM and you should see the log of the size of the map. |
|
The patch was added : LostRuins@c2e06d8 Instead it looks like this: ggml_backend_cuda_graph_compute: CUDA graph warmup complete |
|
Update: I noticed it was kinda limited on sdxl so I went back to testing qwen, a few images in I saw the new debug output. |
|
But what did it output (the values) |
|
The check was an equals check, so I only got it when it hit 100. |
|
It's a modulo check. So you should get the print at 100, 200, 300 etc. |
|
By the time it hit 100 I already had multiple gigabytes of additional ram. |
* cuda : enable CUDA graphs for MMID BS <= 4 * cont : add stream capture check Co-authored-by: Oliver Simons <osimons@nvidia.com> * cont : add MMVQ_MMID_MAX_BATCH_SIZE --------- Co-authored-by: Oliver Simons <osimons@nvidia.com>
* cuda : enable CUDA graphs for MMID BS <= 4 * cont : add stream capture check Co-authored-by: Oliver Simons <osimons@nvidia.com> * cont : add MMVQ_MMID_MAX_BATCH_SIZE --------- Co-authored-by: Oliver Simons <osimons@nvidia.com>
cont #19644
cont #18958
cont #19521
Enable CUDA graphs for ggml graphs with
GGML_OP_MUL_MAT_IDat1 < BS <= 4. Improves the performance for parallel generation of up to 4 sequences. This is useful for example when running multiple local agents in parallel (#19564 (reply in thread))Also, simplify the CUDA graph exception logic. Unless I am missing something, we no longer need to restrict
GGML_OP_ADDforBS > 1because we perform an exhaustive node properties check to decide if the graph needs an update.Next PRs:
BS > 4 && BS <= 8, or more?DGX Spark: