CUDA: Improve performance via less synchronizations between token#17795
CUDA: Improve performance via less synchronizations between token#17795ggerganov merged 13 commits intoggml-org:masterfrom
Conversation
|
How about using existing I believe the Besides, I believe the current way of synchronizing both For synchronizations at the beginning of this code section ( |
While there is some amount of inter-op between the two backends ( Ultimately, this is a matter of taste and what architectural trade-off seems best in this specific case. |
Thank you for the explanation. I was not aware of the distinction and it totally makes sense. |
|
I believe these changes make an implicit assumption that the async copies will enter a common execution queue/stream with all previous operations. This seems to be the case for CUDA, but I don't think it is general enough assumption that can be made for all backends. For example, removing this backend synchronization here I think is incorrect, since semantically it means that copying of the input can start immediately, even if the backend is still doing work: llama.cpp/ggml/src/ggml-backend.cpp Lines 1458 to 1464 in 09519f7 With CUDA, it is apparently not a problem because the copying does not start immediately - it gets queued in a queue (is this correct? who owns this queue? what guarantees that the copy will wait for all previous CUDA operations to finish?). But for some other backend, a copy can start immediately - as soon as we call Another implicit assumption that I think is not very good in this proposal: ggml_backend_dummy_buffer_set_tensor_async(...) {
start_thread_to_copy_data(...).detach();
}Technically, this is an asynchronous call so it's perfectly fine from an API standpoint. But there isn't a common queue to manage this thread. The thread belongs to the buffer, not to the backend. Calling |
JohannesGaessler
left a comment
There was a problem hiding this comment.
The ultimate authority on what the ggml backend interface should or shouldn't be is Diego but my opinion is that ggml_backend_tensor_set should be blocking and that an asynchronous equivalent should be an explicit opt-in. Otherwise user code can accidentally introduce use-after-free bugs depending on how fast or slow the memcpy is vs. the user code.
|
Re Georgi's first point: I think the PR as-is is already incorrect for CUDA. The reason there is a CPU synchronization for the memcpy is that the source pointer is under host control and the host could free the memory immediately after the function returns, resulting in a use-after-free bug. With this PR this could in principle still happen. It would only be safe if there is a precondition for user code that stipulates that memory may only be freed after calling Re Georgi's second point: If I understood Diego's idea of what a ggml "backend" is supposed to be correctly it was pretty much equivalent to the concept of a CUDA stream. And especially when the ggml backend API has a function |
Yes, |
5d26313 to
1233fdd
Compare
|
Hi @ggerganov @JohannesGaessler , I reworked the whole PR and it is ready for review again. I started it from scratch due to the incorrect assumptions I based the first approach on.
|
|
The current behavior of what is implicitly ordered is more or less undocumented and challenging to reverse-engineer (sounds like you had to do some of this yourself). Adding code paths that rely on backend-specific implicit ordering is IMO a step in the wrong direction. |
|
Back to draft for some double checking on scenarios like Multi-CPU or non-CUDA to CUDA with UMA. |
|
Hi, Or is there something more elegant I've overlooked? |
402502f to
459cb40
Compare
|
Hi @ggerganov, @JohannesGaessler, apart from the spurious and likely unrelated CI fails, this PR is ready for review. For easier reviewing, please take a look at the graphic to understand my changes: This graphic shows the behavior of Lower half (Backend to Backend)For the lower half, each backend can implement asynchronous copies to itself. The fallback is fully synchronous. Here, I enabled async copies from the CPU backend to the CUDA backend, next to the existing CUDA->CUDA behavior. Upper half (Host to Backend)For the upper half, we have several small (position ids, masks) parallel copies from the host code to the backend. My changes in ggml-backend.cpp modify these. Regarding my previous comment, I decided to go for option 2) for checking the backend type in ggml-cuda.cu. Let me know if this is strict enough. Performance on WindowsDetailsMaster ( PR: Speedup on windows: |
|
On first look, this looks like a solid proposal - at least I don't see any major concerns atm. Can you verify the following use cases are working as expected:
I'll try to answer this - atm not 100% sure. I'll also try to adapt this to the Metal backend and see if everything is compatible there as well. |
ggml/src/ggml-backend.cpp
Outdated
|
|
||
| // To not change any APIs or change what ggml-base links to, we can only detect backends by string matching | ||
| auto backend_name = ggml_backend_name(current_backend); | ||
| if (strncmp(backend_name, "CUDA", 4) == 0) { |
There was a problem hiding this comment.
FWIW, I think Diego would reject any change that string compares the backend name. If we need to query a capability from the backend maybe add it to ggml_backend_dev_caps?
There was a problem hiding this comment.
I fully agree that this is not optimal. Working from the feedback above, I aimed to propose no API changes. This was the only obvious way.
The only reason this is remotely ok for me is that only up to 4 chars are checked for each inference pass, and that this is likely irrelevant for performance, and whilst a bit ugly, passable for readability.
If we are open to relaxing the "no API change" rule, I'll gladly rework this.
ggml/src/ggml-backend.cpp
Outdated
| #define GGML_SCHED_MAX_COPIES 4 | ||
| #endif | ||
|
|
||
| enum ggml_backend_sync_mode { |
There was a problem hiding this comment.
My biggest concern is that it is still unclear what the synchronization requirements are for commands like get/set_tensor(_async), and this makes it even more confusing.
You've posed this as "a backend naturally has some amount of implicit synchronization, and this exposes information about that to the split code to take advantage of". But within a single backend there could be multiple amounts of synchronization it could do, how does somebody implementing a backend know how much synchronization they need to do? The backend interface needs to be a contract that clearly defines what synchronization is required of the backend.
There was a problem hiding this comment.
I agree that the current async landscape is a bit fuzzy, but I disagree that my additions make it substantially worse.
In other projects I've worked in before, the general gist was to use synchronized functions unless you knew exactly what to do, and needed to go async to squeeze out more performance. I feel that ggml is a bit similar in that regard.
For this reason, my change works on an opt-in basis. If you need more performance, you can relax the synchronization requirements for your backend. If not, keep it simple and retain default synchronous execution.
In the future, there could be new backends which might require a different type of synchronization. I believe that figuring out what level of synchronization relaxation is best for their backend, like any other performance tuning or optimization, is part of the responsibility of these developers.
If GGML_SPLIT_SYNC_MODE_WRITE_READ_BOUNDARY (only a single sync between writes and reads) is not suitable for the vulkan backend, let me know. I would be happy to learn more about this, and to make my proposal more applicable for vulkan and similar backends.
There was a problem hiding this comment.
My understanding of the ggml_backend_i contract atm is that using backend.get_tensor_async and backend.set_tensor_async without synchronization with respect to backend.graph_compute is not safe. I.e. this sequence of calls is not guaranteed to be thread-safe:
# a - async get/set
# g - compute
aaagaaag...Although some backends such as Metal and CUDA would actually not have a problem in this case due to the implicit synchronization between the data movement and compute commands (i.e. due to sharing the same stream/queue), it is not a general assumption that can be made for all backends.
Because this is the current contract, the implementation in ggml_backend_sched_compute_splits() implements the safe path, avoiding the problematic async get/set/compute:
# c - sync get/set (i.e. data copy)
# g - async compute
# s - explicit sync
sccccsgsccccsg...Without making any changes to the existing contract, we can relax this logic and avoid some of the extra synchronizations for the copies (as discussed in the other #17795 (comment)):
# a - async get/set
# g - async compute
# s - explicit sync
saaaasgsaaaasg...This change should be fully compatible with the existing assumptions and any code that uses ggml-backend would not be impacted, since we make no changes to the interface.
Now, the proposal by @aendk attempts to enable the aaaagaaaag... path, which if the above is correct, is currently not safe to use by any ggml application. The bottomline is that in order for this path to work, the backend has to announce in some way whether this sequence of operations is safe depending on the backend implementation. Checking the name is obviously not ideal - likely we can extend the dev_caps to provide this information. Although I think this is rather a property/capability of the backend object itself, rather of the device, so a small concern here.
But for now lets assume that the dev_caps exposes an appropriate flag or enum. What should it be? For example, it can be bool data_and_compute_are_sequential, meaning that if the flag is true, it is guaranteed that the data movement and the compute share a common stream/queue and hence the aaagaaag... path is fine and the user does not need to do explicit synchronizations.
So far, my understanding is that a single flag like the above is all the information that is currently missing regarding this functionality. I am not convinced yet that more information (e.g. an enum as @aendk proposes) is necessary.
There was a problem hiding this comment.
Slight addition on my path: Between inference passes, there is always a synchronization point to retrieve the logits. When looking at two inference passes, my proposal keeps them separate: aaaagsaaaags.
There was a problem hiding this comment.
there is always a synchronization point to retrieve the logits
Generally, that synchronization is optional. For example, when you process a large prompt of N tokens with micro-batch size of N/4, there will be 3 graph computes without synchronization:
# prompt processing text-generation starts
# v v
aaaagaaaaagaaaaagaaaaagas
# ^^
# |\- this is the sync for getting the logits from the final token
# \-- this is the async copy DtoH for the logitsThere was a problem hiding this comment.
Following up on this, I hacked adding a sync before and after copies to the backend,
by adding a ggml_backend_synchronize(split_backend) right after it is defined, and running with GGML_SPLIT_SYNC_MODE_WRITE_READ_BOUNDARY.
This leads to the desired saaasg pattern:

Green cudaStreamSynchronize, followed by red copies, another synchronize and lastly the cudaGraphLaunch.
The performance loss on windows is not significant, even though measurements are a bit noisy on the RTX PRO 6000 Max-Q (not the non-Max-Q I used before). On windows, we lose between ~1% and ~2% compared to earlier tests:
| Model | Test | t/s b7772 | t/s akieslinger/reduce-per-token-syncs | Speedup |
|:----------------------|:-------|------------:|-----------------------------------------:|----------:|
| gpt-oss 20B MXFP4 MoE | tg128 | 295.02 | 317.25 | 1.08 |
| gpt-oss 20B MXFP4 MoE | tg256 | 299.84 | 318.51 | 1.06 |
| gpt-oss 20B MXFP4 MoE | tg512 | 294.47 | 316.15 | 1.07 |
| llama 3B Q4_K_M | tg128 | 347.32 | 357.16 | 1.03 |
| llama 3B Q4_K_M | tg256 | 364.00 | 373.05 | 1.02 |
| llama 3B Q4_K_M | tg512 | 360.34 | 372.52 | 1.03 |
| qwen3 4B Q4_K_M | tg128 | 267.54 | 275.00 | 1.03 |
| qwen3 4B Q4_K_M | tg256 | 276.73 | 286.24 | 1.03 |
| qwen3 4B Q4_K_M | tg512 | 271.34 | 284.77 | 1.05 |
On linux, I see the same 1-2% gain as before.
Note that this run still retains the proposed changes in ggml-cuda.cu. These are now synced by the added ggml_backend_synchronize(split_backend) call, as cpy_tensor_async() is called after this first sync, and because the host orchestration loop is implicitly synced with the CPU backend, as already detailed.
To sum up, the performance gain of this simplified approach is 1-2% less than before, but we still gain up to 7-8% on windows.
|
I double-checked partial GPU offload and MoE offloading via nsight systems and spot-checked coherent output via Detailed Analysis for Partial GPU offloadI ran Lastly, I saw another interesting transition: CPU ( Ultimately, my code changes only touch the CPU-> CUDA transitions, other transitions work the same, and are at times already optimal (CUDA-CUDA). Partial offloading is consequently working correctly with my changes. Detailad analysis for MoE offloadingSetting This means that like in partial GPU offloading, only CPU to CUDA and CUDA to CPU are relevant. MoE offloading consequently did not show any new behavior which clashes with my changes. I'll look into multi-GPU in a subsequent post. |
ggml/src/ggml-backend.cpp
Outdated
| if (input->flags & GGML_TENSOR_FLAG_INPUT) { | ||
| // inputs from the user must be copied immediately to prevent the user overwriting the data before the copy is done | ||
| if (sched->events[split_backend_id][sched->cur_copy] != NULL) { | ||
| ggml_backend_event_synchronize(sched->events[split_backend_id][sched->cur_copy]); | ||
| } else { | ||
| ggml_backend_synchronize(split_backend); | ||
| ggml_backend_synchronize_if_required(split, split_backend); | ||
| } | ||
| ggml_backend_tensor_copy(input, input_cpy); | ||
| ggml_backend_tensor_copy_async(input_backend, split_backend, input, input_cpy); | ||
| ggml_backend_synchronize_if_required(split, split_backend, last_input); |
There was a problem hiding this comment.
If I am following the logic correctly, I think there is a very simple alternative approach that would still eliminate most of the synchronizations, without any changes to the backend API/infra.
So on master, for multiple inputs N to the current split, we do the following:
# method a
# s - synchronize split backend
# c - sync copy (ggml_backend_tensor_copy) (each of this blocks the CPU thread)
sccc...[n times]...cccLet's say for starts, we want to remove just the inter-input syncs (keep it simple):
# method b
# s - synchronize split backend
# a - async copy (ggml_backend_cuda_cpy_tensor_async)
saaa...[n times]...aaasNote that we keep the syncs at the start and at the end in order to keep this functionality the same as on master (currently, I am still not 100% sure in which cases these syncs can be safely avoided and am worried we might be missing some possible data race)
_[actually, this part is irrelevant - events are not needed]_
So to achieve this second method, we can use the existing ggml_backend_tensor_copy_async and in addition, record an event after all the copies:
# s - synchronize backend
# a - async copy
# r - record event
# w - synchronize backend on the event
saaa...[n times]...aaarwThis logic only works if the split backend has async copies and events supported. Otherwise, we fallback to method a.
I think this change would be:
- safe as we only "merge" sequential input copies
- not as optimal as the current proposal, but mostly there
- simpler as it avoids introducing new concepts to the backend logic
Does this make sense?
There was a problem hiding this comment.
I would argue that your proposal would actually be a bit less safe, more complex to verify and merge, and also less performant.
I agree that your proposal is a bit simpler, but I think 2 local helper functions + 1 enum is ok for the impact it enables.
Default behavior
- Not changing the default behavior is an explicit feature of my proposal. No changed behavior for unenrolled backends means less test overhead and overall lower risk for this PR.
- I only add two cases next to the default behavior with explicit syncs before and after each copy:
GGML_SPLIT_SYNC_MODE_IMPLICIT:aaaaaaaaag(whereg= graph launch). This is the case CUDA supports.GGML_SPLIT_SYNC_MODE_WRITE_READ_BOUNDARY:aaaaaaaaasg. This is the case vulkan likely supports. If vulkan does indeed support it and if it benefits from needing less syncs, we can enroll it using ~3 LoC.- If you think that the pattern
saaaaaaaaasgcould be useful, I can add this case, too.
We could name it something like (GGML_SPLIT_SYNC_MODE_COPY_BOUNDARY). - Generally, having these 3-4 modes allows us to evaluate relaxing synchronizations on a case-by-case basis, and enroll compatible backends over time. This splits the risk and the verification overhead.
I think you proposed setting the "new default" for all backends to this case, and this looks more complex to test/validate all backends, at least for me.
Perf
Regarding performance, on gpt-oss we gain 8-9% by avoiding 20 explicit syncs per inference pass in total. Each original sync here costs ~0.4%-0.45%.
Dropping only the syncs between copies from the host orchestration loop to the backend would leave 5 (3 for CPU split to CUDA split + 1 start + 1 end of HtoD copies) syncs open, so 2-2.5%.
That is roughly a quarter of the current performance gains.
Pipeline Parallelism AnalysisThe short answer is that pipeline parallelism uses the event-based scheduling mechanism. Below, I briefly analyzed the differences I found compared to the single GPU setting. These findings could be a basis for future optimizations. Detailed AnalysisWith CPU->CUDA0
Here, we see a single async copy from the CPU backend to the CUDA backend, followed by 8 host to device copies from the host orchestration loop. The behavior of the CPU backend to CUDA backend copy differs compared to previous analyses: CUDA0->CUDA1In #6017, slaren implemented async memory copies for this special case from CUDA to CUDA. The screenshot below shows this; using There is also some synchronization between the async copies. On CPU, these do not matter anymore as a GPU works in parallel to this (blue row at the top), but it might be useful to analyze this further, as the actual switch from GPU0 to GPU1 looks a bit slow at around ~755us:
Here, GPU execution is also indicated at the top of the image in blue. First, the graph is executed on GPU0, then computation is continued 1 row below on GPU1. Between this, both GPUs idle and all synchronization and copies take place. |
d941624 to
5034414
Compare
|
Another round of testing; Comparing the my second to last commit(43f6684), and the simplified version as the last commit of this branch: Windows: Both are run on the RTX PRO 6000 Max-Q. As with the hacky benchmark presented in my previous comment here, this PR is still yielding up to 7-8% improvement on windows. |
ggerganov
left a comment
There was a problem hiding this comment.
I think the current patch is good. However, note that I myself am still grasping the logic of the scheduler and it's possible that we are missing something, so I prefer to merge this when we are more confident and potentially get more feedback from other maintainers.
I am working on some updates for the Metal backend (#18919, #18966) that will allow me to exercise and understand this logic in my development environment. So unless we get strong feedback that this change works as expected and does not have side effects, I prefer to take the time to fully understand the backend scheduler logic before merging.
|
The changes as they are right now are I think correct for CUDA since the copy is queued on the |
Sorry, I was looking at the wrong function. In // asynchronous copy
// the copy is performed after all the currently queued operations in backend_src
// backend_dst will wait for the copy to complete before performing other operations
// automatic fallback to sync copy if async is not supported
GGML_API void ggml_backend_tensor_copy_async(ggml_backend_t backend_src, ggml_backend_t backend_dst, struct ggml_tensor * src, struct ggml_tensor * dst);So based on these requirements I think the logic in this PR is correct. |
ggml_backend_cuda_cpy_tensor_async()
supported backends (CUDA for now)
…fer type to just buffer type, to avoid linking issues
vulkan which require a synchronization between HtoD copies and graph execution could also adopt this change now.
GGML_DEVICE_TYPE_CPU.
ggml_backend_sched_split initialization
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
3ecffb2 to
81c9618
Compare
@jeffbolznv Do you have an opinion on the latest version of this PR? Are your earlier concerns still there? I've been doing some tests on my end, getting familiar with the scheduler and I think the logic is sound. Thinking of merging this unless there are some remaining concerns. I believe @JohannesGaessler is mostly OK too. Most likely we can merge after we resolve some of the ongoing CUDA issues (e.g. #19645 (comment), #19816) in order to avoid overlapping too many changes at the same time. In any case, happy to hear any feedback and opinions. |
|
From my end I think the changes in this PR are correct and would be fine to merge. If we wanted to be extra safe we could wait until we have end-to-end tests though I suspect that those would still be inconclusive for issues with synchronization. |
|
I'm ok with the current changes. |
…ml-org#17795) * Adds CPU-to-CUDA copy capability to ggml_backend_cuda_cpy_tensor_async() * Adds function to relax sync requirements between input copies on supported backends (CUDA for now) * Exchanges synchronous copy with async copy function. * Adds macro guards to allow compilation in non-CUDA builds * Reworked backend detection in ggml-backend.cpp to avoid linking conflicts * Relax requirement of checks in async CUDA copies from backend and buffer type to just buffer type, to avoid linking issues * Minor cleanup * Makes opt-in to relax use of explicit syncs more general. Backends like vulkan which require a synchronization between HtoD copies and graph execution could also adopt this change now. * Reintroduces stricter check for CPU->CUDA backend async copy via GGML_DEVICE_TYPE_CPU. * Corrects initialization of ggml_backend_sync_mode in ggml_backend_sched_split initialization * Simplifies synchronizations to adhere to `saaasg` pattern. * Apply suggestion from @ggerganov (src->buffer to buf_src) Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Apply suggestion from @ggerganov (src->buffer to buf_src) v2 Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>










See comment below
This PR suggest to remove some superfluous synchronization calls between tokens to be faster on CUDA backends. I see between 1% and 2% perf gain depending on the model, GPU and settings.
Mechanism
The performance impact is best explained visually. Here are the "before" and "after" nsight system traces. They are not to scale. The relevant part is the row with the green and red bubbles. Both images show the overhead between the GPU execution of two tokens. The generation of the n-th token ends on the left hand side of the screenshot, in the green bubble titled
cudaStreamSynchronize. The calculation of the next/n+1st token starts on the right hand side, at the green bar titledcudaGraphLaunch. In between, there is CPU orchestration overhead. This PR aims to shrink the time spent in the middle, between GPU token generation. Original:In the middle of the above image, we see red and green bubbles alternating. In this case, the green bubbles are synchronization steps, the red bubbles are asynchronous copy calls from host to device. If async operations are immediately followed by synchronization calls, they are executed synchronously. This is not efficient. Removing the green synchronization operations between asynchronous copy calls leads to asynchronous copies and reduced overhead between GPU token generation:
Performance
I benchmarked on a RTX Pro 6000 Blackwell using
./llama-bench -m $models -p 0 -n 128,256,512 -fa 1.My testing shows around 1% improvement, with
gpt-oss-20bgaining up to 1.4%.llama 3B Q4_K - Mediumshows very high variance, prompting me to run the tests again with-r 100. At-r 100, a clearer trend of improved performance forgemma3n E2B Q8_0is also visible.Details with default `-r 5`
Baseline:
PR:
Speedup:
Details with `-r 100`
Baseline:
PR:
Speedup:
Implementation Concerns
See updated comment below.
@ggerganov @JohannesGaessler