Sched: Reintroduce less synchronizations between token, with fixed pipeline parallelism. by aendk · Pull Request #20793 · ggml-org/llama.cpp

aendk · 2026-03-20T10:19:40Z

Follow up to #20463 (comment).

#17795 improved performance in the single GPU setting on CUDA, but it was rolled back due to a bug surfacing in multi-GPU pipeline parallel settings.

For the single GPU setting, it moved the scheduling from sassassasg to the more efficient saaasg pattern, where s= sync, a= async copy, g= graph execution.
Each asynchronous copy was enclosed in two synchronizations. Removing some superfluous synchronizations improved performance, especially on windows. The change was to only do a single synchronization between memory copies and graph execution.

However in multi-GPU settings, we saw llama-perplexity regressions indicating incorrect scheduling (#20463).

I found that the event-based pipeline parallelism scheduling mechanism very likely implicitly relies on synchronous copies, as (i) in my testing copy_from_host worked as intended, and (ii) disabling it and therefore introducing synchronous copies fixed the bug, llama-perplexity perplexity was then identical to master.

The proposed fix here is therefore to enroll pipeline parallelism into the same synchronization between async copies and graph execution as the single GPU case already has.
I think this can be a good solution as it keeps scheduling similar between single GPU and multi GPU, and because it is simpler and safer than reworking the event-driven pipeline parallelism logic.

In my testing, this proposal has same performance benefits as the initial PR, and it yields correct perplexity scores both in single and multi-GPU.
As this bug surfaced in the community with their more diverse hardware setups and usage scenarios, it would be awesome if you could test-drive this change both with llama-bench and llama-perplexity with your usual model and launch-options usage!
@mxxm-t @slavap @Superbobo75 @thejacer

If you can, check out this branch and compare this against its master (git checkout HEAD~2). Let me know if you run into performance or accuracy issues!

@ggerganov

…ml-org#17795) * Adds CPU-to-CUDA copy capability to ggml_backend_cuda_cpy_tensor_async() * Adds function to relax sync requirements between input copies on supported backends (CUDA for now) * Exchanges synchronous copy with async copy function. * Adds macro guards to allow compilation in non-CUDA builds * Reworked backend detection in ggml-backend.cpp to avoid linking conflicts * Relax requirement of checks in async CUDA copies from backend and buffer type to just buffer type, to avoid linking issues * Minor cleanup * Makes opt-in to relax use of explicit syncs more general. Backends like vulkan which require a synchronization between HtoD copies and graph execution could also adopt this change now. * Reintroduces stricter check for CPU->CUDA backend async copy via GGML_DEVICE_TYPE_CPU. * Corrects initialization of ggml_backend_sync_mode in ggml_backend_sched_split initialization * Simplifies synchronizations to adhere to `saaasg` pattern. * Apply suggestion from @ggerganov (src->buffer to buf_src) Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Apply suggestion from @ggerganov (src->buffer to buf_src) v2 Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

…nput copying and graph computation.

aendk and others added 2 commits March 19, 2026 14:53

Fix: For pipeline parallelism, adds missing synchronization between i…

a48fd3b

…nput copying and graph computation.

This comment was marked as outdated.

Sign in to view

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sched: Reintroduce less synchronizations between token, with fixed pipeline parallelism.#20793

Sched: Reintroduce less synchronizations between token, with fixed pipeline parallelism.#20793
aendk wants to merge 2 commits intoggml-org:masterfrom
aendk:akieslinger/rework-reduce-per-token-syncs

aendk commented Mar 20, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aendk commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aendk commented Mar 20, 2026 •

edited

Loading