Skip to content

Sched: Reintroduce less synchronizations between token, with fixed pipeline parallelism.#20793

Draft
aendk wants to merge 2 commits intoggml-org:masterfrom
aendk:akieslinger/rework-reduce-per-token-syncs
Draft

Sched: Reintroduce less synchronizations between token, with fixed pipeline parallelism.#20793
aendk wants to merge 2 commits intoggml-org:masterfrom
aendk:akieslinger/rework-reduce-per-token-syncs

Conversation

@aendk
Copy link
Contributor

@aendk aendk commented Mar 20, 2026

Follow up to #20463 (comment).

#17795 improved performance in the single GPU setting on CUDA, but it was rolled back due to a bug surfacing in multi-GPU pipeline parallel settings.

For the single GPU setting, it moved the scheduling from sassassasg to the more efficient saaasg pattern, where s= sync, a= async copy, g= graph execution.
Each asynchronous copy was enclosed in two synchronizations. Removing some superfluous synchronizations improved performance, especially on windows. The change was to only do a single synchronization between memory copies and graph execution.

However in multi-GPU settings, we saw llama-perplexity regressions indicating incorrect scheduling (#20463).

I found that the event-based pipeline parallelism scheduling mechanism very likely implicitly relies on synchronous copies, as (i) in my testing copy_from_host worked as intended, and (ii) disabling it and therefore introducing synchronous copies fixed the bug, llama-perplexity perplexity was then identical to master.

The proposed fix here is therefore to enroll pipeline parallelism into the same synchronization between async copies and graph execution as the single GPU case already has.
I think this can be a good solution as it keeps scheduling similar between single GPU and multi GPU, and because it is simpler and safer than reworking the event-driven pipeline parallelism logic.

In my testing, this proposal has same performance benefits as the initial PR, and it yields correct perplexity scores both in single and multi-GPU.
As this bug surfaced in the community with their more diverse hardware setups and usage scenarios, it would be awesome if you could test-drive this change both with llama-bench and llama-perplexity with your usual model and launch-options usage!
@mxxm-t @slavap @Superbobo75 @thejacer

If you can, check out this branch and compare this against its master (git checkout HEAD~2). Let me know if you run into performance or accuracy issues!

aendk and others added 2 commits March 19, 2026 14:53
…ml-org#17795)

* Adds CPU-to-CUDA copy capability to
ggml_backend_cuda_cpy_tensor_async()

* Adds function to relax sync requirements between input copies on
supported backends (CUDA for now)

* Exchanges synchronous copy with async copy function.

* Adds macro guards to allow compilation in non-CUDA builds

* Reworked backend detection in ggml-backend.cpp to avoid linking
conflicts

* Relax requirement of checks in async CUDA copies from backend and buffer type to just buffer type, to avoid linking issues

* Minor cleanup

* Makes opt-in to relax use of explicit syncs more general. Backends like
vulkan which require a synchronization between HtoD copies and graph
execution could also adopt this change now.

* Reintroduces stricter check for CPU->CUDA backend async copy via
GGML_DEVICE_TYPE_CPU.

* Corrects initialization of ggml_backend_sync_mode in
ggml_backend_sched_split initialization

* Simplifies synchronizations to adhere to `saaasg` pattern.

* Apply suggestion from @ggerganov (src->buffer to buf_src)

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Apply suggestion from @ggerganov (src->buffer to buf_src) v2

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@ggml-gh-bot

This comment was marked as outdated.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant