Sched: Reintroduce less synchronizations between token, with fixed pipeline parallelism.#20793
Draft
aendk wants to merge 2 commits intoggml-org:masterfrom
Draft
Sched: Reintroduce less synchronizations between token, with fixed pipeline parallelism.#20793aendk wants to merge 2 commits intoggml-org:masterfrom
aendk wants to merge 2 commits intoggml-org:masterfrom
Conversation
…ml-org#17795) * Adds CPU-to-CUDA copy capability to ggml_backend_cuda_cpy_tensor_async() * Adds function to relax sync requirements between input copies on supported backends (CUDA for now) * Exchanges synchronous copy with async copy function. * Adds macro guards to allow compilation in non-CUDA builds * Reworked backend detection in ggml-backend.cpp to avoid linking conflicts * Relax requirement of checks in async CUDA copies from backend and buffer type to just buffer type, to avoid linking issues * Minor cleanup * Makes opt-in to relax use of explicit syncs more general. Backends like vulkan which require a synchronization between HtoD copies and graph execution could also adopt this change now. * Reintroduces stricter check for CPU->CUDA backend async copy via GGML_DEVICE_TYPE_CPU. * Corrects initialization of ggml_backend_sync_mode in ggml_backend_sched_split initialization * Simplifies synchronizations to adhere to `saaasg` pattern. * Apply suggestion from @ggerganov (src->buffer to buf_src) Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Apply suggestion from @ggerganov (src->buffer to buf_src) v2 Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
…nput copying and graph computation.
This comment was marked as outdated.
This comment was marked as outdated.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow up to #20463 (comment).
#17795 improved performance in the single GPU setting on CUDA, but it was rolled back due to a bug surfacing in multi-GPU pipeline parallel settings.
For the single GPU setting, it moved the scheduling from
sassassasgto the more efficientsaaasgpattern, wheres= sync,a= async copy,g= graph execution.Each asynchronous copy was enclosed in two synchronizations. Removing some superfluous synchronizations improved performance, especially on windows. The change was to only do a single synchronization between memory copies and graph execution.
However in multi-GPU settings, we saw
llama-perplexityregressions indicating incorrect scheduling (#20463).I found that the event-based pipeline parallelism scheduling mechanism very likely implicitly relies on synchronous copies, as (i) in my testing
copy_from_hostworked as intended, and (ii) disabling it and therefore introducing synchronous copies fixed the bug,llama-perplexityperplexity was then identical to master.The proposed fix here is therefore to enroll pipeline parallelism into the same synchronization between async copies and graph execution as the single GPU case already has.
I think this can be a good solution as it keeps scheduling similar between single GPU and multi GPU, and because it is simpler and safer than reworking the event-driven pipeline parallelism logic.
In my testing, this proposal has same performance benefits as the initial PR, and it yields correct perplexity scores both in single and multi-GPU.
As this bug surfaced in the community with their more diverse hardware setups and usage scenarios, it would be awesome if you could test-drive this change both with
llama-benchandllama-perplexitywith your usual model and launch-options usage!@mxxm-t @slavap @Superbobo75 @thejacer
If you can, check out this branch and compare this against its master (
git checkout HEAD~2). Let me know if you run into performance or accuracy issues!