Conversation
6f5d23d to
a2add8a
Compare
952add7 to
c05b224
Compare
|
Explore the complete analysis inside the Version Insights Pull Request #456 Performance SummaryScope: CUDA backend synchronization optimization OverviewThis PR reduces CPU-GPU synchronization overhead in the CUDA backend by introducing conditional synchronization logic. The changes enable asynchronous memory copies between CPU and CUDA without unnecessary synchronization calls, allowing better overlap between CPU and GPU operations during token generation. Key FindingsPerformance-Critical Area Impact: The modifications target the inference scheduling path in
Absolute Time Savings: The measured 1-1.4% improvement translates to approximately 20000-50000 ns saved per token in the scheduling overhead. For a model generating 400 tokens/second, this represents 8000000-20000000 ns reduction per second of inference time. Tokens Per Second Impact: The core inference functions ( Impacted Functions:
Power Consumption: No power consumption analysis data available for this comparison. The optimization reduces CPU idle time during synchronization, potentially lowering CPU power draw, but GPU power consumption remains unchanged as compute workload is unaffected. Backend Compatibility: Non-CUDA backends (Metal, HIP, CPU) retain explicit synchronization, maintaining identical behavior and zero performance impact. |
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary: PR #456OverviewThis PR implements CUDA-specific optimizations to reduce CPU-GPU synchronization overhead during token generation. The changes introduce conditional synchronization skipping for backends with implicit stream-based ordering (CUDA) and enable async CPU-to-CUDA memory copies. The modifications span 2 files with 58 lines changed, targeting the scheduler and CUDA backend. Key FindingsPerformance-Critical Area Impact: The changes directly affect the inference pipeline through Inference Impact: The Affected functions in inference path:
Power Consumption Analysis: The detected 0.242% power consumption increase in Implementation Details: The PR introduces |
|
Explore the complete analysis inside the Version Insights Based on the analysis, no functions were identified with meaningful performance changes between the base and target versions. The code modifications did not result in measurable performance impact. |
ggml_backend_cuda_cpy_tensor_async()
supported backends (CUDA for now)
…fer type to just buffer type, to avoid linking issues
vulkan which require a synchronization between HtoD copies and graph execution could also adopt this change now.
GGML_DEVICE_TYPE_CPU.
ggml_backend_sched_split initialization
|
Explore the complete analysis inside the Version Insights Based on the comprehensive analysis, here is the performance review report: Performance Review ReportOverviewThis review analyzes performance changes across 11 commits focused on implementing asynchronous tensor copy operations in the GGML backend scheduler. The changes modified 35 files, added 37 new files, and deleted 3 files, primarily targeting the backend synchronization infrastructure. Performance Impact SummaryThe changes introduce moderate performance improvements with two functions showing measurable changes: 1.
2.
Code Changes AnalysisThe commit history reveals a systematic implementation of asynchronous operations following the "saaasg" pattern (sync-async-async-async-sync-graph). Key changes include:
Performance-Critical Function AnalysisThe The JustificationThe performance changes align with the stated goal of implementing asynchronous operations for high-throughput inference workloads. The commit messages document a careful, iterative approach to introducing async capabilities while maintaining backward compatibility. The throughput gains in the scheduler function justify the modest latency increase, as the optimization targets production scenarios where batch processing efficiency outweighs per-operation latency. The changes preserve correctness through graceful fallback to synchronous mode for unsupported backends. Conclusion: The changes deliver meaningful throughput improvements for multi-backend inference scenarios with minimal latency overhead, successfully implementing async tensor copy operations as intended. |
Mirrored from ggml-org/llama.cpp#17795
[DRAFT]
This PR suggest to remove some superfluous synchronization calls between tokens to be faster on CUDA backends. I see between 1% and 2% perf gain depending on the model, GPU and settings.
Mechanism
The performance impact is best explained visually. Here are the "before" and "after" nsight system traces. They are not to scale. The relevant part is the row with the green and red bubbles. Both images show the overhead between the GPU execution of two tokens. The generation of the n-th token ends on the left hand side of the screenshot, in the green bubble titled
cudaStreamSynchronize. The calculation of the next/n+1st token starts on the right hand side, at the green bar titledcudaGraphLaunch. In between, there is CPU orchestration overhead. This PR aims to shrink the time spent in the middle, between GPU token generation. Original:In the middle of the above image, we see red and green bubbles alternating. In this case, the green bubbles are synchronization steps, the red bubbles are asynchronous copy calls from host to device. If async operations are immediately followed by synchronization calls, they are executed synchronously. This is not efficient. Removing the green synchronization operations between asynchronous copy calls leads to asynchronous copies and reduced overhead between GPU token generation:
Performance
I benchmarked on a RTX Pro 6000 Blackwell using
./llama-bench -m $models -p 0 -n 128,256,512 -fa 1.My testing shows around 1% improvement, with
gpt-oss-20bgaining up to 1.4%.llama 3B Q4_K - Mediumshows very high variance, prompting me to run the tests again with-r 100. At-r 100, a clearer trend of improved performance forgemma3n E2B Q8_0is also visible.Details with default `-r 5`
Baseline:
PR:
Speedup:
Details with `-r 100`
Baseline:
PR:
Speedup:
Implementation Concerns
The approach here aims to minimize changes in the general backend, and to other backends. However, the synchronization calls originate from the general backend. Some changes there are unavoidable, as well as retaining a synchronization call after the last copy to ensure correctness across backends.
Additionally, AFAIK there is no documentation on the functional guarantees of a function like
ggml_copy_tensor, and it could be that the current design proposal violates existing assumptions, or practices around potentially breaking ABIs between ggml and llama.cpp. For this reason, this PR is a draft.I also have not adapted other
ggml_backend_buffer_iinterface changes (addedset_tensor_async+ whitespace) from the other backends just yet.Please advise on the best course of action here.
For example, we could make the
set_tensorin the CUDA backend default async. This would avoid interface changes, but change the behavior of similar functions between backends.@ggerganov @JohannesGaessler