UPSTREAM PR #16991: CUDA: add stream-based concurrency#72
UPSTREAM PR #16991: CUDA: add stream-based concurrency#72
Conversation
d7421a0 to
5950843
Compare
b1d9e01 to
b3275bb
Compare
|
Access the complete analysis in the LOCI Dashboard Performance Analysis SummaryOverviewThe analysis examined PR #72 implementing CUDA stream-based concurrency for Q, K, V branch parallelization in llama.cpp. The changes introduce concurrent CUDA streams to improve GPU utilization, with demonstrated 2-7% throughput gains in benchmarks. Performance Impact AssessmentHighest Performance Changes:
Core Function Impact: Power Consumption Analysis: Flame Graph and CFG Analysis: Code Review Insights:
The changes are well-architected with appropriate safeguards and demonstrate measurable performance improvements in the target use case (Q, K, V parallelization). ConclusionThe observed performance variations (sub-nanosecond changes) fall within measurement precision limits and do not impact inference performance. The CUDA concurrency implementation represents a positive enhancement to GPU utilization without introducing performance regressions in critical paths. No actionable performance optimizations are required based on the current analysis. |
|
Explore the complete analysis inside the Version Insights Performance Analysis SummaryPR #72: CUDA Stream-Based Concurrency Implementation This PR introduces concurrent CUDA stream execution for Q, K, V branch parallelization in single-GPU configurations. The implementation adds 448 lines across 3 files, establishing fork-join pattern detection and multi-stream scheduling infrastructure. Performance analysis shows 0.0% power consumption change across all binaries and sub-nanosecond timing variations in non-critical utility functions. Key FindingsPerformance-Critical Areas Impact: The changes do not modify any functions within the core inference pipeline identified in the project summary. The observed variations occur in graph construction utilities: Tokens Per Second Impact: No impact on inference throughput. The critical functions Power Consumption Analysis: All 16 binaries maintain stable power consumption. The core inference library Code Implementation: The PR adds CUDA stream management infrastructure including concurrent event structures, graph optimization engine targeting 3-branch fan-out patterns, and per-stream memory pools. The optimization is disabled by default (requires |
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary - PR #72OverviewThis PR introduces CUDA stream-based concurrency for single-GPU inference, targeting parallel execution of Q, K, V attention branches. The implementation adds 469 lines across 3 CUDA backend files, implementing fork-join parallelism with event-based synchronization. Performance Metrics AnalysisStatic Analysis Results:
Measured Runtime Performance (from PR benchmarks):
Key FindingsCode Implementation: Inference Impact: Power Consumption: Analysis Limitation: |
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary: PR #72 - CUDA Stream-Based ConcurrencyOverviewThis PR introduces concurrent CUDA stream execution for Q, K, V tensor processing in single-GPU configurations, gated behind the Performance MetricsBinary-Level Analysis:
Function-Level Analysis: Key FindingsCode Implementation: Inference Impact: Expected Runtime Behavior (When Enabled): Power Consumption: |
Mirrored from ggml-org/llama.cpp#16991
Possibly supersede #16786.
This PR adds support to run concurrent CUDA streams on single GPU setups.
At the moment this only targets the Q, K, V branch. I feel this is the "correct" approach in case the Q, K, V tensors are of different types/not in the same place in memory. The downside is that this approach doesn't come for free and there's some complexity involved, but I'm not an expert at the ggml graph and I feel it could be simplified.
Currently this is hidden by an env variable flag. To run you can use
GGML_CUDA_ENABLE_GRAPH_OPT=1TG Performance is in line with the previous PR (2-7% gain), we leave some performance on the table where we don't fuse operations in the parallel streams themselves (e.g. MUL_MAT + BIAS, RMS_NORM + MUL etc.), I couldn't find a simple enough way to enable fusion there.
Before:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
After:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes