Skip to content

UPSTREAM PR #16991: CUDA: add stream-based concurrency#72

Open
DajanaV wants to merge 17 commits intomainfrom
upstream-PR16991-branch_am17an-fused-qkv-stream
Open

UPSTREAM PR #16991: CUDA: add stream-based concurrency#72
DajanaV wants to merge 17 commits intomainfrom
upstream-PR16991-branch_am17an-fused-qkv-stream

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Nov 4, 2025

Mirrored from ggml-org/llama.cpp#16991

Possibly supersede #16786.

This PR adds support to run concurrent CUDA streams on single GPU setups.
At the moment this only targets the Q, K, V branch. I feel this is the "correct" approach in case the Q, K, V tensors are of different types/not in the same place in memory. The downside is that this approach doesn't come for free and there's some complexity involved, but I'm not an expert at the ggml graph and I feel it could be simplified.

Currently this is hidden by an env variable flag. To run you can use GGML_CUDA_ENABLE_GRAPH_OPT=1

TG Performance is in line with the previous PR (2-7% gain), we leave some performance on the table where we don't fuse operations in the parallel streams themselves (e.g. MUL_MAT + BIAS, RMS_NORM + MUL etc.), I couldn't find a simple enough way to enable fusion there.

Before:

Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes

model size params backend ngl fa test t/s
qwen3moe 30B.A3B Q4_K - Medium 17.28 GiB 30.53 B CUDA 99 1 tg32 172.10 ± 0.05
qwen3moe 30B.A3B Q4_K - Medium 17.28 GiB 30.53 B CUDA 99 1 tg64 164.89 ± 0.07
qwen3moe 30B.A3B Q4_K - Medium 17.28 GiB 30.53 B CUDA 99 1 tg128 162.47 ± 0.05
llama 8B Q5_K - Small 5.21 GiB 8.03 B CUDA 99 1 tg32 124.67 ± 0.03
llama 8B Q5_K - Small 5.21 GiB 8.03 B CUDA 99 1 tg64 121.77 ± 0.21
llama 8B Q5_K - Small 5.21 GiB 8.03 B CUDA 99 1 tg128 121.21 ± 0.04
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 tg32 210.46 ± 0.07
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 tg64 207.49 ± 0.03
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 tg128 205.36 ± 0.03

After:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes

model size params backend ngl fa test t/s
qwen3moe 30B.A3B Q4_K - Medium 17.28 GiB 30.53 B CUDA 99 1 tg32 181.60 ± 0.11
qwen3moe 30B.A3B Q4_K - Medium 17.28 GiB 30.53 B CUDA 99 1 tg64 173.92 ± 0.05
qwen3moe 30B.A3B Q4_K - Medium 17.28 GiB 30.53 B CUDA 99 1 tg128 170.95 ± 0.03
llama 8B Q5_K - Small 5.21 GiB 8.03 B CUDA 99 1 tg32 128.16 ± 0.05
llama 8B Q5_K - Small 5.21 GiB 8.03 B CUDA 99 1 tg64 125.28 ± 0.03
llama 8B Q5_K - Small 5.21 GiB 8.03 B CUDA 99 1 tg128 124.18 ± 0.02
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 tg32 214.24 ± 0.08
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 tg64 211.05 ± 0.04
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 tg128 208.83 ± 0.03

@DajanaV DajanaV force-pushed the main branch 29 times, most recently from d7421a0 to 5950843 Compare November 8, 2025 13:11
@DajanaV DajanaV force-pushed the main branch 9 times, most recently from b1d9e01 to b3275bb Compare November 13, 2025 09:10
@loci-review
Copy link

loci-review bot commented Nov 15, 2025

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

The analysis examined PR #72 implementing CUDA stream-based concurrency for Q, K, V branch parallelization in llama.cpp. The changes introduce concurrent CUDA streams to improve GPU utilization, with demonstrated 2-7% throughput gains in benchmarks.

Performance Impact Assessment

Highest Performance Changes:

  • Response Time: llm_graph_input_out_ids::can_reuse() showed +0.096% (+0.063 ns)
  • Throughput Time: std::make_unique<llm_graph_input_attn_no_cache>() showed +0.111% (+0.078 ns)

Core Function Impact:
The performance changes do not affect critical inference functions (llama_decode, llama_encode, llama_tokenize) that directly impact tokens per second. The modified functions are utility functions in the graph construction layer, not the primary inference pipeline.

Power Consumption Analysis:
All binaries maintain stable power consumption with changes below 0.001%. The core inference library (build.bin.libllama.so) shows negligible variation at 280,731 nanojoules, indicating no significant computational intensity changes.

Flame Graph and CFG Analysis:
The can_reuse() function shows identical assembly code between versions with a flat execution profile (single 65ns operation). The 0.063ns performance difference represents measurement noise rather than algorithmic changes, likely caused by binary layout shifts affecting instruction cache alignment.

Code Review Insights:
The implementation adds sophisticated CUDA stream management infrastructure:

  • New concurrent event structures for stream synchronization
  • Graph optimization engine targeting 3-branch fan-out patterns
  • Per-stream memory pool management
  • Dynamic stream switching during execution

The changes are well-architected with appropriate safeguards and demonstrate measurable performance improvements in the target use case (Q, K, V parallelization).

Conclusion

The observed performance variations (sub-nanosecond changes) fall within measurement precision limits and do not impact inference performance. The CUDA concurrency implementation represents a positive enhancement to GPU utilization without introducing performance regressions in critical paths. No actionable performance optimizations are required based on the current analysis.

@loci-review
Copy link

loci-review bot commented Nov 25, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary

PR #72: CUDA Stream-Based Concurrency Implementation

This PR introduces concurrent CUDA stream execution for Q, K, V branch parallelization in single-GPU configurations. The implementation adds 448 lines across 3 files, establishing fork-join pattern detection and multi-stream scheduling infrastructure. Performance analysis shows 0.0% power consumption change across all binaries and sub-nanosecond timing variations in non-critical utility functions.

Key Findings

Performance-Critical Areas Impact:

The changes do not modify any functions within the core inference pipeline identified in the project summary. The observed variations occur in graph construction utilities: llm_graph_input_out_ids::can_reuse() shows +0.063 ns response time change, and std::make_unique<llm_graph_input_attn_no_cache>() shows +0.078 ns throughput change. These functions execute during graph setup, not during token generation.

Tokens Per Second Impact:

No impact on inference throughput. The critical functions llama_decode, llama_encode, and llama_tokenize show zero measurable changes in response time or throughput. Since these functions remain unaffected, tokens per second performance is preserved. The reference model (smollm:135m on i7-1255U) demonstrates that 2 ms degradation in llama_decode reduces tokens per second by 7%, but this PR introduces no such degradation.

Power Consumption Analysis:

All 16 binaries maintain stable power consumption. The core inference library build.bin.libllama.so shows -0.001% change (228,743 nJ vs 228,744 nJ baseline), representing a 1.21 nJ reduction that falls within measurement precision limits. GGML backend libraries (libggml-base.so, libggml-cpu.so, libggml.so) and all CLI tools show 0.0% change, confirming no computational intensity modifications in the baseline execution path.

Code Implementation:

The PR adds CUDA stream management infrastructure including concurrent event structures, graph optimization engine targeting 3-branch fan-out patterns, and per-stream memory pools. The optimization is disabled by default (requires GGML_CUDA_GRAPH_OPT=1), ensuring existing code paths remain unchanged. Benchmark data from the PR description shows 2-7% throughput improvements for target GPU workloads when optimization is enabled.

am17an and others added 3 commits November 27, 2025 22:10
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
@loci-review
Copy link

loci-review bot commented Nov 27, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #72

Overview

This PR introduces CUDA stream-based concurrency for single-GPU inference, targeting parallel execution of Q, K, V attention branches. The implementation adds 469 lines across 3 CUDA backend files, implementing fork-join parallelism with event-based synchronization.

Performance Metrics Analysis

Static Analysis Results:

  • No function-level performance changes detected between versions
  • Power consumption analysis shows 0.0% change across all 16 binaries
  • All binaries maintain identical computational profiles

Measured Runtime Performance (from PR benchmarks):

  • Qwen3MoE 30B: 172 → 182 tokens/s (10 tokens/s improvement)
  • LLaMA 8B: 125 → 128 tokens/s (3 tokens/s improvement)
  • GPT-OSS 20B: 210 → 214 tokens/s (4 tokens/s improvement)

Key Findings

Code Implementation:
The changes implement graph-level optimization that identifies fork-join patterns in attention computation. The ggml_backend_cuda_graph_optimize() function analyzes computation graphs to detect nodes with fan-out of 3 (Q, K, V branches), validates memory safety through overlap detection, and reorders graph nodes to enable concurrent stream execution. The evaluate_and_capture_cuda_graph() function manages stream assignment and CUDA event synchronization at fork and join points.

Inference Impact:
The static analysis tools show no detectable changes because the optimization operates at the CUDA runtime level rather than modifying function logic. The measured tokens/s improvements indicate the concurrent execution successfully reduces attention computation time. However, core inference functions (llama_decode, llama_encode, llama_tokenize) show no response time or throughput changes in the static analysis, as the optimization affects GPU kernel scheduling rather than CPU-side function execution paths.

Power Consumption:
All binaries including libllama.so (190,887 nJ), llama-run (192,052 nJ), and llama-tts (224,377 nJ) maintain identical power consumption profiles. The concurrent stream execution does not alter the total computational work, only its temporal distribution across parallel streams.

Analysis Limitation:
The discrepancy between static analysis (no changes) and runtime benchmarks (2-7% improvement) indicates the performance gains occur at the GPU execution level, which the binary analysis tools do not capture. The optimization modifies when and how GPU kernels execute without changing the compiled CPU-side code paths that the static analysis examines.

@loci-review
Copy link

loci-review bot commented Nov 28, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #72 - CUDA Stream-Based Concurrency

Overview

This PR introduces concurrent CUDA stream execution for Q, K, V tensor processing in single-GPU configurations, gated behind the GGML_CUDA_GRAPH_OPT=1 environment variable. Static analysis of the compiled binaries reveals no measurable performance impact in the default disabled state.

Performance Metrics

Binary-Level Analysis:
All 16 analyzed binaries show 0% power consumption change:

  • build.bin.libllama.so: 190,887 nJ (baseline: 190,887 nJ)
  • build.bin.libggml-cpu.so: 115,243 nJ (baseline: 115,243 nJ)
  • build.bin.llama-run: 192,052 nJ (baseline: 192,052 nJ)
  • build.bin.llama-cvector-generator: 220,175 nJ (baseline: 220,175 nJ)
  • build.bin.llama-tts: 224,377 nJ (baseline: 224,377 nJ)
  • Remaining 11 binaries: identical power consumption

Function-Level Analysis:
No functions show Response Time or Throughput Time changes. The summary report returned no data for function-level comparisons, indicating identical compiled output between versions.

Key Findings

Code Implementation:
The PR adds 469 lines implementing graph optimization logic that identifies fork-join patterns (specifically 3-way fan-out for Q, K, V tensors) and assigns them to concurrent CUDA streams. Key additions include ggml_cuda_concurrent_event for synchronization management, ggml_backend_cuda_graph_optimize() for pattern detection, and per-stream memory pool isolation.

Inference Impact:
No impact on tokens per second in the analyzed configuration. The core inference functions (llama_decode, llama_encode, llama_tokenize) show no Response Time or Throughput Time changes. Since the feature is disabled by default and requires explicit environment variable activation, the compiled binaries remain functionally identical to the baseline.

Expected Runtime Behavior (When Enabled):
PR benchmarks indicate throughput improvements of 2-6% for text generation workloads on RTX 4090 when GGML_CUDA_GRAPH_OPT=1 is set. The optimization targets CUDA backend graph execution, not CPU-based tokenization paths.

Power Consumption:
No change across all binaries. The optimization logic is compiled but remains dormant without runtime activation, resulting in zero energy impact in the default configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants