UPSTREAM PR #16991: CUDA: add stream-based concurrency by DajanaV · Pull Request #72 · auroralabs-loci/llama.cpp

DajanaV · 2025-11-04T09:38:02Z

Possibly supersede #16786.

This PR adds support to run concurrent CUDA streams on single GPU setups.
At the moment this only targets the Q, K, V branch. I feel this is the "correct" approach in case the Q, K, V tensors are of different types/not in the same place in memory. The downside is that this approach doesn't come for free and there's some complexity involved, but I'm not an expert at the ggml graph and I feel it could be simplified.

Currently this is hidden by an env variable flag. To run you can use GGML_CUDA_ENABLE_GRAPH_OPT=1

TG Performance is in line with the previous PR (2-7% gain), we leave some performance on the table where we don't fuse operations in the parallel streams themselves (e.g. MUL_MAT + BIAS, RMS_NORM + MUL etc.), I couldn't find a simple enough way to enable fusion there.

Before:

Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	CUDA	99	1	tg32	172.10 ± 0.05
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	CUDA	99	1	tg64	164.89 ± 0.07
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	CUDA	99	1	tg128	162.47 ± 0.05
llama 8B Q5_K - Small	5.21 GiB	8.03 B	CUDA	99	1	tg32	124.67 ± 0.03
llama 8B Q5_K - Small	5.21 GiB	8.03 B	CUDA	99	1	tg64	121.77 ± 0.21
llama 8B Q5_K - Small	5.21 GiB	8.03 B	CUDA	99	1	tg128	121.21 ± 0.04
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	tg32	210.46 ± 0.07
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	tg64	207.49 ± 0.03
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	tg128	205.36 ± 0.03

After:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	CUDA	99	1	tg32	181.60 ± 0.11
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	CUDA	99	1	tg64	173.92 ± 0.05
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	CUDA	99	1	tg128	170.95 ± 0.03
llama 8B Q5_K - Small	5.21 GiB	8.03 B	CUDA	99	1	tg32	128.16 ± 0.05
llama 8B Q5_K - Small	5.21 GiB	8.03 B	CUDA	99	1	tg64	125.28 ± 0.03
llama 8B Q5_K - Small	5.21 GiB	8.03 B	CUDA	99	1	tg128	124.18 ± 0.02
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	tg32	214.24 ± 0.08
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	tg64	211.05 ± 0.04
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	tg128	208.83 ± 0.03

loci-review · 2025-11-15T02:52:40Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

The analysis examined PR #72 implementing CUDA stream-based concurrency for Q, K, V branch parallelization in llama.cpp. The changes introduce concurrent CUDA streams to improve GPU utilization, with demonstrated 2-7% throughput gains in benchmarks.

Performance Impact Assessment

Highest Performance Changes:

Response Time: llm_graph_input_out_ids::can_reuse() showed +0.096% (+0.063 ns)
Throughput Time: std::make_unique<llm_graph_input_attn_no_cache>() showed +0.111% (+0.078 ns)

Core Function Impact:
The performance changes do not affect critical inference functions (llama_decode, llama_encode, llama_tokenize) that directly impact tokens per second. The modified functions are utility functions in the graph construction layer, not the primary inference pipeline.

Power Consumption Analysis:
All binaries maintain stable power consumption with changes below 0.001%. The core inference library (build.bin.libllama.so) shows negligible variation at 280,731 nanojoules, indicating no significant computational intensity changes.

Flame Graph and CFG Analysis:
The can_reuse() function shows identical assembly code between versions with a flat execution profile (single 65ns operation). The 0.063ns performance difference represents measurement noise rather than algorithmic changes, likely caused by binary layout shifts affecting instruction cache alignment.

Code Review Insights:
The implementation adds sophisticated CUDA stream management infrastructure:

New concurrent event structures for stream synchronization
Graph optimization engine targeting 3-branch fan-out patterns
Per-stream memory pool management
Dynamic stream switching during execution

The changes are well-architected with appropriate safeguards and demonstrate measurable performance improvements in the target use case (Q, K, V parallelization).

Conclusion

The observed performance variations (sub-nanosecond changes) fall within measurement precision limits and do not impact inference performance. The CUDA concurrency implementation represents a positive enhancement to GPU utilization without introducing performance regressions in critical paths. No actionable performance optimizations are required based on the current analysis.

loci-review · 2025-11-25T09:19:18Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary

PR #72: CUDA Stream-Based Concurrency Implementation

This PR introduces concurrent CUDA stream execution for Q, K, V branch parallelization in single-GPU configurations. The implementation adds 448 lines across 3 files, establishing fork-join pattern detection and multi-stream scheduling infrastructure. Performance analysis shows 0.0% power consumption change across all binaries and sub-nanosecond timing variations in non-critical utility functions.

Key Findings

Performance-Critical Areas Impact:

The changes do not modify any functions within the core inference pipeline identified in the project summary. The observed variations occur in graph construction utilities: llm_graph_input_out_ids::can_reuse() shows +0.063 ns response time change, and std::make_unique<llm_graph_input_attn_no_cache>() shows +0.078 ns throughput change. These functions execute during graph setup, not during token generation.

Tokens Per Second Impact:

No impact on inference throughput. The critical functions llama_decode, llama_encode, and llama_tokenize show zero measurable changes in response time or throughput. Since these functions remain unaffected, tokens per second performance is preserved. The reference model (smollm:135m on i7-1255U) demonstrates that 2 ms degradation in llama_decode reduces tokens per second by 7%, but this PR introduces no such degradation.

Power Consumption Analysis:

All 16 binaries maintain stable power consumption. The core inference library build.bin.libllama.so shows -0.001% change (228,743 nJ vs 228,744 nJ baseline), representing a 1.21 nJ reduction that falls within measurement precision limits. GGML backend libraries (libggml-base.so, libggml-cpu.so, libggml.so) and all CLI tools show 0.0% change, confirming no computational intensity modifications in the baseline execution path.

Code Implementation:

The PR adds CUDA stream management infrastructure including concurrent event structures, graph optimization engine targeting 3-branch fan-out patterns, and per-stream memory pools. The optimization is disabled by default (requires GGML_CUDA_GRAPH_OPT=1), ensuring existing code paths remain unchanged. Benchmark data from the PR description shows 2-7% throughput improvements for target GPU workloads when optimization is enabled.

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

…ite_ranges

loci-review · 2025-11-27T16:14:05Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #72

Overview

This PR introduces CUDA stream-based concurrency for single-GPU inference, targeting parallel execution of Q, K, V attention branches. The implementation adds 469 lines across 3 CUDA backend files, implementing fork-join parallelism with event-based synchronization.

Performance Metrics Analysis

Static Analysis Results:

No function-level performance changes detected between versions
Power consumption analysis shows 0.0% change across all 16 binaries
All binaries maintain identical computational profiles

Measured Runtime Performance (from PR benchmarks):

Qwen3MoE 30B: 172 → 182 tokens/s (10 tokens/s improvement)
LLaMA 8B: 125 → 128 tokens/s (3 tokens/s improvement)
GPT-OSS 20B: 210 → 214 tokens/s (4 tokens/s improvement)

Key Findings

Code Implementation:
The changes implement graph-level optimization that identifies fork-join patterns in attention computation. The ggml_backend_cuda_graph_optimize() function analyzes computation graphs to detect nodes with fan-out of 3 (Q, K, V branches), validates memory safety through overlap detection, and reorders graph nodes to enable concurrent stream execution. The evaluate_and_capture_cuda_graph() function manages stream assignment and CUDA event synchronization at fork and join points.

Inference Impact:
The static analysis tools show no detectable changes because the optimization operates at the CUDA runtime level rather than modifying function logic. The measured tokens/s improvements indicate the concurrent execution successfully reduces attention computation time. However, core inference functions (llama_decode, llama_encode, llama_tokenize) show no response time or throughput changes in the static analysis, as the optimization affects GPU kernel scheduling rather than CPU-side function execution paths.

Power Consumption:
All binaries including libllama.so (190,887 nJ), llama-run (192,052 nJ), and llama-tts (224,377 nJ) maintain identical power consumption profiles. The concurrent stream execution does not alter the total computational work, only its temporal distribution across parallel streams.

Analysis Limitation:
The discrepancy between static analysis (no changes) and runtime benchmarks (2-7% improvement) indicates the performance gains occur at the GPU execution level, which the binary analysis tools do not capture. The optimization modifies when and how GPU kernels execute without changing the compiled CPU-side code paths that the static analysis examines.

loci-review · 2025-11-28T15:11:11Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #72 - CUDA Stream-Based Concurrency

Overview

This PR introduces concurrent CUDA stream execution for Q, K, V tensor processing in single-GPU configurations, gated behind the GGML_CUDA_GRAPH_OPT=1 environment variable. Static analysis of the compiled binaries reveals no measurable performance impact in the default disabled state.

Performance Metrics

Binary-Level Analysis:
All 16 analyzed binaries show 0% power consumption change:

build.bin.libllama.so: 190,887 nJ (baseline: 190,887 nJ)
build.bin.libggml-cpu.so: 115,243 nJ (baseline: 115,243 nJ)
build.bin.llama-run: 192,052 nJ (baseline: 192,052 nJ)
build.bin.llama-cvector-generator: 220,175 nJ (baseline: 220,175 nJ)
build.bin.llama-tts: 224,377 nJ (baseline: 224,377 nJ)
Remaining 11 binaries: identical power consumption

Function-Level Analysis:
No functions show Response Time or Throughput Time changes. The summary report returned no data for function-level comparisons, indicating identical compiled output between versions.

Key Findings

Code Implementation:
The PR adds 469 lines implementing graph optimization logic that identifies fork-join patterns (specifically 3-way fan-out for Q, K, V tensors) and assigns them to concurrent CUDA streams. Key additions include ggml_cuda_concurrent_event for synchronization management, ggml_backend_cuda_graph_optimize() for pattern detection, and per-stream memory pool isolation.

Inference Impact:
No impact on tokens per second in the analyzed configuration. The core inference functions (llama_decode, llama_encode, llama_tokenize) show no Response Time or Throughput Time changes. Since the feature is disabled by default and requires explicit environment variable activation, the compiled binaries remain functionally identical to the baseline.

Expected Runtime Behavior (When Enabled):
PR benchmarks indicate throughput improvements of 2-6% for text generation workloads on RTX 4090 when GGML_CUDA_GRAPH_OPT=1 is set. The optimization targets CUDA backend graph execution, not CPU-based tokenization paths.

Power Consumption:
No change across all binaries. The optimization logic is compiled but remains dormant without runtime activation, resulting in zero energy impact in the default configuration.

DajanaV temporarily deployed to PROD__AL_DEMO November 4, 2025 09:38 — with GitHub Actions Inactive

DajanaV force-pushed the main branch 29 times, most recently from d7421a0 to 5950843 Compare November 8, 2025 13:11

DajanaV force-pushed the main branch 9 times, most recently from b1d9e01 to b3275bb Compare November 13, 2025 09:10

am17an and others added 14 commits November 25, 2025 14:46

CUDA: add stream-based concurrency

773909b

HIP: fix hipStreamWaitEvent define and nodiscard warnings

fb2e2fb

ggml-cuda: fix fusion inside stream

952526c

ggml-cuda: fix bug w.r.t first stream launch

b87eb3e

ggml-cuda: format

8b048e8

ggml-cuda: improve assert message

35ad678

ggml-cuda: use lambda instead of duplicating code

232c47f

ggml-cuda: add some more comments

c70a9db

ggml-cuda: add more detailed comments about concurrency

790ddd4

ggml-cuda: rename + remove unused var

3c71727

ggml-cuda: fix condition for stream launch

05379d9

ggml-cuda: address review comments, add destructor

537a08e

common.cuh: add is_valid for concurrent events

25565c3

common.cuh: make comment better

06b2ad9

am17an and others added 3 commits November 27, 2025 22:10

update comment

e7f3158

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

update comment

2c9c3c2

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

common.cuh: fix lower_bound condition + remove join_node data from wr…

c1dca28

…ite_ranges

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #16991: CUDA: add stream-based concurrency#72

UPSTREAM PR #16991: CUDA: add stream-based concurrency#72
DajanaV wants to merge 17 commits intomainfrom
upstream-PR16991-branch_am17an-fused-qkv-stream

DajanaV commented Nov 4, 2025

Uh oh!

loci-review bot commented Nov 15, 2025

Uh oh!

loci-review bot commented Nov 25, 2025

Uh oh!

loci-review bot commented Nov 27, 2025

Uh oh!

loci-review bot commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

DajanaV commented Nov 4, 2025

Uh oh!

loci-review bot commented Nov 15, 2025

Performance Analysis Summary

Overview

Performance Impact Assessment

Conclusion

Uh oh!

loci-review bot commented Nov 25, 2025

Performance Analysis Summary

Key Findings

Uh oh!

loci-review bot commented Nov 27, 2025

Performance Analysis Summary - PR #72

Overview

Performance Metrics Analysis

Key Findings

Uh oh!

loci-review bot commented Nov 28, 2025

Performance Analysis Summary: PR #72 - CUDA Stream-Based Concurrency

Overview

Performance Metrics

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants