UPSTREAM PR #18496: ggml-cuda: enable concurrent streams by default by loci-dev · Pull Request #766 · auroralabs-loci/llama.cpp

loci-dev · 2025-12-31T05:39:54Z

This PR enables concurrent streams introduced in #16991 by default. To disable a new env flag GGML_CUDA_DISABLE_GRAPH_OPT is introduced

Other changes:

Fixed a bug where we weren't clearing events in case concurrent events are not valid memory-wise
To reduce surface area, the forking node needs to be of the form attn_norm, since that is only pattern (QKV) I've tested extensively
Only works when CUDA graphs are enabled (so Ampere+, single-GPU at the moment)

loci-review · 2025-12-31T06:23:21Z

Explore the complete analysis inside the Version Insights

Here's the summary report for your project:

Performance Summary Report

Project Details:

Repository: llama.cpp
Owner: auroralabs-loci
Pull Request: UPSTREAM PR #18496: ggml-cuda: enable concurrent streams by default #766
Project ID: 2621b8c0-b5ce-11f0-b333-453f42058aa1
Report ID: 9f7d1990-e60b-11f0-81f2-dbb430499cb5

Version Comparison:

Base Version: caf267f0-e5e9-11f0-81f2-dbb430499cb5
Target Version: 9f7d40a0-e60b-11f0-81f2-dbb430499cb5

Performance Analysis Results

Response Time Changes:
No modified functions were found with performance changes greater than 2%

Throughput Time Changes:
No modified functions were found with performance changes greater than 2%

Summary

The analysis shows that Pull Request #766 in the llama.cpp repository has minimal performance impact. No functions exhibited response time or throughput time changes exceeding the 2% threshold when comparing the target version against the base version.

This indicates that the changes introduced in this pull request are performance-neutral, meaning:

✅ No significant performance regressions detected
✅ No significant performance improvements detected
✅ The code changes maintain stable performance characteristics

This is generally a positive outcome, suggesting that the modifications made in PR #766 do not negatively affect the performance of the application.

This PR enables concurrent streams introduced in #16991 by default. To disable a new env flag `GGML_CUDA_DISABLE_GRAPH_OPT` is introduced

loci-review · 2025-12-31T08:34:48Z

Explore the complete analysis inside the Version Insights

Perfect! I've generated the summary report for your project. Here's what the analysis shows:

Summary Report for llama.cpp PR #766

The performance analysis comparing the base version to the target version shows:

Key Findings:

✅ No significant performance regressions detected
✅ No significant performance improvements detected
All modified functions showed performance changes of less than 2% in both response time and throughput

Interpretation:
This indicates that Pull Request #766 is performance-neutral. The changes introduced are likely focused on:

Functionality enhancements
Bug fixes
Code quality improvements
Refactoring

Rather than performance optimization, and importantly, they don't negatively impact the existing performance of the codebase.

Would you like more detailed information about specific aspects of this analysis?

loci-dev temporarily deployed to PROD__AL_DEMO December 31, 2025 05:39 — with GitHub Actions Inactive

ggml-cuda: enable concurrent streams by default

25ae798

This PR enables concurrent streams introduced in #16991 by default. To disable a new env flag `GGML_CUDA_DISABLE_GRAPH_OPT` is introduced

loci-dev force-pushed the main branch from 7f36157 to 6aa2f1a Compare December 31, 2025 07:11

loci-dev force-pushed the upstream-PR18496-branch_am17an-graph-opt-fix branch from 1464216 to 25ae798 Compare December 31, 2025 07:38

loci-dev temporarily deployed to PROD__AL_DEMO December 31, 2025 07:38 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from 6aa2f1a to 8407779 Compare December 31, 2025 08:12

loci-dev force-pushed the main branch 2 times, most recently from 5c1f0b4 to 03ffde7 Compare December 31, 2025 12:15

make flag opt-in

93cfa8d

loci-dev force-pushed the main branch 13 times, most recently from ca06125 to 76fc6ba Compare January 2, 2026 00:37

add todo about special casing

d405fa1

loci-dev force-pushed the main branch 5 times, most recently from 1f52e52 to 59c4631 Compare January 2, 2026 22:08

loci-dev force-pushed the main branch 30 times, most recently from 8271a31 to 12cf436 Compare January 9, 2026 11:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #18496: ggml-cuda: enable concurrent streams by default#766

UPSTREAM PR #18496: ggml-cuda: enable concurrent streams by default#766
loci-dev wants to merge 4 commits intomainfrom
upstream-PR18496-branch_am17an-graph-opt-fix

loci-dev commented Dec 31, 2025

Uh oh!

loci-review bot commented Dec 31, 2025

Uh oh!

loci-review bot commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Dec 31, 2025

Uh oh!

loci-review bot commented Dec 31, 2025

Performance Summary Report

Performance Analysis Results

Summary

Uh oh!

loci-review bot commented Dec 31, 2025

Summary Report for llama.cpp PR #766

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants