Skip to content

UPSTREAM PR #18496: ggml-cuda: enable concurrent streams by default#766

Open
loci-dev wants to merge 4 commits intomainfrom
upstream-PR18496-branch_am17an-graph-opt-fix
Open

UPSTREAM PR #18496: ggml-cuda: enable concurrent streams by default#766
loci-dev wants to merge 4 commits intomainfrom
upstream-PR18496-branch_am17an-graph-opt-fix

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#18496

This PR enables concurrent streams introduced in #16991 by default. To disable a new env flag GGML_CUDA_DISABLE_GRAPH_OPT is introduced

Other changes:

  • Fixed a bug where we weren't clearing events in case concurrent events are not valid memory-wise
  • To reduce surface area, the forking node needs to be of the form attn_norm, since that is only pattern (QKV) I've tested extensively
  • Only works when CUDA graphs are enabled (so Ampere+, single-GPU at the moment)

@loci-review
Copy link

loci-review bot commented Dec 31, 2025

Explore the complete analysis inside the Version Insights

Here's the summary report for your project:

Performance Summary Report

Project Details:

Version Comparison:

  • Base Version: caf267f0-e5e9-11f0-81f2-dbb430499cb5
  • Target Version: 9f7d40a0-e60b-11f0-81f2-dbb430499cb5

Performance Analysis Results

Response Time Changes:
No modified functions were found with performance changes greater than 2%

Throughput Time Changes:
No modified functions were found with performance changes greater than 2%


Summary

The analysis shows that Pull Request #766 in the llama.cpp repository has minimal performance impact. No functions exhibited response time or throughput time changes exceeding the 2% threshold when comparing the target version against the base version.

This indicates that the changes introduced in this pull request are performance-neutral, meaning:

  • ✅ No significant performance regressions detected
  • ✅ No significant performance improvements detected
  • ✅ The code changes maintain stable performance characteristics

This is generally a positive outcome, suggesting that the modifications made in PR #766 do not negatively affect the performance of the application.

This PR enables concurrent streams introduced in #16991 by default. To disable a new env flag `GGML_CUDA_DISABLE_GRAPH_OPT` is introduced
@loci-review
Copy link

loci-review bot commented Dec 31, 2025

Explore the complete analysis inside the Version Insights

Perfect! I've generated the summary report for your project. Here's what the analysis shows:

Summary Report for llama.cpp PR #766

The performance analysis comparing the base version to the target version shows:

Key Findings:

  • No significant performance regressions detected
  • No significant performance improvements detected
  • All modified functions showed performance changes of less than 2% in both response time and throughput

Interpretation:
This indicates that Pull Request #766 is performance-neutral. The changes introduced are likely focused on:

  • Functionality enhancements
  • Bug fixes
  • Code quality improvements
  • Refactoring

Rather than performance optimization, and importantly, they don't negatively impact the existing performance of the codebase.

Would you like more detailed information about specific aspects of this analysis?

@loci-dev loci-dev force-pushed the main branch 2 times, most recently from 5c1f0b4 to 03ffde7 Compare December 31, 2025 12:15
@loci-dev loci-dev force-pushed the main branch 13 times, most recently from ca06125 to 76fc6ba Compare January 2, 2026 00:37
@loci-dev loci-dev force-pushed the main branch 5 times, most recently from 1f52e52 to 59c4631 Compare January 2, 2026 22:08
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 8271a31 to 12cf436 Compare January 9, 2026 11:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants