Skip to content

UPSTREAM PR #18958: CUDA: use mmvq for mul-mat-id for small batch sizes#979

Open
loci-dev wants to merge 3 commits intomainfrom
upstream-PR18958-branch_am17an-mmid-vec
Open

UPSTREAM PR #18958: CUDA: use mmvq for mul-mat-id for small batch sizes#979
loci-dev wants to merge 3 commits intomainfrom
upstream-PR18958-branch_am17an-mmid-vec

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#18958

Currently for batch_sizes > 1, we immediately move to mmq which is suboptimal for small batch sizes. Bring performance of batched bench in line (previously there was a dip at n_tokens = 2)

Micro-benchmark for test-backend-ops

Backend GGML op Op parameters TFLOPS master TFLOPS mmid-vec Speedup
CUDA0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048 4.61 4.62 1.00
CUDA0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048 2.34 6.13 2.62
CUDA0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048 4.27 6.83 1.60
CUDA0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048 5.49 5.49 1.00
CUDA0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048 3.37 6.37 1.89
CUDA0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048 6.57 7.23 1.10

@loci-review
Copy link

loci-review bot commented Jan 20, 2026

Explore the complete analysis inside the Version Insights

Based on the analysis, no functions were identified with meaningful performance changes between the base and target versions. The code modifications did not result in measurable performance impact.

@loci-dev loci-dev force-pushed the main branch 19 times, most recently from 5b137d4 to ab9ebfa Compare January 23, 2026 08:12
@loci-dev loci-dev force-pushed the main branch 3 times, most recently from 4f9b49b to 30f9ba9 Compare January 23, 2026 17:12
@loci-dev loci-dev force-pushed the upstream-PR18958-branch_am17an-mmid-vec branch from d0db284 to 54bfaef Compare January 23, 2026 18:45
@loci-review
Copy link

loci-review bot commented Jan 23, 2026

Based on the analysis, no functions were identified with meaningful performance changes between the base and target versions. The function_insights_topk tool returned empty results for both response time and throughput time metrics, indicating that the code changes in this version do not introduce measurable performance impacts to any functions in the analyzed binary.

This suggests that the modifications between versions are either:

  • Non-performance-affecting changes (documentation, comments, refactoring without algorithmic changes)
  • Changes to code paths not captured in the static analysis
  • Modifications that fall below the detection threshold for performance impact

Conclusion: No performance review is warranted for this version comparison, as no functions exhibit detectable changes in execution time metrics.

See the complete breakdown in Version Insights
Have questions? Tag @loci-dev to ask about this PR.

@loci-dev loci-dev force-pushed the main branch 30 times, most recently from e11b5e5 to 82a6249 Compare January 29, 2026 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants