UPSTREAM PR #18958: CUDA: use mmvq for mul-mat-id for small batch sizes by loci-dev · Pull Request #979 · auroralabs-loci/llama.cpp

loci-dev · 2026-01-20T15:43:16Z

Currently for batch_sizes > 1, we immediately move to mmq which is suboptimal for small batch sizes. Bring performance of batched bench in line (previously there was a dip at n_tokens = 2)

Micro-benchmark for test-backend-ops

Backend	GGML op	Op parameters	TFLOPS master	TFLOPS mmid-vec	Speedup
CUDA0	MUL_MAT_ID	type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048	4.61	4.62	1.00
CUDA0	MUL_MAT_ID	type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048	2.34	6.13	2.62
CUDA0	MUL_MAT_ID	type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048	4.27	6.83	1.60
CUDA0	MUL_MAT_ID	type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048	5.49	5.49	1.00
CUDA0	MUL_MAT_ID	type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048	3.37	6.37	1.89
CUDA0	MUL_MAT_ID	type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048	6.57	7.23	1.10

loci-review · 2026-01-20T16:28:50Z

Explore the complete analysis inside the Version Insights

Based on the analysis, no functions were identified with meaningful performance changes between the base and target versions. The code modifications did not result in measurable performance impact.

loci-review · 2026-01-23T19:37:12Z

Based on the analysis, no functions were identified with meaningful performance changes between the base and target versions. The function_insights_topk tool returned empty results for both response time and throughput time metrics, indicating that the code changes in this version do not introduce measurable performance impacts to any functions in the analyzed binary.

This suggests that the modifications between versions are either:

Non-performance-affecting changes (documentation, comments, refactoring without algorithmic changes)
Changes to code paths not captured in the static analysis
Modifications that fall below the detection threshold for performance impact

Conclusion: No performance review is warranted for this version comparison, as no functions exhibit detectable changes in execution time metrics.

See the complete breakdown in Version Insights
Have questions? Tag @loci-dev to ask about this PR.

loci-dev temporarily deployed to PROD__AL_DEMO January 20, 2026 15:43 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 19 times, most recently from 5b137d4 to ab9ebfa Compare January 23, 2026 08:12

am17an added 2 commits January 23, 2026 12:05

CUDA: use mmvq for mul-mat-id for small batch sizes

6f403f6

add mmvq too

056bccf

loci-dev force-pushed the main branch 3 times, most recently from 4f9b49b to 30f9ba9 Compare January 23, 2026 17:12

Fix perf issue on ampere. Use mmvf mm-id only for non-nvidia GPUs

54bfaef

loci-dev force-pushed the upstream-PR18958-branch_am17an-mmid-vec branch from d0db284 to 54bfaef Compare January 23, 2026 18:45

loci-dev temporarily deployed to PROD__AL_DEMO January 23, 2026 18:45 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 30 times, most recently from e11b5e5 to 82a6249 Compare January 29, 2026 17:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #18958: CUDA: use mmvq for mul-mat-id for small batch sizes#979

UPSTREAM PR #18958: CUDA: use mmvq for mul-mat-id for small batch sizes#979
loci-dev wants to merge 3 commits intomainfrom
upstream-PR18958-branch_am17an-mmid-vec

loci-dev commented Jan 20, 2026

Uh oh!

loci-review bot commented Jan 20, 2026

Uh oh!

loci-review bot commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Jan 20, 2026

Uh oh!

loci-review bot commented Jan 20, 2026

Uh oh!

loci-review bot commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants