UPSTREAM PR #18958: CUDA: use mmvq for mul-mat-id for small batch sizes by loci-dev · Pull Request #1101 · auroralabs-loci/llama.cpp

loci-dev · 2026-01-31T07:40:57Z

Note

Source pull request: ggml-org/llama.cpp#18958

Currently for batch_sizes > 1, we immediately move to mmq which is suboptimal for small batch sizes. Bring performance of batched bench in line (previously there was a dip at n_tokens = 2)

Micro-benchmark for test-backend-ops

Backend	GGML op	Op parameters	TFLOPS master	TFLOPS mmid-vec	Speedup
CUDA0	MUL_MAT_ID	type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048	4.61	4.62	1.00
CUDA0	MUL_MAT_ID	type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048	2.34	6.13	2.62
CUDA0	MUL_MAT_ID	type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048	4.27	6.83	1.60
CUDA0	MUL_MAT_ID	type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048	5.49	5.49	1.00
CUDA0	MUL_MAT_ID	type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048	3.37	6.37	1.89
CUDA0	MUL_MAT_ID	type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048	6.57	7.23	1.10

loci-review · 2026-01-31T08:33:49Z

No meaningful performance changes were detected across 113023 analyzed functions in the following binaries: build.bin.libmtmd.so, build.bin.libllama.so, build.bin.llama-tts, build.bin.llama-cvector-generator, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-tokenize, build.bin.llama-bench, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

loci-dev temporarily deployed to PROD__AL_DEMO January 31, 2026 07:41 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from 5fea2ef to 8a7ef20 Compare January 31, 2026 08:12

loci-dev force-pushed the main branch 27 times, most recently from b128b33 to d613f70 Compare February 1, 2026 12:16

loci-dev force-pushed the main branch 30 times, most recently from 7ff3e7f to 99b11e9 Compare February 3, 2026 09:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #18958: CUDA: use mmvq for mul-mat-id for small batch sizes#1101

UPSTREAM PR #18958: CUDA: use mmvq for mul-mat-id for small batch sizes#1101
loci-dev wants to merge 4 commits intomainfrom
loci/pr-18958-mmid-vec

loci-dev commented Jan 31, 2026

Uh oh!

loci-review bot commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Jan 31, 2026

Uh oh!

loci-review bot commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants