UPSTREAM PR #17342: Throughput improvement for small batch sizes #248

DajanaV · 2025-11-18T03:46:13Z

I came across a core scaling issue while running llama-bench on large core count machine and small batch sizes. During the investigation I found cache-line contention causing scaling issue. This patch fixes the contention.

With this patch I've seen throughput improvement ranging from 2% to 44% while running with Qwen3 30B parameter model.

Results are obtained with following command,
$ llama-bench –m Qwen3-Coder-30B-A3B-Instruct-Q4_K_S.gguf -p 500 -n 0 -t n -b 16,32,64,96,128
Where "n" is number of threads.

Here are the results:

Constant batch Size=16 with varying number of threads

	Patched/Baseline
Threads	TPS
1	1.00
2	1.00
4	1.00
8	1.00
16	`1.03`
32	`1.09`
64	`1.16`
96	`1.20`

Constant number of threads=96 with varying batch size

	Patched/Baseline
Batch Size	TPS
1	1.00
2	`1.44`
4	`1.34`
8	`1.27`
16	`1.20`
32	`1.16`
64	`1.11`
96	`1.07`
128	`1.05`
512	`1.02`
1024	`1.02`

==== Test Results =====

tee -a /tmp/results/ctest_debug-ctest.log
ctest --output-on-failure -L main -E 'test-opt|test-backend-ops'
Test project llama.cpp/build-ci-debug
Start 1: test-tokenizer-0-bert-bge
1/35 Test UPSTREAM PR #16634: metal : initial Metal4 tensor API support #1: test-tokenizer-0-bert-bge ......... Passed 0.11 sec
Start 2: test-tokenizer-0-command-r
2/35 Test UPSTREAM PR #16816: [bug fix] initialise buffer.device in ggml_hexagon_session #2: test-tokenizer-0-command-r ........ Passed 1.37 sec
Start 3: test-tokenizer-0-deepseek-coder
3/35 Test UPSTREAM PR #15805: Add conv2d Implicit GEMM #3: test-tokenizer-0-deepseek-coder ... Passed 0.24 sec
Start 4: test-tokenizer-0-deepseek-llm
4/35 Test UPSTREAM PR #16817: Implement SparseK Attention mechanism — new GGML operator with CPU backend (GPU planned next) #4: test-tokenizer-0-deepseek-llm ..... Passed 0.60 sec
Start 5: test-tokenizer-0-falcon
5/35 Test UPSTREAM PR #16536: Vulkan MMQ Integer Dot Refactor and K-Quant support #5: test-tokenizer-0-falcon ........... Passed 0.35 sec
Start 6: test-tokenizer-0-gpt-2
6/35 Test UPSTREAM PR #16829: cpu: introduce chunking for flash attention #6: test-tokenizer-0-gpt-2 ............ Passed 0.27 sec
Start 7: test-tokenizer-0-llama-bpe
7/35 Test UPSTREAM PR #16828: CUDA: Conv2d tensor core #7: test-tokenizer-0-llama-bpe ........ Passed 0.90 sec
Start 8: test-tokenizer-0-llama-spm
8/35 Test UPSTREAM PR #15277: arm64: add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_… #8: test-tokenizer-0-llama-spm ........ Passed 0.10 sec
Start 9: test-tokenizer-0-mpt
9/35 Test UPSTREAM PR #16831: Model: Minimax M2 #9: test-tokenizer-0-mpt .............. Passed 0.27 sec
Start 10: test-tokenizer-0-phi-3
10/35 Test UPSTREAM PR #16796: vulkan: Fix crash when FP16 mul_mat accumulation is not supported #10: test-tokenizer-0-phi-3 ............ Passed 0.10 sec
Start 11: test-tokenizer-0-qwen2
11/35 Test UPSTREAM PR #16833: cpu: introduce chunking for repack matmuls and enable matmul-id chunking #11: test-tokenizer-0-qwen2 ............ Passed 0.95 sec
Start 12: test-tokenizer-0-refact
12/35 Test UPSTREAM PR #16827: Massively Improved ROCm/HIP rocWMMA Performance (pp and tg) #12: test-tokenizer-0-refact ........... Passed 0.27 sec
Start 13: test-tokenizer-0-starcoder
13/35 Test UPSTREAM PR #16826: vulkan: remove the need for the dryrun #13: test-tokenizer-0-starcoder ........ Passed 0.27 sec
Start 14: test-tokenizers-ggml-vocabs
14/35 Test UPSTREAM PR #16872: vulkan: disable spirv-opt for rope shaders #14: test-tokenizers-ggml-vocabs ....... Passed 6.92 sec
Start 15: test-sampling
15/35 Test UPSTREAM PR #16868: vulkan: fuse mul_mat+add and mul_mat_id+add_id #15: test-sampling ..................... Passed 3.66 sec
Start 16: test-grammar-parser
16/35 Test UPSTREAM PR #16574: mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter) #16: test-grammar-parser ............... Passed 0.00 sec
Start 17: test-grammar-integration
17/35 Test UPSTREAM PR #16858: CUDA: Remove unneded bias/gate dims in fused mmvq #17: test-grammar-integration .......... Passed 0.02 sec
Start 18: test-llama-grammar
18/35 Test UPSTREAM PR #16891: ggml-cpu : bicubic interpolation #18: test-llama-grammar ................ Passed 0.00 sec
Start 19: test-chat
19/35 Test UPSTREAM PR #16252: Refactor llama-model.cpp #19: test-chat ......................... Passed 7.52 sec
Start 20: test-json-schema-to-grammar
20/35 Test UPSTREAM PR #16896: model: add Granite Hybrid nano #20: test-json-schema-to-grammar ....... Passed 1.50 sec
Start 21: test-tokenizer-1-llama-spm
21/35 Test UPSTREAM PR #16884: CUDA: fuse rope + set_rows #21: test-tokenizer-1-llama-spm ........ Passed 0.44 sec
Start 22: test-chat-parser
22/35 Test UPSTREAM PR #16784: webui: auto-refresh /props on inference start to resync model metadata #22: test-chat-parser .................. Passed 0.01 sec
Start 23: test-chat-template
23/35 Test UPSTREAM PR #16757: webui: add HTML/JS preview support to MarkdownContent with sandboxed iframe #23: test-chat-template ................ Passed 0.70 sec
Start 24: test-json-partial
24/35 Test UPSTREAM PR #16618: webui: add OAI-Compat Harmony tool-call streaming visualization and persistence in chat UI #24: test-json-partial ................. Passed 0.01 sec
Start 25: test-log
25/35 Test UPSTREAM PR #16900: Vulkan: MMVQ Integer Dot K-Quant and MUL_MAT_ID support #25: test-log .......................... Passed 0.02 sec
Start 26: test-regex-partial
26/35 Test UPSTREAM PR #16899: vulkan: Fix multi_add invalid descriptor usage #26: test-regex-partial ................ Passed 0.01 sec
Start 27: test-thread-safety
27/35 Test UPSTREAM PR #16901: Add a setting to display message generation statistics #27: test-thread-safety ................ Passed 1.03 sec
Start 28: test-arg-parser
28/35 Test UPSTREAM PR #16805: ggml-cpu: templateify ggml_compute_forward_rope_f32 and _f16 #28: test-arg-parser ................... Passed 0.25 sec
Start 29: test-gguf
29/35 Test UPSTREAM PR #16907: Vulkan: improve mul_mat_vec_iq1_m #29: test-gguf ......................... Passed 0.14 sec
Start 32: test-barrier
30/35 Test UPSTREAM PR #16906: model: add Janus Pro for image understanding #32: test-barrier ...................... Passed 1.44 sec
Start 33: test-quantize-fns
31/35 Test UPSTREAM PR #14891: imatrix: calculate activation-based statistics for new format (GGUF) imatrices #33: test-quantize-fns ................. Passed 16.97 sec
Start 34: test-quantize-perf
32/35 Test UPSTREAM PR #15550: quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest error #34: test-quantize-perf ................ Passed 0.22 sec
Start 35: test-rope
33/35 Test UPSTREAM PR #16917: CUDA: add FLOOR, CEIL, ROUND, TRUNC unary ops #35: test-rope ......................... Passed 0.07 sec
Start 36: test-mtmd-c-api
34/35 Test UPSTREAM PR #16923: Add e2e tests for embedding raw flag #36: test-mtmd-c-api ................... Passed 0.00 sec
Start 37: test-alloc
35/35 Test UPSTREAM PR #16922: disable D_TYPE same as A_TYPE cast for rope shaders #37: test-alloc ........................ Passed 0.00 sec

100% tests passed, 0 tests failed out of 35

Label Time Summary:
main = 46.75 sec*proc (35 tests)

I greatly appreciate any feedback anyone has to get this patch accepted.
Thanks.

…-line contention (cache HITM) This improves throughput for cases where threads have to wait due to lack work and causing process to spend many cycles in a spin loop. This enables to update dynamic chunk counter with static stride partitioning which further helps to eliminate shared counter. * remove one barrier in sgemm() * static stride partitioning

loci-agentic-ai · 2025-11-18T04:25:54Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of PR #248 reveals a 23.86% response time increase in the quantize_row_iq4_xs function (79 ns → 98 ns), representing the highest performance change identified. However, this regression is unrelated to the actual code modifications in the pull request.

Code Changes Analysis

The PR implements a targeted optimization in ggml/src/ggml-cpu/llamafile/sgemm.cpp within the tinyBLAS matrix multiplication class:

Removed thread barrier and dynamic job allocation
Replaced work-stealing with static work distribution using a simple for-loop
Eliminated atomic operations that caused cache-line contention

These changes address legitimate scaling issues in multi-threaded environments with small batch sizes, showing 2-44% throughput improvements in benchmarks.

Key Findings

Performance Impact:

The quantize_row_iq4_xs regression (19 ns absolute increase) stems from memory layout changes during binary compilation, specifically string literal relocation to different memory pages
No core inference functions affected: llama_decode, llama_encode, and llama_tokenize show no performance changes
Tokens per second impact: Zero, as no tokenization or inference functions experienced response time changes

Power Consumption:

3.15% decrease in power consumption for build.bin.libggml-cpu.so (10,060 nJ → 9,743 nJ)
Net energy reduction despite quantization regression indicates overall efficiency gains

Technical Analysis:

Flame graph: Shows shallow execution with 85.7% self-time in quantization function, with PLT call overhead contributing 14.3%
CFG comparison: Identical control flow structure; only assembly differences are memory address relocations for assertion strings
Code review: Clean, well-targeted optimization addressing documented performance bottleneck

Actionable Recommendations:

Review linker configuration to maintain string literal locality and prevent memory page fragmentation
Consider assertion removal in performance-critical builds to eliminate the 14 ns PLT overhead

The core optimization is sound and beneficial; the quantization regression requires separate binary layout investigation.

DajanaV temporarily deployed to PROD__AL_DEMO November 18, 2025 03:46 — with GitHub Actions Inactive

DajanaV force-pushed the main branch 3 times, most recently from f333350 to 9c4623f Compare November 18, 2025 09:10

loci-dev force-pushed the main branch 24 times, most recently from d2e6325 to 22143ca Compare November 24, 2025 14:09

loci-dev force-pushed the main branch 24 times, most recently from a89c6ad to ad5ad9a Compare November 27, 2025 14:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #17342: Throughput improvement for small batch sizes #248

UPSTREAM PR #17342: Throughput improvement for small batch sizes #248

Uh oh!

DajanaV commented Nov 18, 2025

Uh oh!

loci-agentic-ai bot commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UPSTREAM PR #17342: Throughput improvement for small batch sizes #248

Are you sure you want to change the base?

UPSTREAM PR #17342: Throughput improvement for small batch sizes #248

Uh oh!

Conversation

DajanaV commented Nov 18, 2025

Uh oh!

loci-agentic-ai bot commented Nov 18, 2025

Performance Analysis Summary

Overview

Code Changes Analysis

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants