UPSTREAM PR #17342: Throughput improvement for small batch sizes #248
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Mirrored from ggml-org/llama.cpp#17342
I came across a core scaling issue while running llama-bench on large core count machine and small batch sizes. During the investigation I found cache-line contention causing scaling issue. This patch fixes the contention.
With this patch I've seen throughput improvement ranging from 2% to 44% while running with Qwen3 30B parameter model.
Results are obtained with following command,
$ llama-bench –m Qwen3-Coder-30B-A3B-Instruct-Q4_K_S.gguf -p 500 -n 0 -t n -b 16,32,64,96,128
Where "n" is number of threads.
Here are the results:
1.031.091.161.201.441.341.271.201.161.111.071.051.021.02==== Test Results =====
Test project llama.cpp/build-ci-debug
Start 1: test-tokenizer-0-bert-bge
1/35 Test UPSTREAM PR #16634: metal : initial Metal4 tensor API support #1: test-tokenizer-0-bert-bge ......... Passed 0.11 sec
Start 2: test-tokenizer-0-command-r
2/35 Test UPSTREAM PR #16816: [bug fix] initialise buffer.device in ggml_hexagon_session #2: test-tokenizer-0-command-r ........ Passed 1.37 sec
Start 3: test-tokenizer-0-deepseek-coder
3/35 Test UPSTREAM PR #15805: Add conv2d Implicit GEMM #3: test-tokenizer-0-deepseek-coder ... Passed 0.24 sec
Start 4: test-tokenizer-0-deepseek-llm
4/35 Test UPSTREAM PR #16817: Implement SparseK Attention mechanism — new GGML operator with CPU backend (GPU planned next) #4: test-tokenizer-0-deepseek-llm ..... Passed 0.60 sec
Start 5: test-tokenizer-0-falcon
5/35 Test UPSTREAM PR #16536: Vulkan MMQ Integer Dot Refactor and K-Quant support #5: test-tokenizer-0-falcon ........... Passed 0.35 sec
Start 6: test-tokenizer-0-gpt-2
6/35 Test UPSTREAM PR #16829: cpu: introduce chunking for flash attention #6: test-tokenizer-0-gpt-2 ............ Passed 0.27 sec
Start 7: test-tokenizer-0-llama-bpe
7/35 Test UPSTREAM PR #16828: CUDA: Conv2d tensor core #7: test-tokenizer-0-llama-bpe ........ Passed 0.90 sec
Start 8: test-tokenizer-0-llama-spm
8/35 Test UPSTREAM PR #15277: arm64: add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_… #8: test-tokenizer-0-llama-spm ........ Passed 0.10 sec
Start 9: test-tokenizer-0-mpt
9/35 Test UPSTREAM PR #16831: Model: Minimax M2 #9: test-tokenizer-0-mpt .............. Passed 0.27 sec
Start 10: test-tokenizer-0-phi-3
10/35 Test UPSTREAM PR #16796: vulkan: Fix crash when FP16 mul_mat accumulation is not supported #10: test-tokenizer-0-phi-3 ............ Passed 0.10 sec
Start 11: test-tokenizer-0-qwen2
11/35 Test UPSTREAM PR #16833: cpu: introduce chunking for repack matmuls and enable matmul-id chunking #11: test-tokenizer-0-qwen2 ............ Passed 0.95 sec
Start 12: test-tokenizer-0-refact
12/35 Test UPSTREAM PR #16827: Massively Improved ROCm/HIP rocWMMA Performance (pp and tg) #12: test-tokenizer-0-refact ........... Passed 0.27 sec
Start 13: test-tokenizer-0-starcoder
13/35 Test UPSTREAM PR #16826: vulkan: remove the need for the dryrun #13: test-tokenizer-0-starcoder ........ Passed 0.27 sec
Start 14: test-tokenizers-ggml-vocabs
14/35 Test UPSTREAM PR #16872: vulkan: disable spirv-opt for rope shaders #14: test-tokenizers-ggml-vocabs ....... Passed 6.92 sec
Start 15: test-sampling
15/35 Test UPSTREAM PR #16868: vulkan: fuse mul_mat+add and mul_mat_id+add_id #15: test-sampling ..................... Passed 3.66 sec
Start 16: test-grammar-parser
16/35 Test UPSTREAM PR #16574: mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter) #16: test-grammar-parser ............... Passed 0.00 sec
Start 17: test-grammar-integration
17/35 Test UPSTREAM PR #16858: CUDA: Remove unneded bias/gate dims in fused mmvq #17: test-grammar-integration .......... Passed 0.02 sec
Start 18: test-llama-grammar
18/35 Test UPSTREAM PR #16891: ggml-cpu : bicubic interpolation #18: test-llama-grammar ................ Passed 0.00 sec
Start 19: test-chat
19/35 Test UPSTREAM PR #16252: Refactor llama-model.cpp #19: test-chat ......................... Passed 7.52 sec
Start 20: test-json-schema-to-grammar
20/35 Test UPSTREAM PR #16896: model: add Granite Hybrid nano #20: test-json-schema-to-grammar ....... Passed 1.50 sec
Start 21: test-tokenizer-1-llama-spm
21/35 Test UPSTREAM PR #16884: CUDA: fuse rope + set_rows #21: test-tokenizer-1-llama-spm ........ Passed 0.44 sec
Start 22: test-chat-parser
22/35 Test UPSTREAM PR #16784: webui: auto-refresh /props on inference start to resync model metadata #22: test-chat-parser .................. Passed 0.01 sec
Start 23: test-chat-template
23/35 Test UPSTREAM PR #16757: webui: add HTML/JS preview support to MarkdownContent with sandboxed iframe #23: test-chat-template ................ Passed 0.70 sec
Start 24: test-json-partial
24/35 Test UPSTREAM PR #16618: webui: add OAI-Compat Harmony tool-call streaming visualization and persistence in chat UI #24: test-json-partial ................. Passed 0.01 sec
Start 25: test-log
25/35 Test UPSTREAM PR #16900: Vulkan: MMVQ Integer Dot K-Quant and MUL_MAT_ID support #25: test-log .......................... Passed 0.02 sec
Start 26: test-regex-partial
26/35 Test UPSTREAM PR #16899: vulkan: Fix multi_add invalid descriptor usage #26: test-regex-partial ................ Passed 0.01 sec
Start 27: test-thread-safety
27/35 Test UPSTREAM PR #16901: Add a setting to display message generation statistics #27: test-thread-safety ................ Passed 1.03 sec
Start 28: test-arg-parser
28/35 Test UPSTREAM PR #16805: ggml-cpu: templateify ggml_compute_forward_rope_f32 and _f16 #28: test-arg-parser ................... Passed 0.25 sec
Start 29: test-gguf
29/35 Test UPSTREAM PR #16907: Vulkan: improve mul_mat_vec_iq1_m #29: test-gguf ......................... Passed 0.14 sec
Start 32: test-barrier
30/35 Test UPSTREAM PR #16906: model: add Janus Pro for image understanding #32: test-barrier ...................... Passed 1.44 sec
Start 33: test-quantize-fns
31/35 Test UPSTREAM PR #14891: imatrix: calculate activation-based statistics for new format (GGUF) imatrices #33: test-quantize-fns ................. Passed 16.97 sec
Start 34: test-quantize-perf
32/35 Test UPSTREAM PR #15550: quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest error #34: test-quantize-perf ................ Passed 0.22 sec
Start 35: test-rope
33/35 Test UPSTREAM PR #16917: CUDA: add FLOOR, CEIL, ROUND, TRUNC unary ops #35: test-rope ......................... Passed 0.07 sec
Start 36: test-mtmd-c-api
34/35 Test UPSTREAM PR #16923: Add e2e tests for embedding raw flag #36: test-mtmd-c-api ................... Passed 0.00 sec
Start 37: test-alloc
35/35 Test UPSTREAM PR #16922: disable D_TYPE same as A_TYPE cast for rope shaders #37: test-alloc ........................ Passed 0.00 sec
100% tests passed, 0 tests failed out of 35
Label Time Summary:
main = 46.75 sec*proc (35 tests)
I greatly appreciate any feedback anyone has to get this patch accepted.
Thanks.