Skip to content

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Nov 18, 2025

Mirrored from ggml-org/llama.cpp#17342

I came across a core scaling issue while running llama-bench on large core count machine and small batch sizes. During the investigation I found cache-line contention causing scaling issue. This patch fixes the contention.

With this patch I've seen throughput improvement ranging from 2% to 44% while running with Qwen3 30B parameter model.

Results are obtained with following command,
$ llama-bench –m Qwen3-Coder-30B-A3B-Instruct-Q4_K_S.gguf -p 500 -n 0 -t n -b 16,32,64,96,128
Where "n" is number of threads.

Here are the results:

  1. Constant batch Size=16 with varying number of threads
Patched/Baseline
Threads TPS
1 1.00
2 1.00
4 1.00
8 1.00
16 1.03
32 1.09
64 1.16
96 1.20
  1. Constant number of threads=96 with varying batch size
Patched/Baseline
Batch Size TPS
1 1.00
2 1.44
4 1.34
8 1.27
16 1.20
32 1.16
64 1.11
96 1.07
128 1.05
512 1.02
1024 1.02

==== Test Results =====

100% tests passed, 0 tests failed out of 35

Label Time Summary:
main = 46.75 sec*proc (35 tests)

I greatly appreciate any feedback anyone has to get this patch accepted.
Thanks.

…-line contention (cache HITM)

This improves throughput for cases where threads have to wait due to lack work and causing process
to spend many cycles in a spin loop. This enables to update dynamic chunk counter with static stride
partitioning which further helps to eliminate shared counter.

* remove one barrier in sgemm()

* static stride partitioning
@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of PR #248 reveals a 23.86% response time increase in the quantize_row_iq4_xs function (79 ns → 98 ns), representing the highest performance change identified. However, this regression is unrelated to the actual code modifications in the pull request.

Code Changes Analysis

The PR implements a targeted optimization in ggml/src/ggml-cpu/llamafile/sgemm.cpp within the tinyBLAS matrix multiplication class:

  • Removed thread barrier and dynamic job allocation
  • Replaced work-stealing with static work distribution using a simple for-loop
  • Eliminated atomic operations that caused cache-line contention

These changes address legitimate scaling issues in multi-threaded environments with small batch sizes, showing 2-44% throughput improvements in benchmarks.

Key Findings

Performance Impact:

  • The quantize_row_iq4_xs regression (19 ns absolute increase) stems from memory layout changes during binary compilation, specifically string literal relocation to different memory pages
  • No core inference functions affected: llama_decode, llama_encode, and llama_tokenize show no performance changes
  • Tokens per second impact: Zero, as no tokenization or inference functions experienced response time changes

Power Consumption:

  • 3.15% decrease in power consumption for build.bin.libggml-cpu.so (10,060 nJ → 9,743 nJ)
  • Net energy reduction despite quantization regression indicates overall efficiency gains

Technical Analysis:

  • Flame graph: Shows shallow execution with 85.7% self-time in quantization function, with PLT call overhead contributing 14.3%
  • CFG comparison: Identical control flow structure; only assembly differences are memory address relocations for assertion strings
  • Code review: Clean, well-targeted optimization addressing documented performance bottleneck

Actionable Recommendations:

  1. Review linker configuration to maintain string literal locality and prevent memory page fragmentation
  2. Consider assertion removal in performance-critical builds to eliminate the 14 ns PLT overhead

The core optimization is sound and beneficial; the quantization regression requires separate binary layout investigation.

@DajanaV DajanaV force-pushed the main branch 3 times, most recently from f333350 to 9c4623f Compare November 18, 2025 09:10
@loci-dev loci-dev force-pushed the main branch 24 times, most recently from d2e6325 to 22143ca Compare November 24, 2025 14:09
@loci-dev loci-dev force-pushed the main branch 24 times, most recently from a89c6ad to ad5ad9a Compare November 27, 2025 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants