UPSTREAM PR #16833: cpu: introduce chunking for repack matmuls and enable matmul-id chunking #11

DajanaV · 2025-10-29T03:26:26Z

Similar to #16829 and tested in tandem.

A very simple dynamic chunking mechanism for repack matmuls. Helps on platforms with significant performance difference between the CPU cores, and helps distribute the work better under load in general.
I tested on M4 Pro and a few Snapdragons but it should work on all platforms.

See the details below.
I included a trace with instrumented matmuls that shows how threads threads endup processing chunks.

## M4 Pro

Before (no other load)
| model                  |       size |     params | backend    | ngl | threads | fa | dev   |   test |            t/s |
| ---------------------- | ---------: | ---------: | ---------- | --: | ------: | -: | ----- | -----: |  ------------: |
| gpt-oss 20B MXFP4 MoE  |  11.27 GiB |    20.91 B | Metal      |   0 |       6 |  1 | none  |  pp256 |   75.67 ± 0.43 |
| gpt-oss 20B MXFP4 MoE  |  11.27 GiB |    20.91 B | Metal      |   0 |       6 |  1 | none  |   tg64 |   56.13 ± 0.26 |

| model                  |       size |     params | backend    | ngl | threads | fa | dev   |   test |            t/s |
| ---------------------- | ---------: | ---------: | ---------- | --: | ------: | -: | ----- | -----: | -------------: |
| llama 3B Q4_0          |   1.78 GiB |     3.21 B | Metal      |   0 |       2 |  1 | none  |  pp256 |  100.81 ± 2.22 |
| llama 3B Q4_0          |   1.78 GiB |     3.21 B | Metal      |   0 |       2 |  1 | none  |   tg64 |   37.27 ± 1.06 |
| llama 3B Q4_0          |   1.78 GiB |     3.21 B | Metal      |   0 |       4 |  1 | none  |  pp256 |  198.34 ± 0.21 |
| llama 3B Q4_0          |   1.78 GiB |     3.21 B | Metal      |   0 |       4 |  1 | none  |   tg64 |   67.88 ± 0.63 |
| llama 3B Q4_0          |   1.78 GiB |     3.21 B | Metal      |   0 |       6 |  1 | none  |  pp256 |  275.03 ± 8.60 |
| llama 3B Q4_0          |   1.78 GiB |     3.21 B | Metal      |   0 |       6 |  1 | none  |   tg64 |   92.09 ± 1.40 |

After (no other load)
| model                  |       size |     params | backend    | ngl | threads | fa | dev   |   test |            t/s |
| ---------------------- | ---------: | ---------: | ---------- | --: | ------: | -: | ----- | -----: |  ------------: |
| gpt-oss 20B MXFP4 MoE  |  11.27 GiB |    20.91 B | Metal      |   0 |       6 |  1 | none  |  pp256 |   76.57 ± 0.15 |
| gpt-oss 20B MXFP4 MoE  |  11.27 GiB |    20.91 B | Metal      |   0 |       6 |  1 | none  |   tg64 |   55.66 ± 0.46 |

| model                  |       size |     params | backend    | ngl | threads | fa | dev   |   test |            t/s |
| ---------------------  | ---------: | ---------: | ---------- | --: | ------: | -: | ----- | -----: | -------------: |
| llama 3B Q4_0          |   1.78 GiB |     3.21 B | Metal      |   0 |       2 |  1 | none  |  pp256 |  105.01 ± 0.33 |
| llama 3B Q4_0          |   1.78 GiB |     3.21 B | Metal      |   0 |       2 |  1 | none  |   tg64 |   38.63 ± 0.10 |
| llama 3B Q4_0          |   1.78 GiB |     3.21 B | Metal      |   0 |       4 |  1 | none  |  pp256 |  198.66 ± 0.19 |
| llama 3B Q4_0          |   1.78 GiB |     3.21 B | Metal      |   0 |       4 |  1 | none  |   tg64 |   67.40 ± 0.29 |
| llama 3B Q4_0          |   1.78 GiB |     3.21 B | Metal      |   0 |       6 |  1 | none  |  pp256 |  290.01 ± 1.31 |
| llama 3B Q4_0          |   1.78 GiB |     3.21 B | Metal      |   0 |       6 |  1 | none  |   tg64 |   89.92 ± 0.09 |


Chunking in action (no load)
thread-5: matmul ffn_up-0 nchunks 4 usec 7219
thread-3: matmul ffn_up-0 nchunks 4 usec 7221
thread-2: matmul ffn_up-0 nchunks 4 usec 7232
thread-1: matmul ffn_up-0 nchunks 4 usec 7247
thread-0: matmul ffn_up-0 nchunks 4 usec 7259
thread-4: matmul ffn_up-0 nchunks 4 usec 7260
thread-3: matmul ffn_out-0 nchunks 4 usec 7402
thread-1: matmul ffn_out-0 nchunks 4 usec 7423
thread-2: matmul ffn_out-0 nchunks 4 usec 7425
thread-4: matmul ffn_out-0 nchunks 4 usec 7402
thread-0: matmul ffn_out-0 nchunks 4 usec 7411
thread-5: matmul ffn_out-0 nchunks 4 usec 7402

Chunking in action (heavy other load)
thread-3: matmul ffn_up-6 nchunks 3 usec 8080
thread-1: matmul ffn_up-6 nchunks 5 usec 9055
thread-4: matmul ffn_up-6 nchunks 5 usec 9070
thread-5: matmul ffn_up-6 nchunks 5 usec 9428
thread-2: matmul ffn_up-6 nchunks 3 usec 9502
thread-0: matmul ffn_up-6 nchunks 3 usec 10552
thread-3: matmul ffn_out-6 nchunks 4 usec 8556
thread-0: matmul ffn_out-6 nchunks 3 usec 8612
thread-4: matmul ffn_out-6 nchunks 4 usec 8809
thread-1: matmul ffn_out-6 nchunks 5 usec 9275
thread-5: matmul ffn_out-6 nchunks 5 usec 9750
thread-2: matmul ffn_out-6 nchunks 3 usec 9963


## Snapdragon 8-Elite Gen5

## LLama3.2 1B Q4_0
  llama_model_loader: - type  f32:   34 tensors
  llama_model_loader: - type q4_0:  112 tensors
  llama_model_loader: - type q6_K:    1 tensors

Before
| model          |       size |     params | backend    | ngl | threads | fa | dev   |   test |            t/s |
| -------------- | ---------: | ---------: | ---------- | --: | ------: | -: | ----- | -----: | -------------: |
| llama 1B Q4_0  | 727.75 MiB |     1.24 B | CPU        |   0 |       2 |  1 | none  |  pp128 |  384.94 ± 9.15 |
| llama 1B Q4_0  | 727.75 MiB |     1.24 B | CPU        |   0 |       2 |  1 | none  |   tg64 |   65.17 ± 1.49 |
| llama 1B Q4_0  | 727.75 MiB |     1.24 B | CPU        |   0 |       4 |  1 | none  |  pp128 |  351.52 ± 0.28 |
| llama 1B Q4_0  | 727.75 MiB |     1.24 B | CPU        |   0 |       4 |  1 | none  |   tg64 |   71.00 ± 1.49 |
| llama 1B Q4_0  | 727.75 MiB |     1.24 B | CPU        |   0 |       6 |  1 | none  |  pp128 |  512.93 ± 1.78 |
| llama 1B Q4_0  | 727.75 MiB |     1.24 B | CPU        |   0 |       6 |  1 | none  |   tg64 |   77.26 ± 1.29 |


After
| model          |       size |     params | backend    | ngl | threads | fa | dev   |    test |            t/s |
| -------------- | ---------: | ---------: | ---------- | --: | ------: | -: | ----- | ------: |--------------: |
| llama 1B Q4_0  | 727.75 MiB |     1.24 B | CPU        |   0 |       2 |  1 | none  |   pp128 |  395.65 ± 7.81 |
| llama 1B Q4_0  | 727.75 MiB |     1.24 B | CPU        |   0 |       2 |  1 | none  |    tg64 |   64.40 ± 0.85 |
| llama 1B Q4_0  | 727.75 MiB |     1.24 B | CPU        |   0 |       4 |  1 | none  |   pp128 |  459.51 ± 1.04 |
| llama 1B Q4_0  | 727.75 MiB |     1.24 B | CPU        |   0 |       4 |  1 | none  |    tg64 |   73.62 ± 0.67 |
| llama 1B Q4_0  | 727.75 MiB |     1.24 B | CPU        |   0 |       6 |  1 | none  |   pp128 |  669.03 ± 1.41 |
| llama 1B Q4_0  | 727.75 MiB |     1.24 B | CPU        |   0 |       6 |  1 | none  |    tg64 |   79.75 ± 0.56 |


## Llama3.2 3B Q4_0
  llama_model_loader: - type  f32:   58 tensors
  llama_model_loader: - type q4_0:  196 tensors
  llama_model_loader: - type q6_K:    1 tensors

Before
| model           |       size |     params | backend    | ngl | threads | fa | dev   |  test |             t/s |
| --------------- | ---------: | ---------: | ---------- | --: | ------: | -: | ----- | ----: | --------------: |
| llama 3B Q4_0   |   1.78 GiB |     3.21 B | CPU        |   0 |       2 |  1 | none  | pp128 |   127.73 ± 2.43 |
| llama 3B Q4_0   |   1.78 GiB |     3.21 B | CPU        |   0 |       2 |  1 | none  |  tg64 |    27.91 ± 0.61 |
| llama 3B Q4_0   |   1.78 GiB |     3.21 B | CPU        |   0 |       4 |  1 | none  | pp128 |   122.97 ± 0.02 |
| llama 3B Q4_0   |   1.78 GiB |     3.21 B | CPU        |   0 |       4 |  1 | none  |  tg64 |    29.72 ± 1.09 |
| llama 3B Q4_0   |   1.78 GiB |     3.21 B | CPU        |   0 |       6 |  1 | none  | pp128 |  159.59 ± 14.06 |
| llama 3B Q4_0   |   1.78 GiB |     3.21 B | CPU        |   0 |       6 |  1 | none  |  tg64 |    30.33 ± 0.60 |


After
| model           |       size |     params | backend    | ngl | threads | fa | dev   |  test |             t/s |
| --------------- | ---------: | ---------: | ---------- | --: | ------: | -: | ----- | ----: | --------------: |
| llama 3B Q4_0   |   1.78 GiB |     3.21 B | CPU        |   0 |       2 |  1 | none  | pp128 |   128.16 ± 2.09 |
| llama 3B Q4_0   |   1.78 GiB |     3.21 B | CPU        |   0 |       2 |  1 | none  |  tg64 |    27.46 ± 0.47 |
| llama 3B Q4_0   |   1.78 GiB |     3.21 B | CPU        |   0 |       4 |  1 | none  | pp128 |   161.89 ± 0.30 |
| llama 3B Q4_0   |   1.78 GiB |     3.21 B | CPU        |   0 |       4 |  1 | none  |  tg64 |    30.07 ± 0.65 |
| llama 3B Q4_0   |   1.78 GiB |     3.21 B | CPU        |   0 |       6 |  1 | none  | pp128 |   227.02 ± 7.26 |
| llama 3B Q4_0   |   1.78 GiB |     3.21 B | CPU        |   0 |       6 |  1 | none  |  tg64 |    32.16 ± 0.58 |


## Llama3.2 1B chunking in action

thread-0: matmul ffn_up-11 nchunks 7 usec 143
thread-5: matmul ffn_up-11 nchunks 8 usec 147
thread-3: matmul ffn_up-11 nchunks 2 usec 150
thread-1: matmul ffn_up-11 nchunks 2 usec 150
thread-2: matmul ffn_up-11 nchunks 2 usec 152
thread-4: matmul ffn_up-11 nchunks 3 usec 158
thread-0: matmul ffn_out-11 nchunks 7 usec 124
thread-1: matmul ffn_out-11 nchunks 2 usec 125
thread-5: matmul ffn_out-11 nchunks 8 usec 128
thread-4: matmul ffn_out-11 nchunks 2 usec 129
thread-2: matmul ffn_out-11 nchunks 2 usec 139
thread-3: matmul ffn_out-11 nchunks 3 usec 150


## Galaxy S25+ (Snapdragon 8-Elite Gen4)

## LLama3.2 1B chunking in action

thread-2: matmul ffn_up-11 nchunks 6 usec 147
thread-3: matmul ffn_up-11 nchunks 3 usec 150
thread-0: matmul ffn_up-11 nchunks 6 usec 147
thread-5: matmul ffn_up-11 nchunks 3 usec 150
thread-1: matmul ffn_up-11 nchunks 3 usec 147
thread-4: matmul ffn_up-11 nchunks 3 usec 152
thread-4: matmul ffn_out-11 nchunks 3 usec 136
thread-2: matmul ffn_out-11 nchunks 6 usec 142
thread-5: matmul ffn_out-11 nchunks 3 usec 146
thread-1: matmul ffn_out-11 nchunks 3 usec 136
thread-0: matmul ffn_out-11 nchunks 6 usec 144
thread-3: matmul ffn_out-11 nchunks 3 usec 136

…ing on ARM64 Very similar implementation to the flash-attention chunking, with similar benefits.

loci-agentic-ai-dev · 2025-10-29T04:33:39Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: llama.cpp PR #11

Key Findings

Performance Degradations Identified

Response Time: _RegexTranslatorBase@plt shows 0.066% degradation (7.37 ns vs 7.37 ns base)
Throughput: stack constructor exhibits 0.072% degradation (92.29 ns vs 92.22 ns base)
Bottleneck: _Construct template demonstrates 0.131% degradation (19.67 ns vs 19.64 ns base)

Core Function Impact Assessment

Minimal Impact on Critical Components: The degradations occur in auxiliary functions rather than core llama.cpp performance-critical areas:

Matrix Operations: Unaffected by the observed degradations
Attention Mechanisms: No performance impact detected
Quantization/Dequantization: Core inference paths remain stable
Memory Management: KV cache and tensor operations show no degradation

The affected functions (_RegexTranslatorBase@plt, STL constructors) are part of grammar parsing infrastructure, representing minimal computational overhead during LLM inference.

Power Consumption Analysis

Negligible Energy Impact: Overall power consumption shows 0.001% improvement

Primary Binary: build.bin.libllama.so demonstrates slight efficiency gain
Supporting Libraries: No measurable power consumption changes in GGML components
System Efficiency: Stable energy profile despite minor function-level degradations

Flame Graph and CFG Analysis

PLT Overhead Confirmation:

Structural Analysis: Identical control flow graphs between versions confirm no code-level changes
Assembly Verification: Zero differences in instruction sequences for degraded functions
Root Cause: Performance degradation stems from dynamic linking overhead, not algorithmic changes
External Factors: Library loading order or symbol resolution timing variations likely responsible

GitHub Code Review Insights

Positive Optimization Changes:

ARM64 Enablement: Removes artificial chunking restrictions on ARM64 platforms
Dynamic Load Balancing: Implements 4x chunks per thread for better work distribution
Architecture Unification: Consolidates chunking logic across platforms
Performance Gains: Benchmark data shows 5-30% throughput improvements on target platforms

No Critical Risks Identified: Changes maintain backward compatibility and include appropriate fallback mechanisms.

Overall Assessment

Change Impact Evaluation

Net Positive Performance Impact: While minor degradations exist in auxiliary functions, the core matrix multiplication optimizations provide substantial benefits:

Inference Performance: Expected improvements in token generation throughput
Hardware Utilization: Better CPU core utilization on heterogeneous architectures
System Stability: Maintained through careful preservation of existing synchronization patterns

Maintainability Considerations

Well-Engineered Implementation:

Code Structure: Clean separation of chunking logic with clear fallback paths
Testing Coverage: Comprehensive validation across multiple ARM platforms
Documentation: Detailed performance benchmarks and trace analysis provided

Future Performance Considerations

Monitoring Recommendations:

Dynamic Linking Overhead: Investigate PLT performance variations in production environments
Chunking Effectiveness: Profile matrix multiplication performance across diverse hardware configurations
Memory Alignment: Validate chunk boundary calculations maintain optimal cache performance

Optimization Opportunities:

Static Linking: Consider eliminating PLT overhead for frequently-used grammar parsing components
Adaptive Chunking: Implement hardware-aware chunk sizing for optimal performance scaling

The changes represent a mature optimization that addresses real performance bottlenecks while maintaining system reliability. The minor degradations in auxiliary functions are overshadowed by significant improvements in core computational pathways, resulting in a net positive impact on llama.cpp performance and maintainability.

cpu: introduce chunking for repack matmuls and enable matmul-id chunk…

3492441

…ing on ARM64 Very similar implementation to the flash-attention chunking, with similar benefits.

DajanaV force-pushed the main branch 2 times, most recently from 1983956 to 326a60a Compare October 29, 2025 12:13

DajanaV added the dev-stale Stale dev environment — dashboard not accessible label Oct 30, 2025

DajanaV deleted the branch main October 30, 2025 15:25

DajanaV closed this Oct 30, 2025

DajanaV deleted the upstream-PR16833-branch_qualcomm-repack-matmul-chunking branch October 30, 2025 15:25

DajanaV mentioned this pull request Nov 18, 2025

UPSTREAM PR #17342: Throughput improvement for small batch sizes #248

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #16833: cpu: introduce chunking for repack matmuls and enable matmul-id chunking #11

UPSTREAM PR #16833: cpu: introduce chunking for repack matmuls and enable matmul-id chunking #11

Uh oh!

DajanaV commented Oct 29, 2025

Uh oh!

loci-agentic-ai-dev bot commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UPSTREAM PR #16833: cpu: introduce chunking for repack matmuls and enable matmul-id chunking #11

UPSTREAM PR #16833: cpu: introduce chunking for repack matmuls and enable matmul-id chunking #11

Uh oh!

Conversation

DajanaV commented Oct 29, 2025

Uh oh!

loci-agentic-ai-dev bot commented Oct 29, 2025

Performance Analysis Summary: llama.cpp PR #11

Key Findings

Performance Degradations Identified

Core Function Impact Assessment

Power Consumption Analysis

Flame Graph and CFG Analysis

GitHub Code Review Insights

Overall Assessment

Change Impact Evaluation

Maintainability Considerations

Future Performance Considerations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants