`--numa mirror`: mirror model weights to every Numa node in the system #16000

dbsanfte · 2025-09-15T06:13:51Z

This PR adds a new --numa mirror option which mirrors model weights to each Numa node on the system, and uses a thread-local var in the OMP threadpool to select the correct mirror copy local to the thread at runtime, to eliminate cross-socket traffic.

Build instructions:

apt-get update
apt-get install -y libnuma-dev libgomp1
cmake -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_FLAGS="-march=native" -DCMAKE_CXX_FLAGS="-march=native"  -DGGML_OPENMP=ON
cmake --build build --parallel

To test:

# No mirroring:
./build/bin/llama-bench -m ~/models/Qwen3-30B-A3B-UD-Q4_K_XL.gguf

# Numa mirroring of model weights to every node:
./build/bin/llama-bench -m ~/models/Qwen3-30B-A3B-UD-Q4_K_XL.gguf --numa mirror

Test system is a two-socket Xeon 6238R Cascade Lake, with 768GB of DDR4-2933 (6 channels per socket).

Without --numa mirror:

developer@81ec6c6e6af6:/workspaces/llama-cpp-dbsanfte-dev$ ./build/bin/llama-bench -m ./.devcontainer/Qwen3-32B-Q6_K.gguf     
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3 32B Q6_K                 |  25.03 GiB |    32.76 B | CPU        |      56 |           pp512 |         20.99 ± 0.01 |
| qwen3 32B Q6_K                 |  25.03 GiB |    32.76 B | CPU        |      56 |           tg128 |          1.91 ± 0.00 |

build: c665d3c9 (6468)

With --numa mirror:

developer@81ec6c6e6af6:/workspaces/llama-cpp-dbsanfte-dev$ ./build/bin/llama-bench -m .
/.devcontainer/Qwen3-32B-Q6_K.gguf --numa mirror
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3 32B Q6_K                 |  25.03 GiB |    32.76 B | CPU        |      56 |           pp512 |         21.36 ± 0.11 |
| qwen3 32B Q6_K                 |  25.03 GiB |    32.76 B | CPU        |      56 |           tg128 |          2.70 ± 0.00 |

build: c665d3c9 (6468)

Intel PCM tool during mirror inference showing both sockets using local mem:

There's still a bit of cross-socket traffic (5%) because only model weights are mirrored, not tensors created at inference time. I'll play with that, maybe mirroring those aggressively will help too, or maybe not. Right now anything created at inference time just gets set to live on Node 0.

- Achieved 5% inference speed improvement (14.6 -> 15.3 t/s) - Clean explicit NUMA setup during model loading - Ultra-minimal hot path with thread-local NUMA node access - Working NUMA mirrors for all model weights - Performance: text generation improved, prompt processing needs optimization Performance Results (Qwen3-30B-A3B): - Text Generation: 14.6 -> 15.3 t/s (+5% improvement) - Prompt Processing: 176 -> 152 t/s (14% regression - needs investigation) Technical Implementation: - tensor_data(): O(1) NUMA-aware access via thread-local ggml_current_numa_node - tensor_set_data_with_numa_mirrors(): Explicit NUMA setup for model weights - NUMA coordinator: Thread binding and memory locality - Clean separation: model loading (explicit setup) vs inference (fast access)

… CUDA

dbsanfte · 2025-09-15T06:20:06Z

Physical core detection was very broken in arg.cpp / common.cpp, it assumed every 2nd core consecutively was a hyperthread. This isn't true on Xeons at least - Physical cores are my first 56, then the next 56 are the hyperthreads. This led to wildly inconsistent results at inference time. Now I use proper CPU topology detection to choose only physical cores. But if you still want to use hyperthreads I added a new option: --cpu-use-hyperthreading.

Ph0rk0z · 2025-09-15T14:10:09Z

I'd hate to say it but that doesn't look like a big gain. Have you compared just using numa distribute and interleave=all? In contrast, fastllm, with numa mode gives 7t/s on qwen-235b on slower ram than yours.

usrlocalben · 2025-09-15T14:19:50Z

I see that commits are still incoming, but currently a clean checkout / build is giving many instances of:

llama.cpp/ggml/src/ggml-cuda/conv2d-dw.cu(124): error: class "ggml_tensor" has no member "data"
      const float * w_d = (const float *) kernel->data;

dbsanfte · 2025-09-15T15:53:09Z

I see that commits are still incoming, but currently a clean checkout / build is giving many instances of:
llama.cpp/ggml/src/ggml-cuda/conv2d-dw.cu(124): error: class "ggml_tensor" has no member "data"
      const float * w_d = (const float *) kernel->data;

Fixed in latest commit. Sorry I've been running CPU-only to simplify my testing, forgot to compile CUDA :)

I'll fix ROCm too.

dbsanfte · 2025-09-15T16:28:31Z

I'd hate to say it but that doesn't look like a big gain. Have you compared just using numa distribute and interleave=all? In contrast, fastllm, with numa mode gives 7t/s on qwen-235b on slower ram than yours.

This is CPU-only inference, no GPU.

dbsanfte@xeon:~/llama.cpp-vanilla$ numactl --interleave=all ./build/bin/llama-bench -m ~/models/Qwen3-32B-Q6_K.gguf --numa distribute
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3 32B Q6_K                 |  25.03 GiB |    32.76 B | CPU        |      56 |           pp512 |         20.88 ± 0.09 |
| qwen3 32B Q6_K                 |  25.03 GiB |    32.76 B | CPU        |      56 |           tg128 |          2.44 ± 0.00 |

build: 28c39da7 (6478)

52GB/s across the UPI link 💀

Only 50% local access on each node (as would be expected):

Not as good as --numa mirror.

Only 8GB/s over UPI link, 97% average local access:

dbsanfte@xeon:~/llama-cpp-dbsanfte-new-fork$ ./build/bin/llama-bench -m ~/models/Qwen3-32B-Q6_K.gguf --numa mirror
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3 32B Q6_K                 |  25.03 GiB |    32.76 B | CPU        |      56 |           pp512 |         21.28 ± 0.12 |
| qwen3 32B Q6_K                 |  25.03 GiB |    32.76 B | CPU        |      56 |           tg128 |          2.77 ± 0.00 |

build: 34a50172 (6488)

Both prompt processing and inference are faster.

usrlocalben · 2025-09-15T17:16:55Z

I tried it out and took some basic measurements.

Model is anikifoss V3.1 quant (Q8 attn/shared, Q4/Q5 mixed MoE)

Platform is 2S EPYC 9115 NPS1 w/24x 64GB DDR5-4800 + RTX 6000 Pro (Blackwell)

worth noting: 2S EPYC NPS2/NPS4 would require 4x and 8x copies 😳

All attention/shexp on GPU, MoE on CPU.

AFAIK The best available NUMA on master branch is --numa distribute .
My understanding is that page/node assignment is stochastic as it relies on mmap for NUMA "migration," so it's important to drop caches before load so the warm-up phase will populate the cache consistent with threads.

numa-mirror

--numa mirror

prompt eval time =    6836.72 ms /    94 tokens (   72.73 ms per token,    13.75 tokens per second)
       eval time =   71504.51 ms /  1189 tokens (   60.14 ms per token,    16.63 tokens per second)
      total time =   78341.24 ms /  1283 tokens

--numa distribute (after drop_caches)

prompt eval time =    7068.73 ms /    94 tokens (   75.20 ms per token,    13.30 tokens per second)
       eval time =  127871.15 ms /  1417 tokens (   90.24 ms per token,    11.08 tokens per second)
      total time =  134939.88 ms /  1511 tokens

note: distribute still loaded model w/mirroring.

master

--numa distribute (after drop_caches)

prompt eval time =    8035.73 ms /    94 tokens (   85.49 ms per token,    11.70 tokens per second)
       eval time =   85014.67 ms /  1406 tokens (   60.47 ms per token,    16.54 tokens per second)
      total time =   93050.40 ms /  1500 tokens

For NUMA I'm only interested in TG perf since PP can be solved by offload/batch.

I'll ignore numa-mirror + distribute since whatever is happening there is probably unintended.

master + distribute: 60.47ms/t
numa-mirror + mirror: 60.14ms/t

An improvement of 0.5% in exchange for 2x RAM cost.
(And I only did one run each, so it might not even be real)

Worthy of mention is that the loader is much better than the master/distribute situation.
The need for drop_caches + slower-by-a-factor loader is not a great experience if one changes models often.

My impression is the NUMA impl. in this change has much more depth than before. Could it be used to implement Expert-Parallelism? That should be much closer to the 1.5-2x improvement everyone likely desires. And without the RAM cost, even on NPS2/NPS4.

Invocation used:

--host 0.0.0.0 --port 9999 -np 1
-b 4096 -ub 4096
-fa on
-m anikifoss/HQ4_K/DeepSeek-V3.1-HQ4_K-00001-of-00010.gguf
--numa {mirror,distribute}
-ngl 99
-ot "blk\.([3-9])\.ffn_up_exps=CUDA0,blk\.([3-9])\.ffn_gate_exps=CUDA0"
-ot exps=CPU
-c 96000

Ph0rk0z · 2025-09-15T17:25:06Z

This is CPU-only inference, no GPU.

Yes, no gpu. In fastLLM, cuda and numa doesn't work together. Can run cuda + cpu or cuda for PP and numa for t/g.

Am curious what kind of figures you get on the other test for ram bandwidth utilization. FastLLM takes all my 230 GB/s. It's thorny for other reasons to actually use-use, but that's the metric to shoot for.

Could be good to look at their code and see what strategy they use to get a much more than 10% improvement. You do have me wondering if I'm bottle-necked by my QPI link with that 52gb/s figure though.

node assignment is stochastic as it relies on mmap

Heh, I don't use mmap for models because it's slower. Should run llama-sweep-bench for a better benchie too. There is a port of it floating for mainline.

usrlocalben · 2025-09-15T17:35:27Z

Heh, I don't use mmap for models because it's slower.

I may be wrong about the mmap part now that you mention it.

As for benching MoE-only, it doesn't seem like sweep is necessary. Mesuring TG with low context should be best since attn. is at a minimum and on fast GPU basically a small constant factor with e.g. ctx < 1000. Sweeps with attention on GPU just measure well, attention. expert computation is constant as I understand it.

I don't think I'm bottlenecked anywhere as ik_+distribute gives another 10+% TG, and last I measured my likwid-bench bandwidth these values are still just ~50-60% of raw read capacity. (even on low-end 9115, in Performance mode likwid-bench gives ~212GB/socket)

dbsanfte · 2025-09-15T17:52:49Z

Mirroring takes care of the cross-socket traffic problem.

The kernels and compute infrastructure need looking at but that's outside the scope of this PR really.

One thing at a time gents.

…lso make logging prettier.

Make logging prettier

dbsanfte · 2025-09-15T20:27:58Z

I noticed something interesting testing Qwen3 30B A3B. MoE models still have significant cross-socket traffic during inferencing:

I don't see this with dense models like Qwen3 32B. It's as above.

I'm going to see what's going on with that.

Edit: I think it's the repacking, the repacked tensors aren't getting numa mirrors. I'll see if I can fix this tomorrow.

Ph0rk0z · 2025-09-15T22:34:46Z

As for benching MoE-only, it doesn't seem like sweep is necessary. Mesuring TG with low context should be best since attn.

Not sure, TG falls as context builds and some changes only show benefit later on. The multiple runs also give you a better picture.

Mirroring takes care of the cross-socket traffic problem.

As long as you have double the ram for the weights. :P

…mode (for future cross-numa data slicing)

jeffbolznv · 2025-09-18T04:24:20Z

IMO its not practical to duplicate all weights. If we had backend-agnostic row splitting, we could treat each numa node as a separate device and use row splitting, and avoid the increased memory consumption.

rankaiyx · 2025-09-18T17:53:59Z

2x E5-2698B v3 (total 32/64)
8x DDR3 1866

BIOS disable numa

$numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 0 size: 257585 MB
node 0 free: 238868 MB
node distances:
node   0
  0:  10


~/450G/numa/llama.cpp/build/bin$ ./llama-bench -m  ~/450G/Qwen3-30B-A3B-Q4_K_S.gguf
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Small  |  16.25 GiB |    30.53 B | CPU        |      32 |           pp512 |        123.90 ± 0.12 |
| qwen3moe 30B.A3B Q4_K - Small  |  16.25 GiB |    30.53 B | CPU        |      32 |           tg128 |         20.82 ± 0.04 |

build: 313bf8a3 (6498)

BIOS enable numa

$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 0 size: 128767 MB
node 0 free: 128016 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 1 size: 129009 MB
node 1 free: 128289 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10




 ./llama-bench -m  ~/450G/Qwen3-30B-A3B-Q4_K_S.gguf --numa mirror
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Small  |  16.25 GiB |    30.53 B | CPU        |      32 |           pp512 |        119.10 ± 4.74 |
| qwen3moe 30B.A3B Q4_K - Small  |  16.25 GiB |    30.53 B | CPU        |      32 |           tg128 |         20.31 ± 0.39 |

build: 313bf8a3 (6498)


./llama-bench -m  ~/450G/Qwen3-30B-A3B-Q4_K_S.gguf --numa distribute
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Small  |  16.25 GiB |    30.53 B | CPU        |      32 |           pp512 |        120.23 ± 1.24 |
| qwen3moe 30B.A3B Q4_K - Small  |  16.25 GiB |    30.53 B | CPU        |      32 |           tg128 |         17.10 ± 0.19 |

build: 313bf8a3 (6498)


./llama-bench -m  ~/450G/Qwen3-30B-A3B-Q4_K_S.gguf --numa numactl
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Small  |  16.25 GiB |    30.53 B | CPU        |      32 |           pp512 |        121.27 ± 0.27 |
| qwen3moe 30B.A3B Q4_K - Small  |  16.25 GiB |    30.53 B | CPU        |      32 |           tg128 |         15.51 ± 0.53 |

build: 313bf8a3 (6498)

dbsanfte · 2025-09-22T17:19:31Z

Without some form of data slicing I don't think we are going to see a lot of cross-socket speedups. The inferencing infrastructure and existing CPU kernels simply aren't set up to do it and refactoring it in place is a major challenge.

I'll leave this PR here in case anyone else wants to develop on it further. I am going to work on my own greenfield inferencing engine with what I've learned. This has been pretty fun :)

Ph0rk0z · 2025-10-02T14:14:37Z

Without some form of data slicing I don't think we are going to see a lot of cross-socket speedups.

After testing GLM 4.6 on a single node, closest to the GPUs, I get even less bandwidth consumed than when cranking 20g up/30g down on the UPI link. Almost perfect 50/50 local-remote access. Still only 48g/200g utilized.

During t/g pcie bandwidth isn't even getting used very much so it's not bottlenecking there, far as I can tell. Think even single socket is getting under-used when combined with GPU.

dbsanfte added 4 commits September 14, 2025 17:35

numa mirroring

06a46ce

copilot instructions

435f095

1) fix CPU detection of physical cores 2) fix tensor_data() access in…

c665d3c

… CUDA

dbsanfte requested review from 0cc4m, JohannesGaessler and taronaeo as code owners September 15, 2025 06:13

dbsanfte marked this pull request as draft September 15, 2025 06:14

dbsanfte mentioned this pull request Sep 15, 2025

Implementation of GGML_NUMA_MIRROR for inferencing performance gain on numa systems #14969

Closed

dbsanfte added 8 commits September 15, 2025 08:51

Merge branch 'master' into numa-mirror

d357ef5

cleanup refs and logging

6d309d5

cleanup more logging, add impl details for LLM agent

48d8d59

optimisation: force all cplan work buffers to allocate on Numa node 0

4f7562d

remove unncessary ifdef

a665a0c

tidy up compiler warnings

4b016f7

tidy up formatting

166b978

add guard clause: --numa mirror requires OpenMP

c951357

fix cuda

34a5017

dbsanfte added 2 commits September 15, 2025 20:17

don't try to mirror weights when we're not in --numa mirror mode. A…

b8bf5fa

…lso make logging prettier.

all tensors we load in llama-model-loader.cpp are model weights.

4da24f7

Make logging prettier

dbsanfte added 5 commits September 17, 2025 17:06

rename instructions file

b41a837

experimental - interleave work buffers

98135c9

update docs

fa3a5b4

add thread-local to tell threads how many numas are active in mirror …

23c9784

…mode (for future cross-numa data slicing)

update instructions

6ad6795

dbsanfte added 5 commits September 18, 2025 10:00

check in devcontainer

c19cd80

update devcontainer json

8c00fb0

update devcontainer to package git-lfs

313bf8a

fix vulkan builds

e227c75

check in mul_mat optimisation analysis for further work

d99fb3f

--numa mirror: mirror model weights to every Numa node in the system #16000

Are you sure you want to change the base?

--numa mirror: mirror model weights to every Numa node in the system #16000

Conversation

dbsanfte commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dbsanfte commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ph0rk0z commented Sep 15, 2025

Uh oh!

usrlocalben commented Sep 15, 2025

Uh oh!

dbsanfte commented Sep 15, 2025

Uh oh!

dbsanfte commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

usrlocalben commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

numa-mirror

master

Uh oh!

Ph0rk0z commented Sep 15, 2025

Uh oh!

usrlocalben commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dbsanfte commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dbsanfte commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ph0rk0z commented Sep 15, 2025

Uh oh!

jeffbolznv commented Sep 18, 2025

Uh oh!

rankaiyx commented Sep 18, 2025

Uh oh!

dbsanfte commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ph0rk0z commented Oct 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

`--numa mirror`: mirror model weights to every Numa node in the system #16000

`--numa mirror`: mirror model weights to every Numa node in the system #16000

dbsanfte commented Sep 15, 2025 •

edited

Loading

dbsanfte commented Sep 15, 2025 •

edited

Loading

dbsanfte commented Sep 15, 2025 •

edited

Loading

usrlocalben commented Sep 15, 2025 •

edited

Loading

usrlocalben commented Sep 15, 2025 •

edited

Loading

dbsanfte commented Sep 15, 2025 •

edited

Loading

dbsanfte commented Sep 15, 2025 •

edited

Loading

dbsanfte commented Sep 22, 2025 •

edited

Loading