Skip to content

Conversation

dbsanfte
Copy link

@dbsanfte dbsanfte commented Sep 15, 2025

This PR adds a new --numa mirror option which mirrors model weights to each Numa node on the system, and uses a thread-local var in the OMP threadpool to select the correct mirror copy local to the thread at runtime, to eliminate cross-socket traffic.

Build instructions:

apt-get update
apt-get install -y libnuma-dev libgomp1
cmake -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_FLAGS="-march=native" -DCMAKE_CXX_FLAGS="-march=native"  -DGGML_OPENMP=ON
cmake --build build --parallel

To test:

# No mirroring:
./build/bin/llama-bench -m ~/models/Qwen3-30B-A3B-UD-Q4_K_XL.gguf

# Numa mirroring of model weights to every node:
./build/bin/llama-bench -m ~/models/Qwen3-30B-A3B-UD-Q4_K_XL.gguf --numa mirror

Test system is a two-socket Xeon 6238R Cascade Lake, with 768GB of DDR4-2933 (6 channels per socket).

Without --numa mirror:

developer@81ec6c6e6af6:/workspaces/llama-cpp-dbsanfte-dev$ ./build/bin/llama-bench -m ./.devcontainer/Qwen3-32B-Q6_K.gguf     
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3 32B Q6_K                 |  25.03 GiB |    32.76 B | CPU        |      56 |           pp512 |         20.99 ± 0.01 |
| qwen3 32B Q6_K                 |  25.03 GiB |    32.76 B | CPU        |      56 |           tg128 |          1.91 ± 0.00 |

build: c665d3c9 (6468)

With --numa mirror:

developer@81ec6c6e6af6:/workspaces/llama-cpp-dbsanfte-dev$ ./build/bin/llama-bench -m .
/.devcontainer/Qwen3-32B-Q6_K.gguf --numa mirror
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3 32B Q6_K                 |  25.03 GiB |    32.76 B | CPU        |      56 |           pp512 |         21.36 ± 0.11 |
| qwen3 32B Q6_K                 |  25.03 GiB |    32.76 B | CPU        |      56 |           tg128 |          2.70 ± 0.00 |

build: c665d3c9 (6468)

Intel PCM tool during mirror inference showing both sockets using local mem:

image

There's still a bit of cross-socket traffic (5%) because only model weights are mirrored, not tensors created at inference time. I'll play with that, maybe mirroring those aggressively will help too, or maybe not. Right now anything created at inference time just gets set to live on Node 0.

- Achieved 5% inference speed improvement (14.6 -> 15.3 t/s)
- Clean explicit NUMA setup during model loading
- Ultra-minimal hot path with thread-local NUMA node access
- Working NUMA mirrors for all model weights
- Performance: text generation improved, prompt processing needs optimization

Performance Results (Qwen3-30B-A3B):
- Text Generation: 14.6 -> 15.3 t/s (+5% improvement)
- Prompt Processing: 176 -> 152 t/s (14% regression - needs investigation)

Technical Implementation:
- tensor_data(): O(1) NUMA-aware access via thread-local ggml_current_numa_node
- tensor_set_data_with_numa_mirrors(): Explicit NUMA setup for model weights
- NUMA coordinator: Thread binding and memory locality
- Clean separation: model loading (explicit setup) vs inference (fast access)
@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend examples python python script changes devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Apple Metal https://en.wikipedia.org/wiki/Metal_(API) Ascend NPU issues specific to Ascend NPUs OpenCL Issues specific to the OpenCL backend IBM zDNN issues specific to IBM zDNN Accelerator labels Sep 15, 2025
@dbsanfte dbsanfte marked this pull request as draft September 15, 2025 06:14
@dbsanfte
Copy link
Author

dbsanfte commented Sep 15, 2025

Physical core detection was very broken in arg.cpp / common.cpp, it assumed every 2nd core consecutively was a hyperthread. This isn't true on Xeons at least - Physical cores are my first 56, then the next 56 are the hyperthreads. This led to wildly inconsistent results at inference time. Now I use proper CPU topology detection to choose only physical cores. But if you still want to use hyperthreads I added a new option: --cpu-use-hyperthreading.

@Ph0rk0z
Copy link

Ph0rk0z commented Sep 15, 2025

I'd hate to say it but that doesn't look like a big gain. Have you compared just using numa distribute and interleave=all? In contrast, fastllm, with numa mode gives 7t/s on qwen-235b on slower ram than yours.

@usrlocalben
Copy link

I see that commits are still incoming, but currently a clean checkout / build is giving many instances of:

llama.cpp/ggml/src/ggml-cuda/conv2d-dw.cu(124): error: class "ggml_tensor" has no member "data"
      const float * w_d = (const float *) kernel->data;

@dbsanfte
Copy link
Author

I see that commits are still incoming, but currently a clean checkout / build is giving many instances of:

llama.cpp/ggml/src/ggml-cuda/conv2d-dw.cu(124): error: class "ggml_tensor" has no member "data"
      const float * w_d = (const float *) kernel->data;

Fixed in latest commit. Sorry I've been running CPU-only to simplify my testing, forgot to compile CUDA :)

I'll fix ROCm too.

@dbsanfte
Copy link
Author

dbsanfte commented Sep 15, 2025

I'd hate to say it but that doesn't look like a big gain. Have you compared just using numa distribute and interleave=all? In contrast, fastllm, with numa mode gives 7t/s on qwen-235b on slower ram than yours.

This is CPU-only inference, no GPU.

dbsanfte@xeon:~/llama.cpp-vanilla$ numactl --interleave=all ./build/bin/llama-bench -m ~/models/Qwen3-32B-Q6_K.gguf --numa distribute
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3 32B Q6_K                 |  25.03 GiB |    32.76 B | CPU        |      56 |           pp512 |         20.88 ± 0.09 |
| qwen3 32B Q6_K                 |  25.03 GiB |    32.76 B | CPU        |      56 |           tg128 |          2.44 ± 0.00 |

build: 28c39da7 (6478)

52GB/s across the UPI link 💀

Only 50% local access on each node (as would be expected):
image

Not as good as --numa mirror.

Only 8GB/s over UPI link, 97% average local access:
image

dbsanfte@xeon:~/llama-cpp-dbsanfte-new-fork$ ./build/bin/llama-bench -m ~/models/Qwen3-32B-Q6_K.gguf --numa mirror
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3 32B Q6_K                 |  25.03 GiB |    32.76 B | CPU        |      56 |           pp512 |         21.28 ± 0.12 |
| qwen3 32B Q6_K                 |  25.03 GiB |    32.76 B | CPU        |      56 |           tg128 |          2.77 ± 0.00 |

build: 34a50172 (6488)

Both prompt processing and inference are faster.

@usrlocalben
Copy link

usrlocalben commented Sep 15, 2025

I tried it out and took some basic measurements.

Model is anikifoss V3.1 quant (Q8 attn/shared, Q4/Q5 mixed MoE)

Platform is 2S EPYC 9115 NPS1 w/24x 64GB DDR5-4800 + RTX 6000 Pro (Blackwell)

worth noting: 2S EPYC NPS2/NPS4 would require 4x and 8x copies 😳

All attention/shexp on GPU, MoE on CPU.

AFAIK The best available NUMA on master branch is --numa distribute .
My understanding is that page/node assignment is stochastic as it relies on mmap for NUMA "migration," so it's important to drop caches before load so the warm-up phase will populate the cache consistent with threads.

numa-mirror

--numa mirror

prompt eval time =    6836.72 ms /    94 tokens (   72.73 ms per token,    13.75 tokens per second)
       eval time =   71504.51 ms /  1189 tokens (   60.14 ms per token,    16.63 tokens per second)
      total time =   78341.24 ms /  1283 tokens

--numa distribute (after drop_caches)

prompt eval time =    7068.73 ms /    94 tokens (   75.20 ms per token,    13.30 tokens per second)
       eval time =  127871.15 ms /  1417 tokens (   90.24 ms per token,    11.08 tokens per second)
      total time =  134939.88 ms /  1511 tokens

note: distribute still loaded model w/mirroring.

master

--numa distribute (after drop_caches)

prompt eval time =    8035.73 ms /    94 tokens (   85.49 ms per token,    11.70 tokens per second)
       eval time =   85014.67 ms /  1406 tokens (   60.47 ms per token,    16.54 tokens per second)
      total time =   93050.40 ms /  1500 tokens

For NUMA I'm only interested in TG perf since PP can be solved by offload/batch.

I'll ignore numa-mirror + distribute since whatever is happening there is probably unintended.

master + distribute: 60.47ms/t
numa-mirror + mirror: 60.14ms/t

An improvement of 0.5% in exchange for 2x RAM cost.
(And I only did one run each, so it might not even be real)

Worthy of mention is that the loader is much better than the master/distribute situation.
The need for drop_caches + slower-by-a-factor loader is not a great experience if one changes models often.

My impression is the NUMA impl. in this change has much more depth than before. Could it be used to implement Expert-Parallelism? That should be much closer to the 1.5-2x improvement everyone likely desires. And without the RAM cost, even on NPS2/NPS4.

Invocation used:

--host 0.0.0.0 --port 9999 -np 1
-b 4096 -ub 4096
-fa on
-m anikifoss/HQ4_K/DeepSeek-V3.1-HQ4_K-00001-of-00010.gguf
--numa {mirror,distribute}
-ngl 99
-ot "blk\.([3-9])\.ffn_up_exps=CUDA0,blk\.([3-9])\.ffn_gate_exps=CUDA0"
-ot exps=CPU
-c 96000

@Ph0rk0z
Copy link

Ph0rk0z commented Sep 15, 2025

This is CPU-only inference, no GPU.

Yes, no gpu. In fastLLM, cuda and numa doesn't work together. Can run cuda + cpu or cuda for PP and numa for t/g.

Am curious what kind of figures you get on the other test for ram bandwidth utilization. FastLLM takes all my 230 GB/s. It's thorny for other reasons to actually use-use, but that's the metric to shoot for.

Could be good to look at their code and see what strategy they use to get a much more than 10% improvement. You do have me wondering if I'm bottle-necked by my QPI link with that 52gb/s figure though.

node assignment is stochastic as it relies on mmap

Heh, I don't use mmap for models because it's slower. Should run llama-sweep-bench for a better benchie too. There is a port of it floating for mainline.

@usrlocalben
Copy link

usrlocalben commented Sep 15, 2025

Heh, I don't use mmap for models because it's slower.

I may be wrong about the mmap part now that you mention it.

As for benching MoE-only, it doesn't seem like sweep is necessary. Mesuring TG with low context should be best since attn. is at a minimum and on fast GPU basically a small constant factor with e.g. ctx < 1000. Sweeps with attention on GPU just measure well, attention. expert computation is constant as I understand it.

I don't think I'm bottlenecked anywhere as ik_+distribute gives another 10+% TG, and last I measured my likwid-bench bandwidth these values are still just ~50-60% of raw read capacity. (even on low-end 9115, in Performance mode likwid-bench gives ~212GB/socket)

@dbsanfte
Copy link
Author

dbsanfte commented Sep 15, 2025

Mirroring takes care of the cross-socket traffic problem.

The kernels and compute infrastructure need looking at but that's outside the scope of this PR really.

One thing at a time gents.

@dbsanfte
Copy link
Author

dbsanfte commented Sep 15, 2025

I noticed something interesting testing Qwen3 30B A3B. MoE models still have significant cross-socket traffic during inferencing:

image

I don't see this with dense models like Qwen3 32B. It's as above.

I'm going to see what's going on with that.

Edit: I think it's the repacking, the repacked tensors aren't getting numa mirrors. I'll see if I can fix this tomorrow.

@Ph0rk0z
Copy link

Ph0rk0z commented Sep 15, 2025

As for benching MoE-only, it doesn't seem like sweep is necessary. Mesuring TG with low context should be best since attn.

Not sure, TG falls as context builds and some changes only show benefit later on. The multiple runs also give you a better picture.

Mirroring takes care of the cross-socket traffic problem.

As long as you have double the ram for the weights. :P

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) Ascend NPU issues specific to Ascend NPUs devops improvements to build systems and github actions examples ggml changes relating to the ggml tensor library for machine learning IBM zDNN issues specific to IBM zDNN Accelerator Nvidia GPU Issues specific to Nvidia GPUs OpenCL Issues specific to the OpenCL backend python python script changes SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants