-
Notifications
You must be signed in to change notification settings - Fork 13k
--numa mirror
: mirror model weights to every Numa node in the system
#16000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
- Achieved 5% inference speed improvement (14.6 -> 15.3 t/s) - Clean explicit NUMA setup during model loading - Ultra-minimal hot path with thread-local NUMA node access - Working NUMA mirrors for all model weights - Performance: text generation improved, prompt processing needs optimization Performance Results (Qwen3-30B-A3B): - Text Generation: 14.6 -> 15.3 t/s (+5% improvement) - Prompt Processing: 176 -> 152 t/s (14% regression - needs investigation) Technical Implementation: - tensor_data(): O(1) NUMA-aware access via thread-local ggml_current_numa_node - tensor_set_data_with_numa_mirrors(): Explicit NUMA setup for model weights - NUMA coordinator: Thread binding and memory locality - Clean separation: model loading (explicit setup) vs inference (fast access)
Physical core detection was very broken in |
I'd hate to say it but that doesn't look like a big gain. Have you compared just using numa distribute and interleave=all? In contrast, fastllm, with numa mode gives 7t/s on qwen-235b on slower ram than yours. |
I see that commits are still incoming, but currently a clean checkout / build is giving many instances of:
|
Fixed in latest commit. Sorry I've been running CPU-only to simplify my testing, forgot to compile CUDA :) I'll fix ROCm too. |
I tried it out and took some basic measurements. Model is anikifoss V3.1 quant (Q8 attn/shared, Q4/Q5 mixed MoE) Platform is 2S EPYC 9115 NPS1 w/24x 64GB DDR5-4800 + RTX 6000 Pro (Blackwell) worth noting: 2S EPYC NPS2/NPS4 would require 4x and 8x copies 😳 All attention/shexp on GPU, MoE on CPU. AFAIK The best available NUMA on master branch is numa-mirror
note: distribute still loaded model w/mirroring. master
For NUMA I'm only interested in TG perf since PP can be solved by offload/batch. I'll ignore numa-mirror + distribute since whatever is happening there is probably unintended. master + distribute: 60.47ms/t An improvement of 0.5% in exchange for 2x RAM cost. Worthy of mention is that the loader is much better than the master/distribute situation. My impression is the NUMA impl. in this change has much more depth than before. Could it be used to implement Expert-Parallelism? That should be much closer to the 1.5-2x improvement everyone likely desires. And without the RAM cost, even on NPS2/NPS4. Invocation used:
|
Yes, no gpu. In fastLLM, cuda and numa doesn't work together. Can run cuda + cpu or cuda for PP and numa for t/g. Am curious what kind of figures you get on the other test for ram bandwidth utilization. FastLLM takes all my 230 GB/s. It's thorny for other reasons to actually use-use, but that's the metric to shoot for. Could be good to look at their code and see what strategy they use to get a much more than 10% improvement. You do have me wondering if I'm bottle-necked by my QPI link with that 52gb/s figure though.
Heh, I don't use mmap for models because it's slower. Should run llama-sweep-bench for a better benchie too. There is a port of it floating for mainline. |
I may be wrong about the mmap part now that you mention it. As for benching MoE-only, it doesn't seem like sweep is necessary. Mesuring TG with low context should be best since attn. is at a minimum and on fast GPU basically a small constant factor with e.g. ctx < 1000. Sweeps with attention on GPU just measure well, attention. expert computation is constant as I understand it. I don't think I'm bottlenecked anywhere as ik_+distribute gives another 10+% TG, and last I measured my likwid-bench bandwidth these values are still just ~50-60% of raw read capacity. (even on low-end 9115, in Performance mode likwid-bench gives ~212GB/socket) |
Mirroring takes care of the cross-socket traffic problem. The kernels and compute infrastructure need looking at but that's outside the scope of this PR really. One thing at a time gents. |
…lso make logging prettier.
Make logging prettier
Not sure, TG falls as context builds and some changes only show benefit later on. The multiple runs also give you a better picture.
As long as you have double the ram for the weights. :P |
This PR adds a new
--numa mirror
option which mirrors model weights to each Numa node on the system, and uses a thread-local var in the OMP threadpool to select the correct mirror copy local to the thread at runtime, to eliminate cross-socket traffic.Build instructions:
To test:
Test system is a two-socket Xeon 6238R Cascade Lake, with 768GB of DDR4-2933 (6 channels per socket).
Without
--numa mirror
:With
--numa mirror
:Intel PCM tool during mirror inference showing both sockets using local mem:
There's still a bit of cross-socket traffic (5%) because only model weights are mirrored, not tensors created at inference time. I'll play with that, maybe mirroring those aggressively will help too, or maybe not. Right now anything created at inference time just gets set to live on Node 0.