- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.4k
Closed
Labels
Description
Feature Description
Current llama.cpp implementation doesn't optimally utilize NUMA architecture when running Mixture-of-Experts (MoE) models, potentially leaving significant performance gains untapped.
Proposed Solution
Implement NUMA-aware expert allocation through one or more of these approaches:
- 
Process-Level Binding - Integrate numactl-like functionality directly into llama.cpp
- Allow specifying NUMA nodes per expert group via CLI/config
 
- Integrate 
- 
Thread Affinity Control - Add pthread/OpenMP affinity binding for expert computation threads
- Example: --numa-expert-map "0-7:0,8-15:1"(experts 0-7 on NUMA0, 8-15 on NUMA1)
 
- 
NUMA-Aware Memory Allocation - Leverage libnumafor expert weight allocations
- Implement mmapstrategy withMAP_FIXED_NOREPLACEfor specific nodes
 
- Leverage 
Performance Considerations
- Cross-NUMA communication cost vs. compute density tradeoff
- Automatic topology detection vs. manual mapping
- Support for hybrid CPU+accelerator configurations
jeff31415, fabairim, dirkson, Xxianna, ubergarm and 4 more