Feature Request: NUMA-aware MoE Expert Allocation for Improved Performanc

### Feature Description

Current llama.cpp implementation doesn't optimally utilize NUMA architecture when running Mixture-of-Experts (MoE) models, potentially leaving significant performance gains untapped. 

### Proposed Solution  
Implement NUMA-aware expert allocation through one or more of these approaches:  
1. **Process-Level Binding**  
   - Integrate `numactl`-like functionality directly into llama.cpp  
   - Allow specifying NUMA nodes per expert group via CLI/config  

2. **Thread Affinity Control**  
   - Add pthread/OpenMP affinity binding for expert computation threads  
   - Example: `--numa-expert-map "0-7:0,8-15:1"` (experts 0-7 on NUMA0, 8-15 on NUMA1)  

3. **NUMA-Aware Memory Allocation**  
   - Leverage `libnuma` for expert weight allocations  
   - Implement `mmap` strategy with `MAP_FIXED_NOREPLACE` for specific nodes  

### Performance Considerations  
- Cross-NUMA communication cost vs. compute density tradeoff  
- Automatic topology detection vs. manual mapping  
- Support for hybrid CPU+accelerator configurations  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: NUMA-aware MoE Expert Allocation for Improved Performanc #11333

Feature Description

Proposed Solution

Performance Considerations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: NUMA-aware MoE Expert Allocation for Improved Performanc #11333

Description

Feature Description

Proposed Solution

Performance Considerations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions