Skip to content

Feature Request: NUMA-aware MoE Expert Allocation for Improved Performanc #11333

@l15y

Description

@l15y

Feature Description

Current llama.cpp implementation doesn't optimally utilize NUMA architecture when running Mixture-of-Experts (MoE) models, potentially leaving significant performance gains untapped.

Proposed Solution

Implement NUMA-aware expert allocation through one or more of these approaches:

  1. Process-Level Binding

    • Integrate numactl-like functionality directly into llama.cpp
    • Allow specifying NUMA nodes per expert group via CLI/config
  2. Thread Affinity Control

    • Add pthread/OpenMP affinity binding for expert computation threads
    • Example: --numa-expert-map "0-7:0,8-15:1" (experts 0-7 on NUMA0, 8-15 on NUMA1)
  3. NUMA-Aware Memory Allocation

    • Leverage libnuma for expert weight allocations
    • Implement mmap strategy with MAP_FIXED_NOREPLACE for specific nodes

Performance Considerations

  • Cross-NUMA communication cost vs. compute density tradeoff
  • Automatic topology detection vs. manual mapping
  • Support for hybrid CPU+accelerator configurations

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions