Skip to content

Conversation

@z-vishal
Copy link

@z-vishal z-vishal commented Dec 2, 2025

This PR adds ZenDNN backend support for accelerated inference on AMD EPYC™ CPUs.

Background

ZenDNN is AMD's optimized deep learning library for EPYC processors, providing high-performance primitives for inference workloads. It uses the LowOHA (Low Overhead High-performance) MatMul operator for efficient matrix multiplication.

Changes

  • Backend implementation:

    • New ZenDNN backend in ggml/src/ggml-zendnn/
    • Implements GGML_OP_MUL_MAT acceleration using ZenDNN primitives
    • Supports FP32 and BF16 data types
    • Auto-converts types
  • Build system:

    • CMake integration with automatic download/build option: -DGGML_ZENDNN=ON
    • Custom installation path support: -DGGML_ZENDNN_PATH=/path/to/zendnn
    • Uses ZenDNN's CMake package for clean dependency management
  • Documentation:

    • Comprehensive backend documentation in docs/backend/ZenDNN.md
    • Build instructions added to docs/build.md
    • Covers hardware support, setup, performance tuning, and profiling

Hardware Support

  • AMD EPYC 9005 Series (Turin/Zen 5)
  • AMD EPYC 9004 Series (Zen 4) - Recommended (best BF16 performance)
  • AMD EPYC 7003 Series (Milan/Zen 3)
  • AMD Ryzen AI MAX (Strix Halo)

Performance Notes

  • Best performance with export ZENDNNL_MATMUL_ALGO=2 (Blocked AOCL BLIS backend)
  • Optimized for BF16 inference on Zen 4/5 processors
  • Automatic parallel dispatch using OpenMP

Testing

Tested on AMD EPYC systems with llama-server and llama-cli using various models (LLaMA, Mistral, Qwen).

Performance Results

Test Configuration

  • Hardware: AMD EPYC 9004 Series (Zen 4)
  • Threads: 96
  • Batch Size: 4096
  • Tool: llama-bench
  • llama.cpp version: 7134
  • ZenDNN version: 1.0.0
  • Environment: ZENDNNL_MATMUL_ALGO=2 (Blocked AOCL BLIS)

Benchmark Results

LLaMA 3.1 8B (BF16)

Test CPU t/s ZenDNN t/s Speedup
pp128 341.50 395.58 1.16x
pp256 382.52 561.94 1.47x
pp512 423.40 624.61 1.48x
pp1024 414.12 637.97 1.54x
pp2048 338.50 622.08 1.84x
pp4096 308.53 534.76 1.73x
tg128 7.28 10.53 1.45x

LLaMA 3.1 8B (F32)

Test CPU t/s ZenDNN t/s Speedup
pp128 184.44 293.39 1.59x
pp256 189.69 384.71 2.03x
pp512 234.74 431.21 1.84x
pp1024 231.49 451.51 1.95x
pp2048 220.05 425.65 1.93x
pp4096 189.75 396.73 2.09x
tg128 2.69 7.34 2.73x

Qwen2 7B (BF16)

Test CPU t/s ZenDNN t/s Speedup
pp128 339.58 381.26 1.12x
pp256 380.82 482.33 1.27x
pp512 434.41 639.02 1.47x
pp1024 432.35 703.14 1.63x
pp2048 382.49 694.71 1.82x
pp4096 316.63 640.01 2.02x
tg128 6.30 11.96 1.90x

Qwen2 7B (F32)

Test CPU t/s ZenDNN t/s Speedup
pp128 201.64 309.29 1.53x
pp256 217.81 408.51 1.88x
pp512 250.92 451.24 1.80x
pp1024 251.71 461.91 1.84x
pp2048 228.00 454.05 1.99x
pp4096 207.30 445.56 2.15x
tg128 2.75 8.11 2.95x

LLaMA 2 7B (BF16)

Test CPU t/s ZenDNN t/s Speedup
pp128 325.94 387.72 1.19x
pp256 364.62 547.76 1.50x
pp512 417.88 613.29 1.47x
pp1024 418.46 603.59 1.44x
pp2048 382.10 623.88 1.63x
pp4096 316.20 559.45 1.77x
tg128 7.05 11.59 1.64x

LLaMA 2 7B (F32)

Test CPU t/s ZenDNN t/s Speedup
pp128 201.47 315.96 1.57x
pp256 217.71 397.12 1.82x
pp512 249.96 436.97 1.75x
pp1024 249.78 454.70 1.82x
pp2048 224.65 440.21 1.96x
pp4096 195.72 392.68 2.01x
tg128 3.70 8.15 2.20x

LLaMA 2 13B (BF16)

Test CPU t/s ZenDNN t/s Speedup
pp128 185.20 202.39 1.09x
pp256 200.55 300.21 1.50x
pp512 227.04 370.78 1.63x
pp1024 221.33 358.21 1.62x
pp2048 170.63 377.57 2.21x
pp4096 177.55 302.23 1.70x
tg128 3.72 6.76 1.82x

LLaMA 2 13B (F32)

Test CPU t/s ZenDNN t/s Speedup
pp128 107.74 174.92 1.62x
pp256 114.34 215.51 1.88x
pp512 129.28 246.26 1.90x
pp1024 127.64 232.02 1.82x
pp2048 113.25 253.00 2.23x
pp4096 105.44 220.49 2.09x
tg128 1.92 4.73 2.46x

Mixtral 8x7B (BF16)

Test CPU t/s ZenDNN t/s Speedup
pp128 92.74 94.24 1.02x
pp256 136.77 143.61 1.05x
pp512 164.38 167.70 1.02x
pp1024 169.80 175.44 1.03x
pp2048 166.19 176.64 1.06x
pp4096 151.95 174.29 1.15x
tg128 3.73 3.43 0.92x

Key Observations:

  • Best gains on F32 models: up to 2.95x speedup (Qwen2-7B token generation)
  • BF16: 1.5-2x faster with lower memory usage
  • Larger batches (pp2048, pp4096) show better performance
  • Smaller models (7B-13B) benefit more than large MoE models (Mixtral 8x7B)
  • Token generation: 1.45x-2.95x faster across models

Related

AI usage disclosure: AI assistance was used for documentation writing, formatting and CMake syntax. All code logic, implementation decisions, backend integration, and testing were done manually. The core ZenDNN backend implementation, performance optimizations, and benchmark testing were human-authored and validated.

@z-vishal z-vishal requested a review from ggerganov as a code owner December 2, 2025 12:44
@github-actions github-actions bot added documentation Improvements or additions to documentation examples ggml changes relating to the ggml tensor library for machine learning labels Dec 2, 2025
@Djip007
Copy link
Contributor

Djip007 commented Dec 2, 2025

I was thinking to create a backend with https://github.com/amd/blis (with FBGEMM) but good with zenDNN to.

@taronaeo taronaeo self-requested a review December 3, 2025 00:23
@taronaeo
Copy link
Collaborator

taronaeo commented Dec 3, 2025

Can you also include the benchmark results from #17684 into this PR?

@z-vishal
Copy link
Author

z-vishal commented Dec 3, 2025

@taronaeo Updated the PR description with benchmark results

@z-vishal
Copy link
Author

z-vishal commented Dec 3, 2025

@Djip007 Thanks! AMD BLIS is actually what ZenDNN uses under the hood the ZENDNNL_MATMUL_ALGO=2 setting activates the "Blocked AOCL BLIS" backend for optimal performance so you're getting BLIS optimizations through ZenDNN


return &ggml_backend_zendnn_device;

GGML_UNUSED(reg);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like both reg and index are used so these GGML_UNUSED are not needed.


ZenDNN provides optimized deep learning primitives for AMD EPYC™ CPUs. It accelerates matrix multiplication operations for inference workloads.

### Compilation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed to install LIBXSMM to compile:

$ sudo apt-get install libxsmm-dev

Perhaps this should be mentioned somewhere if this is the case.


static bool ggml_zendnn_sgemm(ggml_backend_zendnn_context * ctx, int64_t m, int64_t n, int64_t k,
const void * A, int64_t lda, const void * B, int64_t ldb, void * C,
int64_t ldc, int Atype, int Btype, int Ctype) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The coding convention is to use snake case, so perhaps something like A_type for the parameters instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation examples ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants