Skip to content

UPSTREAM PR #17690: ggml-zendnn : add ZenDNN backend for AMD CPUs#402

Open
loci-dev wants to merge 3 commits intomainfrom
upstream-PR17690-branch_z-vishal-ggml-zendnn
Open

UPSTREAM PR #17690: ggml-zendnn : add ZenDNN backend for AMD CPUs#402
loci-dev wants to merge 3 commits intomainfrom
upstream-PR17690-branch_z-vishal-ggml-zendnn

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Dec 2, 2025

Mirrored from ggml-org/llama.cpp#17690

This PR adds ZenDNN backend support for accelerated inference on AMD EPYC™ CPUs.

Background

ZenDNN is AMD's optimized deep learning library for EPYC processors, providing high-performance primitives for inference workloads. It uses the LowOHA (Low Overhead High-performance) MatMul operator for efficient matrix multiplication.

Changes

  • Backend implementation:

    • New ZenDNN backend in ggml/src/ggml-zendnn/
    • Implements GGML_OP_MUL_MAT acceleration using ZenDNN primitives
    • Supports FP32 and BF16 data types
    • Auto-converts types
  • Build system:

    • CMake integration with automatic download/build option: -DGGML_ZENDNN=ON
    • Custom installation path support: -DGGML_ZENDNN_PATH=/path/to/zendnn
    • Uses ZenDNN's CMake package for clean dependency management
  • Documentation:

    • Comprehensive backend documentation in docs/backend/ZenDNN.md
    • Build instructions added to docs/build.md
    • Covers hardware support, setup, performance tuning, and profiling

Hardware Support

  • AMD EPYC 9005 Series (Turin/Zen 5)
  • AMD EPYC 9004 Series (Zen 4) - Recommended (best BF16 performance)
  • AMD EPYC 7003 Series (Milan/Zen 3)
  • AMD Ryzen AI MAX (Strix Halo)

Performance Notes

  • Best performance with export ZENDNNL_MATMUL_ALGO=2 (Blocked AOCL BLIS backend)
  • Optimized for BF16 inference on Zen 4/5 processors
  • Automatic parallel dispatch using OpenMP

Testing

Tested on AMD EPYC systems with llama-server and llama-cli using various models (LLaMA, Mistral, Qwen).

Related

AI usage disclosure: AI assistance was used for documentation writing, formatting and CMake syntax. All code logic, implementation decisions, backend integration, and testing were done manually. The core ZenDNN backend implementation, performance optimizations, and benchmark testing were human-authored and validated.

@loci-review
Copy link

loci-review bot commented Dec 2, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #402: ZenDNN Backend Integration

Overview

This PR adds ZenDNN backend support for AMD EPYC CPUs through 19,728 additions across 12 files. The changes introduce a new backend registration path without modifying core inference functions. Analysis shows startup latency increase with no impact on inference performance.

Key Findings

Performance-Critical Area: Backend Initialization

The function ggml_backend_load_all_from_path in ggml/src/ggml-backend-reg.cpp shows response time increase of 133,671 ns due to the added ggml_backend_load_best("zendnn", ...) call. This breaks down as: filesystem scan (30,000 ns), dynamic library loading (50,000 ns), symbol resolution (15,000 ns), and backend initialization (38,000 ns). The throughput time increased by 5 ns from conditional compilation logic. This is a one-time startup cost that does not affect runtime inference.

Inference Performance Impact

No changes detected in core inference functions: llama_decode, llama_encode, or llama_tokenize. These functions remain unmodified with identical response time and throughput metrics between versions. Tokens per second performance is unchanged as the ZenDNN integration only affects backend registration during initialization, not the inference execution path.

Power Consumption Analysis

Binary-level analysis shows minimal power consumption changes: libggml.so decreased by 6.6 nJ and llama-bench increased by 31 nJ. The net change is effectively zero, indicating no measurable power impact from the backend registration code itself.

Code Changes

The implementation adds ZenDNN to the backend registry through two paths: static registration via ggml_backend_registry() constructor when compiled with GGML_USE_ZENDNN, and dynamic loading via ggml_backend_load_all_from_path(). The backend is positioned second in loading priority after BLAS. No modifications to public API headers or core inference logic were made, ensuring backward compatibility.

@loci-dev loci-dev force-pushed the main branch 28 times, most recently from bdacbc7 to ca9e0d2 Compare December 5, 2025 00:36
@loci-dev loci-dev force-pushed the main branch 27 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32
@loci-dev loci-dev force-pushed the main branch 3 times, most recently from ef7afbe to d4c3480 Compare February 14, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants