From fdaf196dd5e3843cf06b6f29c91e38ddd4c27759 Mon Sep 17 00:00:00 2001 From: "plotnikov.v10" Date: Wed, 27 May 2026 22:05:53 +0300 Subject: [PATCH 1/5] docs zendnn added information about Q8 support --- docs/backend/ZenDNN.md | 26 +++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/docs/backend/ZenDNN.md b/docs/backend/ZenDNN.md index add5805331c..d8c7c90023a 100644 --- a/docs/backend/ZenDNN.md +++ b/docs/backend/ZenDNN.md @@ -72,10 +72,13 @@ The ZenDNN backend accelerates **matrix multiplication (MUL_MAT)** and **expert- |:----------------------:|:-------:|:---------------------------------------------:| | FP32 | Support | Full precision floating point | | BF16 | Support | BFloat16 (best performance on Zen 4/Zen 5) | +| Q8_0 | Support | Quantized 8-bit weights accelerated through ZenDNN MulMat branch | *Notes:* - **BF16** provides best performance on Zen 4 and Zen 5 EPYC™ processors (Genoa, Turin). +- **Q8_0** support is available for quantized model weights in supported ZenDNN Mul Mat branch. +- Other quantization formats may fall back to the standard CPU backend unless explicitly supported by the ZenDNN backend. ## Linux @@ -140,6 +143,15 @@ Download LLaMA 3.1 8B Instruct BF16 model: huggingface-cli download meta-llama/Llama-3.1-8B-Instruct-GGUF --local-dir models/ ``` +You can also use a Q8_0 GGUF model: + +```sh +# Download a Q8_0 GGUF model from Hugging Face +huggingface-cli download meta-llama/Llama-3.1-8B-Instruct-GGUF \ + Llama-3.1-8B-Instruct-Q8_0.gguf \ + --local-dir models/ +``` + #### 2. Start Server Run llama.cpp server with ZenDNN acceleration: @@ -176,6 +188,17 @@ export ZENDNNL_MATMUL_ALGO=1 # Blocked AOCL DLP algo (recommended) For more details on available algorithms, see the [ZenDNN MatMul Algorithm Documentation](https://github.com/amd/ZenDNN/blob/a18adf8c605fb5f5e52cefd7eda08a7b18febbaf/docs/runtime_env.md#algorithm-details). +### Q8_0 Performance Notes + +Q8_0 support is mainly beneficial for prompt processing / prefill workloads where large matrix multiplications dominate execution. Token generation performance may remain close to the standard CPU backend depending on the model, batch size, number of threads, and CPU topology. + +For best results with Q8_0 models: + +- Use a supported AMD Zen-based CPU. +- Use `ZENDNNL_MATMUL_ALGO=1`. +- Use sufficiently large prompt/batch workloads to expose matrix multiplication acceleration. +- Enable profiling or logging to verify that ZenDNN MatMul kernels are being used. + ### Profiling and Debugging For detailed profiling and logging options, refer to the [ZenDNN Logging Documentation](https://github.com/amd/ZenDNN/blob/a18adf8c605fb5f5e52cefd7eda08a7b18febbaf/docs/logging.md). @@ -184,6 +207,7 @@ For detailed profiling and logging options, refer to the [ZenDNN Logging Documen - **Limited operation support**: Currently matrix multiplication (MUL_MAT) and expert-based matrix multiplication (MUL_MAT_ID) are accelerated via ZenDNN. Other operations fall back to the standard CPU backend. Future updates may expand supported operations. - **BF16 support**: BF16 operations require AMD Zen 4 or Zen 5 architecture (EPYC 9004/9005 series). On older CPUs, operations will use FP32. +- **Q8_0 support scope**: Q8_0 acceleration is available for supported matrix multiplication paths. Other quantization formats may still fall back to the standard CPU backend. - **NUMA awareness**: For multi-socket systems, manual NUMA binding may be required for optimal performance. ## Q&A @@ -202,7 +226,7 @@ A: ZenDNN is optimized specifically for AMD processors. While it may work on oth **Q: Does ZenDNN support quantized models?** -A: Currently, ZenDNN primarily supports FP32 and BF16 data types. Quantized model support is not available at this time. +A: Yes. The ZenDNN backend supports Q8_0 quantized models for supported matrix multiplication operations. FP32 and BF16 are also supported. Other quantization formats may fall back to the standard CPU backend unless explicitly supported by the ZenDNN backend. **Q: Why is my inference not faster with ZenDNN?** From 63b95f4d3ed882ebe92392688a6c57ec892b0738 Mon Sep 17 00:00:00 2001 From: "plotnikov.v10" Date: Wed, 27 May 2026 22:11:50 +0300 Subject: [PATCH 2/5] docs zendnn rm unnecessary data --- docs/backend/ZenDNN.md | 7 ------- 1 file changed, 7 deletions(-) diff --git a/docs/backend/ZenDNN.md b/docs/backend/ZenDNN.md index d8c7c90023a..0f45debcb93 100644 --- a/docs/backend/ZenDNN.md +++ b/docs/backend/ZenDNN.md @@ -192,13 +192,6 @@ For more details on available algorithms, see the [ZenDNN MatMul Algorithm Docum Q8_0 support is mainly beneficial for prompt processing / prefill workloads where large matrix multiplications dominate execution. Token generation performance may remain close to the standard CPU backend depending on the model, batch size, number of threads, and CPU topology. -For best results with Q8_0 models: - -- Use a supported AMD Zen-based CPU. -- Use `ZENDNNL_MATMUL_ALGO=1`. -- Use sufficiently large prompt/batch workloads to expose matrix multiplication acceleration. -- Enable profiling or logging to verify that ZenDNN MatMul kernels are being used. - ### Profiling and Debugging For detailed profiling and logging options, refer to the [ZenDNN Logging Documentation](https://github.com/amd/ZenDNN/blob/a18adf8c605fb5f5e52cefd7eda08a7b18febbaf/docs/logging.md). From 6fec66c5f53f25c0af05c65208216c1485601b82 Mon Sep 17 00:00:00 2001 From: "plotnikov.v10" Date: Sun, 31 May 2026 00:56:57 +0300 Subject: [PATCH 3/5] docs update, links to ZenDNN docs provided --- docs/backend/ZenDNN.md | 4 ++-- docs/build.md | 1 + 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/backend/ZenDNN.md b/docs/backend/ZenDNN.md index 0f45debcb93..f90fbfb96a0 100644 --- a/docs/backend/ZenDNN.md +++ b/docs/backend/ZenDNN.md @@ -72,13 +72,13 @@ The ZenDNN backend accelerates **matrix multiplication (MUL_MAT)** and **expert- |:----------------------:|:-------:|:---------------------------------------------:| | FP32 | Support | Full precision floating point | | BF16 | Support | BFloat16 (best performance on Zen 4/Zen 5) | -| Q8_0 | Support | Quantized 8-bit weights accelerated through ZenDNN MulMat branch | +| Q8_0 | Support | 8-bit quantized weights via [dynamic quantization](https://github.com/amd/ZenDNN/blob/main/docs/operator/lowoha_matmul_operator.md) | *Notes:* - **BF16** provides best performance on Zen 4 and Zen 5 EPYC™ processors (Genoa, Turin). - **Q8_0** support is available for quantized model weights in supported ZenDNN Mul Mat branch. -- Other quantization formats may fall back to the standard CPU backend unless explicitly supported by the ZenDNN backend. +- Other quantization formats fall back to the standard CPU backend unless explicitly supported by the ZenDNN backend. ## Linux diff --git a/docs/build.md b/docs/build.md index 7beafbf5f46..007b757d47d 100644 --- a/docs/build.md +++ b/docs/build.md @@ -22,6 +22,7 @@ The following sections describe how to build with different backends and options * [HIP](#hip) * [Vulkan](#vulkan) * [CANN](#cann) +* [ZenDNN](#zendnn) * [Arm® KleidiAI™](#arm-kleidiai) * [OpenCL](#opencl) * [Android](#android-1) From 738a887cc44290570f9c4d39d339816f726050a3 Mon Sep 17 00:00:00 2001 From: "plotnikov.v10" Date: Sun, 31 May 2026 01:01:17 +0300 Subject: [PATCH 4/5] docs zenDNN update: clarified explanation --- docs/backend/ZenDNN.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/backend/ZenDNN.md b/docs/backend/ZenDNN.md index f90fbfb96a0..ad4755b48d5 100644 --- a/docs/backend/ZenDNN.md +++ b/docs/backend/ZenDNN.md @@ -77,7 +77,7 @@ The ZenDNN backend accelerates **matrix multiplication (MUL_MAT)** and **expert- *Notes:* - **BF16** provides best performance on Zen 4 and Zen 5 EPYC™ processors (Genoa, Turin). -- **Q8_0** support is available for quantized model weights in supported ZenDNN Mul Mat branch. +- **Q8_0** is available for quantized model weights since ZenDNN supports dynamic quantization [LowOHA MatMul operator](https://github.com/amd/ZenDNN/blob/main/docs/operator/lowoha_matmul_operator.md). - Other quantization formats fall back to the standard CPU backend unless explicitly supported by the ZenDNN backend. ## Linux From 3d57d0c33cfb2079facd5c19e91bb9efcf0eea2e Mon Sep 17 00:00:00 2001 From: "plotnikov.v10" Date: Sun, 31 May 2026 01:14:49 +0300 Subject: [PATCH 5/5] docs zenDNN update: one more explanation clarified --- docs/backend/ZenDNN.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/backend/ZenDNN.md b/docs/backend/ZenDNN.md index ad4755b48d5..b2f970d8c43 100644 --- a/docs/backend/ZenDNN.md +++ b/docs/backend/ZenDNN.md @@ -200,7 +200,7 @@ For detailed profiling and logging options, refer to the [ZenDNN Logging Documen - **Limited operation support**: Currently matrix multiplication (MUL_MAT) and expert-based matrix multiplication (MUL_MAT_ID) are accelerated via ZenDNN. Other operations fall back to the standard CPU backend. Future updates may expand supported operations. - **BF16 support**: BF16 operations require AMD Zen 4 or Zen 5 architecture (EPYC 9004/9005 series). On older CPUs, operations will use FP32. -- **Q8_0 support scope**: Q8_0 acceleration is available for supported matrix multiplication paths. Other quantization formats may still fall back to the standard CPU backend. +- **Q8_0 support scope**: Q8_0 acceleration is available for supported matrix multiplication paths. Other quantization formats still fall back to the standard CPU backend. - **NUMA awareness**: For multi-socket systems, manual NUMA binding may be required for optimal performance. ## Q&A