Skip to content

ggml-zendnn : add Q8_0 quantization support#23414

Merged
CISC merged 3 commits into
ggml-org:masterfrom
z-sachin:ggml-zendnn/add-q8_0-support
May 22, 2026
Merged

ggml-zendnn : add Q8_0 quantization support#23414
CISC merged 3 commits into
ggml-org:masterfrom
z-sachin:ggml-zendnn/add-q8_0-support

Conversation

@z-sachin
Copy link
Copy Markdown
Contributor

Overview

This PR adds Q8_0 quantization support in the ggml-zendnn backend.

The implementation enables ZenDNN execution paths for Q8_0 models and integrates the required handling for quantized weights and matmul operations.

Key changes:

  • Added Q8_0 support in ggml-zendnn backend
  • Enabled ZenDNN execution path for Q8_0 quantized matmul operations
  • Added handling for Q8_0 tensor layouts and conversions
  • Integrated backend execution support for Q8_0 models
  • Also pointing to the latest ZenDNN

Benchmark Results

Benchmark configuration:

  • threads = 96
  • type_k = bf16
  • type_v = bf16

Llama-3.1-8B-Instruct Q8_0

Prompt Size GGML_CPU_Q8_0 t/s ZenDNN_Q8_0 t/s Gain
256 472.28 730.87 54.75%
512 450.86 832.48 84.64%
768 446.81 864.52 93.49%
1024 439.58 800.15 82.03%
2048 405.07 778.34 92.15%
tg128 33.08 33.14 0.18%

Mixtral-8x7B Q8_0

Prompt Size GGML_CPU_Q8_0 t/s ZenDNN_Q8_0 t/s Gain
256 156.09 297.67 90.70%
512 156.63 389.44 148.64%
768 156.76 417.38 166.25%
1024 154.70 438.73 183.60%
2048 150.11 470.41 213.38%
tg128 20.95 20.92 -0.14%

gemma4 31B Q8_0

Prompt Size GGML_CPU_Q8_0 t/s ZenDNN_Q8_0 t/s Gain
256 116.05 195.02 68.05%
512 112.53 229.12 103.61%
768 111.96 239.02 113.49%
1024 110.93 238.03 114.58%
2048 106.37 222.32 109.01%
tg128 8.50 8.47 -0.35%

gemma-4-26B-A4B-it Q8_0

Prompt Size GGML_CPU_Q8_0 t/s ZenDNN_Q8_0 t/s Gain
256 570.87 597.84 4.72%
512 581.80 666.18 14.50%
768 588.67 683.91 16.18%
1024 574.79 684.13 19.02%
2048 562.26 642.08 14.20%
tg128 33.96 33.83 -0.38%

Observations

  • Significant prompt-processing gains are observed for larger prompt sizes
  • Decoding (tg128) performance remains comparable to ggml-cpu

Additional information

Validated on:

  • Llama-3.1-8B-Instruct Q8_0
  • Mixtral-8x7B Q8_0
  • gemma4 31B Q8_0
  • gemma-4-26B-A4B-it Q8_0

Requirements

@CISC
Copy link
Copy Markdown
Member

CISC commented May 20, 2026

cc/ @avinashcpandey @Jiten1parmar @z-vishal

@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning AMD ZenDNN Issues related to the AMD ZenDNN backend labels May 20, 2026
Comment thread ggml/src/ggml-zendnn/ggml-zendnn.cpp Outdated
Comment thread ggml/src/ggml-zendnn/ggml-zendnn.cpp Outdated
Comment thread ggml/src/ggml-zendnn/ggml-zendnn.cpp
Comment thread ggml/src/ggml-zendnn/ggml-zendnn.cpp Outdated
@z-vishal
Copy link
Copy Markdown
Contributor

z-vishal commented May 20, 2026

8-bit quantization support was much awaited in ZenDNN backend, and the benchmark numbers look solid! thanks @z-sachin
big thanks to the ZenDNN team for making this happen
cc: @amukho @avinashcpandey @Jiten1parmar

@z-vishal
Copy link
Copy Markdown
Contributor

@CISC the PR looks good
from my side I approved the changes, now we can merge :)

@z-vishal
Copy link
Copy Markdown
Contributor

@ggml-org/maintainers Another approval required

@taronaeo
Copy link
Copy Markdown
Member

Are we waiting for CI? Looks pretty jammed up.

@CISC
Copy link
Copy Markdown
Member

CISC commented May 22, 2026

I'm preemptively clearing the queue for Release fix.

@z-vishal
Copy link
Copy Markdown
Contributor

@CISC could you trigger the cancelled checks again if the release is done?

@CISC CISC merged commit 99d4026 into ggml-org:master May 22, 2026
17 of 49 checks passed
@CISC
Copy link
Copy Markdown
Member

CISC commented May 22, 2026

@CISC could you trigger the cancelled checks again if the release is done?

No need, the ones that finished are good enough.

Alex7MV pushed a commit to Alex7MV/claude_llama.cpp that referenced this pull request May 22, 2026
* ggml-zendnn : add Q8_0 quantization support

* ggml-zendnn : sync with latest ZenDNN

* ggml-zendnn : address review comments for Q8_0
ProTekk pushed a commit to ProTekk/buun-llama-cpp that referenced this pull request May 22, 2026
* ggml-zendnn : add Q8_0 quantization support

* ggml-zendnn : sync with latest ZenDNN

* ggml-zendnn : address review comments for Q8_0
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request May 22, 2026
* origin/master:
server: only parse empty msg if continuing an assistant msg (ggml-org#23506)
perplexity : fix integer overflow (ggml-org#23496)
SYCL: improve MoE prefill throughput (ggml-org#23142)
sycl : Level Zero detection in ggml_sycl_init (ggml-org#23097)
SYCL : gated_delta_net K>1 (ggml-org#23174)
SYCL: add BF16 to DMMV kernel path (~4x tg speedup on Intel Arc) (ggml-org#21580)
docs: Update documentation with Granite 4.0/4.1 (ggml-org#23404)
ggml-zendnn : add Q8_0 quantization support (ggml-org#23414)
cmake : build router app only during standalone builds (ggml-org#23521)
vocab : fix HybridDNA tokenizer (ggml-org#23466)
cmake : add install() for impl libraries + fix apple builds (ggml-org#23511)
CUDA: fix PDL CC check for JIT compilation (ggml-org#23471)
cmake : remove STATIC from impl libraries, enable LLAMA_BUILD_APP by default (ggml-org#23462)
Update WebGPU support and add link to blog/demo (ggml-org#23483)
vulkan: fuse snake activation (mul, sin, sqr, mul, add) (ggml-org#22855)
baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026
* ggml-zendnn : add Q8_0 quantization support

* ggml-zendnn : sync with latest ZenDNN

* ggml-zendnn : address review comments for Q8_0
srossitto79 pushed a commit to srossitto79/llama.cpp that referenced this pull request May 23, 2026
* ggml-zendnn : add Q8_0 quantization support

* ggml-zendnn : sync with latest ZenDNN

* ggml-zendnn : address review comments for Q8_0
kashif pushed a commit to kashif/llama.cpp that referenced this pull request May 23, 2026
* ggml-zendnn : add Q8_0 quantization support

* ggml-zendnn : sync with latest ZenDNN

* ggml-zendnn : address review comments for Q8_0
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
* ggml-zendnn : add Q8_0 quantization support

* ggml-zendnn : sync with latest ZenDNN

* ggml-zendnn : address review comments for Q8_0
turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 2, 2026
* ggml-zendnn : add Q8_0 quantization support

* ggml-zendnn : sync with latest ZenDNN

* ggml-zendnn : address review comments for Q8_0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AMD ZenDNN Issues related to the AMD ZenDNN backend ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants