ggml-zendnn : add Q8_0 quantization support by z-sachin · Pull Request #23414 · ggml-org/llama.cpp

z-sachin · 2026-05-20T12:01:55Z

Overview

This PR adds Q8_0 quantization support in the ggml-zendnn backend.

The implementation enables ZenDNN execution paths for Q8_0 models and integrates the required handling for quantized weights and matmul operations.

Key changes:

Added Q8_0 support in ggml-zendnn backend
Enabled ZenDNN execution path for Q8_0 quantized matmul operations
Added handling for Q8_0 tensor layouts and conversions
Integrated backend execution support for Q8_0 models
Also pointing to the latest ZenDNN

Benchmark Results

Benchmark configuration:

threads = 96
type_k = bf16
type_v = bf16

Llama-3.1-8B-Instruct Q8_0

Prompt Size	GGML_CPU_Q8_0 t/s	ZenDNN_Q8_0 t/s	Gain
256	472.28	730.87	54.75%
512	450.86	832.48	84.64%
768	446.81	864.52	93.49%
1024	439.58	800.15	82.03%
2048	405.07	778.34	92.15%
tg128	33.08	33.14	0.18%

Mixtral-8x7B Q8_0

Prompt Size	GGML_CPU_Q8_0 t/s	ZenDNN_Q8_0 t/s	Gain
256	156.09	297.67	90.70%
512	156.63	389.44	148.64%
768	156.76	417.38	166.25%
1024	154.70	438.73	183.60%
2048	150.11	470.41	213.38%
tg128	20.95	20.92	-0.14%

gemma4 31B Q8_0

Prompt Size	GGML_CPU_Q8_0 t/s	ZenDNN_Q8_0 t/s	Gain
256	116.05	195.02	68.05%
512	112.53	229.12	103.61%
768	111.96	239.02	113.49%
1024	110.93	238.03	114.58%
2048	106.37	222.32	109.01%
tg128	8.50	8.47	-0.35%

gemma-4-26B-A4B-it Q8_0

Prompt Size	GGML_CPU_Q8_0 t/s	ZenDNN_Q8_0 t/s	Gain
256	570.87	597.84	4.72%
512	581.80	666.18	14.50%
768	588.67	683.91	16.18%
1024	574.79	684.13	19.02%
2048	562.26	642.08	14.20%
tg128	33.96	33.83	-0.38%

Observations

Significant prompt-processing gains are observed for larger prompt sizes
Decoding (tg128) performance remains comparable to ggml-cpu

Additional information

Validated on:

Llama-3.1-8B-Instruct Q8_0
Mixtral-8x7B Q8_0
gemma4 31B Q8_0
gemma-4-26B-A4B-it Q8_0

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes

CISC · 2026-05-20T12:11:47Z

cc/ @avinashcpandey @Jiten1parmar @z-vishal

z-vishal · 2026-05-20T21:09:40Z

8-bit quantization support was much awaited in ZenDNN backend, and the benchmark numbers look solid! thanks @z-sachin
big thanks to the ZenDNN team for making this happen
cc: @amukho @avinashcpandey @Jiten1parmar

z-vishal · 2026-05-22T05:31:33Z

@CISC the PR looks good
from my side I approved the changes, now we can merge :)

z-vishal · 2026-05-22T05:32:31Z

@ggml-org/maintainers Another approval required

taronaeo · 2026-05-22T07:43:37Z

Are we waiting for CI? Looks pretty jammed up.

CISC · 2026-05-22T07:50:26Z

I'm preemptively clearing the queue for Release fix.

z-vishal · 2026-05-22T11:10:00Z

@CISC could you trigger the cancelled checks again if the release is done?

CISC · 2026-05-22T11:17:18Z

@CISC could you trigger the cancelled checks again if the release is done?

No need, the ones that finished are good enough.

* ggml-zendnn : add Q8_0 quantization support * ggml-zendnn : sync with latest ZenDNN * ggml-zendnn : address review comments for Q8_0

* origin/master: server: only parse empty msg if continuing an assistant msg (ggml-org#23506) perplexity : fix integer overflow (ggml-org#23496) SYCL: improve MoE prefill throughput (ggml-org#23142) sycl : Level Zero detection in ggml_sycl_init (ggml-org#23097) SYCL : gated_delta_net K>1 (ggml-org#23174) SYCL: add BF16 to DMMV kernel path (~4x tg speedup on Intel Arc) (ggml-org#21580) docs: Update documentation with Granite 4.0/4.1 (ggml-org#23404) ggml-zendnn : add Q8_0 quantization support (ggml-org#23414) cmake : build router app only during standalone builds (ggml-org#23521) vocab : fix HybridDNA tokenizer (ggml-org#23466) cmake : add install() for impl libraries + fix apple builds (ggml-org#23511) CUDA: fix PDL CC check for JIT compilation (ggml-org#23471) cmake : remove STATIC from impl libraries, enable LLAMA_BUILD_APP by default (ggml-org#23462) Update WebGPU support and add link to blog/demo (ggml-org#23483) vulkan: fuse snake activation (mul, sin, sqr, mul, add) (ggml-org#22855)

* ggml-zendnn : add Q8_0 quantization support * ggml-zendnn : sync with latest ZenDNN * ggml-zendnn : address review comments for Q8_0

ggml-zendnn : add Q8_0 quantization support

8590426

ggml-zendnn : sync with latest ZenDNN

c750e49

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning AMD ZenDNN Issues related to the AMD ZenDNN backend labels May 20, 2026

z-vishal reviewed May 20, 2026

View reviewed changes

Comment thread ggml/src/ggml-zendnn/ggml-zendnn.cpp Outdated

Comment thread ggml/src/ggml-zendnn/ggml-zendnn.cpp Outdated

Comment thread ggml/src/ggml-zendnn/ggml-zendnn.cpp

Comment thread ggml/src/ggml-zendnn/ggml-zendnn.cpp Outdated

ggml-zendnn : address review comments for Q8_0

d1bd552

z-vishal approved these changes May 22, 2026

View reviewed changes

CISC approved these changes May 22, 2026

View reviewed changes

taronaeo approved these changes May 22, 2026

View reviewed changes

CISC merged commit 99d4026 into ggml-org:master May 22, 2026
17 of 49 checks passed

THEman6989 mentioned this pull request May 22, 2026

Add install() for impl libraries and fix Apple/Android builds THEman6989/llama.cpp-gfx906-turbo-mtp#1

Merged

kashif pushed a commit to kashif/llama.cpp that referenced this pull request May 23, 2026

ggml-zendnn : add Q8_0 quantization support (ggml-org#23414)

977fea5

* ggml-zendnn : add Q8_0 quantization support * ggml-zendnn : sync with latest ZenDNN * ggml-zendnn : address review comments for Q8_0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-zendnn : add Q8_0 quantization support#23414

ggml-zendnn : add Q8_0 quantization support#23414
CISC merged 3 commits into
ggml-org:masterfrom
z-sachin:ggml-zendnn/add-q8_0-support

z-sachin commented May 20, 2026

Uh oh!

CISC commented May 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

z-vishal commented May 20, 2026 •

edited

Loading

Uh oh!

z-vishal commented May 22, 2026

Uh oh!

z-vishal commented May 22, 2026

Uh oh!

taronaeo commented May 22, 2026

Uh oh!

CISC commented May 22, 2026

Uh oh!

z-vishal commented May 22, 2026

Uh oh!

Uh oh!

CISC commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

z-sachin commented May 20, 2026

Overview

Benchmark Results

Llama-3.1-8B-Instruct Q8_0

Mixtral-8x7B Q8_0

gemma4 31B Q8_0

gemma-4-26B-A4B-it Q8_0

Observations

Additional information

Requirements

Uh oh!

CISC commented May 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

z-vishal commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

z-vishal commented May 22, 2026

Uh oh!

z-vishal commented May 22, 2026

Uh oh!

taronaeo commented May 22, 2026

Uh oh!

CISC commented May 22, 2026

Uh oh!

z-vishal commented May 22, 2026

Uh oh!

Uh oh!

CISC commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

z-vishal commented May 20, 2026 •

edited

Loading