ggml-zendnn : add Q8_0 quantization support#23414
Merged
Merged
Conversation
Member
z-vishal
reviewed
May 20, 2026
Contributor
|
8-bit quantization support was much awaited in ZenDNN backend, and the benchmark numbers look solid! thanks @z-sachin |
z-vishal
approved these changes
May 22, 2026
Contributor
|
@CISC the PR looks good |
Contributor
|
@ggml-org/maintainers Another approval required |
CISC
approved these changes
May 22, 2026
taronaeo
approved these changes
May 22, 2026
Member
|
Are we waiting for CI? Looks pretty jammed up. |
Member
|
I'm preemptively clearing the queue for |
Contributor
|
@CISC could you trigger the cancelled checks again if the release is done? |
Member
No need, the ones that finished are good enough. |
Alex7MV
pushed a commit
to Alex7MV/claude_llama.cpp
that referenced
this pull request
May 22, 2026
* ggml-zendnn : add Q8_0 quantization support * ggml-zendnn : sync with latest ZenDNN * ggml-zendnn : address review comments for Q8_0
ProTekk
pushed a commit
to ProTekk/buun-llama-cpp
that referenced
this pull request
May 22, 2026
* ggml-zendnn : add Q8_0 quantization support * ggml-zendnn : sync with latest ZenDNN * ggml-zendnn : address review comments for Q8_0
gabe-l-hart
added a commit
to gabe-l-hart/llama.cpp
that referenced
this pull request
May 22, 2026
* origin/master: server: only parse empty msg if continuing an assistant msg (ggml-org#23506) perplexity : fix integer overflow (ggml-org#23496) SYCL: improve MoE prefill throughput (ggml-org#23142) sycl : Level Zero detection in ggml_sycl_init (ggml-org#23097) SYCL : gated_delta_net K>1 (ggml-org#23174) SYCL: add BF16 to DMMV kernel path (~4x tg speedup on Intel Arc) (ggml-org#21580) docs: Update documentation with Granite 4.0/4.1 (ggml-org#23404) ggml-zendnn : add Q8_0 quantization support (ggml-org#23414) cmake : build router app only during standalone builds (ggml-org#23521) vocab : fix HybridDNA tokenizer (ggml-org#23466) cmake : add install() for impl libraries + fix apple builds (ggml-org#23511) CUDA: fix PDL CC check for JIT compilation (ggml-org#23471) cmake : remove STATIC from impl libraries, enable LLAMA_BUILD_APP by default (ggml-org#23462) Update WebGPU support and add link to blog/demo (ggml-org#23483) vulkan: fuse snake activation (mul, sin, sqr, mul, add) (ggml-org#22855)
baramofme
pushed a commit
to baramofme/llama-cpp-turboquant
that referenced
this pull request
May 23, 2026
* ggml-zendnn : add Q8_0 quantization support * ggml-zendnn : sync with latest ZenDNN * ggml-zendnn : address review comments for Q8_0
srossitto79
pushed a commit
to srossitto79/llama.cpp
that referenced
this pull request
May 23, 2026
* ggml-zendnn : add Q8_0 quantization support * ggml-zendnn : sync with latest ZenDNN * ggml-zendnn : address review comments for Q8_0
kashif
pushed a commit
to kashif/llama.cpp
that referenced
this pull request
May 23, 2026
* ggml-zendnn : add Q8_0 quantization support * ggml-zendnn : sync with latest ZenDNN * ggml-zendnn : address review comments for Q8_0
fewtarius
pushed a commit
to fewtarius/llama.cpp
that referenced
this pull request
May 30, 2026
* ggml-zendnn : add Q8_0 quantization support * ggml-zendnn : sync with latest ZenDNN * ggml-zendnn : address review comments for Q8_0
turbo-tan
pushed a commit
to turbo-tan/llama.cpp-tq3
that referenced
this pull request
Jun 2, 2026
* ggml-zendnn : add Q8_0 quantization support * ggml-zendnn : sync with latest ZenDNN * ggml-zendnn : address review comments for Q8_0
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This PR adds Q8_0 quantization support in the ggml-zendnn backend.
The implementation enables ZenDNN execution paths for Q8_0 models and integrates the required handling for quantized weights and matmul operations.
Key changes:
Benchmark Results
Benchmark configuration:
Llama-3.1-8B-Instruct Q8_0
Mixtral-8x7B Q8_0
gemma4 31B Q8_0
gemma-4-26B-A4B-it Q8_0
Observations
tg128) performance remains comparable to ggml-cpuAdditional information
Validated on:
Requirements