[SYCL] Support Q4_1, Q5_0, Q5_1 in Flash-attention by arthw · Pull Request #23812 · ggml-org/llama.cpp

arthw · 2026-05-28T11:11:05Z

Support Q4_1, Q5_0, Q5_1 in Flash-attention
UT cases are passed locally.

nilo85 · 2026-05-31T13:09:14Z

I got about 20% token generation speed improvment applying this pr when using q4_1 vs q4_0 for Qwen 2.6 35B E3B 🥳

EDIT: cause my understanding is that default q4_0 suffers an upscaling issue to fp16 at compute time..?! =)

ctk q4_0 -ctv q4_0:

[root@niklas-pc:/home/niklas/workspace/llama.cpp]# podman run --rm -it   --device /dev/dri/renderD129   -e ZES_ENABLE_SYSMAN=1   -v /root/.cache/huggingface/hub:/models:Z   localhost/llama.cpp:full-intel-extras   --bench   -t 6 -ngl 99 -fa 1 -ctk q4_0 -ctv q4_0   -p 512   -n 128   -d 0,10240   -b 2048 -ub 1024 -r 5   -m /models/models--unsloth--Qwen3.6-35B-A3B-GGUF/snapshots/a483e9e6cbd595906af30beda3187c2663a1118c/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
load_backend: loaded SYCL backend from /app/libggml-sycl.so
load_backend: loaded CPU backend from /app/libggml-cpu-alderlake.so
| model                          |       size |     params | backend    | ngl | threads | n_ubatch | type_k | type_v |  fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -----: | -----: | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | SYCL       |  99 |       6 |     1024 |   q4_0 |   q4_0 |   1 |           pp512 |        981.77 ± 5.32 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | SYCL       |  99 |       6 |     1024 |   q4_0 |   q4_0 |   1 |           tg128 |         67.33 ± 0.15 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | SYCL       |  99 |       6 |     1024 |   q4_0 |   q4_0 |   1 |  pp512 @ d10240 |        743.22 ± 2.45 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | SYCL       |  99 |       6 |     1024 |   q4_0 |   q4_0 |   1 |  tg128 @ d10240 |         44.82 ± 0.04 |

build: 05ab40313 (9454)

-ctk q4_1 -ctv q4_1:

[root@niklas-pc:/home/niklas/workspace/llama.cpp]# podman run --rm -it   --device /dev/dri/renderD129   -e ZES_ENABLE_SYSMAN=1   -v /root/.cache/huggingface/hub:/models:Z   localhost/llama.cpp:full-intel-extras   --bench   -t 6 -ngl 99 -fa 1 -ctk q4_1 -ctv q4_1   -p 512   -n 128   -d 0,10240   -b 2048 -ub 1024 -r 5   -m /models/models--unsloth--Qwen3.6-35B-A3B-GGUF/snapshots/a483e9e6cbd595906af30beda3187c2663a1118c/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
load_backend: loaded SYCL backend from /app/libggml-sycl.so
load_backend: loaded CPU backend from /app/libggml-cpu-alderlake.so
| model                          |       size |     params | backend    | ngl | threads | n_ubatch | type_k | type_v |  fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -----: | -----: | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | SYCL       |  99 |       6 |     1024 |   q4_1 |   q4_1 |   1 |           pp512 |        982.63 ± 9.19 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | SYCL       |  99 |       6 |     1024 |   q4_1 |   q4_1 |   1 |           tg128 |         69.32 ± 0.25 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | SYCL       |  99 |       6 |     1024 |   q4_1 |   q4_1 |   1 |  pp512 @ d10240 |        739.69 ± 4.93 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | SYCL       |  99 |       6 |     1024 |   q4_1 |   q4_1 |   1 |  tg128 @ d10240 |         54.72 ± 0.17 |

build: 05ab40313 (9454)

-ctk q5_0 -ctv q5_0:

[root@niklas-pc:/home/niklas/workspace/llama.cpp]# podman run --rm -it   --device /dev/dri/renderD129   -e ZES_ENABLE_SYSMAN=1   -v /root/.cache/huggingface/hub:/models:Z   localhost/llama.cpp:full-intel-extras   --bench   -t 6 -ngl 99 -fa 1 -ctk q5_0 -ctv q5_0   -p 512   -n 128   -d 0,10240   -b 2048 -ub 1024 -r 5   -m /models/models--unsloth--Qwen3.6-35B-A3B-GGUF/snapshots/a483e9e6cbd595906af30beda3187c2663a1118c/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
load_backend: loaded SYCL backend from /app/libggml-sycl.so
load_backend: loaded CPU backend from /app/libggml-cpu-alderlake.so
| model                          |       size |     params | backend    | ngl | threads | n_ubatch | type_k | type_v |  fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -----: | -----: | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | SYCL       |  99 |       6 |     1024 |   q5_0 |   q5_0 |   1 |           pp512 |        957.79 ± 6.78 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | SYCL       |  99 |       6 |     1024 |   q5_0 |   q5_0 |   1 |           tg128 |         66.30 ± 0.18 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | SYCL       |  99 |       6 |     1024 |   q5_0 |   q5_0 |   1 |  pp512 @ d10240 |        740.21 ± 3.96 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | SYCL       |  99 |       6 |     1024 |   q5_0 |   q5_0 |   1 |  tg128 @ d10240 |         37.07 ± 0.14 |

build: 05ab40313 (9454)

* origin/master: (36 commits) vendor : update cpp-httplib to 0.46.1 (ggml-org#23980) llama: limit max outputs of `llama_context` (ggml-org#23861) metal: template GLU kernels to support f16/f32 (ggml-org#23882) vulkan: don't hold the device mutex while compiling pipelines (ggml-org#23641) vulkan: reduce host memory lock contention (ggml-org#23376) vocab: add normalizer.lowercase support to WPM (ggml-org#23899) TP: quantized KV cache support (ggml-org#23792) security : disable private disclosures (ggml-org#23963) model: Add EXAONE 4.5 implementations (ggml-org#21733) vulkan: Block-load Q3_K/Q6_K block data and subtract on 32b ints (ggml-org#23056) vulkan: Removed unused functions (ggml-org#23175) common : support manually triggering the reasoning budget end sequence (ggml-org#23949) ci : add missing Linux label to cpu-x64-high-perf runner (ggml-org#23958) [SYCL] Support Q4_1, Q5_0, Q5_1 in Flash-attention (ggml-org#23812) [SYCL] Add more types in GET_ROWS OP (ggml-org#23710) sycl : Optimize Q3_K mul_mat by reorder (ggml-org#23725) ci: remove redundant or duplicate jobs (ggml-org#23927) server : handle If-None-Match weak ETags (ggml-org#23916) ci : limit trigger paths for the CPU workflow (ggml-org#23938) vocab : add tokenizer support for jina-embeddings-v2-base-zh (ggml-org#18756) ...

* support Q4_1, Q5_0, Q5_1 * update ut case

arthw added 2 commits May 28, 2026 19:05

support Q4_1, Q5_0, Q5_1

5ac528e

update ut case

fafea13

arthw requested a review from a team as a code owner May 28, 2026 11:11

github-actions Bot added documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 28, 2026

arthw added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label May 29, 2026

ggerganov merged commit a511424 into ggml-org:master Jun 1, 2026
28 checks passed

turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 2, 2026

[SYCL] Support Q4_1, Q5_0, Q5_1 in Flash-attention (ggml-org#23812)

60d97c8

* support Q4_1, Q5_0, Q5_1 * update ut case

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL] Support Q4_1, Q5_0, Q5_1 in Flash-attention#23812

[SYCL] Support Q4_1, Q5_0, Q5_1 in Flash-attention#23812
ggerganov merged 2 commits into
ggml-org:masterfrom
arthw:enhance_flash-attention

arthw commented May 28, 2026

Uh oh!

nilo85 commented May 31, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

arthw commented May 28, 2026

Uh oh!

nilo85 commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nilo85 commented May 31, 2026 •

edited

Loading