Skip to content

[SYCL] Support Q4_1, Q5_0, Q5_1 in Flash-attention#23812

Merged
ggerganov merged 2 commits into
ggml-org:masterfrom
arthw:enhance_flash-attention
Jun 1, 2026
Merged

[SYCL] Support Q4_1, Q5_0, Q5_1 in Flash-attention#23812
ggerganov merged 2 commits into
ggml-org:masterfrom
arthw:enhance_flash-attention

Conversation

@arthw
Copy link
Copy Markdown
Contributor

@arthw arthw commented May 28, 2026

Support Q4_1, Q5_0, Q5_1 in Flash-attention
UT cases are passed locally.

@arthw arthw requested a review from a team as a code owner May 28, 2026 11:11
@github-actions github-actions Bot added documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 28, 2026
@arthw arthw added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label May 29, 2026
@nilo85
Copy link
Copy Markdown

nilo85 commented May 31, 2026

I got about 20% token generation speed improvment applying this pr when using q4_1 vs q4_0 for Qwen 2.6 35B E3B 🥳

EDIT: cause my understanding is that default q4_0 suffers an upscaling issue to fp16 at compute time..?! =)

ctk q4_0 -ctv q4_0:

[root@niklas-pc:/home/niklas/workspace/llama.cpp]# podman run --rm -it   --device /dev/dri/renderD129   -e ZES_ENABLE_SYSMAN=1   -v /root/.cache/huggingface/hub:/models:Z   localhost/llama.cpp:full-intel-extras   --bench   -t 6 -ngl 99 -fa 1 -ctk q4_0 -ctv q4_0   -p 512   -n 128   -d 0,10240   -b 2048 -ub 1024 -r 5   -m /models/models--unsloth--Qwen3.6-35B-A3B-GGUF/snapshots/a483e9e6cbd595906af30beda3187c2663a1118c/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
load_backend: loaded SYCL backend from /app/libggml-sycl.so
load_backend: loaded CPU backend from /app/libggml-cpu-alderlake.so
| model                          |       size |     params | backend    | ngl | threads | n_ubatch | type_k | type_v |  fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -----: | -----: | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | SYCL       |  99 |       6 |     1024 |   q4_0 |   q4_0 |   1 |           pp512 |        981.77 ± 5.32 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | SYCL       |  99 |       6 |     1024 |   q4_0 |   q4_0 |   1 |           tg128 |         67.33 ± 0.15 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | SYCL       |  99 |       6 |     1024 |   q4_0 |   q4_0 |   1 |  pp512 @ d10240 |        743.22 ± 2.45 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | SYCL       |  99 |       6 |     1024 |   q4_0 |   q4_0 |   1 |  tg128 @ d10240 |         44.82 ± 0.04 |

build: 05ab40313 (9454)

-ctk q4_1 -ctv q4_1:

[root@niklas-pc:/home/niklas/workspace/llama.cpp]# podman run --rm -it   --device /dev/dri/renderD129   -e ZES_ENABLE_SYSMAN=1   -v /root/.cache/huggingface/hub:/models:Z   localhost/llama.cpp:full-intel-extras   --bench   -t 6 -ngl 99 -fa 1 -ctk q4_1 -ctv q4_1   -p 512   -n 128   -d 0,10240   -b 2048 -ub 1024 -r 5   -m /models/models--unsloth--Qwen3.6-35B-A3B-GGUF/snapshots/a483e9e6cbd595906af30beda3187c2663a1118c/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
load_backend: loaded SYCL backend from /app/libggml-sycl.so
load_backend: loaded CPU backend from /app/libggml-cpu-alderlake.so
| model                          |       size |     params | backend    | ngl | threads | n_ubatch | type_k | type_v |  fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -----: | -----: | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | SYCL       |  99 |       6 |     1024 |   q4_1 |   q4_1 |   1 |           pp512 |        982.63 ± 9.19 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | SYCL       |  99 |       6 |     1024 |   q4_1 |   q4_1 |   1 |           tg128 |         69.32 ± 0.25 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | SYCL       |  99 |       6 |     1024 |   q4_1 |   q4_1 |   1 |  pp512 @ d10240 |        739.69 ± 4.93 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | SYCL       |  99 |       6 |     1024 |   q4_1 |   q4_1 |   1 |  tg128 @ d10240 |         54.72 ± 0.17 |

build: 05ab40313 (9454)

-ctk q5_0 -ctv q5_0:

[root@niklas-pc:/home/niklas/workspace/llama.cpp]# podman run --rm -it   --device /dev/dri/renderD129   -e ZES_ENABLE_SYSMAN=1   -v /root/.cache/huggingface/hub:/models:Z   localhost/llama.cpp:full-intel-extras   --bench   -t 6 -ngl 99 -fa 1 -ctk q5_0 -ctv q5_0   -p 512   -n 128   -d 0,10240   -b 2048 -ub 1024 -r 5   -m /models/models--unsloth--Qwen3.6-35B-A3B-GGUF/snapshots/a483e9e6cbd595906af30beda3187c2663a1118c/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
load_backend: loaded SYCL backend from /app/libggml-sycl.so
load_backend: loaded CPU backend from /app/libggml-cpu-alderlake.so
| model                          |       size |     params | backend    | ngl | threads | n_ubatch | type_k | type_v |  fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -----: | -----: | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | SYCL       |  99 |       6 |     1024 |   q5_0 |   q5_0 |   1 |           pp512 |        957.79 ± 6.78 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | SYCL       |  99 |       6 |     1024 |   q5_0 |   q5_0 |   1 |           tg128 |         66.30 ± 0.18 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | SYCL       |  99 |       6 |     1024 |   q5_0 |   q5_0 |   1 |  pp512 @ d10240 |        740.21 ± 3.96 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | SYCL       |  99 |       6 |     1024 |   q5_0 |   q5_0 |   1 |  tg128 @ d10240 |         37.07 ± 0.14 |

build: 05ab40313 (9454)

@ggerganov ggerganov merged commit a511424 into ggml-org:master Jun 1, 2026
28 checks passed
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Jun 1, 2026
* origin/master: (36 commits)
vendor : update cpp-httplib to 0.46.1 (ggml-org#23980)
llama: limit max outputs of `llama_context` (ggml-org#23861)
metal: template GLU kernels to support f16/f32 (ggml-org#23882)
vulkan: don't hold the device mutex while compiling pipelines (ggml-org#23641)
vulkan: reduce host memory lock contention (ggml-org#23376)
vocab: add normalizer.lowercase support to WPM (ggml-org#23899)
TP: quantized KV cache support (ggml-org#23792)
security : disable private disclosures (ggml-org#23963)
model: Add EXAONE 4.5 implementations (ggml-org#21733)
vulkan: Block-load Q3_K/Q6_K block data and subtract on 32b ints (ggml-org#23056)
vulkan: Removed unused functions (ggml-org#23175)
common : support manually triggering the reasoning budget end sequence (ggml-org#23949)
ci : add missing Linux label to cpu-x64-high-perf runner (ggml-org#23958)
[SYCL] Support Q4_1, Q5_0, Q5_1 in Flash-attention (ggml-org#23812)
[SYCL] Add more types in GET_ROWS OP (ggml-org#23710)
sycl : Optimize Q3_K mul_mat by reorder (ggml-org#23725)
ci: remove redundant or duplicate jobs (ggml-org#23927)
server : handle If-None-Match weak ETags (ggml-org#23916)
ci : limit trigger paths for the CPU workflow (ggml-org#23938)
vocab : add tokenizer support for jina-embeddings-v2-base-zh (ggml-org#18756)
...
turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants