ggml-webgpu: replace f32 with kv_type and q_type by Constannnnnt · Pull Request #23372 · ggml-org/llama.cpp

Constannnnnt · 2026-05-20T01:42:39Z

Overview

This PR is to address the discussion in #22808 to avoid using f32 all the time for KV shared memory. Tested with the same prompts and images.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO.

ArberSephirotheca

Looks great to me!

Constannnnnt · 2026-05-20T21:54:08Z

Actually, I tested again using Qwen3.5-0.8B-Q8_0 with mmproj-f16 and the same prompt and image: the final cosine similarity slightly dropped: f32 (cos=0.99999412) and KV_TYPE (cos=0.99999248). The final result was quite similar, at least from the first 128 tokens.

* origin/master: (138 commits) fix(flash-attn): replace f32 with kv_type and q_type (ggml-org#23372) tests : move save-load-state from examples to tests (ggml-org#23336) server: expose prompt token counts in /slots endpoint (ggml-org#23454) metal : optimize concat kernel and fix set kernel threads (ggml-org#23411) server : free draft/MTP resources on sleep to fix VRAM leak (ggml-org#23461) server: re-inject subcommand when router spawns children under unified binary (ggml-org#23442) app : add batched-bench, fit-params, quantize & perplexity (ggml-org#23459) mtp: use inp_out_ids for skipping logit computation (ggml-org#23433) vocab : add Carbon-3B (HybridDNATokenizer) support (ggml-org#23410) doc: fix spec mtp typo (ggml-org#23435) ui: Improve Git Hooks for UI development (ggml-org#23403) ggml : Check the right iface method before using the fallback 2d get (ggml-org#23306) llama-graph: fix null-buffer crash in llm_graph_input_attn_kv_iswa for SWA-only models (ggml-org#23131) hexagon: ssm-conv fix for large prompts (ggml-org#23307) app : show version (ggml-org#23426) mtmd, model : merge HunyuanOCR into HunyuanVL and fix OCR vision precision (ggml-org#23329) ui: Add max image size option (ggml-org#22849) Move to backend sampling for MTP draft path (ggml-org#23287) opencl: refactor backend initilization (ggml-org#23318) common/speculative : fix nullptr crash in get_devices_str (ggml-org#23386) ...

fix(flash-attn): replace f32 with kv_type and q_type

c6036f3

Constannnnnt requested a review from a team as a code owner May 20, 2026 01:42

Constannnnnt changed the title ~~fix(flash-attn): replace f32 with kv_type and q_type~~ ggml-webgpu: replace f32 with kv_type and q_type May 20, 2026

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning WebGPU labels May 20, 2026

ArberSephirotheca approved these changes May 20, 2026

View reviewed changes

reeselevine approved these changes May 21, 2026

View reviewed changes

reeselevine requested review from CISC and ggerganov May 21, 2026 02:50

CISC approved these changes May 21, 2026

View reviewed changes

reeselevine merged commit 5306f4b into ggml-org:master May 21, 2026
73 of 78 checks passed

ProTekk pushed a commit to ProTekk/buun-llama-cpp that referenced this pull request May 21, 2026

fix(flash-attn): replace f32 with kv_type and q_type (ggml-org#23372)

8ffcd0d

baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026

fix(flash-attn): replace f32 with kv_type and q_type (ggml-org#23372)

56a449b

srossitto79 pushed a commit to srossitto79/llama.cpp that referenced this pull request May 23, 2026

fix(flash-attn): replace f32 with kv_type and q_type (ggml-org#23372)

2b819fe

a-ghorbani mentioned this pull request May 25, 2026

chore(deps): upgrade llama.rn to 0.12.4 a-ghorbani/pocketpal-ai#743

Merged

7 tasks

Constannnnnt deleted the webgpu/flash-attn-f16 branch May 28, 2026 19:55

fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026

fix(flash-attn): replace f32 with kv_type and q_type (ggml-org#23372)

cef3401

turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 2, 2026

fix(flash-attn): replace f32 with kv_type and q_type (ggml-org#23372)

90b24b8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-webgpu: replace f32 with kv_type and q_type#23372

ggml-webgpu: replace f32 with kv_type and q_type#23372
reeselevine merged 1 commit into
ggml-org:masterfrom
noumena-labs:webgpu/flash-attn-f16

Constannnnnt commented May 20, 2026

Uh oh!

ArberSephirotheca left a comment

Uh oh!

Constannnnnt commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Constannnnnt commented May 20, 2026

Overview

Requirements

Uh oh!

ArberSephirotheca left a comment

Choose a reason for hiding this comment

Uh oh!

Constannnnnt commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants