Skip to content

ggml-webgpu: replace f32 with kv_type and q_type#23372

Merged
reeselevine merged 1 commit into
ggml-org:masterfrom
noumena-labs:webgpu/flash-attn-f16
May 21, 2026
Merged

ggml-webgpu: replace f32 with kv_type and q_type#23372
reeselevine merged 1 commit into
ggml-org:masterfrom
noumena-labs:webgpu/flash-attn-f16

Conversation

@Constannnnnt
Copy link
Copy Markdown
Contributor

Overview

This PR is to address the discussion in #22808 to avoid using f32 all the time for KV shared memory. Tested with the same prompts and images.

Requirements

@Constannnnnt Constannnnnt requested a review from a team as a code owner May 20, 2026 01:42
@Constannnnnt Constannnnnt changed the title fix(flash-attn): replace f32 with kv_type and q_type ggml-webgpu: replace f32 with kv_type and q_type May 20, 2026
@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning WebGPU labels May 20, 2026
Copy link
Copy Markdown
Contributor

@ArberSephirotheca ArberSephirotheca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me!

@Constannnnnt
Copy link
Copy Markdown
Contributor Author

Actually, I tested again using Qwen3.5-0.8B-Q8_0 with mmproj-f16 and the same prompt and image: the final cosine similarity slightly dropped: f32 (cos=0.99999412) and KV_TYPE (cos=0.99999248). The final result was quite similar, at least from the first 128 tokens.

@reeselevine reeselevine requested review from CISC and ggerganov May 21, 2026 02:50
@reeselevine reeselevine merged commit 5306f4b into ggml-org:master May 21, 2026
73 of 78 checks passed
ProTekk pushed a commit to ProTekk/buun-llama-cpp that referenced this pull request May 21, 2026
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request May 21, 2026
* origin/master: (138 commits)
fix(flash-attn): replace f32 with kv_type and q_type (ggml-org#23372)
tests : move save-load-state from examples to tests (ggml-org#23336)
server: expose prompt token counts in /slots endpoint (ggml-org#23454)
metal : optimize concat kernel and fix set kernel threads (ggml-org#23411)
server : free draft/MTP resources on sleep to fix VRAM leak (ggml-org#23461)
server: re-inject subcommand when router spawns children under unified binary (ggml-org#23442)
app : add batched-bench, fit-params, quantize & perplexity (ggml-org#23459)
mtp: use inp_out_ids for skipping logit computation (ggml-org#23433)
vocab : add Carbon-3B (HybridDNATokenizer) support (ggml-org#23410)
doc: fix spec mtp typo (ggml-org#23435)
ui: Improve Git Hooks for UI development (ggml-org#23403)
ggml : Check the right iface method before using the fallback 2d get (ggml-org#23306)
llama-graph: fix null-buffer crash in llm_graph_input_attn_kv_iswa for SWA-only models (ggml-org#23131)
hexagon: ssm-conv fix for large prompts (ggml-org#23307)
app : show version (ggml-org#23426)
mtmd, model : merge HunyuanOCR into HunyuanVL and fix OCR vision precision (ggml-org#23329)
ui: Add max image size option (ggml-org#22849)
Move to backend sampling for MTP draft path (ggml-org#23287)
opencl: refactor backend initilization (ggml-org#23318)
common/speculative : fix nullptr crash in get_devices_str (ggml-org#23386)
...
baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026
srossitto79 pushed a commit to srossitto79/llama.cpp that referenced this pull request May 23, 2026
@Constannnnnt Constannnnnt deleted the webgpu/flash-attn-f16 branch May 28, 2026 19:55
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning WebGPU

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants