ggml webgpu: support for backend sampling by reeselevine · Pull Request #18880 · ggml-org/llama.cpp

reeselevine · 2026-01-16T17:26:44Z

The refactor to backend-sampler tests in #18753 broke WebGPU CI, because we hadn't implemented necessary operations yet and something about that refactor caused the tests to automatically run and fail on whatever backend is available. I suppose I could have just modified the tests to not run on the WebGPU backend, but instead I implemented all the operations necessary to run the sampler tests via WebGPU.

So, this PR adds the following:

Support for CLAMP, LOG, FILL, PAD, ARGMAX, ARGSORT, TOP_K, CUMSUM, SUM, SUM_ROWS, as well as a few unary operators that weren't needed for sampling, but which were written by @abhijitramesh and which I included as part of my unary ops refactor to JIT compilation while supporting CLAMP.
Support for I32 indices for SET_ROWS and I32 variants of CPY.
Implements compilation of new operations in a JIT manner, so we only compile variants of shaders that are actually used by the models. WebGPU compilation always happens at runtime because it needs to know what platform it's running on, but many of the existing shaders are compiled at startup, which reduces flexibility and increases resource usage for no good reason (in my opinion). We're moving to this new approach generally.
Update ops support

Implements SOFTPLUS (log(1 + exp(x))) with f16/f32 support. Uses f32 precision for intermediate calculations to prevent f16 overflow. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * Follow Vulkan backend numerical stability pattern

Implements EXPM1 (exp(x) - 1) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support

Implements FLOOR (rounds down to nearest integer) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support

Implements CEIL (rounds up to nearest integer) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support

Implements ROUND (rounds to nearest integer) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support

Implements TRUNC (truncates towards zero) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support

… TRUNC, EXPM1, SOFTPLUS)

CISC

Awesome!

The CI failure is just corrupt ccache, I have deleted the cache and will rerun it once CIs are done.

ggml/src/ggml-webgpu/ggml-webgpu.cpp

CISC · 2026-01-16T21:17:38Z

LOG fails:

2026-01-16T21:14:19.3151243Z 35: Failing tests:
2026-01-16T21:14:19.3154644Z 35:   LOG(type=f16,ne=[10,5,4,3])
2026-01-16T21:14:19.3158114Z 35:   LOG(type=f16,ne=[7,1,5,3])
2026-01-16T21:14:19.3161950Z 35:   Backend WebGPU: WebGPU: �[1;31mFAIL�[0m

reeselevine · 2026-01-16T21:19:59Z

LOG fails:

2026-01-16T21:14:19.3151243Z 35: Failing tests:
2026-01-16T21:14:19.3154644Z 35:   LOG(type=f16,ne=[10,5,4,3])
2026-01-16T21:14:19.3158114Z 35:   LOG(type=f16,ne=[7,1,5,3])
2026-01-16T21:14:19.3161950Z 35:   Backend WebGPU: WebGPU: �[1;31mFAIL�[0m

looks like it's numerical issues, let me try casting to f32 for the actual operation, shouldn't affect performance much if at all

jeffbolznv · 2026-01-16T21:23:39Z

looks like it's numerical issues, let me try casting to f32 for the actual operation, shouldn't affect performance much if at all

Could be a rounding issue on the result - for ggml-vulkan we've had to force RTNE to make a lot of these f16 tests pass.

reeselevine · 2026-01-16T21:27:22Z

looks like it's numerical issues, let me try casting to f32 for the actual operation, shouldn't affect performance much if at all

Could be a rounding issue on the result - for ggml-vulkan we've had to force RTNE to make a lot of these f16 tests pass.

Ah interesting, if this doesn't fix it I'll look into that. I don't think WebGPU supports the same features though.

reeselevine · 2026-01-16T21:39:03Z

Looks like the higher precision worked!

* ggml webgpu: add SOFTPLUS unary operator Implements SOFTPLUS (log(1 + exp(x))) with f16/f32 support. Uses f32 precision for intermediate calculations to prevent f16 overflow. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * Follow Vulkan backend numerical stability pattern * ggml webgpu: add EXPM1 unary operator Implements EXPM1 (exp(x) - 1) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * ggml webgpu: add FLOOR unary operator Implements FLOOR (rounds down to nearest integer) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * ggml webgpu: add CEIL unary operator Implements CEIL (rounds up to nearest integer) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * ggml webgpu: add ROUND unary operator Implements ROUND (rounds to nearest integer) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * ggml webgpu: add TRUNC unary operator Implements TRUNC (truncates towards zero) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * docs : update WebGPU support for unary operators (FLOOR, CEIL, ROUND, TRUNC, EXPM1, SOFTPLUS) * Updates to webgpu get_memory * Add argmax * Add argmax,cumsum,sum,sum_rows * Add necessary CPY/GET_ROWS operators * Support for argsort using multi-pass strategy * Update set_rows for i32 indices, move to pre-wgsl * Port unary operators to pre-wgsl and support FILL * Implement PAD * Add support for top-k * clean up, scope pipeline init mutex * fix newline * Add support for log * Update LOG for better precision, and ops doc --------- Co-authored-by: Abhijit Ramesh <abhijitramesh2k@gmail.com>

abhijitramesh and others added 20 commits December 18, 2025 15:27

ggml webgpu: add EXPM1 unary operator

ac51c62

Implements EXPM1 (exp(x) - 1) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support

ggml webgpu: add FLOOR unary operator

e2a00cf

Implements FLOOR (rounds down to nearest integer) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support

ggml webgpu: add CEIL unary operator

267d3b4

Implements CEIL (rounds up to nearest integer) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support

ggml webgpu: add ROUND unary operator

0e59487

Implements ROUND (rounds to nearest integer) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support

ggml webgpu: add TRUNC unary operator

4f358f7

Implements TRUNC (truncates towards zero) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support

docs : update WebGPU support for unary operators (FLOOR, CEIL, ROUND,…

c4c4f77

… TRUNC, EXPM1, SOFTPLUS)

Updates to webgpu get_memory

0ba2cc1

Merge branch 'ggml-org:master' into master

e7a0a59

Merge remote-tracking branch 'abhijit/abhijit/unary'

0db8291

Add argmax

c212d18

Add argmax,cumsum,sum,sum_rows

4c13d60

Add necessary CPY/GET_ROWS operators

be94a50

Support for argsort using multi-pass strategy

8fa1895

Update set_rows for i32 indices, move to pre-wgsl

8eaaf13

Port unary operators to pre-wgsl and support FILL

5f60b23

Implement PAD

233736c

Add support for top-k

b0588b1

clean up, scope pipeline init mutex

b33df67

Merge remote-tracking branch 'upstream/master' into sampling

df04f92

loci-dev mentioned this pull request Jan 16, 2026

UPSTREAM PR #18880: ggml webgpu: support for backend sampling auroralabs-loci/llama.cpp#942

Open

github-actions bot added documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning labels Jan 16, 2026

reeselevine added 2 commits January 16, 2026 11:08

fix newline

4071092

Add support for log

6c5cf57

CISC approved these changes Jan 16, 2026

View reviewed changes

CISC reviewed Jan 16, 2026

View reviewed changes

ggml/src/ggml-webgpu/ggml-webgpu.cpp Show resolved Hide resolved

Update LOG for better precision, and ops doc

9f06f67

reeselevine merged commit a89002f into ggml-org:master Jan 17, 2026
143 of 150 checks passed

reeselevine deleted the sampling branch January 20, 2026 17:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml webgpu: support for backend sampling#18880

ggml webgpu: support for backend sampling#18880
reeselevine merged 23 commits intoggml-org:masterfrom
reeselevine:sampling

reeselevine commented Jan 16, 2026 •

edited

Loading

Uh oh!

CISC left a comment

Uh oh!

Uh oh!

CISC commented Jan 16, 2026

Uh oh!

reeselevine commented Jan 16, 2026

Uh oh!

jeffbolznv commented Jan 16, 2026

Uh oh!

reeselevine commented Jan 16, 2026

Uh oh!

reeselevine commented Jan 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

reeselevine commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

CISC commented Jan 16, 2026

Uh oh!

reeselevine commented Jan 16, 2026

Uh oh!

jeffbolznv commented Jan 16, 2026

Uh oh!

reeselevine commented Jan 16, 2026

Uh oh!

reeselevine commented Jan 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

reeselevine commented Jan 16, 2026 •

edited

Loading