Skip to content

ggml-webgpu: Enable NVIDIA self-hosted CI#22976

Merged
reeselevine merged 7 commits into
ggml-org:masterfrom
reeselevine:enable-nvidia-ci
May 14, 2026
Merged

ggml-webgpu: Enable NVIDIA self-hosted CI#22976
reeselevine merged 7 commits into
ggml-org:masterfrom
reeselevine:enable-nvidia-ci

Conversation

@reeselevine
Copy link
Copy Markdown
Contributor

@reeselevine reeselevine commented May 12, 2026

Overview

Enables the self-hosted NVIDIA CI for the WebGPU backend. In order to pass the CI, the NMSE threshold had to be relaxed, to avoid errors in many operations that write to f16 tensors. This includes operations like DIV, where even if the calculation is done in f32, casting to f16 causes slight drift, and SET_ROWS, where the operation is a straightahead cast. I found that the errors were usually between 2e-7 to 3e-7, just above the default 1e-7 threshold set by test-backend-ops.

Since the WebGPU backend ultimately lowers to Vulkan on this CI host, I investigated the difference in the SPIR-V code between the two, and found that while the instruction for the cast is the same (OpFConvert), the Vulkan backend adds Vulkan's "round-to-even" mode, which matches ggml-cpu's conversion from f32 to f16. However, WebGPU does not specify the rounding mode, leaving it implementation-defined, and Dawn currently does not expose rounding mode control to my knowledge (although interestingly, rounding mode is an example in a hypothetical extension for WGSL).

Ultimately, this means that the WebGPU backend may need slightly looser tolerances for floating-point operations. While that may mean some models on some devices are slightly off compared to other backends, that is already the case right now, so I think enabling this CI and making it an explicit decision for now is worth it. If Dawn or WebGPU ever adds support for rounding mode, we can revisit this.

The other related change in this PR is to clamp random values to the range [-10, 10] for EXP and EXPM1 f16 tensors, since another quirk of WebGPU is that some inf f32 values can be cast to the max f16 value (65504.0), due to the rules on discarding extra signficand bits, and the existing range was exposing this.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: yes, to investigate various rounding methods in ggml

@github-actions github-actions Bot added devops improvements to build systems and github actions testing Everything test related ggml changes relating to the ggml tensor library for machine learning WebGPU labels May 12, 2026
@reeselevine reeselevine marked this pull request as ready for review May 14, 2026 03:17
@reeselevine reeselevine requested review from a team and ggerganov as code owners May 14, 2026 03:17
@reeselevine reeselevine requested a review from CISC May 14, 2026 03:18
Comment on lines +1131 to +1134
ggml_backend_reg_t reg = ggml_backend_dev_backend_reg(ggml_backend_get_device(backend));
if (contains_f16 && strcmp(ggml_backend_reg_name(reg), "WebGPU") == 0) {
return std::max(max_nmse_err(), 1e-6);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to reference the change to this PR otherwise future maintainers would wonder why WebGPU has a special case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added. Can I get a reapproval?

Added comment referencing pull request for clarification.
@reeselevine reeselevine merged commit 834a243 into ggml-org:master May 14, 2026
40 of 49 checks passed
xxmustafacooTR pushed a commit to xxPlayground/llama-cpp-turboquant that referenced this pull request May 15, 2026
* Enabel nvidia ci for webgpu

* Address precision issues

* fix placement

* Relax more set_rows and div

* Try relaxing all f16

* formatting and naming

* Add comment explaining max_nmse_err logic

Added comment referencing pull request for clarification.
dandm1 pushed a commit to dandm1/llama.cpp that referenced this pull request May 16, 2026
* Enabel nvidia ci for webgpu

* Address precision issues

* fix placement

* Relax more set_rows and div

* Try relaxing all f16

* formatting and naming

* Add comment explaining max_nmse_err logic

Added comment referencing pull request for clarification.
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 19, 2026
* Enabel nvidia ci for webgpu

* Address precision issues

* fix placement

* Relax more set_rows and div

* Try relaxing all f16

* formatting and naming

* Add comment explaining max_nmse_err logic

Added comment referencing pull request for clarification.
ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request May 19, 2026
* Enabel nvidia ci for webgpu

* Address precision issues

* fix placement

* Relax more set_rows and div

* Try relaxing all f16

* formatting and naming

* Add comment explaining max_nmse_err logic

Added comment referencing pull request for clarification.
baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026
* Enabel nvidia ci for webgpu

* Address precision issues

* fix placement

* Relax more set_rows and div

* Try relaxing all f16

* formatting and naming

* Add comment explaining max_nmse_err logic

Added comment referencing pull request for clarification.
winstonma pushed a commit to winstonma/llama.cpp that referenced this pull request May 27, 2026
* Enabel nvidia ci for webgpu

* Address precision issues

* fix placement

* Relax more set_rows and div

* Try relaxing all f16

* formatting and naming

* Add comment explaining max_nmse_err logic

Added comment referencing pull request for clarification.
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
* Enabel nvidia ci for webgpu

* Address precision issues

* fix placement

* Relax more set_rows and div

* Try relaxing all f16

* formatting and naming

* Add comment explaining max_nmse_err logic

Added comment referencing pull request for clarification.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning testing Everything test related WebGPU

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants