Enable graph capture for webgpu #1848

qjia7 · 2025-10-30T06:24:27Z

This PR enables the graph capture for webgpu. It implements CopyDeviceToCpu\CopyCpuToDevice\CopyFrom\Zero functions using the new CopyTensors API.

The ort part needs to apply this PR #26450 to make it work for webgpu.

Below things will be implemented in following-up PRs to get the full performance gain for graph capture (The original one is #1720).

Support UpdateAttentionMask, UpdatePositionIds, and Cast to keep the whole pipeline on gpu.
Optimize CopyFrom with offsets

ambroser53 · 2025-11-05T11:58:52Z

Is this a result of internal discussion on the last comment from #1720 or some other approach?

Also have you tested your webgpu implementation on a UMA device (i.e arm64/apple silicon)? If so and you witness the same issue I have then we can discuss. I have met with Arm reps on the kleidiai team regarding the issue and may be able to give insights.

qjia7 · 2025-11-06T05:34:58Z

Is this a result of internal discussion on the last comment from #1720 or some other approach?

I think the preferred solution is Solution two in #1720 after sync with some guys. And currently, onnxruntime has exposed OrtEnv::CopyTensors interface so we can use it to implement the CopyDeviceToCpu, CopyCpuToDevice, and CopyFrom functions in genai like what I did in this PR. For others, it will fall back to cpu. After the calculation, upload to gpu again. We can improve the latter ones by integrating some onnx models to implement them in separate PRs. I will make it ready for review after resolving the incorrect results with this PR.

Also have you tested your webgpu implementation on a UMA device (i.e arm64/apple silicon)? If so and you witness the same issue I have then we can discuss. I have met with Arm reps on the kleidiai team regarding the issue and may be able to give insights.

So what kind of issue do you meet in Arm? And what's your insights?

ambroser53 · 2025-11-07T12:13:53Z

So what kind of issue do you meet in Arm? And what's your insights?

I have seen an issue whereby calls to CopyDeviceToCpu cause major slowdown as well as visual stuttering with under-utilisation of the hardware. This is the case for my own solution (similar to your solution three in #1720) as well as your implementation in that PR for solution one (with USE_WEBGPU=OFF).

Essentially the execution of the model is very fast, whilst the first CopyDeviceToCpu call takes longer than the model itself. This happens whether it's when transferring the logits (~262KB for Llama-3B) or the decoded sequence (~16KB for a 2000 int64 token sequence)). I have a version that decodes entirely on GPU with WebGPU kernels and only calls CopyDeviceToCpu once at the end of generation and I still see the same behaviour.

See the following trace to illustrate (using ORT_GenAI's internal tracing system from #1524)

This is when using WebGPU EP with all ondevice sampling/decoding and a call to GetSequence after the first token has been generated. Notice that the prefill takes less time than the CopyDeviceToCpu call and CopyDeviceToCpu calls are intermittent during the generation of new tokens after the first call, but are much smaller in comparison to the first call. Hardware utilisation is very low (near non-existent) whilst CopyDeviceToCpu is being called but it is still showing visual stuttering whilst this is taking place. I was hoping that github.com/microsoft/onnxruntime/pull/25941 would fix the visual stuttering but it did not.

This only occurs on unified memory systems, which are mostly Arm in the market right now (Qualcomm, Apple Silicon, etc.) but the same behaviour appears when we tested on the AMD Zen 2 unified memory APU. When discussing with Arm reps that worked on the KleidiAI EP (which is entirely CPU based), they mentioned that it might be to do with the synchronisation between CPU and GPU, but admitted that if inference on Arm chips could be done on GPU they will likely see better performance and energy efficiency compared to their CPU kleidiAI approach.

However, I have seen slowdown due to this on some lower-end dedicated GPU devices such as the AMD Radeon RX 5700 XT.

Have you tested at all with any unified memory devices? If you would like anymore information on my findings let me know. We're actively trying to resolve this issue.

qjia7 · 2025-11-08T03:38:26Z

Essentially the execution of the model is very fast, whilst the first CopyDeviceToCpu call takes longer than the model itself.

My understanding is that the CopyDeviceToCpu call might also includes partial model inference time. How do you ensure that all previous GPU work is completed before measuring the CopyDeviceToCpu time? If you have access to the raw Dawn/WebGPU API in your code, you might want to use the https://www.w3.org/TR/webgpu/#dom-gpuqueue-onsubmittedworkdone call before proceeding with CopyDeviceToCpu. This method can confirm that all prior GPU tasks are finished. Please share your findings. Thank you.

src/webgpu/interface.cpp

This reverts commit 60619b2.

Copilot

Pull request overview

This PR enables graph capture support for WebGPU by implementing device-CPU memory operations using ONNX Runtime's new CopyTensors API. It also upgrades the ONNX Runtime dependency from version 1.22.0 to 1.23.0 across the entire codebase.

Key changes:

Implements CopyDeviceToCpu, CopyCpuToDevice, CopyFrom, and Zero methods for WebGPU using the new CopyTensors API
Adds graph capture detection for WebGPU in config processing and model initialization
Updates ONNX Runtime version to 1.23.0 across all build configurations and test requirements

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
src/webgpu/interface.cpp	Implements WebGPU memory operations with CopyTensors API; manages CPU/device memory transfers with fallback for offset copies
src/models/onnxruntime_api.h	Adds CopyTensors method declaration and OrtSyncStream wrapper for async operations
src/models/onnxruntime_inline.h	Implements CopyTensors wrapper with validation
src/models/model.cpp	Adds WebGPU graph capture check to determine device memory usage for inputs
src/config.cpp	Implements graph capture detection for WebGPU provider based on enableGraphCapture option
CMakeLists.txt	Suppresses MSVC warning C4819 for non-representable characters
test/python/*/ort/requirements.txt	Updates ONNX Runtime test dependencies to 1.23.0
cmake/ortlib.cmake	Updates default ONNX Runtime version to 1.23.0 for all execution providers
examples/slm_engine/build_scripts/build_deps.py	Updates ORT version references to 1.23.0
.pipelines/nuget-publishing.yml	Updates pipeline ORT version parameters to 1.23.0

src/webgpu/interface.cpp

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

src/models/onnxruntime_api.h

src/webgpu/interface.cpp

Co-authored-by: Copilot <[email protected]>

This PR enables the graph capture for webgpu. It implements CopyDeviceToCpu\CopyCpuToDevice\CopyFrom\Zero functions using the new `CopyTensors` API. The ort part needs to apply this PR [#26450](microsoft/onnxruntime#26450) to make it work for webgpu. Below things will be implemented in following-up PRs to get the full performance gain for graph capture (The original one is #1720). 1. Support UpdateAttentionMask, UpdatePositionIds, and Cast to keep the whole pipeline on gpu. 2. Optimize CopyFrom with offsets --------- Co-authored-by: Copilot <[email protected]>

qjia7 added 2 commits October 28, 2025 16:12

Add CopyTensors interface

d1793bc

Use CopyTensors to support WebGPUMemory

9ab1241

qjia7 added 2 commits November 11, 2025 15:06

Merge branch 'main' into pr_graph_capture

eba8d97

fix the incorrect result

811d81f

qjia7 commented Nov 11, 2025

View reviewed changes

src/webgpu/interface.cpp Show resolved Hide resolved

qjia7 added 6 commits November 17, 2025 16:14

update onnxruntime version

f60cf34

Update onnxruntime version in workflows

60619b2

Revert "Update onnxruntime version in workflows"

d8a317f

This reverts commit 60619b2.

Merge branch 'main' into pr_graph_capture

690e723

restore to download from the nightly feed

77b639b

update the requirements.txt

cf96c37

qjia7 marked this pull request as ready for review November 18, 2025 09:42

qjia7 changed the title ~~[WIP] Enable graph capture for webgpu~~ Enable graph capture for webgpu Nov 18, 2025

qjia7 requested review from baijumeswani and kunal-vaishnavi November 18, 2025 09:42

kunal-vaishnavi requested a review from Copilot November 24, 2025 03:48

Copilot started reviewing on behalf of kunal-vaishnavi November 24, 2025 03:49 View session

Copilot finished reviewing on behalf of kunal-vaishnavi November 24, 2025 03:52

Copilot AI reviewed Nov 24, 2025

View reviewed changes

src/webgpu/interface.cpp Show resolved Hide resolved

kunal-vaishnavi previously approved these changes Nov 24, 2025

View reviewed changes

qjia7 added 2 commits November 24, 2025 12:38

address comments

bc46535

Merge branch 'main' into pr_graph_capture

01b8334

qjia7 dismissed kunal-vaishnavi’s stale review via 01b8334 November 24, 2025 04:40

qjia7 requested a review from Copilot November 24, 2025 04:44

Copilot started reviewing on behalf of qjia7 November 24, 2025 04:44 View session

Copilot finished reviewing on behalf of qjia7 November 24, 2025 04:48

Copilot AI reviewed Nov 24, 2025

View reviewed changes

src/models/onnxruntime_api.h Show resolved Hide resolved

src/webgpu/interface.cpp Show resolved Hide resolved

src/webgpu/interface.cpp Show resolved Hide resolved

qjia7 and others added 2 commits November 24, 2025 13:13

Apply suggestions from code review

c4a9ea5

Co-authored-by: Copilot <[email protected]>

fix CI errors

4b214c5

kunal-vaishnavi mentioned this pull request Nov 24, 2025

Generic shared emb_tokens/lm_head implementation #1885

Merged

qjia7 requested a review from kunal-vaishnavi November 24, 2025 07:16

kunal-vaishnavi approved these changes Nov 25, 2025

View reviewed changes

qjia7 merged commit 32d101d into main Nov 25, 2025
15 checks passed

qjia7 deleted the graph_capture branch November 25, 2025 05:40

dependabot bot mentioned this pull request Dec 15, 2025

Bump Microsoft.ML.OnnxRuntimeGenAI from 0.11.2 to 0.11.4 yuniko-software/qwen3-onnx#10

Merged

qjia7 mentioned this pull request Dec 16, 2025

[Need discussion] Add graph capture for webgpu #1720

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable graph capture for webgpu #1848

Enable graph capture for webgpu #1848

Uh oh!

qjia7 commented Oct 30, 2025 •

edited

Loading

Uh oh!

ambroser53 commented Nov 5, 2025

Uh oh!

qjia7 commented Nov 6, 2025

Uh oh!

ambroser53 commented Nov 7, 2025

Uh oh!

qjia7 commented Nov 8, 2025

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Enable graph capture for webgpu #1848

Enable graph capture for webgpu #1848

Uh oh!

Conversation

qjia7 commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ambroser53 commented Nov 5, 2025

Uh oh!

qjia7 commented Nov 6, 2025

Uh oh!

ambroser53 commented Nov 7, 2025

Uh oh!

qjia7 commented Nov 8, 2025

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

qjia7 commented Oct 30, 2025 •

edited

Loading