[webgpu] Optimize string stream used in WebGPU EP by fs-eire · Pull Request #27223 · microsoft/onnxruntime

fs-eire · 2026-02-02T03:22:20Z

Description

Optimize the string stream used in WebGPU EP.

Motivation and Context

The current implementation uses a absl::OStringStream, which is faster than std::ostringstream. However, it is still slow in the usage of generating the program cache key.

From the profiling data, CalculateProgramCacheKey() is extremely time consuming. It can consume up to 1/3 of all CPU time inside WebGpuContext::Run():

The basic analyze shows that most time spent in the std::basic_ostream operator <<() implementation, and this is way slower than expected.

To optimize, this PR uses a simplified implementation FastOStringStream, which does not inherit from std::basic_ostream. Instead, the class implementation only includes necessary overrides for the minimum requirements of generating cache key and shader code, to reduce the unnecessary overhead as much as possible.

As a result, the CPU sampling of CalculateProgramCacheKey() in the same test dropped from 2555 to 176. Generation TPS of E2E model benchmark on Qwen3-0.6B increased from ~90 to ~130 on Windows11/13900k/RTX4070.

Copilot

Pull request overview

This PR refactors the WebGPU execution provider’s string-building infrastructure to use a custom lightweight FastOStringStream instead of Abseil’s OStringStream, aiming to reduce overhead when generating WGSL shader code and program cache keys.

Changes:

Introduces FastOStringStream in string_utils.h and updates the SS/SS_GET/SS_APPEND macros to construct, append to, and extract strings using this new stream type.
Migrates shader generation utilities (shader helper, shader variables, tensor kernels) and program cache-key construction to use the new OStringStream alias and pre-sized buffers instead of manual std::string management.
Unifies enum-to-string streaming for WebGPU program metadata by adding OStringStream overloads and a helper macro to generate operator<< implementations for both std::ostream and OStringStream.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
onnxruntime/core/providers/webgpu/tensor/split.cc	Switches split helper functions to accept `OStringStream&` so WGSL snippets are built with the new fast stream.
onnxruntime/core/providers/webgpu/tensor/resize_impl.cc	Updates resize coordinate and nearest-pixel helpers to emit WGSL into `OStringStream` instead of `std::ostream`.
onnxruntime/core/providers/webgpu/tensor/depth_to_space.cc	Changes permutation helper to use `OStringStream` for WGSL code generation.
onnxruntime/core/providers/webgpu/tensor/concat.cc	Updates concat WGSL helper functions to take `OStringStream&` for building shader snippets.
onnxruntime/core/providers/webgpu/string_utils.h	Replaces Abseil `OStringStream` with `FastOStringStream`, adds `std::to_chars`-based numeric streaming and centralizes `OStringStreamAppend` helpers.
onnxruntime/core/providers/webgpu/string_macros.h	Redefines `SS` to construct `OStringStream` with a reserve size and `SS_GET` to move out the final string from the stream.
onnxruntime/core/providers/webgpu/shader_variable.h	Updates internal `Impl` methods to take `OStringStream&`, aligning shader variable codegen with the new stream type.
onnxruntime/core/providers/webgpu/shader_variable.cc	Adapts shader variable/index helper implementations and `GetByOffsetImpl`/`SetByOffsetImpl` to use `SS`/`SS_GET` with `OStringStream`.
onnxruntime/core/providers/webgpu/shader_helper.h	Changes constant-writing and source-code generation APIs to work with `OStringStream` members and a non-const `GenerateSourceCode`.
onnxruntime/core/providers/webgpu/shader_helper.cc	Initializes `additional_implementation_ss_`/`body_ss_` with tuned reserve sizes and uses `SS_GET` to splice them into the final WGSL source.
onnxruntime/core/providers/webgpu/program_cache_key.cc	Builds program cache keys using `OStringStream` and `SS`/`SS_GET` instead of manual `std::string` accumulation.
onnxruntime/core/providers/webgpu/program.h	Includes `string_utils.h` and declares `OStringStream` streaming overloads for various program enums instead of some `std::ostream`-only overloads.
onnxruntime/core/providers/webgpu/program.cc	Introduces a `DEFINE_ENUM_STREAM_OP` macro to implement `operator<<` for both `std::ostream` and `OStringStream`, and ports `ProgramTensorMetadataDependency`’s printer to `OStringStream`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ache-key

guschmue · 2026-02-02T17:44:50Z

I think I see some issue when using in my local repo. Debugging ...

guschmue · 2026-02-02T18:07:56Z

python binding looks ok.
But genai main has issues with ort main + this PR:
RuntimeError: Specified device is not supported. Try CreateMemoryInfo_V2

guschmue · 2026-02-02T18:50:09Z

void - all good.

xenova · 2026-02-04T15:31:51Z

Generation TPS of E2E model benchmark on Qwen3-0.6B increased from ~90 to ~130 on Windows11/13900k/RTX4070.

Oh wow, that's a huge difference. Great work @fs-eire!

[webgpu] Optimize string stream used in WebGPU EP

d2c5941

fs-eire requested a review from Copilot February 2, 2026 03:25

Copilot started reviewing on behalf of fs-eire February 2, 2026 03:25 View session

Copilot AI reviewed Feb 2, 2026

View reviewed changes

Comment thread onnxruntime/core/providers/webgpu/string_utils.h Outdated

fix pool

065bfae

fs-eire force-pushed the fs-eire/opt-program-cache-key branch from 81a8d8a to 065bfae Compare February 2, 2026 07:10

fs-eire added 2 commits February 1, 2026 23:13

resolve comments

cffb120

Merge remote-tracking branch 'origin/main' into fs-eire/opt-program-c…

59bbd51

…ache-key

guschmue approved these changes Feb 2, 2026

View reviewed changes

guschmue added the ep:WebGPU ort-web webgpu provider label Feb 3, 2026

fs-eire merged commit e21b948 into microsoft:main Feb 4, 2026
94 of 107 checks passed

BrewTestBot mentioned this pull request Apr 20, 2026

onnxruntime 1.25.0 Homebrew/homebrew-core#278543

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[webgpu] Optimize string stream used in WebGPU EP#27223

[webgpu] Optimize string stream used in WebGPU EP#27223
fs-eire merged 4 commits intomicrosoft:mainfrom
fs-eire:fs-eire/opt-program-cache-key

fs-eire commented Feb 2, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

guschmue commented Feb 2, 2026

Uh oh!

guschmue commented Feb 2, 2026 •

edited

Loading

Uh oh!

guschmue commented Feb 2, 2026

Uh oh!

Uh oh!

xenova commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

fs-eire commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

guschmue commented Feb 2, 2026

Uh oh!

guschmue commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guschmue commented Feb 2, 2026

Uh oh!

Uh oh!

xenova commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fs-eire commented Feb 2, 2026 •

edited

Loading

guschmue commented Feb 2, 2026 •

edited

Loading