Skip to content

[webgpu] Optimize string stream used in WebGPU EP#27223

Merged
fs-eire merged 4 commits intomicrosoft:mainfrom
fs-eire:fs-eire/opt-program-cache-key
Feb 4, 2026
Merged

[webgpu] Optimize string stream used in WebGPU EP#27223
fs-eire merged 4 commits intomicrosoft:mainfrom
fs-eire:fs-eire/opt-program-cache-key

Conversation

@fs-eire
Copy link
Copy Markdown
Contributor

@fs-eire fs-eire commented Feb 2, 2026

Description

Optimize the string stream used in WebGPU EP.

Motivation and Context

The current implementation uses a absl::OStringStream, which is faster than std::ostringstream. However, it is still slow in the usage of generating the program cache key.

From the profiling data, CalculateProgramCacheKey() is extremely time consuming. It can consume up to 1/3 of all CPU time inside WebGpuContext::Run():

image

The basic analyze shows that most time spent in the std::basic_ostream operator <<() implementation, and this is way slower than expected.

To optimize, this PR uses a simplified implementation FastOStringStream, which does not inherit from std::basic_ostream. Instead, the class implementation only includes necessary overrides for the minimum requirements of generating cache key and shader code, to reduce the unnecessary overhead as much as possible.

image

As a result, the CPU sampling of CalculateProgramCacheKey() in the same test dropped from 2555 to 176. Generation TPS of E2E model benchmark on Qwen3-0.6B increased from ~90 to ~130 on Windows11/13900k/RTX4070.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the WebGPU execution provider’s string-building infrastructure to use a custom lightweight FastOStringStream instead of Abseil’s OStringStream, aiming to reduce overhead when generating WGSL shader code and program cache keys.

Changes:

  • Introduces FastOStringStream in string_utils.h and updates the SS/SS_GET/SS_APPEND macros to construct, append to, and extract strings using this new stream type.
  • Migrates shader generation utilities (shader helper, shader variables, tensor kernels) and program cache-key construction to use the new OStringStream alias and pre-sized buffers instead of manual std::string management.
  • Unifies enum-to-string streaming for WebGPU program metadata by adding OStringStream overloads and a helper macro to generate operator<< implementations for both std::ostream and OStringStream.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
onnxruntime/core/providers/webgpu/tensor/split.cc Switches split helper functions to accept OStringStream& so WGSL snippets are built with the new fast stream.
onnxruntime/core/providers/webgpu/tensor/resize_impl.cc Updates resize coordinate and nearest-pixel helpers to emit WGSL into OStringStream instead of std::ostream.
onnxruntime/core/providers/webgpu/tensor/depth_to_space.cc Changes permutation helper to use OStringStream for WGSL code generation.
onnxruntime/core/providers/webgpu/tensor/concat.cc Updates concat WGSL helper functions to take OStringStream& for building shader snippets.
onnxruntime/core/providers/webgpu/string_utils.h Replaces Abseil OStringStream with FastOStringStream, adds std::to_chars-based numeric streaming and centralizes OStringStreamAppend helpers.
onnxruntime/core/providers/webgpu/string_macros.h Redefines SS to construct OStringStream with a reserve size and SS_GET to move out the final string from the stream.
onnxruntime/core/providers/webgpu/shader_variable.h Updates internal Impl methods to take OStringStream&, aligning shader variable codegen with the new stream type.
onnxruntime/core/providers/webgpu/shader_variable.cc Adapts shader variable/index helper implementations and GetByOffsetImpl/SetByOffsetImpl to use SS/SS_GET with OStringStream.
onnxruntime/core/providers/webgpu/shader_helper.h Changes constant-writing and source-code generation APIs to work with OStringStream members and a non-const GenerateSourceCode.
onnxruntime/core/providers/webgpu/shader_helper.cc Initializes additional_implementation_ss_/body_ss_ with tuned reserve sizes and uses SS_GET to splice them into the final WGSL source.
onnxruntime/core/providers/webgpu/program_cache_key.cc Builds program cache keys using OStringStream and SS/SS_GET instead of manual std::string accumulation.
onnxruntime/core/providers/webgpu/program.h Includes string_utils.h and declares OStringStream streaming overloads for various program enums instead of some std::ostream-only overloads.
onnxruntime/core/providers/webgpu/program.cc Introduces a DEFINE_ENUM_STREAM_OP macro to implement operator<< for both std::ostream and OStringStream, and ports ProgramTensorMetadataDependency’s printer to OStringStream.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/providers/webgpu/string_utils.h Outdated
@fs-eire fs-eire force-pushed the fs-eire/opt-program-cache-key branch from 81a8d8a to 065bfae Compare February 2, 2026 07:10
@guschmue
Copy link
Copy Markdown
Contributor

guschmue commented Feb 2, 2026

I think I see some issue when using in my local repo. Debugging ...

@guschmue
Copy link
Copy Markdown
Contributor

guschmue commented Feb 2, 2026

python binding looks ok.
But genai main has issues with ort main + this PR:
RuntimeError: Specified device is not supported. Try CreateMemoryInfo_V2

@guschmue
Copy link
Copy Markdown
Contributor

guschmue commented Feb 2, 2026

void - all good.

@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Feb 3, 2026
@fs-eire fs-eire merged commit e21b948 into microsoft:main Feb 4, 2026
94 of 107 checks passed
@xenova
Copy link
Copy Markdown
Contributor

xenova commented Feb 4, 2026

Generation TPS of E2E model benchmark on Qwen3-0.6B increased from ~90 to ~130 on Windows11/13900k/RTX4070.

Oh wow, that's a huge difference. Great work @fs-eire!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ep:WebGPU ort-web webgpu provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants