[webgpu] Optimize string stream used in WebGPU EP#27223
[webgpu] Optimize string stream used in WebGPU EP#27223fs-eire merged 4 commits intomicrosoft:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors the WebGPU execution provider’s string-building infrastructure to use a custom lightweight FastOStringStream instead of Abseil’s OStringStream, aiming to reduce overhead when generating WGSL shader code and program cache keys.
Changes:
- Introduces
FastOStringStreaminstring_utils.hand updates theSS/SS_GET/SS_APPENDmacros to construct, append to, and extract strings using this new stream type. - Migrates shader generation utilities (shader helper, shader variables, tensor kernels) and program cache-key construction to use the new
OStringStreamalias and pre-sized buffers instead of manualstd::stringmanagement. - Unifies enum-to-string streaming for WebGPU program metadata by adding
OStringStreamoverloads and a helper macro to generateoperator<<implementations for bothstd::ostreamandOStringStream.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/core/providers/webgpu/tensor/split.cc | Switches split helper functions to accept OStringStream& so WGSL snippets are built with the new fast stream. |
| onnxruntime/core/providers/webgpu/tensor/resize_impl.cc | Updates resize coordinate and nearest-pixel helpers to emit WGSL into OStringStream instead of std::ostream. |
| onnxruntime/core/providers/webgpu/tensor/depth_to_space.cc | Changes permutation helper to use OStringStream for WGSL code generation. |
| onnxruntime/core/providers/webgpu/tensor/concat.cc | Updates concat WGSL helper functions to take OStringStream& for building shader snippets. |
| onnxruntime/core/providers/webgpu/string_utils.h | Replaces Abseil OStringStream with FastOStringStream, adds std::to_chars-based numeric streaming and centralizes OStringStreamAppend helpers. |
| onnxruntime/core/providers/webgpu/string_macros.h | Redefines SS to construct OStringStream with a reserve size and SS_GET to move out the final string from the stream. |
| onnxruntime/core/providers/webgpu/shader_variable.h | Updates internal Impl methods to take OStringStream&, aligning shader variable codegen with the new stream type. |
| onnxruntime/core/providers/webgpu/shader_variable.cc | Adapts shader variable/index helper implementations and GetByOffsetImpl/SetByOffsetImpl to use SS/SS_GET with OStringStream. |
| onnxruntime/core/providers/webgpu/shader_helper.h | Changes constant-writing and source-code generation APIs to work with OStringStream members and a non-const GenerateSourceCode. |
| onnxruntime/core/providers/webgpu/shader_helper.cc | Initializes additional_implementation_ss_/body_ss_ with tuned reserve sizes and uses SS_GET to splice them into the final WGSL source. |
| onnxruntime/core/providers/webgpu/program_cache_key.cc | Builds program cache keys using OStringStream and SS/SS_GET instead of manual std::string accumulation. |
| onnxruntime/core/providers/webgpu/program.h | Includes string_utils.h and declares OStringStream streaming overloads for various program enums instead of some std::ostream-only overloads. |
| onnxruntime/core/providers/webgpu/program.cc | Introduces a DEFINE_ENUM_STREAM_OP macro to implement operator<< for both std::ostream and OStringStream, and ports ProgramTensorMetadataDependency’s printer to OStringStream. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
81a8d8a to
065bfae
Compare
|
I think I see some issue when using in my local repo. Debugging ... |
|
python binding looks ok. |
|
void - all good. |
Oh wow, that's a huge difference. Great work @fs-eire! |
Description
Optimize the string stream used in WebGPU EP.
Motivation and Context
The current implementation uses a
absl::OStringStream, which is faster thanstd::ostringstream. However, it is still slow in the usage of generating the program cache key.From the profiling data,
CalculateProgramCacheKey()is extremely time consuming. It can consume up to 1/3 of all CPU time insideWebGpuContext::Run():The basic analyze shows that most time spent in the
std::basic_ostream operator <<()implementation, and this is way slower than expected.To optimize, this PR uses a simplified implementation
FastOStringStream, which does not inherit fromstd::basic_ostream. Instead, the class implementation only includes necessary overrides for the minimum requirements of generating cache key and shader code, to reduce the unnecessary overhead as much as possible.As a result, the CPU sampling of
CalculateProgramCacheKey()in the same test dropped from 2555 to 176. Generation TPS of E2E model benchmark on Qwen3-0.6B increased from ~90 to ~130 on Windows11/13900k/RTX4070.