Skip to content

Conversation

@qwu16
Copy link
Contributor

@qwu16 qwu16 commented Jul 30, 2025

Description

Cached opSupportLimits in webnn backend and avoid quering it from lower layer each time to improve the performance. Update the trace event in data transfer.

Motivation and Context

In current implementation, each time calling ensureTensor API to check input/output tensor, MLContext.opSupportLimits API will be called to query support ops capability from chromium and this function call will be the hotspot. Call this API when session is created and then cache it will avoid the frequent lower API call.

@qwu16
Copy link
Contributor Author

qwu16 commented Jul 30, 2025

@Honry PTAL, thanks!

Copy link
Contributor

@Honry Honry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@qwu16
Copy link
Contributor Author

qwu16 commented Jul 30, 2025

@fdwr PTAL, thanks~

@fdwr
Copy link
Contributor

fdwr commented Jul 31, 2025

/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline,Windows GPU WebGPU CI Pipeline,Windows OpenVINO CI Pipeline

@fdwr
Copy link
Contributor

fdwr commented Jul 31, 2025

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

@fdwr
Copy link
Contributor

fdwr commented Jul 31, 2025

/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI

@fdwr
Copy link
Contributor

fdwr commented Jul 31, 2025

/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@fdwr
Copy link
Contributor

fdwr commented Jul 31, 2025

/azp run Test Linux CUDA x64 Release,Test Linux TensorRT x64 Release,web_Debug / build_onnxruntime_web,web_Release / build_onnxruntime_web

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@fdwr
Copy link
Contributor

fdwr commented Jul 31, 2025

In current implementation, each time calling ensureTensor API to check input/output tensor, MLContext.opSupportLimits API will be called to query support ops capability from chromium and this function call will be the hotspot.

Are there any hotspots in the Chromium implementation that warrant improvement (reducing the need for this cache)? It appears the code after still goes through an MLOpSupportLimits, but just avoids the call to this.mlContextBySessionId.get(sessionId).opSupportLimits().

Copy link
Contributor

@fdwr fdwr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@qwu16
Copy link
Contributor Author

qwu16 commented Jul 31, 2025

In current implementation, each time calling ensureTensor API to check input/output tensor, MLContext.opSupportLimits API will be called to query support ops capability from chromium and this function call will be the hotspot.

Are there any hotspots in the Chromium implementation that warrant improvement (reducing the need for this cache)? It appears the code after still goes through an MLOpSupportLimits, but just avoids the call to this.mlContextBySessionId.get(sessionId).opSupportLimits().

In current Chromium implementation, queried properties for different backend have been cached in MLContext, so when calling MLContext.opSupportLimits, there is no extra IPC between render and GPU process. But in https://source.chromium.org/chromium/chromium/src/+/main:third_party/blink/renderer/modules/ml/ml_context.cc;l=241, hundreds of op limit properties will be set before return, that takes some time. This PR will cache the MLOpSUpportLimits object and avoid setting hundreds of op limit settings in calling MLContext::opSupportLimits.

@fdwr fdwr merged commit a7bc727 into microsoft:main Jul 31, 2025
98 of 101 checks passed
@qwu16 qwu16 deleted the oplimit branch August 1, 2025 00:09
sanketkaleoss pushed a commit to sanketkaleoss/onnxruntime that referenced this pull request Aug 11, 2025
microsoft#25589)

### Description
Cached opSupportLimits in webnn backend and avoid quering it from lower
layer each time to improve the performance. Update the trace event in
data transfer.



### Motivation and Context
In current implementation, each time calling ensureTensor API to check
input/output tensor, MLContext.opSupportLimits API will be called to
query support ops capability from chromium and this function call will
be the hotspot. Call this API when session is created and then cache it
will avoid the frequent lower API call.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants