Skip to content

Add enable_profiling in runoptions#26846

Merged
xiaofeihan1 merged 21 commits intomainfrom
xiaofeihan/runoptions_profiling
Jan 23, 2026
Merged

Add enable_profiling in runoptions#26846
xiaofeihan1 merged 21 commits intomainfrom
xiaofeihan/runoptions_profiling

Conversation

@xiaofeihan1
Copy link
Copy Markdown
Contributor

@xiaofeihan1 xiaofeihan1 commented Dec 22, 2025

Description

Support run-level profiling

This PR adds support for profiling individual Run executions, similar to session-level profiling. Developers can enable run-level profiling by setting enable_profiling and profile_file_prefix in RunOptions. Once the run completes, a JSON profiling file will be saved using profile_file_prefix + timestamp.

png (2)

Key Changes

  1. Introduced a local variable run_profiler in InferenceSession::Run, which is destroyed after the run completes. Using a dedicated profiler per run ensures that profiling data is isolated and prevents interleaving or corruption across runs.
  2. To maintain accurate execution time when both session-level and run-level profiling are enabled, overloaded Start and EndTimeAndRecordEvent functions have been added. These allow the caller to provide timestamps instead of relying on std::chrono::high_resolution_clock::now(), avoiding potential timing inaccuracies.
  3. Added a TLS variable tls_run_profiler_ to support run-level profiling with WebGPU Execution Provider (EP). This ensures that when multiple threads enable run-level profiling, each thread logs only to its own WebGPU profiler, keeping thread-specific data isolated.
  4. Use HH:MM:SS.mm instead of HH:MM:SSin the JSON filename to prevent conflicts when profiling multiple consecutive runs.

Motivation and Context

Previously, profiling only for session level. Sometimes developer want to profile for specfic run . so the PR comes.

Some details

When profiling is enabled via RunOptions, it should ideally collect two types of events:

  1. Profiler events
    Used to calculate the CPU execution time of each operator.
  2. Execution Provider (EP) profiler events
    Used to measure GPU kernel execution time.

Unlike session-level, we need to ensure the collecting events is correct for multiple thread scenario.

For 1, this can be supported easily(sequential_executor.cc). We use a thread-local storage (TLS) variable, RunLevelState (defined in profiler.h), to maintain run-level profiling state for each thread.

For 2, each Execution Provider (EP) has its own profiler implementation, and each EP must ensure correct behavior under run-level profiling. This PR ensures that the WebGPU profiler works correctly with run-level profiling.

Test Cases

Scenario Example Expected Result
Concurrent runs on the same session with different run-level profiling settings t1: sess1.Run({ enable_profiling: true })
t2: sess1.Run({ enable_profiling: false })
t3: sess1.Run({ enable_profiling: true })
Two trace JSON files are generated: one for t1 and one for t3.
Run-level profiling enabled together with session-level profiling sess1 = OrtSession({ enable_profiling: true })
sess1.Run({ enable_profiling: true })
Two trace JSON files are generated: one corresponding to session-level profiling and one corresponding to run-level profiling.

@xiaofeihan1 xiaofeihan1 changed the title Add enable_profiling in runoptions [Local variable]Add enable_profiling in runoptions Dec 23, 2025
@xiaofeihan1 xiaofeihan1 requested a review from Copilot January 7, 2026 06:42
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds per-run profiling capability to ONNX Runtime by introducing enable_profiling and profile_file_prefix options to RunOptions. This allows users to enable profiling for individual inference runs independent of session-level profiling, providing more granular control over performance analysis.

Key changes:

  • Added enable_profiling and profile_file_prefix fields to RunOptions structure
  • Modified execution providers to accept an enable_profiling parameter in GetProfiler() method
  • Enhanced timestamp formatting to include milliseconds for more precise profiling file naming

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
include/onnxruntime/core/framework/run_options.h Added enable_profiling flag and profile_file_prefix configuration
onnxruntime/python/onnxruntime_pybind_state.cc Exposed new profiling options to Python API
onnxruntime/core/session/inference_session.cc Implemented run-level profiler creation, initialization, and lifecycle management
include/onnxruntime/core/framework/execution_provider.h Updated GetProfiler signature to accept enable_profiling parameter
onnxruntime/core/providers/cuda/cuda_execution_provider.h/cc Updated GetProfiler implementation for CUDA provider
onnxruntime/core/providers/vitisai/vitisai_execution_provider.h/cc Updated GetProfiler implementation for VitisAI provider
onnxruntime/core/providers/webgpu/webgpu_execution_provider.h/cc Implemented session vs run profiler separation using thread_local storage
onnxruntime/core/providers/webgpu/webgpu_context.h/cc Added profiler registration/unregistration and multi-profiler event collection
onnxruntime/core/providers/webgpu/webgpu_profiler.cc Updated to register/unregister with context and handle event collection
onnxruntime/core/common/profiler.h/cc Added overloaded Start and EndTimeAndRecordEvent methods accepting explicit timestamps
onnxruntime/core/framework/utils.h/cc Propagated run_profiler parameter through execution graph functions
onnxruntime/core/framework/sequential_executor.h/cc Added run_profiler support in SessionScope and KernelScope for dual profiling

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@xiaofeihan1 xiaofeihan1 force-pushed the xiaofeihan/runoptions_profiling branch 3 times, most recently from 828938d to 0022eb0 Compare January 8, 2026 06:53
@xiaofeihan1 xiaofeihan1 changed the title [Local variable]Add enable_profiling in runoptions Add enable_profiling in runoptions Jan 8, 2026
@xiaofeihan1 xiaofeihan1 force-pushed the xiaofeihan/runoptions_profiling branch from 1fa65ff to 978b59a Compare January 8, 2026 13:45
keep same data

impl

disable profiling for graph capture stage
@xiaofeihan1 xiaofeihan1 force-pushed the xiaofeihan/runoptions_profiling branch from 978b59a to c48efdb Compare January 8, 2026 13:48
Copy link
Copy Markdown
Member

@yuslepukhin yuslepukhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🕐

@xiaofeihan1 xiaofeihan1 force-pushed the xiaofeihan/runoptions_profiling branch from aa5c138 to 5f0a9aa Compare January 14, 2026 15:25
yuslepukhin
yuslepukhin previously approved these changes Jan 22, 2026
Copy link
Copy Markdown
Member

@yuslepukhin yuslepukhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Copy Markdown
Member

@yuslepukhin yuslepukhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@xiaofeihan1 xiaofeihan1 merged commit 879ec03 into main Jan 23, 2026
97 of 101 checks passed
@xiaofeihan1 xiaofeihan1 deleted the xiaofeihan/runoptions_profiling branch January 23, 2026 05:15
quic-muchhsu pushed a commit to CodeLinaro/onnxruntime that referenced this pull request Feb 27, 2026
* Bump version to 1.25.0 (microsoft#27048)

Increase version number to 1.25.0.

* [webgpu] Optimize generic 4D Transpose using OIHW2OHWI Program (microsoft#26942)

### Description
This PR migrates the `OIHW2OHWI` Program from `Im2ColMatMul` to the
`Transpose` operator. By centralizing this logic, we leverage the
specialized shader to optimize generic 4D transpositions (specifically
the {0, 2, 3, 1} permutation pattern) while reducing code duplication.

While this shader is capable of supporting 2D/3D transpositions, those
optimizations are reserved for follow-up PRs.

### Motivation and Context
See above.

* Fix failing mainline build on Arm64 linux (microsoft#27101)

### Description
`sconv.h` was renamed to `sconv_nchwc_kernel_neon.h` in microsoft#26688 but the
reference to the old name was still in a new file added at around the
same time in microsoft#26838.
The CI doesn't include building for this configuration yet - it will be
added after the 1.24 release.



### Motivation and Context
Fixes failing mainline build on Arm64 linux when
`--enable_arm_neon_nchwc` is supplied.


### Testing
This now passes on Arm64 linux
`./build.sh --config Release --build_shared_lib --parallel
--compile_no_warning_as_error --skip_submodule_sync --skip_tests
--enable_pybind --build_wheel --enable_arm_neon_nchwc`

* Add dedicated API to support extracting compatibility string from model metadata (microsoft#27015)

### Description
This change proposes a new helper ORT API for callers that need to
extract the model compatibility string from a precompiled model.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
See microsoft#25749 for more background on the model compatibility concept and
infrastructure; microsoft#25841 provides a related helper API for an application
to call to do a validation check using the compatibility info string.
However, there is no direct way to get to the model metadata without
creating a session (which some callers may prefer to avoid) or by taking
a dependency on a separate library to parse the model's protobuf (which
again callers may prefer to avoid).

This change proposes a separate helper API which can be used to retrieve
the compatibility info string, thereby avoiding session creation or an
external dependency. This does incur some amount of redundant work in
that the model protobuf will be parsed again during session creation-
but for some callers, this tradeoff may be acceptable.

---------

Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: adrastogi <8368026+adrastogi@users.noreply.github.com>

* Move model compatibility checks ahead of session initialization (microsoft#27037)

### Description
<!-- Describe your changes. -->
The current infrastructure for validating compatibility of a precompiled
model does the check after session initialization occurs, which turns
out to be quite costly. The check should ideally happen beforehand, to
short-circuit those expensive operations.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This change will make it more tractable for applications to rely on the
existing session machinery to check compatibility of any of their
models.

Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>

* [test] refactor common test target settings (microsoft#27013)

### Description
- factor duplicated test target settings into helper functions
- reuse helpers for onnxruntime_test_all and onnxruntime_provider_test
- keep target-specific settings intact


### Motivation and Context

There are some duplicated codes in the onnxruntime_unittests. Originally
there is only one unit test `onnxruntime_test_all` and later it is split
into two: `onnxruntime_test_all` and `onnxruntime_provider_test`. Some
lines for setting up build flags are simply copied. This causes
potential risk for inconsistency in future.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Fix OrtApi static_assert violation, add instructions for updating additional API structs. (microsoft#27100)

### Description
<!-- Describe your changes. -->

Fix OrtApi 1.24 API size static_assert violation triggered by addition
of new APIs in
microsoft@f481b17.

Add version update instructions for updating additional API structs.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fix build on main.

Add info about other API structs to version update instructions.

* Linux device discovery for TRT-RTX Ep (microsoft#26210)

### Description
<!-- Describe your changes. -->

This change adds PCIe bus_id to the properties detected
during Linux device discovery.

This property is used to enable device discovery on Linux for the
TRT-RTX execution provider.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve? -->
I want to use device discovery for TRT-EP also on Linux.


This changes have already been tested with the newly added inference
samples
microsoft/onnxruntime-inference-examples#529 .

@gedoensmax for visibilty

* Add absl cuda warnings patch (microsoft#27096)

Some PRs that use core/common/inlined_containers.h can cause failures in
the CUDA CI pipeline.

```
E:\_work\_temp\build\RelWithDebInfo\vcpkg_installed\x64-windows-static-md\include\absl/hash/internal/hash.h(481): error microsoft#68-D: integer conversion resulted in a change of sign [E:\_work\_temp\build\RelWithDebInfo\onnxruntime_providers_cuda.vcxproj]
          sizeof(T) == -1,
                       ^
  Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

E:\_work\_temp\build\RelWithDebInfo\vcpkg_installed\x64-windows-static-md\include\absl/hash/hash.h(337): error microsoft#549-D: variable "s" is used before its value is set [E:\_work\_temp\build\RelWithDebInfo\onnxruntime_providers_cuda.vcxproj]
        return s;
               ^
E:\_work\_temp\build\RelWithDebInfo\vcpkg_installed\x64-windows-static-md\include\absl/container/internal/raw_hash_set.h(468): error microsoft#69-D: integer conversion resulted in truncation [E:\_work\_temp\build\RelWithDebInfo\onnxruntime_providers_cuda.vcxproj]
          static_cast<uint16_t>(reinterpret_cast<uintptr_t>(&seed));
                      ^
  3 errors detected in the compilation of "E:/_work/onnxruntime/onnxruntime/onnxruntime/contrib_ops/cuda/sparse/block_mask.cu".
```

This change adds a patch to Abseil to mitigate those failures.


This solution has been verified to be effective in PR
microsoft#27087.

* [webgpu] Support Identity (microsoft#27067)

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

* Add enable_profiling in runoptions (microsoft#26846)

### Description
Support run-level profiling

This PR adds support for profiling individual Run executions, similar to
session-level profiling. Developers can enable run-level profiling by
setting `enable_profiling` and `profile_file_prefix` in RunOptions. Once
the run completes, a JSON profiling file will be saved using
profile_file_prefix + timestamp.

<img width="514" height="281" alt="png (2)"
src="https://github.com/user-attachments/assets/8a997068-71d9-49ed-8a5c-00e0fa8853af"
/>


### Key Changes
1. Introduced a local variable `run_profiler` in
`InferenceSession::Run`, which is destroyed after the run completes.
Using a dedicated profiler per run ensures that profiling data is
isolated and prevents interleaving or corruption across runs.
2. To maintain accurate execution time when both session-level and
run-level profiling are enabled, overloaded `Start` and
`EndTimeAndRecordEvent` functions have been added. These allow the
caller to provide timestamps instead of relying on
`std::chrono::high_resolution_clock::now()`, avoiding potential timing
inaccuracies.
3. Added a TLS variable `tls_run_profiler_` to support run-level
profiling with WebGPU Execution Provider (EP). This ensures that when
multiple threads enable run-level profiling, each thread logs only to
its own WebGPU profiler, keeping thread-specific data isolated.
4. Use `HH:MM:SS.mm` instead of `HH:MM:SS`in the JSON filename to
prevent conflicts when profiling multiple consecutive runs.

### Motivation and Context
Previously, profiling only for session level. Sometimes developer want
to profile for specfic run . so the PR comes.


### Some details

When profiling is enabled via RunOptions, it should ideally collect two
types of events:
1. Profiler events
Used to calculate the CPU execution time of each operator.
2. Execution Provider (EP) profiler events
Used to measure GPU kernel execution time. 

Unlike session-level, we need to ensure the collecting events is correct
for multiple thread scenario.

For 1, this can be supported easily(sequential_executor.cc). We use a
thread-local storage (TLS) variable, RunLevelState (defined in
profiler.h), to maintain run-level profiling state for each thread.

For 2, each Execution Provider (EP) has its own profiler implementation,
and each EP must ensure correct behavior under run-level profiling. This
PR ensures that the WebGPU profiler works correctly with run-level
profiling.

# Test Cases

| Scenario | Example | Expected Result |
|---------|---------|-----------------|
| Concurrent runs on the same session with different run-level profiling
settings| t1: `sess1.Run({ enable_profiling: true })`<br>t2:
`sess1.Run({ enable_profiling: false })`<br>t3: `sess1.Run({
enable_profiling: true })` | Two trace JSON files are generated: one for
`t1` and one for `t3`. |
| Run-level profiling enabled together with session-level profiling|
`sess1 = OrtSession({ enable_profiling: true })`<br>`sess1.Run({
enable_profiling: true })` | Two trace JSON files are generated: one
corresponding to session-level profiling and one corresponding to
run-level profiling. |

* Fix GQA Parity (microsoft#27108)

Fix [microsoft#27079](microsoft#27079) -
Qwen3 model quality regression on CUDA backend.
### Root Cause Analysis
The parity issue was caused by **buffer pointer misconfiguration** in
the GQA (Group Query Attention) QKV preprocessing pipeline. The original
implementation used multiple separate kernels for:
1. Unpacking packed QKV tensor
2. Applying RoPE (Rotary Position Embedding) to Q and K 
3. Appending K/V to cache
This multi-kernel approach created opportunities for misconfiguration:
- Buffers were allocated but not properly used
- Pointers could reference memory that was not yet allocated or
initialized
- Buffer sharing logic was fragmented across different code paths
### Solution
Consolidate QKV preprocessing into a **single fused kernel**
(`UnpackRoPEAppend`) that performs all operations in one pass:
1. **Unified kernel design**: A single kernel handles unpacking, RoPE
application, and cache append operations
2. **Simplified buffer management**: The new `PrepareQKV` function
clearly manages buffer allocation and ensures proper initialization
3. **Explicit past-to-present cache copy**: When
`past_present_share_buffer` is false, explicitly copy past KV cache to
present buffer before appending new tokens
4. **Zero-initialization for non-shared buffers**: Clear present KV
buffers when not sharing with past to ensure deterministic output
### Changes Summary
| File | Changes |
|------|---------|
|
[group_query_attention_qkv.cuh](cci:7://file:///home/tlwu/onnxruntime/onnxruntime/contrib_ops/cuda/bert/group_query_attention_qkv.cuh:0:0-0:0)
| New fused `UnpackRoPEAppend` kernel with shared memory optimization
for non-interleaved RoPE |
| `group_query_attention_impl.cu` | New `PrepareQKV` helper function
that orchestrates buffer setup and kernel launch |
| `group_query_attention.cc` | Simplified operator logic by delegating
QKV prep to unified helper |
| `test_gqa.py` | Enhanced test coverage for various QKV configurations
|
### Key Improvements
- **Reduced kernel launches**: From 4-5 separate kernel calls to a
single fused kernel
- **Better memory safety**: All buffer pointers are validated in a
single location
- **Improved RoPE handling**: Uses shared memory for efficient
non-interleaved RoPE computation
- **Deterministic output**: Explicit buffer initialization ensures
consistent results across runs
- **Compatible with quantized KV cache**: The new preprocessing kernel
design supports future quantization work
### Testing
- All existing GQA unit tests pass
- Verified Qwen3 model no longer produces gibberish output
- Tested both fp16/bf16 and various head configurations

* [QNN EP] Fix error messages being logged as VERBOSE instead of ERROR (microsoft#24931)

## Problem

QNN error messages were being logged at VERBOSE level instead of ERROR
level, making them invisible unless verbose logging was enabled. Users
would only see unhelpful generic error messages like:

```
Failed to finalize QNN graph. Error code: 1002 at location qnn_model.cc:167 FinalizeGraphs
```

But the actual detailed error messages from QNN were hidden in verbose
logs:

```
tcm_migration.cc:2088:ERROR:Operator named q::*InputSlicePad (0x1654900000002) not sufficiently tiled to fit in TCM. Requires 12441600 bytes
graph_prepare.cc:2808:ERROR:Graph prepare TCM Migration action failed
graph_prepare.cc:2868:ERROR:Graph prepare failed during optimization with err: 17, Fatal Optimize
```

## Root Cause

The `QnnLogging` callback function in `qnn_backend_manager.cc` was
ignoring the `level` parameter from QNN and hardcoding all messages as
`kVERBOSE` severity:

```cpp
void QnnLogging(const char* format, QnnLog_Level_t level, uint64_t timestamp, va_list argument_parameter) {
  ORT_UNUSED_PARAMETER(level);  // ❌ Ignoring the actual log level
  // ...
  const auto severity = ::onnxruntime::logging::Severity::kVERBOSE;  // ❌ Hardcoded as VERBOSE
```

## Solution

Modified the `QnnLogging` function to properly map QNN log levels to
appropriate ORT severity levels:

- `QNN_LOG_LEVEL_ERROR` → `logging::Severity::kERROR` ✅ **Key fix**
- `QNN_LOG_LEVEL_WARN` → `logging::Severity::kWARNING`
- `QNN_LOG_LEVEL_INFO` → `logging::Severity::kINFO`
- `QNN_LOG_LEVEL_VERBOSE/DEBUG` → `logging::Severity::kVERBOSE`

## Changes Made

1. **Modified `QnnLogging` function**: Removed hardcoded `kVERBOSE` and
added proper level mapping
2. **Added `MapQNNLogLevelToOrtSeverity` function**: For potential
future reuse
3. **Minimal and surgical changes**: Only 37 lines added, 2 removed

## Impact

QNN error messages will now appear as ERROR-level logs in normal logging
output, making debugging much easier for users without requiring verbose
logging to be enabled.

Fixes microsoft#24876.

---

💡 You can make Copilot smarter by setting up custom instructions,
customizing its development environment and configuring Model Context
Protocol (MCP) servers. Learn more [Copilot coding agent
tips](https://gh.io/copilot-coding-agent-tips) in the docs.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: vraspar <51386888+vraspar@users.noreply.github.com>
Co-authored-by: yuslepukhin <11303988+yuslepukhin@users.noreply.github.com>
Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Dmitri Smirnov <dmitrism@microsoft.com>

* [webgpu] Use LazyRelease for prepack allocator (microsoft#27077)

BUG microsoft#27068

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>

* [webgpu] fix broadcast for SkipLayerNorm (microsoft#27107)

### Description

Fix the bug discovered by microsoft#27014:

```
SkipLayerNormTest.SkipLayerNormBatch2_Skip_Broadcast_No_Batch_Size
SkipLayerNormTest.SkipLayerNormBatch2_Skip_Broadcast_Batch_Size_1
```

* [webgpu] Support int64 for range (microsoft#26673)

### Description  
 - Add new registerInt64Ops option to WebGpuExecutionProviderConfig
- Int64 support now enabled when enable_graph_capture OR
register_int64_ops is true
- Refactor Range kernel registration to support conditional int64
registration
  - Update kernel registry caching to handle all 4 combinations of flags
- Rename parameters from enable_graph_capture to enable_int64 for
clarity
- Add config parsing in webgpu_provider_factory.cc for registerInt64Ops
option

### Motivation
Needed by updating position id with an onnx model in genai.

Continuous decoding mode: `position_ids[i] = i + total_length -
new_kv_length`

We can use an onnx model which includes a Range op to implement update
the position ids:
Inputs: start (total_length - new_kv_length), limit (total_length),
delta (1)
    Output: position_ids (1D tensor of size new_kv_length)

* Remove x86 from nuget (microsoft#27124)

Related issue: microsoft#26985

* perftest: support plugin eps for compile_ep_context (microsoft#27121)

* Extend compile_ep_context to also support plugin eps
* Adds compile_only option to skip execution, can be used when compiling
for virtual devices

compile_ep_context (physical device)
<img width="1259" height="510" alt="image"
src="https://github.com/user-attachments/assets/14650c17-0c8a-4002-a7ce-e8e4c815a516"
/>

compile_ep_context + compile_only (virtual device)
<img width="1262" height="173" alt="image"
src="https://github.com/user-attachments/assets/2f0844cc-5e83-4b2d-bf0a-0d815d9bad29"
/>

* [CPU] Fix arithmetic overflow and legacy TODO in Det operator (microsoft#27070)

### Description
This PR fixes the legacy `TODO: fix the warnings` in the `Det` operator.
The arithmetic overflow warning (C26451) is addressed by using `int64_t`
for tensor dimension and batch size calculations, ensuring safe pointer
arithmetic.

### Motivation and Context
- Removes unused warning suppression pragma.
- Prevents potential overflow when handling large batches of matrices.

* Engine compatibility validity API implementation (microsoft#26774)

Added support for engine validation check for EP Context models.

### Motivation and Context
We wanted to implement the GetModelCompatibilityForEpDevices() API
support and thus have an end user available API for the engine
validation check for EP context models. Added this support and the
necessary function implementation

---------

Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Jianhui Dai <jianhui.j.dai@intel.com>
Co-authored-by: Rohanjames1997 <rohanjms@amazon.com>
Co-authored-by: adrastogi <aditya.rastogi@microsoft.com>
Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: adrastogi <8368026+adrastogi@users.noreply.github.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Stephan Seitz <stephan.seitz@fau.de>
Co-authored-by: Xiaofei Han <xiaofeihan@microsoft.com>
Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com>
Co-authored-by: vraspar <51386888+vraspar@users.noreply.github.com>
Co-authored-by: yuslepukhin <11303988+yuslepukhin@users.noreply.github.com>
Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com>
Co-authored-by: Dmitri Smirnov <dmitrism@microsoft.com>
Co-authored-by: Jaskaran Singh Nagi <jaskaran.singh.nagi@intel.com>
Co-authored-by: Theodore Cooper <63190431+the0cp@users.noreply.github.com>
Co-authored-by: umangb-09 <umangb@nvidia.com>
Co-authored-by: ortqnnepci <ortqnnepci@qti.qualcomm.com>
adrianlizarraga added a commit that referenced this pull request Mar 31, 2026
### Description

Run-level profiling (introduced in PR #26846) does not currently capture
profiling events for operators inside subgraphs. This PR fixes that by
threading the `run_profiler` pointer through `OpKernelContextInternal`
to subgraph execution, following the same pattern as `terminate_flag`.

### Root Cause

`utils::ExecuteSubgraph()` had no `run_profiler` parameter and always
passed `nullptr` to `ExecuteGraphImpl`, so nested operators (inside If,
Loop, Scan, BeamSearch, GreedySearch) were never profiled at the run
level.

### Fix

1. **`OpKernelContextInternal`** — Added `run_profiler_` member and
`GetRunProfiler()` accessor.
2. **`SessionScope` / `ExecuteKernel()`** — Pass the run profiler into
`OpKernelContextInternal`.
3. **`ExecuteSubgraph()`** — Added `profiling::Profiler* run_profiler =
nullptr` parameter, forwarded to `ExecuteGraphImpl()`.
4. **Control flow ops** (`if.cc`, `loop.cc`, `scan_utils.cc`) — Pass
`context_.GetRunProfiler()` to `ExecuteSubgraph()`.
5. **Contrib transformer ops** (`beam_search_impl_gpt.h`,
`beam_search_impl_t5.h`, `beam_search_impl_whisper.h`,
`greedy_search_impl_gpt.h`) — All 8 `ExecuteSubgraph()` call sites
updated to pass `this->context_.GetRunProfiler()`.

Plugin EP control flow kernels (`PluginEpIfKernelImpl`, etc.) delegate
to the same internal kernels, so the fix propagates automatically.

### Tests

- **`CheckRunProfilerWithSubgraph`** (`inference_session_test.cc`) —
Runs `if_mul.onnx`, enables run profiling, asserts `mul_0` (inside If's
then-branch) appears in the profile JSON.
- **`CheckRunProfilerWithBeamSearch`** (`beam_search_test.cc`) — Runs
`tiny_gpt2_beamsearch.onnx`, enables run profiling, asserts decoder
subgraph Node entries (beyond the top-level BeamSearch op) appear in the
profile JSON.

### Files Changed (12 files)

| File | Change |
|------|--------|
| `core/framework/op_kernel_context_internal.h` | Added `run_profiler_`
member, `GetRunProfiler()`, constructor param |
| `core/framework/sequential_executor.cc` |
`SessionScope::GetRunProfiler()`, pass to `OpKernelContextInternal` |
| `core/framework/utils.h` / `utils.cc` | `run_profiler` param on
`ExecuteSubgraph()` |
| `core/providers/cpu/controlflow/if.cc` | Forward `GetRunProfiler()` |
| `core/providers/cpu/controlflow/loop.cc` | Forward `GetRunProfiler()`
|
| `core/providers/cpu/controlflow/scan_utils.cc` | Forward
`GetRunProfiler()` |
| `contrib_ops/cpu/transformers/beam_search_impl_gpt.h` | 2 call sites |
| `contrib_ops/cpu/transformers/beam_search_impl_t5.h` | 2 call sites |
| `contrib_ops/cpu/transformers/beam_search_impl_whisper.h` | 2 call
sites |
| `contrib_ops/cpu/transformers/greedy_search_impl_gpt.h` | 2 call sites
|
| `test/framework/inference_session_test.cc` |
`CheckRunProfilerWithSubgraph` test |
| `test/contrib_ops/beam_search_test.cc` |
`CheckRunProfilerWithBeamSearch` test |
adrianlizarraga added a commit that referenced this pull request Apr 1, 2026
### Description
#### TLDR

This PR ports the existing C++
[EpProfiler](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/include/onnxruntime/core/framework/execution_provider.h#L359)
interfaces used by provider-bridge EPs to the binary-stable C APIs for
plugin EPs. It introduces C/C++ APIs for creating/querying profiling
events, a container for appending EP events, and callback hooks
(`StartEvent`/`StopEvent`) that give EPs access to ORT event metadata in
real-time.

#### Changes to the original C++ API

The original `EpProfiler` C++ interface was adapted for the C API with
the following intentional changes:

1. **`StartProfiling`** now receives an offset indicating the elapsed
time since profiling started, as opposed to receiving an
absolute/epoch-dependent profiling start time. This prevents EPs from
having to do epoch conversions. Credit to @edgchen1 for the idea.
2. **`StartEvent`/`StopEvent` receive an absolute, epoch-based
correlation ID (`ort_event_correlation_id`)** instead of a relative ORT
event ID. The `PluginEpProfiler` bridge layer automatically converts the
C++ `relative_ort_event_id` (microseconds since profiling start) to an
absolute `ort_event_correlation_id` by adding the epoch-based profiling
start time. This means plugin EPs can use the correlation ID directly
with profiling utilities like CUPTI or ROCTracer without computing the
conversion themselves.
3. **`StopEvent` now receives the completed ORT event as a parameter.**
This allows EPs to optionally inspect ORT event metadata (e.g.,
`op_name`, `event_name`) at the time the event ends, facilitating
annotation of correlated EP events.
4. **`EndProfiling` only allows EPs to *append* events (via
`OrtProfilingEventsContainer`), not read or modify the full events
array.** This is motivated by:
- Prevent any one EP from modifying events generated by ORT or another
EP.
- Certain EPs (VitisAI and WebGPU) already only append events without
reading the entire events array.
- The CUDA EP reads the entire events array solely to merge/sort its own
EP events next to correlated ORT events and add `parent_name`/`op_name`
metadata. However:
- Merging/sorting is mostly unnecessary since trace viewers that load
these files do their own event sorting.
- This merging/sorting step was previously required to augment CUDA EP
events with metadata from the correlated ORT event. However, that can
now be obtained more simply via the new `StopEvent` parameter that
provides the EP with the full correlated ORT event.
- The [merge algorithm used by CUDA
EP](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/include/onnxruntime/core/common/gpu_profiler_common.h#L391-L397)
**incorrectly** assumes ORT events are sorted by non-decreasing *start*
time, but they are actually sorted by [non-decreasing *end*
time](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/onnxruntime/core/common/profiler.cc#L91)
(also see
#13706 (comment)).
Fixing this would require sorting the entire Events array before asking
a provider-bridge EP to merge in its events into the global events
array. Not sure this is worth the runtime cost.

#### Naming conventions for ORT event IDs

- **C++ `EpProfiler` interface** (existing): Uses
`relative_ort_event_id` — a timestamp offset in microseconds relative to
profiling start.
- **C API `OrtEpProfilerImpl`** (new in this PR): Uses
`ort_event_correlation_id` — an absolute, epoch-based timestamp in
microseconds computed from `std::chrono::high_resolution_clock`
(platform-defined epoch). Unique across concurrent profiling sessions
within the same process.
- **Conversion**: The `PluginEpProfiler` bridge class (in
`ep_event_profiling.cc`) performs `ort_event_correlation_id =
relative_ort_event_id + profiling_start_time_epoch_us_`, mirroring the
pattern in `GPUTracerManager::PushCorrelation`.

### New C APIs

| API | Description |
|-----|-------------|
| `CreateProfilingEvent` | Create a profiling event with category,
process/thread IDs, name, timestamp, duration, and key-value args |
| `ReleaseProfilingEvent` | Release a profiling event |
| `ProfilingEvent_GetCategory` | Get event category (`SESSION`, `NODE`,
`KERNEL`, `API`) |
| `ProfilingEvent_GetName` | Get event name |
| `ProfilingEvent_GetTimestampUs` | Get event start timestamp (µs) |
| `ProfilingEvent_GetDurationUs` | Get event duration (µs) |
| `ProfilingEvent_GetArgValue` | Get an event argument value by key |
| `ProfilingEventsContainer_AddEvents` | Append an array of EP events to
the output container |
| `OrtEp::CreateProfiler` | Returns an instance of the EP's profiler
implementation |
| `OrtEpProfilerImpl::StartProfiling` | Called by ORT to start a
profiling session. Receives elapsed time offset (ns) since ORT profiling
started |
| `OrtEpProfilerImpl::StartEvent` | Called by ORT to notify that an ORT
event has started. Receives an absolute `ort_event_correlation_id` |
| `OrtEpProfilerImpl::StopEvent` | Called by ORT to notify that an ORT
event has ended. Receives the same `ort_event_correlation_id` and ORT
event metadata |
| `OrtEpProfilerImpl::EndProfiling` | Called by ORT to end the profiling
session and collect EP events into the output container |
| `OrtEpProfilerImpl::Release` | Release the profiler instance |

### New C++ wrapper classes

| Class | Description |
|-------|-------------|
| `Ort::ConstProfilingEvent` | Non-owning const wrapper for reading
fields from an `OrtProfilingEvent` (e.g., in `StopEvent`) |
| `Ort::ProfilingEvent` | Owning wrapper that creates and manages an
`OrtProfilingEvent` (e.g., for `EndProfiling`) |
| `Ort::UnownedProfilingEventsContainer` | Non-owning wrapper for adding
events to an `OrtProfilingEventsContainer` during `EndProfiling` |

### Example EP profiling implementation
This PR updates an example plugin EP to use the new profiling APIs:
- Plugin EP code:
[test/autoep/library/example_plugin_ep_kernel_registry](https://github.com/microsoft/onnxruntime/tree/adrianl/PluginEp_ProfilingApis/onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry)
- `OrtEpProfilerImpl` implementation:
[ep_profiling.h](https://github.com/microsoft/onnxruntime/blob/adrianl/PluginEp_ProfilingApis/onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep_profiling.h)
/
[ep_profiling.cc](https://github.com/microsoft/onnxruntime/blob/adrianl/PluginEp_ProfilingApis/onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep_profiling.cc)
- `OrtEp::CreateProfiler()` implementation:
[ep.cc](https://github.com/microsoft/onnxruntime/blob/adrianl/PluginEp_ProfilingApis/onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep.cc)

### Existing bugs found
Not fixed in this PR.

- The [merge algorithm used by CUDA
EP](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/include/onnxruntime/core/common/gpu_profiler_common.h#L391-L397)
**incorrectly** assumes ORT events are sorted by non-decreasing *start*
time, but they are actually sorted by [non-decreasing *end*
time](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/onnxruntime/core/common/profiler.cc#L91)
(also see
#13706 (comment)).
- Run profilers do not handle subgraphs (e.g., subgraph of a
control-flow operator). Has been the case since run profilers were
[introduced](#26846).

### Motivation and Context
Allows plugin EPs to generate profiling events, further closing the
functionality gap between provider-bridge EPs and plugin EPs.

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants