Skip to content

ORT 1.24.3 release cherry pick round 4#27558

Merged
tianleiwu merged 14 commits intorel-1.24.3from
tlwu/rel-1.24.3_cherrypick_round4
Mar 5, 2026
Merged

ORT 1.24.3 release cherry pick round 4#27558
tianleiwu merged 14 commits intorel-1.24.3from
tlwu/rel-1.24.3_cherrypick_round4

Conversation

@tianleiwu
Copy link
Copy Markdown
Contributor

This cherry-picks the following commits for the release:

Commit ID PR Number Commit Title
d5387d8 #27192 Avoid repetitive creation of fp4/fp8 native-custom-op domains for NvTensorRtRtx EP
0b9906a #27454 Suppress spurious Array Out of Bounds warnings produced by GCC 14.2 compiler on Linux builds
4a80b0b #27471 Fix double-free in TRT EP custom op domain Release functions
c7c939f #27499 Fix -Warray-bounds build error in MLAS on clang 17+
f99dcca #27514 [Build] Fix pybind11 vcpkg configuration
ef04b10 #27518 [CXX Lora] Prevent heap OOB from maliciously crafted Lora Adapters.
0b2b6d0 #27288 [NvTensorRTRTX EP]: Add missing override specifiers to suppress warnings
c1d8f5c #27522 Add "library_path" metadata entry to OrtEpDevice instances for plugin and provider bridge EPs
fdead1c #27537 Account for ORT_NO_EXCEPTIONS builds in Lora test
3d1365e #27521 increase kMaxValueLength to 8192
df8f4a7 #27535 Add OrtEnv.DisableDllImportResolver to prevent fatal error on resolver conflict
bdd672a #27356 Add/Update telemetry events
2da1a30 #27543 Fix RoiAlign heap out-of-bounds read via unchecked batch_indices
5c3f544 #27466 DQ→MatMulNBits fusion transformer for NvTensorRtRtx ep

vishalpandya1990 and others added 14 commits March 4, 2026 16:04
…ensorRtRtx EP (#27192)

### Description
- Avoid repetitive creation of FP4/FP8 native custom-ops in create
method for custom-op domains (leaving plugin-based custom-op handling as
is, as it was before native-custom-ops addition in
[PR-26555](#26555)).
- Avoid deleting the custom-op domains at destructor time, since those
are created with static scope, so avoid potential double-delete.



### Motivation and Context
- Repetitive checks and creation of custom-ops domain is redundant. So,
cleaning it up a bit.
- Explicit deletion of static objects in destructor can lead to
double-delete. So, avoiding it.
…ompiler on Linux builds (#27454)

### Description
Suppress spurious Array Out of Bounds warnings produced by GCC 14.2
compiler on Linux builds

### Motivation and Context
Linux build fails when compiled with GCC 14.2 due to spurious Array Out
of Bounds warnings (Warnings Treated as Errors)
### Description

Apply the same double-free fix from NvTensorRtRtx EP ([PR
#27192](#27192)) to the TRT
EP.

`CreateTensorRTCustomOpDomainList()` owns domains/ops via static
`unique_ptr`s, but `ReleaseTensorRTCustomOpDomain()` was manually
`delete`-ing the same objects through raw pointers — double-free at
program exit.

- `ReleaseTensorRTCustomOpDomain()` → no-op (static `unique_ptr`s own
the lifetime)
- `ReleaseTensorRTCustomOpDomainList()` → `clear()` the reference vector
only
- Added ownership comments to static members matching NvTensorRtRtx EP
style

### Motivation and Context

PR #27192 review
([thread](#27192 (comment)))
identified TRT EP has the identical bug pattern that was fixed in
NvTensorRtRtx EP. The TRT EP code was the original source this pattern
was borrowed from. @tianleiwu noted a follow-up PR was needed.

<!-- START COPILOT CODING AGENT TIPS -->
---

🔒 GitHub Advanced Security automatically protects Copilot coding agent
pull requests. You can protect all pull requests by enabling Advanced
Security for your repositories. [Learn more about Advanced
Security.](https://gh.io/cca-advanced-security)

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
### Description

The `-Warray-bounds` suppression pragma in
`sqnbitgemm_kernel_avx2_int8_blklen32.h` was gated on
`defined(HAS_ARRAY_BOUNDS)`, which is set in `onnxruntime_config.h`.
MLAS never includes that header, so the guard was dead code and the
pragma never fired.

Changed the guard to `#ifdef __clang__`:

```cpp
// Before: HAS_ARRAY_BOUNDS never defined in MLAS TU
#if defined(__clang__) && defined(HAS_ARRAY_BOUNDS)

// After
#ifdef __clang__
```

Note: `__has_warning("-Warray-bounds")` was considered but the C
preprocessor does not short-circuit `&&`, so GCC fails to parse it even
behind `defined(__clang__)`.

### Motivation and Context

Build fails on Intel Mac with Apple Clang 17.0.0
(`-Werror,-Warray-bounds`). Clang raises a false-positive array-bounds
warning on `acc[4..7]` inside an `if constexpr (NCols4 == 8)` branch
that is dead when `NCols4 == 4`.

<!-- START COPILOT ORIGINAL PROMPT -->



<details>

<summary>Original prompt</summary>

> 
> ----
> 
> *This section details on the original issue you should resolve*
> 
> <issue_title>[Build] error: array index 4 is past the end of the array
(that has type '__m256[4]') [-Werror,-Warray-bounds]</issue_title>
> <issue_description>### Describe the issue
> 
> Unable to build from main branch
(0768f42 as of time writing this issue)
on Intel Mac
> 
> ```
> /usr/bin/c++ --version
> Apple clang version 17.0.0 (clang-1700.0.13.5)
> Target: x86_64-apple-darwin24.5.0
> Thread model: posix
> InstalledDir: /Library/Developer/CommandLineTools/usr/bin
> ```
> 
> 
> ### Urgency
> 
> _No response_
> 
> ### Target platform
> 
> MacOS
> 
> ### Build script
> 
> ./build.sh --config RelWithDebInfo --build_shared_lib --parallel
--cmake_extra_defines CMAKE_OSX_ARCHITECTURES=x86_64
> 
> ### Error / output
> 
> [ 18%] Building CXX object
CMakeFiles/onnxruntime_mlas.dir/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2.cpp.o
> In file included from
/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2.cpp:26:
>
/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2_int8_blklen32.h:1677:49:
error: array index 4 is past the end of the array (that has type
'__m256[4]') [-Werror,-Warray-bounds]
> 1677 | __m128 acc_1 = FoldAccumulators(acc[4], acc[5], acc[6],
acc[7]);
>       |                                                 ^   ~
>
/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2_int8_blklen32.h:1531:13:
note: array 'acc' declared here
>  1531 |             __m256 acc[NCols4];
>       |             ^
>
/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2_int8_blklen32.h:1677:57:
error: array index 5 is past the end of the array (that has type
'__m256[4]') [-Werror,-Warray-bounds]
> 1677 | __m128 acc_1 = FoldAccumulators(acc[4], acc[5], acc[6],
acc[7]);
>       |                                                         ^   ~
>
/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2_int8_blklen32.h:1531:13:
note: array 'acc' declared here
>  1531 |             __m256 acc[NCols4];
>       |             ^
>
/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2_int8_blklen32.h:1677:65:
error: array index 6 is past the end of the array (that has type
'__m256[4]') [-Werror,-Warray-bounds]
> 1677 | __m128 acc_1 = FoldAccumulators(acc[4], acc[5], acc[6],
acc[7]);
> | ^ ~
>
/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2_int8_blklen32.h:1531:13:
note: array 'acc' declared here
>  1531 |             __m256 acc[NCols4];
>       |             ^
>
/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2_int8_blklen32.h:1677:73:
error: array index 7 is past the end of the array (that has type
'__m256[4]') [-Werror,-Warray-bounds]
> 1677 | __m128 acc_1 = FoldAccumulators(acc[4], acc[5], acc[6],
acc[7]);
> | ^ ~
>
/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2_int8_blklen32.h:1531:13:
note: array 'acc' declared here
>  1531 |             __m256 acc[NCols4];
>       |             ^
> 4 errors generated.
> 
> ### Visual Studio Version
> 
> _No response_
> 
> ### GCC / Compiler Version
> 
> Apple clang version 17.0.0 (clang-1700.0.13.5)</issue_description>
> 
> <agent_instructions>Please investigate the build error. If code need
fix, create a pull requests. Otherwise, suggest ways to avoid the build
errors.</agent_instructions>
> 
> ## Comments on the Issue (you are @copilot in this section)
> 
> <comments>
> </comments>
> 


</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

- Fixes #27497

<!-- START COPILOT CODING AGENT TIPS -->
---

🔒 GitHub Advanced Security automatically protects Copilot coding agent
pull requests. You can protect all pull requests by enabling Advanced
Security for your repositories. [Learn more about Advanced
Security.](https://gh.io/cca-advanced-security)

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
There is build error using `--use_vcpkg` without
`--use_vcpkg_ms_internal_asset_cache`, the error is like:
```
C:\code\onnxruntime\cmake\./vcpkg-ports\pybind11: info: installing overlay port from here
Downloading https://github.com/pybind/pybind11/archive/v3.0.2.tar.gz -> pybind-pybind11-v3.0.2.tar.gz
pybind-pybind11-v3.0.2.tar.gz.33772.part: error: download from https://github.com/pybind/pybind11/archive/v3.0.2.tar.gz had an unexpected hash
note: Expected: 786b1bf534ac67a8d5669f8babf67bb13e48b3a3da1b6344e43ae10a84b80bbc8fea5f12a65fd18739c341fefef5622c5dc096db964dff33cc62ea4259b2e2c1
note: Actual  : 19bee2c76320e25202ee078b5680ff8a7acfb33494dec29dad984ab04de8bcb01340d9fec37c8cc5ac9015dfc367e60312dcd8506e66ce8f0af4c49db562ddef
CMake Error at scripts/cmake/vcpkg_download_distfile.cmake:136 (message):
  Download failed, halting portfile.
```

The root cause is that I uploaded zip file to cache server. Without
`--use_vcpkg_ms_internal_asset_cache`, vcpkg will try download tar.gz
file from github, and the SHA is different from the one of zip file.

In this PR, I configure the portfile to download zip file to avoid the
issue.
…27518)

### Description
<!-- Describe your changes. -->
Detect and test mismatch between raw data size and declared data type
and shape of the lora adapter parameter.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Disallow maliciously crafted lora adapters leading to heap OOB.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…ngs (#27288)

### Description
<!-- Describe your changes. -->

Compilation on Clang toolchains on Linux currently fails due to this
warning (among others) since ONNX runtime compiles with -Werror by
default. We address `-Winconsistent-missing-override` with this PR in
TRT NV EP.

```
/home/stephan/projects/onnxruntime-winai/onnxruntime/core/providers/nv_tensorrt_rtx/nv_execution_provider.h:309:7: error: 'GetDeviceId' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override]
  309 |   int GetDeviceId() const { return device_id_; }
      |       ^
/home/stephan/projects/onnxruntime-winai/include/onnxruntime/core/framework/execution_provider.h:183:15: note: overridden virtual function is here
  183 |   virtual int GetDeviceId() const { return default_device_.Id(); }
      |               ^
In file included from /home/stephan/projects/onnxruntime-winai/onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_factory.cc:18:
/home/stephan/projects/onnxruntime-winai/onnxruntime/core/providers/nv_tensorrt_rtx/nv_execution_provider.h:310:10: error: 'Sync' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override]
  310 |   Status Sync() const;
      |          ^
/home/stephan/projects/onnxruntime-winai/include/onnxruntime/core/framework/execution_provider.h:231:26: note: overridden virtual function is here
  231 |   virtual common::Status Sync() const { return Status::OK(); }
      |                          ^
/home/stephan/projects/onnxruntime-winai/onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_factory.cc:63:39: error: 'CreateProvider' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override]
   63 |   std::unique_ptr<IExecutionProvider> CreateProvider(const OrtSessionOptions& session_options,
      |                                       ^
/home/stephan/projects/onnxruntime-winai/include/onnxruntime/core/providers/providers.h:29:47: note: overridden virtual function is here
   29 |   virtual std::unique_ptr<IExecutionProvider> CreateProvider(const OrtSessionOptions& session_options,
      |                                               ^
/home/stephan/projects/onnxruntime-winai/onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_factory.cc:112:46: error: 'CreateExecutionProviderFactory' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override]
  112 |   std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory(const void* param) {
      |                                              ^
/home/stephan/projects/onnxruntime-winai/onnxruntime/core/providers/shared_library/provider_host_api.h:19:54: note: overridden virtual function is here
   19 |   virtual std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory(const void* /*provider_options*/) { return nullptr;
/home/stephan/projects/onnxruntime-winai/onnxruntime/core/providers/nv_tensorrt_rtx/nv_execution_provider.h:309:7: error: 'GetDeviceId' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override]
  309 |   int GetDeviceId() const { return device_id_; }
      |       ^
/home/stephan/projects/onnxruntime-winai/include/onnxruntime/core/framework/execution_provider.h:183:15: note: overridden virtual function is here
  183 |   virtual int GetDeviceId() const { return default_device_.Id(); }
```

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fixing clang warnings enables builds with clang on Linux since `-Werror`
enforces warning-free builds.
… and provider bridge EPs (#27522)

### Description
<!-- Describe your changes. -->
Set "library_path" metadata entry in OrtEpDevice instances for plugin
and provider bridge EPs.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Make available everywhere. Required by GenAI to load custom ops library.

#27496
### Description
<!-- Describe your changes. -->
Non required builds fail because Lora Tests use `ASSERT_THROW` while
RTTI is disabled.

Followup: #27518
### Motivation and Context
Change this because for 32B model like Qwen2.5-coder-32B in TRTRTX ep,
there is a long string in GenAI


https://github.com/microsoft/onnxruntime-genai/blob/3c47932e9d7afa0d44db0b3918e479bbdd4c5353/src/models/model.cpp#L516

Example

```
AddConfigEntry: ep.nvtensorrtrtxexecutionprovider.nv_profile_min_shapes (length=4364) = input_ids:1x1,attention_mask:1x1,past_key_values.0.key:1x8x0x128,past_key_values.0.value:1x8x0x128,past_key_values.1.key:1x8x0x128,past_key_values.1.value:1x8x0x128,past_key_values.2.key:1x8x0x128,past_key_values.2.value:1x8x0x128,past_key_values.3.key:1x8x0x128,past_key_values.3.value:1x8x0x128,past_key_values.4.key:1x8x0x128,past_key_values.4.value:1x8x0x128,past_key_values.5.key:1x8x0x128,past_key_values.5.value:1x8x0x128,past_key_values.6.key:1x8x0x128,past_key_values.6.value:1x8x0x128,past_key_values.7.key:1x8x0x128,past_key_values.7.value:1x8x0x128,past_key_values.8.key:1x8x0x128,past_key_values.8.value:1x8x0x128,past_key_values.9.key:1x8x0x128,past_key_values.9.value:1x8x0x128,past_key_values.10.key:1x8x0x128,past_key_values.10.value:1x8x0x128,past_key_values.11.key:1x8x0x128,past_key_values.11.value:1x8x0x128,past_key_values.12.key:1x8x0x128,past_key_values.12.value:1x8x0x128,past_key_values.13.key:1x8x0x128,past_key_values.13.value:1x8x0x128,past_key_values.14.key:1x8x0x128,past_key_values.14.value:1x8x0x128,past_key_values.15.key:1x8x0x128,past_key_values.15.value:1x8x0x128,past_key_values.16.key:1x8x0x128,past_key_values.16.value:1x8x0x128,past_key_values.17.key:1x8x0x128,past_key_values.17.value:1x8x0x128,past_key_values.18.key:1x8x0x128,past_key_values.18.value:1x8x0x128,past_key_values.19.key:1x8x0x128,past_key_values.19.value:1x8x0x128,past_key_values.20.key:1x8x0x128,past_key_values.20.value:1x8x0x128,past_key_values.21.key:1x8x0x128,past_key_values.21.value:1x8x0x128,past_key_values.22.key:1x8x0x128,past_key_values.22.value:1x8x0x128,past_key_values.23.key:1x8x0x128,past_key_values.23.value:1x8x0x128,past_key_values.24.key:1x8x0x128,past_key_values.24.value:1x8x0x128,past_key_values.25.key:1x8x0x128,past_key_values.25.value:1x8x0x128,past_key_values.26.key:1x8x0x128,past_key_values.26.value:1x8x0x128,past_key_values.27.key:1x8x0x128,past_key_values.27.value:1x8x0x128,past_key_values.28.key:1x8x0x128,past_key_values.28.value:1x8x0x128,past_key_values.29.key:1x8x0x128,past_key_values.29.value:1x8x0x128,past_key_values.30.key:1x8x0x128,past_key_values.30.value:1x8x0x128,past_key_values.31.key:1x8x0x128,past_key_values.31.value:1x8x0x128,past_key_values.32.key:1x8x0x128,past_key_values.32.value:1x8x0x128,past_key_values.33.key:1x8x0x128,past_key_values.33.value:1x8x0x128,past_key_values.34.key:1x8x0x128,past_key_values.34.value:1x8x0x128,past_key_values.35.key:1x8x0x128,past_key_values.35.value:1x8x0x128,past_key_values.36.key:1x8x0x128,past_key_values.36.value:1x8x0x128,past_key_values.37.key:1x8x0x128,past_key_values.37.value:1x8x0x128,past_key_values.38.key:1x8x0x128,past_key_values.38.value:1x8x0x128,past_key_values.39.key:1x8x0x128,past_key_values.39.value:1x8x0x128,past_key_values.40.key:1x8x0x128,past_key_values.40.value:1x8x0x128,past_key_values.41.key:1x8x0x128,past_key_values.41.value:1x8x0x128,past_key_values.42.key:1x8x0x128,past_key_values.42.value:1x8x0x128,past_key_values.43.key:1x8x0x128,past_key_values.43.value:1x8x0x128,past_key_values.44.key:1x8x0x128,past_key_values.44.value:1x8x0x128,past_key_values.45.key:1x8x0x128,past_key_values.45.value:1x8x0x128,past_key_values.46.key:1x8x0x128,past_key_values.46.value:1x8x0x128,past_key_values.47.key:1x8x0x128,past_key_values.47.value:1x8x0x128,past_key_values.48.key:1x8x0x128,past_key_values.48.value:1x8x0x128,past_key_values.49.key:1x8x0x128,past_key_values.49.value:1x8x0x128,past_key_values.50.key:1x8x0x128,past_key_values.50.value:1x8x0x128,past_key_values.51.key:1x8x0x128,past_key_values.51.value:1x8x0x128,past_key_values.52.key:1x8x0x128,past_key_values.52.value:1x8x0x128,past_key_values.53.key:1x8x0x128,past_key_values.53.value:1x8x0x128,past_key_values.54.key:1x8x0x128,past_key_values.54.value:1x8x0x128,past_key_values.55.key:1x8x0x128,past_key_values.55.value:1x8x0x128,past_key_values.56.key:1x8x0x128,past_key_values.56.value:1x8x0x128,past_key_values.57.key:1x8x0x128,past_key_values.57.value:1x8x0x128,past_key_values.58.key:1x8x0x128,past_key_values.58.value:1x8x0x128,past_key_values.59.key:1x8x0x128,past_key_values.59.value:1x8x0x128,past_key_values.60.key:1x8x0x128,past_key_values.60.value:1x8x0x128,past_key_values.61.key:1x8x0x128,past_key_values.61.value:1x8x0x128,past_key_values.62.key:1x8x0x128,past_key_values.62.value:1x8x0x128,past_key_values.63.key:1x8x0x128,past_key_values.63.value:1x8x0x128
Traceback (most recent call last):
  File "Convert to NVIDIA TRT for RTX_32B\test_config.py", line 2, in <module>
    model = og.Model("Convert to NVIDIA TRT for RTX_32B\\model")
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Config value is longer than maximum length: 4096
```

---------

Co-authored-by: hualxie <hualxie@microsoft.com>
…r conflict (#27535)

## Description

`NativeLibrary.SetDllImportResolver()` can only be called once per
assembly. If a host application registers its own `DllImportResolver`
for the ONNX Runtime assembly before any ORT API is used, ORT's internal
call to `SetDllImportResolver` in the `NativeMethods` static constructor
throws an `InvalidOperationException`, which surfaces as a fatal
`TypeInitializationException` — making ONNX Runtime completely unusable.

This PR adds two complementary safeguards:

1. **`try/catch(InvalidOperationException)`** around the
`SetDllImportResolver` call in `NativeMethods..cctor`, so that if a
resolver is already registered, ORT logs a diagnostic trace and
continues normally.

2. **`OrtEnv.DisableDllImportResolver`** — a public static `bool`
property that allows callers to explicitly opt out of ORT's resolver
registration before any ORT type is accessed. This is useful when the
host application needs full control over native library resolution.

### Usage

```csharp
// Option 1: Opt out before any ORT usage
OrtEnv.DisableDllImportResolver = true;
NativeLibrary.SetDllImportResolver(typeof(OrtEnv).Assembly, MyCustomResolver);
var env = OrtEnv.Instance();

// Option 2: Do nothing — if a resolver is already registered,
// ORT catches the conflict and continues using the existing resolver.
```

## Motivation and Context

When a library client has already called `SetDllImportResolver()` for
the ORT assembly (e.g., to handle platform-specific library loading),
ORT's attempt to register its own resolver causes a fatal, unrecoverable
error. This change makes ORT resilient to this scenario and gives
clients explicit control.

## Changes

### `csharp/src/Microsoft.ML.OnnxRuntime/OrtEnv.shared.cs`
- Added `public static bool DisableDllImportResolver` property with XML
documentation and usage example.

### `csharp/src/Microsoft.ML.OnnxRuntime/NativeMethods.shared.cs`
- Wrapped `NativeLibrary.SetDllImportResolver` in `if
(!OrtEnv.DisableDllImportResolver)` guard.
- Added `try/catch(InvalidOperationException)` with a `Trace.WriteLine`
diagnostic message.

### `csharp/test/Microsoft.ML.OnnxRuntime.Tests.Common/OrtEnvTests.cs`
- Added `OrtEnvExternalDllImportResolverTest` test class with two tests
using `AssemblyLoadContext` for process-level isolation of static
constructor behavior:
- **`TestExternalResolverRegisteredFirst`** — Registers an external
resolver FIRST, then initializes ORT. Verifies the `try/catch` prevents
a fatal error and ORT remains fully functional (`GetVersionString()`
succeeds).
- **`TestDisableDllImportResolverWorks`** — Sets
`DisableDllImportResolver = true`, initializes ORT, then registers an
external resolver. Verifies no `InvalidOperationException` is thrown,
proving ORT correctly skipped its own registration.
### Description
ModelLoadStart/End - InferenceSession::LoadWithLoader,
InferenceSession::LoadOrtModelWithLoader
SessionCreationEnd - InferenceSession::Initialize
RegisterEpLibraryWithLibPath, RegisterEpLibraryStart/End -
Environment::RegisterExecutionProviderLibrary

Update: RuntimePerf event is triggered more
frequently with exponential backoff.
It is also now triggered from ~InferenceSession() to log data in the
tail.

### Motivation and Context
To better measure health

---------

Co-authored-by: Darshak Bhatti <dabhatti@micorsoft.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
)

# Fix RoiAlign heap out-of-bounds read via unchecked batch_indices

## Description

Add value-range validation for `batch_indices` in the RoiAlign operator
to prevent out-of-bounds heap reads from maliciously crafted ONNX
models.

`CheckROIAlignValidInput()` previously validated tensor shapes but never
checked that the **values** in `batch_indices` fall within `[0,
batch_size)`. An attacker could supply `batch_indices` containing values
exceeding the batch dimension of the input tensor `X`, causing the
kernel to read arbitrary heap memory at:

- **CPU:** `roialign.cc:212` — `roi_batch_ind` used as unchecked index
into `bottom_data`
- **CUDA:** `roialign_impl.cu:109` — `batch_indices_ptr[n]` used as
unchecked index into `bottom_data` on GPU

## Impact

- **Vulnerability type:** Heap out-of-bounds read
- **Impact:** Arbitrary heap memory read, potential information
disclosure, program crash
- **Trigger:** Construct `batch_indices` with values ≥ `batch_size` or <
0
- **Affected providers:** CPU and CUDA (both call
`CheckROIAlignValidInput()`)

## Changes

### `onnxruntime/core/providers/cpu/object_detection/roialign.cc`
- Added per-element bounds check in `CheckROIAlignValidInput()`: each
`batch_indices[i]` must satisfy `0 <= value < X.shape[0]`
- Returns `INVALID_ARGUMENT` with a descriptive error message on
violation
- Guarded by `batch_indices_ptr->Location().device.Type() ==
OrtDevice::CPU` so it only runs when the tensor data is host-accessible
(CPU EP and CropAndResize). For the CUDA EP, `batch_indices` lives in
GPU memory and cannot be safely dereferenced on the host.

### `onnxruntime/test/providers/cpu/object_detection/roialign_test.cc`
- Added `BatchIndicesOutOfRange` test: `batch_indices={1}` with
`batch_size=1` (exercises `>= batch_size` path)
- Added `BatchIndicesNegative` test: `batch_indices={-1}` (exercises `<
0` path)

## Known Limitation

The CUDA execution path is **not** protected by this bounds check
because `batch_indices` is a GPU tensor and cannot be read on the host.
Adding a device-side bounds check would require passing `batch_size`
into the CUDA kernel — this is tracked as a follow-up.

Note: Using `.InputMemoryType(OrtMemTypeCPUInput, 2)` was considered but
rejected because it would force a GPU→CPU transfer of `batch_indices`,
breaking CUDA graph capture for models like Masked R-CNN where
`batch_indices` is produced by upstream GPU ops.

## Validation

- Full `RoiAlignTest.*` suite passes (12/12 tests) on CPU build
- Full `RoiAlignTest.*` suite passes (12/12 tests) on CUDA build
- No regressions in existing positive or negative tests
## Summary

Generalize the WebNN-specific DequantizeLinear → MatMulNBits graph
fusion transformer so it can be
reused by other execution providers (e.g. NvTensorRTRTX), and add
defensive shape/size validation
to prevent crashes on malformed tensors.

### Fusion patterns

**Pattern 1:** `DequantizeLinear → Reshape → Transpose → [Cast] →
MatMul/Gemm` → **MatMulNBits**

**Pattern 2:** `DequantizeLinear (axis=0) → MatMul/Gemm` →
**MatMulNBits**

---------

Co-authored-by: praneshgo <227579474+praneshgo@users.noreply.github.com>
@tianleiwu tianleiwu enabled auto-merge (squash) March 5, 2026 04:27
@tianleiwu tianleiwu merged commit 3a728b7 into rel-1.24.3 Mar 5, 2026
94 of 128 checks passed
@tianleiwu tianleiwu deleted the tlwu/rel-1.24.3_cherrypick_round4 branch March 5, 2026 04:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.