ORT 1.24.3 release cherry pick round 4#27558
Merged
tianleiwu merged 14 commits intorel-1.24.3from Mar 5, 2026
Merged
Conversation
…ensorRtRtx EP (#27192) ### Description - Avoid repetitive creation of FP4/FP8 native custom-ops in create method for custom-op domains (leaving plugin-based custom-op handling as is, as it was before native-custom-ops addition in [PR-26555](#26555)). - Avoid deleting the custom-op domains at destructor time, since those are created with static scope, so avoid potential double-delete. ### Motivation and Context - Repetitive checks and creation of custom-ops domain is redundant. So, cleaning it up a bit. - Explicit deletion of static objects in destructor can lead to double-delete. So, avoiding it.
…ompiler on Linux builds (#27454) ### Description Suppress spurious Array Out of Bounds warnings produced by GCC 14.2 compiler on Linux builds ### Motivation and Context Linux build fails when compiled with GCC 14.2 due to spurious Array Out of Bounds warnings (Warnings Treated as Errors)
### Description Apply the same double-free fix from NvTensorRtRtx EP ([PR #27192](#27192)) to the TRT EP. `CreateTensorRTCustomOpDomainList()` owns domains/ops via static `unique_ptr`s, but `ReleaseTensorRTCustomOpDomain()` was manually `delete`-ing the same objects through raw pointers — double-free at program exit. - `ReleaseTensorRTCustomOpDomain()` → no-op (static `unique_ptr`s own the lifetime) - `ReleaseTensorRTCustomOpDomainList()` → `clear()` the reference vector only - Added ownership comments to static members matching NvTensorRtRtx EP style ### Motivation and Context PR #27192 review ([thread](#27192 (comment))) identified TRT EP has the identical bug pattern that was fixed in NvTensorRtRtx EP. The TRT EP code was the original source this pattern was borrowed from. @tianleiwu noted a follow-up PR was needed. <!-- START COPILOT CODING AGENT TIPS --> --- 🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. [Learn more about Advanced Security.](https://gh.io/cca-advanced-security) --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
### Description
The `-Warray-bounds` suppression pragma in
`sqnbitgemm_kernel_avx2_int8_blklen32.h` was gated on
`defined(HAS_ARRAY_BOUNDS)`, which is set in `onnxruntime_config.h`.
MLAS never includes that header, so the guard was dead code and the
pragma never fired.
Changed the guard to `#ifdef __clang__`:
```cpp
// Before: HAS_ARRAY_BOUNDS never defined in MLAS TU
#if defined(__clang__) && defined(HAS_ARRAY_BOUNDS)
// After
#ifdef __clang__
```
Note: `__has_warning("-Warray-bounds")` was considered but the C
preprocessor does not short-circuit `&&`, so GCC fails to parse it even
behind `defined(__clang__)`.
### Motivation and Context
Build fails on Intel Mac with Apple Clang 17.0.0
(`-Werror,-Warray-bounds`). Clang raises a false-positive array-bounds
warning on `acc[4..7]` inside an `if constexpr (NCols4 == 8)` branch
that is dead when `NCols4 == 4`.
<!-- START COPILOT ORIGINAL PROMPT -->
<details>
<summary>Original prompt</summary>
>
> ----
>
> *This section details on the original issue you should resolve*
>
> <issue_title>[Build] error: array index 4 is past the end of the array
(that has type '__m256[4]') [-Werror,-Warray-bounds]</issue_title>
> <issue_description>### Describe the issue
>
> Unable to build from main branch
(0768f42 as of time writing this issue)
on Intel Mac
>
> ```
> /usr/bin/c++ --version
> Apple clang version 17.0.0 (clang-1700.0.13.5)
> Target: x86_64-apple-darwin24.5.0
> Thread model: posix
> InstalledDir: /Library/Developer/CommandLineTools/usr/bin
> ```
>
>
> ### Urgency
>
> _No response_
>
> ### Target platform
>
> MacOS
>
> ### Build script
>
> ./build.sh --config RelWithDebInfo --build_shared_lib --parallel
--cmake_extra_defines CMAKE_OSX_ARCHITECTURES=x86_64
>
> ### Error / output
>
> [ 18%] Building CXX object
CMakeFiles/onnxruntime_mlas.dir/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2.cpp.o
> In file included from
/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2.cpp:26:
>
/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2_int8_blklen32.h:1677:49:
error: array index 4 is past the end of the array (that has type
'__m256[4]') [-Werror,-Warray-bounds]
> 1677 | __m128 acc_1 = FoldAccumulators(acc[4], acc[5], acc[6],
acc[7]);
> | ^ ~
>
/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2_int8_blklen32.h:1531:13:
note: array 'acc' declared here
> 1531 | __m256 acc[NCols4];
> | ^
>
/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2_int8_blklen32.h:1677:57:
error: array index 5 is past the end of the array (that has type
'__m256[4]') [-Werror,-Warray-bounds]
> 1677 | __m128 acc_1 = FoldAccumulators(acc[4], acc[5], acc[6],
acc[7]);
> | ^ ~
>
/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2_int8_blklen32.h:1531:13:
note: array 'acc' declared here
> 1531 | __m256 acc[NCols4];
> | ^
>
/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2_int8_blklen32.h:1677:65:
error: array index 6 is past the end of the array (that has type
'__m256[4]') [-Werror,-Warray-bounds]
> 1677 | __m128 acc_1 = FoldAccumulators(acc[4], acc[5], acc[6],
acc[7]);
> | ^ ~
>
/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2_int8_blklen32.h:1531:13:
note: array 'acc' declared here
> 1531 | __m256 acc[NCols4];
> | ^
>
/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2_int8_blklen32.h:1677:73:
error: array index 7 is past the end of the array (that has type
'__m256[4]') [-Werror,-Warray-bounds]
> 1677 | __m128 acc_1 = FoldAccumulators(acc[4], acc[5], acc[6],
acc[7]);
> | ^ ~
>
/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2_int8_blklen32.h:1531:13:
note: array 'acc' declared here
> 1531 | __m256 acc[NCols4];
> | ^
> 4 errors generated.
>
> ### Visual Studio Version
>
> _No response_
>
> ### GCC / Compiler Version
>
> Apple clang version 17.0.0 (clang-1700.0.13.5)</issue_description>
>
> <agent_instructions>Please investigate the build error. If code need
fix, create a pull requests. Otherwise, suggest ways to avoid the build
errors.</agent_instructions>
>
> ## Comments on the Issue (you are @copilot in this section)
>
> <comments>
> </comments>
>
</details>
<!-- START COPILOT CODING AGENT SUFFIX -->
- Fixes #27497
<!-- START COPILOT CODING AGENT TIPS -->
---
🔒 GitHub Advanced Security automatically protects Copilot coding agent
pull requests. You can protect all pull requests by enabling Advanced
Security for your repositories. [Learn more about Advanced
Security.](https://gh.io/cca-advanced-security)
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
There is build error using `--use_vcpkg` without `--use_vcpkg_ms_internal_asset_cache`, the error is like: ``` C:\code\onnxruntime\cmake\./vcpkg-ports\pybind11: info: installing overlay port from here Downloading https://github.com/pybind/pybind11/archive/v3.0.2.tar.gz -> pybind-pybind11-v3.0.2.tar.gz pybind-pybind11-v3.0.2.tar.gz.33772.part: error: download from https://github.com/pybind/pybind11/archive/v3.0.2.tar.gz had an unexpected hash note: Expected: 786b1bf534ac67a8d5669f8babf67bb13e48b3a3da1b6344e43ae10a84b80bbc8fea5f12a65fd18739c341fefef5622c5dc096db964dff33cc62ea4259b2e2c1 note: Actual : 19bee2c76320e25202ee078b5680ff8a7acfb33494dec29dad984ab04de8bcb01340d9fec37c8cc5ac9015dfc367e60312dcd8506e66ce8f0af4c49db562ddef CMake Error at scripts/cmake/vcpkg_download_distfile.cmake:136 (message): Download failed, halting portfile. ``` The root cause is that I uploaded zip file to cache server. Without `--use_vcpkg_ms_internal_asset_cache`, vcpkg will try download tar.gz file from github, and the SHA is different from the one of zip file. In this PR, I configure the portfile to download zip file to avoid the issue.
…27518) ### Description <!-- Describe your changes. --> Detect and test mismatch between raw data size and declared data type and shape of the lora adapter parameter. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Disallow maliciously crafted lora adapters leading to heap OOB. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…ngs (#27288) ### Description <!-- Describe your changes. --> Compilation on Clang toolchains on Linux currently fails due to this warning (among others) since ONNX runtime compiles with -Werror by default. We address `-Winconsistent-missing-override` with this PR in TRT NV EP. ``` /home/stephan/projects/onnxruntime-winai/onnxruntime/core/providers/nv_tensorrt_rtx/nv_execution_provider.h:309:7: error: 'GetDeviceId' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override] 309 | int GetDeviceId() const { return device_id_; } | ^ /home/stephan/projects/onnxruntime-winai/include/onnxruntime/core/framework/execution_provider.h:183:15: note: overridden virtual function is here 183 | virtual int GetDeviceId() const { return default_device_.Id(); } | ^ In file included from /home/stephan/projects/onnxruntime-winai/onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_factory.cc:18: /home/stephan/projects/onnxruntime-winai/onnxruntime/core/providers/nv_tensorrt_rtx/nv_execution_provider.h:310:10: error: 'Sync' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override] 310 | Status Sync() const; | ^ /home/stephan/projects/onnxruntime-winai/include/onnxruntime/core/framework/execution_provider.h:231:26: note: overridden virtual function is here 231 | virtual common::Status Sync() const { return Status::OK(); } | ^ /home/stephan/projects/onnxruntime-winai/onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_factory.cc:63:39: error: 'CreateProvider' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override] 63 | std::unique_ptr<IExecutionProvider> CreateProvider(const OrtSessionOptions& session_options, | ^ /home/stephan/projects/onnxruntime-winai/include/onnxruntime/core/providers/providers.h:29:47: note: overridden virtual function is here 29 | virtual std::unique_ptr<IExecutionProvider> CreateProvider(const OrtSessionOptions& session_options, | ^ /home/stephan/projects/onnxruntime-winai/onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_factory.cc:112:46: error: 'CreateExecutionProviderFactory' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override] 112 | std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory(const void* param) { | ^ /home/stephan/projects/onnxruntime-winai/onnxruntime/core/providers/shared_library/provider_host_api.h:19:54: note: overridden virtual function is here 19 | virtual std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory(const void* /*provider_options*/) { return nullptr; /home/stephan/projects/onnxruntime-winai/onnxruntime/core/providers/nv_tensorrt_rtx/nv_execution_provider.h:309:7: error: 'GetDeviceId' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override] 309 | int GetDeviceId() const { return device_id_; } | ^ /home/stephan/projects/onnxruntime-winai/include/onnxruntime/core/framework/execution_provider.h:183:15: note: overridden virtual function is here 183 | virtual int GetDeviceId() const { return default_device_.Id(); } ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fixing clang warnings enables builds with clang on Linux since `-Werror` enforces warning-free builds.
… and provider bridge EPs (#27522) ### Description <!-- Describe your changes. --> Set "library_path" metadata entry in OrtEpDevice instances for plugin and provider bridge EPs. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Make available everywhere. Required by GenAI to load custom ops library. #27496
### Description <!-- Describe your changes. --> Non required builds fail because Lora Tests use `ASSERT_THROW` while RTTI is disabled. Followup: #27518
### Motivation and Context Change this because for 32B model like Qwen2.5-coder-32B in TRTRTX ep, there is a long string in GenAI https://github.com/microsoft/onnxruntime-genai/blob/3c47932e9d7afa0d44db0b3918e479bbdd4c5353/src/models/model.cpp#L516 Example ``` AddConfigEntry: ep.nvtensorrtrtxexecutionprovider.nv_profile_min_shapes (length=4364) = input_ids:1x1,attention_mask:1x1,past_key_values.0.key:1x8x0x128,past_key_values.0.value:1x8x0x128,past_key_values.1.key:1x8x0x128,past_key_values.1.value:1x8x0x128,past_key_values.2.key:1x8x0x128,past_key_values.2.value:1x8x0x128,past_key_values.3.key:1x8x0x128,past_key_values.3.value:1x8x0x128,past_key_values.4.key:1x8x0x128,past_key_values.4.value:1x8x0x128,past_key_values.5.key:1x8x0x128,past_key_values.5.value:1x8x0x128,past_key_values.6.key:1x8x0x128,past_key_values.6.value:1x8x0x128,past_key_values.7.key:1x8x0x128,past_key_values.7.value:1x8x0x128,past_key_values.8.key:1x8x0x128,past_key_values.8.value:1x8x0x128,past_key_values.9.key:1x8x0x128,past_key_values.9.value:1x8x0x128,past_key_values.10.key:1x8x0x128,past_key_values.10.value:1x8x0x128,past_key_values.11.key:1x8x0x128,past_key_values.11.value:1x8x0x128,past_key_values.12.key:1x8x0x128,past_key_values.12.value:1x8x0x128,past_key_values.13.key:1x8x0x128,past_key_values.13.value:1x8x0x128,past_key_values.14.key:1x8x0x128,past_key_values.14.value:1x8x0x128,past_key_values.15.key:1x8x0x128,past_key_values.15.value:1x8x0x128,past_key_values.16.key:1x8x0x128,past_key_values.16.value:1x8x0x128,past_key_values.17.key:1x8x0x128,past_key_values.17.value:1x8x0x128,past_key_values.18.key:1x8x0x128,past_key_values.18.value:1x8x0x128,past_key_values.19.key:1x8x0x128,past_key_values.19.value:1x8x0x128,past_key_values.20.key:1x8x0x128,past_key_values.20.value:1x8x0x128,past_key_values.21.key:1x8x0x128,past_key_values.21.value:1x8x0x128,past_key_values.22.key:1x8x0x128,past_key_values.22.value:1x8x0x128,past_key_values.23.key:1x8x0x128,past_key_values.23.value:1x8x0x128,past_key_values.24.key:1x8x0x128,past_key_values.24.value:1x8x0x128,past_key_values.25.key:1x8x0x128,past_key_values.25.value:1x8x0x128,past_key_values.26.key:1x8x0x128,past_key_values.26.value:1x8x0x128,past_key_values.27.key:1x8x0x128,past_key_values.27.value:1x8x0x128,past_key_values.28.key:1x8x0x128,past_key_values.28.value:1x8x0x128,past_key_values.29.key:1x8x0x128,past_key_values.29.value:1x8x0x128,past_key_values.30.key:1x8x0x128,past_key_values.30.value:1x8x0x128,past_key_values.31.key:1x8x0x128,past_key_values.31.value:1x8x0x128,past_key_values.32.key:1x8x0x128,past_key_values.32.value:1x8x0x128,past_key_values.33.key:1x8x0x128,past_key_values.33.value:1x8x0x128,past_key_values.34.key:1x8x0x128,past_key_values.34.value:1x8x0x128,past_key_values.35.key:1x8x0x128,past_key_values.35.value:1x8x0x128,past_key_values.36.key:1x8x0x128,past_key_values.36.value:1x8x0x128,past_key_values.37.key:1x8x0x128,past_key_values.37.value:1x8x0x128,past_key_values.38.key:1x8x0x128,past_key_values.38.value:1x8x0x128,past_key_values.39.key:1x8x0x128,past_key_values.39.value:1x8x0x128,past_key_values.40.key:1x8x0x128,past_key_values.40.value:1x8x0x128,past_key_values.41.key:1x8x0x128,past_key_values.41.value:1x8x0x128,past_key_values.42.key:1x8x0x128,past_key_values.42.value:1x8x0x128,past_key_values.43.key:1x8x0x128,past_key_values.43.value:1x8x0x128,past_key_values.44.key:1x8x0x128,past_key_values.44.value:1x8x0x128,past_key_values.45.key:1x8x0x128,past_key_values.45.value:1x8x0x128,past_key_values.46.key:1x8x0x128,past_key_values.46.value:1x8x0x128,past_key_values.47.key:1x8x0x128,past_key_values.47.value:1x8x0x128,past_key_values.48.key:1x8x0x128,past_key_values.48.value:1x8x0x128,past_key_values.49.key:1x8x0x128,past_key_values.49.value:1x8x0x128,past_key_values.50.key:1x8x0x128,past_key_values.50.value:1x8x0x128,past_key_values.51.key:1x8x0x128,past_key_values.51.value:1x8x0x128,past_key_values.52.key:1x8x0x128,past_key_values.52.value:1x8x0x128,past_key_values.53.key:1x8x0x128,past_key_values.53.value:1x8x0x128,past_key_values.54.key:1x8x0x128,past_key_values.54.value:1x8x0x128,past_key_values.55.key:1x8x0x128,past_key_values.55.value:1x8x0x128,past_key_values.56.key:1x8x0x128,past_key_values.56.value:1x8x0x128,past_key_values.57.key:1x8x0x128,past_key_values.57.value:1x8x0x128,past_key_values.58.key:1x8x0x128,past_key_values.58.value:1x8x0x128,past_key_values.59.key:1x8x0x128,past_key_values.59.value:1x8x0x128,past_key_values.60.key:1x8x0x128,past_key_values.60.value:1x8x0x128,past_key_values.61.key:1x8x0x128,past_key_values.61.value:1x8x0x128,past_key_values.62.key:1x8x0x128,past_key_values.62.value:1x8x0x128,past_key_values.63.key:1x8x0x128,past_key_values.63.value:1x8x0x128 Traceback (most recent call last): File "Convert to NVIDIA TRT for RTX_32B\test_config.py", line 2, in <module> model = og.Model("Convert to NVIDIA TRT for RTX_32B\\model") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Config value is longer than maximum length: 4096 ``` --------- Co-authored-by: hualxie <hualxie@microsoft.com>
…r conflict (#27535) ## Description `NativeLibrary.SetDllImportResolver()` can only be called once per assembly. If a host application registers its own `DllImportResolver` for the ONNX Runtime assembly before any ORT API is used, ORT's internal call to `SetDllImportResolver` in the `NativeMethods` static constructor throws an `InvalidOperationException`, which surfaces as a fatal `TypeInitializationException` — making ONNX Runtime completely unusable. This PR adds two complementary safeguards: 1. **`try/catch(InvalidOperationException)`** around the `SetDllImportResolver` call in `NativeMethods..cctor`, so that if a resolver is already registered, ORT logs a diagnostic trace and continues normally. 2. **`OrtEnv.DisableDllImportResolver`** — a public static `bool` property that allows callers to explicitly opt out of ORT's resolver registration before any ORT type is accessed. This is useful when the host application needs full control over native library resolution. ### Usage ```csharp // Option 1: Opt out before any ORT usage OrtEnv.DisableDllImportResolver = true; NativeLibrary.SetDllImportResolver(typeof(OrtEnv).Assembly, MyCustomResolver); var env = OrtEnv.Instance(); // Option 2: Do nothing — if a resolver is already registered, // ORT catches the conflict and continues using the existing resolver. ``` ## Motivation and Context When a library client has already called `SetDllImportResolver()` for the ORT assembly (e.g., to handle platform-specific library loading), ORT's attempt to register its own resolver causes a fatal, unrecoverable error. This change makes ORT resilient to this scenario and gives clients explicit control. ## Changes ### `csharp/src/Microsoft.ML.OnnxRuntime/OrtEnv.shared.cs` - Added `public static bool DisableDllImportResolver` property with XML documentation and usage example. ### `csharp/src/Microsoft.ML.OnnxRuntime/NativeMethods.shared.cs` - Wrapped `NativeLibrary.SetDllImportResolver` in `if (!OrtEnv.DisableDllImportResolver)` guard. - Added `try/catch(InvalidOperationException)` with a `Trace.WriteLine` diagnostic message. ### `csharp/test/Microsoft.ML.OnnxRuntime.Tests.Common/OrtEnvTests.cs` - Added `OrtEnvExternalDllImportResolverTest` test class with two tests using `AssemblyLoadContext` for process-level isolation of static constructor behavior: - **`TestExternalResolverRegisteredFirst`** — Registers an external resolver FIRST, then initializes ORT. Verifies the `try/catch` prevents a fatal error and ORT remains fully functional (`GetVersionString()` succeeds). - **`TestDisableDllImportResolverWorks`** — Sets `DisableDllImportResolver = true`, initializes ORT, then registers an external resolver. Verifies no `InvalidOperationException` is thrown, proving ORT correctly skipped its own registration.
### Description ModelLoadStart/End - InferenceSession::LoadWithLoader, InferenceSession::LoadOrtModelWithLoader SessionCreationEnd - InferenceSession::Initialize RegisterEpLibraryWithLibPath, RegisterEpLibraryStart/End - Environment::RegisterExecutionProviderLibrary Update: RuntimePerf event is triggered more frequently with exponential backoff. It is also now triggered from ~InferenceSession() to log data in the tail. ### Motivation and Context To better measure health --------- Co-authored-by: Darshak Bhatti <dabhatti@micorsoft.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
) # Fix RoiAlign heap out-of-bounds read via unchecked batch_indices ## Description Add value-range validation for `batch_indices` in the RoiAlign operator to prevent out-of-bounds heap reads from maliciously crafted ONNX models. `CheckROIAlignValidInput()` previously validated tensor shapes but never checked that the **values** in `batch_indices` fall within `[0, batch_size)`. An attacker could supply `batch_indices` containing values exceeding the batch dimension of the input tensor `X`, causing the kernel to read arbitrary heap memory at: - **CPU:** `roialign.cc:212` — `roi_batch_ind` used as unchecked index into `bottom_data` - **CUDA:** `roialign_impl.cu:109` — `batch_indices_ptr[n]` used as unchecked index into `bottom_data` on GPU ## Impact - **Vulnerability type:** Heap out-of-bounds read - **Impact:** Arbitrary heap memory read, potential information disclosure, program crash - **Trigger:** Construct `batch_indices` with values ≥ `batch_size` or < 0 - **Affected providers:** CPU and CUDA (both call `CheckROIAlignValidInput()`) ## Changes ### `onnxruntime/core/providers/cpu/object_detection/roialign.cc` - Added per-element bounds check in `CheckROIAlignValidInput()`: each `batch_indices[i]` must satisfy `0 <= value < X.shape[0]` - Returns `INVALID_ARGUMENT` with a descriptive error message on violation - Guarded by `batch_indices_ptr->Location().device.Type() == OrtDevice::CPU` so it only runs when the tensor data is host-accessible (CPU EP and CropAndResize). For the CUDA EP, `batch_indices` lives in GPU memory and cannot be safely dereferenced on the host. ### `onnxruntime/test/providers/cpu/object_detection/roialign_test.cc` - Added `BatchIndicesOutOfRange` test: `batch_indices={1}` with `batch_size=1` (exercises `>= batch_size` path) - Added `BatchIndicesNegative` test: `batch_indices={-1}` (exercises `< 0` path) ## Known Limitation The CUDA execution path is **not** protected by this bounds check because `batch_indices` is a GPU tensor and cannot be read on the host. Adding a device-side bounds check would require passing `batch_size` into the CUDA kernel — this is tracked as a follow-up. Note: Using `.InputMemoryType(OrtMemTypeCPUInput, 2)` was considered but rejected because it would force a GPU→CPU transfer of `batch_indices`, breaking CUDA graph capture for models like Masked R-CNN where `batch_indices` is produced by upstream GPU ops. ## Validation - Full `RoiAlignTest.*` suite passes (12/12 tests) on CPU build - Full `RoiAlignTest.*` suite passes (12/12 tests) on CUDA build - No regressions in existing positive or negative tests
## Summary Generalize the WebNN-specific DequantizeLinear → MatMulNBits graph fusion transformer so it can be reused by other execution providers (e.g. NvTensorRTRTX), and add defensive shape/size validation to prevent crashes on malformed tensors. ### Fusion patterns **Pattern 1:** `DequantizeLinear → Reshape → Transpose → [Cast] → MatMul/Gemm` → **MatMulNBits** **Pattern 2:** `DequantizeLinear (axis=0) → MatMul/Gemm` → **MatMulNBits** --------- Co-authored-by: praneshgo <227579474+praneshgo@users.noreply.github.com>
edgchen1
approved these changes
Mar 5, 2026
kunal-vaishnavi
approved these changes
Mar 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This cherry-picks the following commits for the release: