ORT 1.24.3 release cherry pick round 1#27476
Merged
tianleiwu merged 21 commits intorel-1.24.3from Feb 27, 2026
Merged
Conversation
… is provided (#27318) Fixes #27317 When running inside a Blob URL Web Worker with `wasmBinary` provided and `numThreads=1`, `isSameOrigin(scriptSrc)` can fail because blob: URLs have opaque origins. This causes a fallback to dynamic `import('./ort-wasm-simd-threaded.mjs')` which doesn't exist in that context. Since `wasmBinary` is already provided and no worker spawning is needed (single-threaded), the embedded Emscripten module can be used directly — no URL resolution or same-origin check is needed. **Change:** One line in `wasm-utils-import.ts` line 275: ```typescript // Before: useEmbeddedModule = isSameOrigin(scriptSrc); // After: useEmbeddedModule = isSameOrigin(scriptSrc) || (isWasmOverridden && !isMultiThreaded); ``` This extends the existing pattern from the `!scriptSrc` case (line 268) to also apply when `scriptSrc` is available but fails same-origin checks. The condition (`wasmBinary` provided + single-threaded) guarantees no file resolution or worker spawning is needed.
#27413) Fix Python refcount bug in map input conversion that caused shutdown segfault in `onnxruntime_test_python_mlops.py` ( see #27392). ## Summary This PR fixes a Python reference-count ownership bug in the map conversion path in `onnxruntime/python/onnxruntime_pybind_mlvalue.cc`. In Python 3.14 test runs, the bug could manifest as a segmentation fault after tests completed (typically at interpreter shutdown), even when test assertions passed. . ## Root Cause In `CreateMapMLValue_LoopIntoMap`, error paths decremented `item` unconditionally. - In single-map flow, `item` is a **borrowed reference** (must not be decref'd there). - In iterator/vector-map flow, `item` is an **owned reference** (must be decref'd). The unconditional decref in borrowed-reference flow caused refcount corruption and eventually a crash. ## Fix Add explicit ownership handling for `item`: - `CreateMapMLValue_LoopIntoMap(..., bool owns_item_ref, ...)` - Pass `owns_item_ref = false` from single-map path (`CreateMapMLValue_Map`) - Pass `owns_item_ref = true` from vector-map path (`CreateMapMLValue_VectorMap`) - Only `Py_XDECREF(item)` on error when `owns_item_ref` is true This preserves existing behavior while correcting reference ownership. ## Validation ```bash cd onnxruntime/test/python python onnxruntime_test_python_mlops.py ``` Result: - `OK` - Exit code `0` (no shutdown segfault) ## Notes - Although this became reproducible in Python 3.14, the underlying refcount bug is version-agnostic C-extension undefined behavior.
…27421) ### Description Fix for dynamic qgemm pack B size. Byte assignment accidentally removed in previous commit. Which causes test failures with the following error message C++ exception with description "Dynamic QGEMM requires non-null PackedB pointer." thrown in the test body. Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>
### Description This PR addresses two issues related to the newly added `DllImportResolver` for `.NET` native library loading: 1. **Fix IL3000 Warning during Native AOT / Single-File Publish** When publishing projects that reference `Microsoft.ML.OnnxRuntime` as a single file or using Native AOT in .NET 9, the compiler reports an `IL3000` warning/error because `DllImportResolver` accesses `Assembly.Location`. In these deployment models, `Assembly.Location` always returns an empty string or throws. Since `DllImportResolver` already correctly handles the empty string failure and falls back to `AppContext.BaseDirectory` (which is fully supported), this PR adds the `[UnconditionalSuppressMessage]` attribute to suppress the build warning statically. 2. **Fix `TypeInitializationException` in `NativeMethods` Static Constructor** Users reported a `System.TypeInitializationException: The type initializer for 'Microsoft.ML.OnnxRuntime.NativeMethods' threw an exception.` when initializing the ONNX Runtime environment. This occurs because the `DllImportResolver` (registered in the static constructor) is invoked on the first P/Invoke (`OrtGetApiBase`). If any API within the resolver throws an unhandled exception (for instance, `AppContext.BaseDirectory` throwing `AppDomainUnloadedException` in sandboxed AppDomains or `Environment.GetEnvironmentVariable` throwing `SecurityException`), the exception bubbles up and crashes the application with a type initialization failure. This PR wraps the `DllImportResolver` logic in a `try-catch` block (specifically handling `AppContext.BaseDirectory` edge cases) so that any resolution failure safely swallows the error and falls back to `IntPtr.Zero`, allowing the default .NET Platform Invoke mechanism to take over and throw a standard `DllNotFoundException` instead of a fatal type initialization crash. A unit test (`TestDllImportResolverDoesNotThrow`) has been added to `OrtEnvTests.cs` to verify that `DllImportResolver` successfully swallows internal exceptions without crashing the initialization process. ### Motivation and Context These changes ensure that .NET developers can safely compile Native AOT/Single-File applications without build errors and prevent hard application crashes in environments with restricted permissions.
### Description Fix incorrect scales element count while pre-packing scales while we processing the B input in the Prepack() method of MatmulNBits operator ### Motivation and Context Fix potential crash due to incorrect element count --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…27430) ### Description This PR fixes the validation of external data paths when ONNX models are loaded from bytes (in-memory). Previously, when a model was loaded from bytes without an explicit external data folder path being set, path using ".." sequences were not properly validated, potentially allowing access to arbitrary files on the filesystem. ### Motivation and Context Address a security concern
This pull request enables Python 3.14 testing in the CI pipelines and
upgrades several key dependencies to support the new Python version.
Previously, python 3.14 CI was not enabled since some dependent packages
not support python 3.14 at that time. Now it is the time to upgrade
them.
Key Python dependencies have been updated to versions that support
Python 3.14. The conditional version checks (based on `python_version`)
have been removed in favor of these updated versions across all
environments:
- **pybind**: Upgraded to `3.0.2`.
- **numpy**: Upgraded to `2.4.2`.
- **onnxscript**: Upgraded to `0.6.2`.
- **onnx-ir**: Upgraded to `0.1.16`.
- **onnx**: Standardized on `1.20.1`.
- **torch**: Upgraded to `2.10.0`.
- **triton**: Upgraded to `3.5.0`.
These updates affect multiple `requirements.txt` files across Linux and
Windows Docker images and build stages.
- Use `dynamo=False` for onnx export in failed python tests since
PyTorch 2.10 changed `dynamo=True` as default, which broke a few test
cases.
The conditional logic that previously skipped Python 3.14 tests has been
removed from the Azure Pipelines configuration.
- **Python 3.14 Tests Enabled**: Removed `condition: and(succeeded(),
ne('${{ parameters.PYTHON_VERSION }}', '3.14'))` from
`py-win-webgpu-stage.yml`.
- **Test Execution Flow**: Updated `py-win-cpu.yml` to remove the
restriction that prevented `onnxruntime` tests and
`onnx_backend_test_series.py` from running on Python 3.14.
#27392
### Description This PR fixes: * An out-of-bounds write in CUDA Resize for LINEAR mode when running trilinear paths (3D/5D) * A race condition for the reduction kernel ### Root cause 1. The temporary dims-mapping buffer for LINEAR mode was sized using only H+W, while the trilinear coordinate mapping kernel writes D+H+W entries. 2. shared-memory race in the block-level reduction loop inside [reduction_functions.cu](vscode-file://vscode-app/c:/Users/lukas.folle/AppData/Local/Programs/Microsoft%20VS%20Code/072586267e/resources/app/out/vs/code/electron-browser/workbench/workbench.html). The condition allowed threads outside the active lower half to update shared memory in the same stride phase, creating overlapping read/write hazards My colleague @korbinian-mechlem-snkeos noticed this warning from compute-sanitzer > ========= Invalid __global__ write of size 4 bytes ========= at void onnxruntime::cuda::_ResizeTrilinearCoordinateMapping<float, onnxruntime::cuda::TransformCoordinate_HALF_PIXEL>(long long, long long, long long, long long, long long, long long, float, float, float, float, float, float, float, float, float, unsigned long long, bool, const T2 &, onnxruntime::cuda::LinearMappingInfo *)+0x400 ========= by thread (17,0,0) in block (2,0,0) ========= Address 0xb28fff7cc is out of bounds ========= and is 205 bytes after the nearest allocation at 0xb28fff400 of size 768 bytes ========= Saved host backtrace up to driver entry point at kernel launch time AND > ========= Warning: Race reported between Read access at void onnxruntime::cuda::detail::reduce_matrix_columns_kernel<float, float, float, onnxruntime::cuda::Identity, onnxruntime::cuda::Identity, (bool)0>(int, int, const T1 *, T2 *, T3 *, int *)+0xe80 ========= and Write access at void onnxruntime::cuda::detail::reduce_matrix_columns_kernel<float, float, float, onnxruntime::cuda::Identity, onnxruntime::cuda::Identity, (bool)0>(int, int, const T1 *, T2 *, T3 *, int *)+0xea0 [337920 hazards] ### Motivation and Context Update LINEAR buffer size calculation to: * use H+W for bilinear (2D/4D) * use D+H+W for trilinear (3D/5D) Prevents invalid global writes and intermittent CUDA memory errors in trilinear resize workloads. @johannes-rehm-snkeos
…ead/Write (#27444) ### Description This pull request improves the robustness and correctness of the CPU implementation of the Gather operator in ONNX Runtime. The key changes focus on preventing integer overflow issues in parallel processing and output shape calculations, as well as enhancing test coverage to verify these safeguards. Enhancements to overflow handling and parallel processing: * Changed the lambda function in `GatherCopyData` to use `ptrdiff_t` instead of `int64_t` for the index, and explicitly cast batch and i variables, ensuring safer arithmetic for large tensor sizes. * Updated the parallel loop in `GatherCopyData` to iterate using `ptrdiff_t` indices, preventing potential overflow when processing large tensors. Testing improvements: * Added a new unit test `Gather_overflow_check` in `gather_op_test.cc` to verify that the Gather operator correctly handles very large output shapes without overflowing, specifically testing dimensions that exceed the 32-bit integer limit. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
### Description Fixes the issue 1 described in #27317 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
### Description remove the "Unknown CPU vendor" warning for webassembly. CPU info is not supposed to expose in a browser environment, so it is expected to have no CPU info at runtime. Disable the confusing warning message for WebAssembly. ### Motivation and Context fixes #27336
Fixes #23828 Added validation to check: - num_batches is not zero - num_slices is divisible by num_batches Before this fix, mismatched batch dimensions caused a crash due to division by zero. ### Description <!-- Describe your changes. --> This PR fixes a division by zero crash in the GatherND operator when batch dimensions mismatch between input and indices tensors. Changes made: Added validation in gather_nd.cc to check that num_batches is not zero before division Added validation that num_slices is divisible by num_batches Added a unit test to verify the fix ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Description Fixes #23828 When batch_dims is set but the actual batch dimensions of the input tensor and indices tensor don't align correctly, the code performs a division that can result in division by zero, causing a crash. For example, with: Input shape: [2, 2, 2] Indices shape: [2, 1] batch_dims=1 The calculation num_slices / num_batches would crash if num_batches is 0, or produce unexpected results if they don't divide evenly. This fix returns a clear error message instead of crashing. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This PR addresses several issues in the QMoE CPU implementation, improves MLAS documentation. ## Changes ### 1. QMoE CPU Operator Fixes - **Corrected Bias Handling**: Renamed `fc2_bias_handled_by_q4_gemm` to `fc2_bias_added_by_mlas` and updated the logic to consistently track whether FC2 bias has been applied. This ensures that bias is not double-counted or missed when using `DirectQ4Gemm`. - **SwiGLU Attribute Update**: Switched from `swiglu_interleaved` to `swiglu_fusion` in both the C++ operator and the Python test infrastructure to align with the latest QMoE implementation standards. ### 2. MLAS Documentation - **Clarified Buffer Shapes**: Added explicit documentation to `MlasQ4GemmPackB` to specify that the input `FpData` buffer expects a shape of `[K, N]`. This helps prevent layout-related errors in future integrations. ### 3. Test Updates - **PyTorch Parity Fixes**: Refactored `onnxruntime/test/python/transformers/test_qmoe_cpu.py` to use `swiglu_fusion` and improved the test structure for better parity checks with PyTorch. ## Verification - Verified by running `test_qmoe_cpu.py` to ensure all QMoE parity tests pass on CPU.
## Summary This change improves QMoE CPU performance by moving more work to prepack time and enabling the DirectQ4 GEMM fast path where appropriate, while preserving an env-var switch for performance/accuracy A/B testing. This PR introduces: - Prepack and cache infrastructure for QMoE expert weights. - DirectQ4 packed-B cache built during prepack (instead of mutable runtime cache in `Compute()`). - Fast-path support for block-wise cases (including block size 32 where supported by MLAS Q4 type). - Runtime toggle via `ORT_USE_MLAS_Q4_GEMM_MOE`. - Default fast-path policy refined to avoid known accuracy-loss scenarios unless explicitly overridden by env var. - Test and benchmark refinements for QMoE CPU validation. ## Key Implementation Changes ### 1. Prepack-time cache build - Moves DirectQ4 packed-B cache construction to prepack stage. - Removes mutable runtime cache maintenance from `Compute()`. - Reduces per-inference overhead and avoids mutable shared cache complexity. ### 2. Fast path vs fallback - Keeps two execution modes: - DirectQ4 GEMM fast path (`MlasQ4GemmPackB` + `DirectQ4Gemm` cache usage). - Fallback path (`DequantizePrePacked` + `MlasGemm`). - Allows controlled fallback for accuracy-sensitive configurations. ### 3. Environment variable behavior - `ORT_USE_MLAS_Q4_GEMM_MOE=1`: force fast path when supported. - `ORT_USE_MLAS_Q4_GEMM_MOE=0`: force fallback path. - Unset: use default policy that enables fast path unless a known accuracy-loss pattern is detected. ### 4. Test updates - QMoE CPU tests were refined to validate env-var on/off behavior and no-env behavior. - Coverage includes parity checks for symmetric/asymmetric, row-wise/block-wise settings. ## Benchmark Results (1000 inferences, `benchmark_qmoe.py`) Note: PyTorch latency fluctuates across runs and is excluded from conclusions below. ### ORT results comparison | Config | Baseline ORT Time (ms) | Baseline ORT tok/s | New ORT Time (env=0) (ms) | New ORT tok/s (env=0) | New ORT Time (env=1) (ms) | New ORT tok/s (env=1) | |---|---:|---:|---:|---:|---:|---:| | Medium-4bit | 748.594 | 1.3 | 237.219 | 4.2 | 178.943 | 5.6 | | Medium-8bit | 209.277 | 4.8 | 212.074 | 4.7 | 203.882 | 4.9 | ### ORT speedup vs baseline | Config | env=0 speedup vs baseline (time) | env=1 speedup vs baseline (time) | |---|---:|---:| | Medium-4bit | 3.16x faster | 4.18x faster | | Medium-8bit | 0.99x (about flat) | 1.03x faster | ## Accuracy Notes - `env=1` (forced fast path) provides the best 4-bit performance but may show non-zero max diff in known cases. - `env=0` (fallback) maintains parity behavior with zero observed max diff in the reported benchmark table. - Default no-env policy is designed to avoid known accuracy-loss cases while still enabling fast path where safe.
…1D (#27459) ### Description The `SkipLayerNormFusion` optimizer skips fusion when the `LayerNormalization` gamma or beta inputs are not 1D tensors (e.g. shape `[1, 1, hidden_size]`). The `SkipLayerNormalization` kernel strictly requires 1D gamma/beta, so fusing without this check caused a hard runtime error. - **`skip_layer_norm_fusion.cc`**: After matching the Add+LayerNorm pattern, check that gamma (and beta if present) have exactly 1 dimension before proceeding with fusion. If shape info is unavailable (dynamic), fusion is allowed and runtime validation takes over. - **`graph_transform_test_layernorm.cc`**: Added `SkipLayerNormFusion_3DGamma_NoFusion` test — builds a graph with `Add + LayerNormalization` where gamma/beta are `[1, 1, 4]` and asserts no `SkipLayerNormalization` node is created. ### Motivation and Context Models with residual connections followed by `LayerNormalization` where the scale/bias tensors carry extra batch/sequence dimensions (e.g. exported as `[1, 1, hidden_size]` rather than `[hidden_size]`) would trigger fusion and then fail at runtime: ``` Non-zero status code returned while running SkipLayerNormalization node. Status Message: gamma is expected to have 1 dimension, got 3 ``` The error only appeared with 3D inputs and disappeared at `ORT_ENABLE_BASIC` optimization level (which disables the fusion), confirming the optimizer as the source of the regression. <!-- START COPILOT CODING AGENT TIPS --> --- 💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey). --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
<!-- Describe your changes. --> - Add ARM64X build to packaging pipeline. An additional zip archive artifact with the ARM64X binaries will be produced. - Add basic C++ sample program. - Add binary archive tests using the C++ sample program to package test pipeline. <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Address request for ARM64X binaries. Add testing of binary archives to package test pipeline.
### Description allows new memory info name for WebGPU. ### Motivation and Context This allows at least 1.24.3 works with future (1.25.x) WebGPU plugin DLL
apsonawane
approved these changes
Feb 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This cherry-picks the following commits for the release: