ORT 1.24.3 release cherry pick round 1 by tianleiwu · Pull Request #27476 · microsoft/onnxruntime

tianleiwu · 2026-02-27T04:31:37Z

This cherry-picks the following commits for the release:

Commit ID	PR Number	Commit Title
`decd177`	#27090	Fix GatherND division by zero when batch dimensions mismatch
`55f8234`	#27360	Fix QMoE CPU Operator
`df9146f`	#27403	[MLAS] Adding DynamicQGemm function pointers and ukernel interface
`0f93853`	#27318	[js/web] Use embedded WASM module in Blob URL workers when wasmBinary is provided
`b2a6e69`	#27364	QMoE CPU Performance Update (Up to 4x on 4-bit)
`f501e1d`	#27413	Fix refcount bug in map input conversion that caused shutdown segfault
`b32b205`	#27421	Fix error where bytes is not assigned for dynamic qgemm pack b size
`426b006`	#27397	Fix DllImportResolver
`0982844`	#27412	MatmulNBits prepacking scales fix
`9afb0d2`	#27430	Fix validation for external data paths for models loaded from bytes
`71d2cd0`	#27401	Enable Python 3.14 CI and Upgrade Dependencies
`79e0676`	#27419	fix: out of bounds access for resize operation
`82eb99c`	#27459	Fix SkipLayerNorm fusion incorrectly applied when gamma/beta are not 1D
`355278a`	#27444	Fix GatherCopyData Integer Truncation Leading to Heap Out-of-Bounds Read/Write
`cf96123`	#27411	[web] fix usage of wasmBinary together with a blob URL for .mjs
`1131a86`	#27399	[web] remove the unhelpful "Unknown CPU vendor" warning.
`ffbbc4f`	#27316	Build Windows ARM64X binaries as part of packaging pipeline

…27403) ### Description * Adding function pointer overrides to KleidiAI DynamicQGemm * Making use of ukernel interface for DynamicQGemm to select between SME and SME2 variants ### Motivation and Context Fixes #26377

… is provided (#27318) Fixes #27317 When running inside a Blob URL Web Worker with `wasmBinary` provided and `numThreads=1`, `isSameOrigin(scriptSrc)` can fail because blob: URLs have opaque origins. This causes a fallback to dynamic `import('./ort-wasm-simd-threaded.mjs')` which doesn't exist in that context. Since `wasmBinary` is already provided and no worker spawning is needed (single-threaded), the embedded Emscripten module can be used directly — no URL resolution or same-origin check is needed. **Change:** One line in `wasm-utils-import.ts` line 275: ```typescript // Before: useEmbeddedModule = isSameOrigin(scriptSrc); // After: useEmbeddedModule = isSameOrigin(scriptSrc) || (isWasmOverridden && !isMultiThreaded); ``` This extends the existing pattern from the `!scriptSrc` case (line 268) to also apply when `scriptSrc` is available but fails same-origin checks. The condition (`wasmBinary` provided + single-threaded) guarantees no file resolution or worker spawning is needed.

#27413) Fix Python refcount bug in map input conversion that caused shutdown segfault in `onnxruntime_test_python_mlops.py` ( see #27392). ## Summary This PR fixes a Python reference-count ownership bug in the map conversion path in `onnxruntime/python/onnxruntime_pybind_mlvalue.cc`. In Python 3.14 test runs, the bug could manifest as a segmentation fault after tests completed (typically at interpreter shutdown), even when test assertions passed. . ## Root Cause In `CreateMapMLValue_LoopIntoMap`, error paths decremented `item` unconditionally. - In single-map flow, `item` is a **borrowed reference** (must not be decref'd there). - In iterator/vector-map flow, `item` is an **owned reference** (must be decref'd). The unconditional decref in borrowed-reference flow caused refcount corruption and eventually a crash. ## Fix Add explicit ownership handling for `item`: - `CreateMapMLValue_LoopIntoMap(..., bool owns_item_ref, ...)` - Pass `owns_item_ref = false` from single-map path (`CreateMapMLValue_Map`) - Pass `owns_item_ref = true` from vector-map path (`CreateMapMLValue_VectorMap`) - Only `Py_XDECREF(item)` on error when `owns_item_ref` is true This preserves existing behavior while correcting reference ownership. ## Validation ```bash cd onnxruntime/test/python python onnxruntime_test_python_mlops.py ``` Result: - `OK` - Exit code `0` (no shutdown segfault) ## Notes - Although this became reproducible in Python 3.14, the underlying refcount bug is version-agnostic C-extension undefined behavior.

…27421) ### Description Fix for dynamic qgemm pack B size. Byte assignment accidentally removed in previous commit. Which causes test failures with the following error message C++ exception with description "Dynamic QGEMM requires non-null PackedB pointer." thrown in the test body. Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>

### Description This PR addresses two issues related to the newly added `DllImportResolver` for `.NET` native library loading: 1. **Fix IL3000 Warning during Native AOT / Single-File Publish** When publishing projects that reference `Microsoft.ML.OnnxRuntime` as a single file or using Native AOT in .NET 9, the compiler reports an `IL3000` warning/error because `DllImportResolver` accesses `Assembly.Location`. In these deployment models, `Assembly.Location` always returns an empty string or throws. Since `DllImportResolver` already correctly handles the empty string failure and falls back to `AppContext.BaseDirectory` (which is fully supported), this PR adds the `[UnconditionalSuppressMessage]` attribute to suppress the build warning statically. 2. **Fix `TypeInitializationException` in `NativeMethods` Static Constructor** Users reported a `System.TypeInitializationException: The type initializer for 'Microsoft.ML.OnnxRuntime.NativeMethods' threw an exception.` when initializing the ONNX Runtime environment. This occurs because the `DllImportResolver` (registered in the static constructor) is invoked on the first P/Invoke (`OrtGetApiBase`). If any API within the resolver throws an unhandled exception (for instance, `AppContext.BaseDirectory` throwing `AppDomainUnloadedException` in sandboxed AppDomains or `Environment.GetEnvironmentVariable` throwing `SecurityException`), the exception bubbles up and crashes the application with a type initialization failure. This PR wraps the `DllImportResolver` logic in a `try-catch` block (specifically handling `AppContext.BaseDirectory` edge cases) so that any resolution failure safely swallows the error and falls back to `IntPtr.Zero`, allowing the default .NET Platform Invoke mechanism to take over and throw a standard `DllNotFoundException` instead of a fatal type initialization crash. A unit test (`TestDllImportResolverDoesNotThrow`) has been added to `OrtEnvTests.cs` to verify that `DllImportResolver` successfully swallows internal exceptions without crashing the initialization process. ### Motivation and Context These changes ensure that .NET developers can safely compile Native AOT/Single-File applications without build errors and prevent hard application crashes in environments with restricted permissions.

### Description Fix incorrect scales element count while pre-packing scales while we processing the B input in the Prepack() method of MatmulNBits operator ### Motivation and Context Fix potential crash due to incorrect element count --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

…27430) ### Description This PR fixes the validation of external data paths when ONNX models are loaded from bytes (in-memory). Previously, when a model was loaded from bytes without an explicit external data folder path being set, path using ".." sequences were not properly validated, potentially allowing access to arbitrary files on the filesystem. ### Motivation and Context Address a security concern

This pull request enables Python 3.14 testing in the CI pipelines and upgrades several key dependencies to support the new Python version. Previously, python 3.14 CI was not enabled since some dependent packages not support python 3.14 at that time. Now it is the time to upgrade them. Key Python dependencies have been updated to versions that support Python 3.14. The conditional version checks (based on `python_version`) have been removed in favor of these updated versions across all environments: - **pybind**: Upgraded to `3.0.2`. - **numpy**: Upgraded to `2.4.2`. - **onnxscript**: Upgraded to `0.6.2`. - **onnx-ir**: Upgraded to `0.1.16`. - **onnx**: Standardized on `1.20.1`. - **torch**: Upgraded to `2.10.0`. - **triton**: Upgraded to `3.5.0`. These updates affect multiple `requirements.txt` files across Linux and Windows Docker images and build stages. - Use `dynamo=False` for onnx export in failed python tests since PyTorch 2.10 changed `dynamo=True` as default, which broke a few test cases. The conditional logic that previously skipped Python 3.14 tests has been removed from the Azure Pipelines configuration. - **Python 3.14 Tests Enabled**: Removed `condition: and(succeeded(), ne('${{ parameters.PYTHON_VERSION }}', '3.14'))` from `py-win-webgpu-stage.yml`. - **Test Execution Flow**: Updated `py-win-cpu.yml` to remove the restriction that prevented `onnxruntime` tests and `onnx_backend_test_series.py` from running on Python 3.14. #27392

@korbinian-mechlem-snkeos

### Description This PR fixes: * An out-of-bounds write in CUDA Resize for LINEAR mode when running trilinear paths (3D/5D) * A race condition for the reduction kernel ### Root cause 1. The temporary dims-mapping buffer for LINEAR mode was sized using only H+W, while the trilinear coordinate mapping kernel writes D+H+W entries. 2. shared-memory race in the block-level reduction loop inside [reduction_functions.cu](vscode-file://vscode-app/c:/Users/lukas.folle/AppData/Local/Programs/Microsoft%20VS%20Code/072586267e/resources/app/out/vs/code/electron-browser/workbench/workbench.html). The condition allowed threads outside the active lower half to update shared memory in the same stride phase, creating overlapping read/write hazards My colleague @korbinian-mechlem-snkeos noticed this warning from compute-sanitzer > ========= Invalid __global__ write of size 4 bytes ========= at void onnxruntime::cuda::_ResizeTrilinearCoordinateMapping<float, onnxruntime::cuda::TransformCoordinate_HALF_PIXEL>(long long, long long, long long, long long, long long, long long, float, float, float, float, float, float, float, float, float, unsigned long long, bool, const T2 &, onnxruntime::cuda::LinearMappingInfo *)+0x400 ========= by thread (17,0,0) in block (2,0,0) ========= Address 0xb28fff7cc is out of bounds ========= and is 205 bytes after the nearest allocation at 0xb28fff400 of size 768 bytes ========= Saved host backtrace up to driver entry point at kernel launch time AND > ========= Warning: Race reported between Read access at void onnxruntime::cuda::detail::reduce_matrix_columns_kernel<float, float, float, onnxruntime::cuda::Identity, onnxruntime::cuda::Identity, (bool)0>(int, int, const T1 *, T2 *, T3 *, int *)+0xe80 ========= and Write access at void onnxruntime::cuda::detail::reduce_matrix_columns_kernel<float, float, float, onnxruntime::cuda::Identity, onnxruntime::cuda::Identity, (bool)0>(int, int, const T1 *, T2 *, T3 *, int *)+0xea0 [337920 hazards] ### Motivation and Context Update LINEAR buffer size calculation to: * use H+W for bilinear (2D/4D) * use D+H+W for trilinear (3D/5D) Prevents invalid global writes and intermittent CUDA memory errors in trilinear resize workloads. @johannes-rehm-snkeos

…ead/Write (#27444) ### Description This pull request improves the robustness and correctness of the CPU implementation of the Gather operator in ONNX Runtime. The key changes focus on preventing integer overflow issues in parallel processing and output shape calculations, as well as enhancing test coverage to verify these safeguards. Enhancements to overflow handling and parallel processing: * Changed the lambda function in `GatherCopyData` to use `ptrdiff_t` instead of `int64_t` for the index, and explicitly cast batch and i variables, ensuring safer arithmetic for large tensor sizes. * Updated the parallel loop in `GatherCopyData` to iterate using `ptrdiff_t` indices, preventing potential overflow when processing large tensors. Testing improvements: * Added a new unit test `Gather_overflow_check` in `gather_op_test.cc` to verify that the Gather operator correctly handles very large output shapes without overflowing, specifically testing dimensions that exceed the 32-bit integer limit. ### Motivation and Context

### Description Fixes the issue 1 described in #27317 ### Motivation and Context

### Description remove the "Unknown CPU vendor" warning for webassembly. CPU info is not supposed to expose in a browser environment, so it is expected to have no CPU info at runtime. Disable the confusing warning message for WebAssembly. ### Motivation and Context fixes #27336

Fixes #23828 Added validation to check: - num_batches is not zero - num_slices is divisible by num_batches Before this fix, mismatched batch dimensions caused a crash due to division by zero. ### Description  This PR fixes a division by zero crash in the GatherND operator when batch dimensions mismatch between input and indices tensors. Changes made: Added validation in gather_nd.cc to check that num_batches is not zero before division Added validation that num_slices is divisible by num_batches Added a unit test to verify the fix ### Motivation and Context  Description Fixes #23828 When batch_dims is set but the actual batch dimensions of the input tensor and indices tensor don't align correctly, the code performs a division that can result in division by zero, causing a crash. For example, with: Input shape: [2, 2, 2] Indices shape: [2, 1] batch_dims=1 The calculation num_slices / num_batches would crash if num_batches is 0, or produce unexpected results if they don't divide evenly. This fix returns a clear error message instead of crashing. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

This PR addresses several issues in the QMoE CPU implementation, improves MLAS documentation. ## Changes ### 1. QMoE CPU Operator Fixes - **Corrected Bias Handling**: Renamed `fc2_bias_handled_by_q4_gemm` to `fc2_bias_added_by_mlas` and updated the logic to consistently track whether FC2 bias has been applied. This ensures that bias is not double-counted or missed when using `DirectQ4Gemm`. - **SwiGLU Attribute Update**: Switched from `swiglu_interleaved` to `swiglu_fusion` in both the C++ operator and the Python test infrastructure to align with the latest QMoE implementation standards. ### 2. MLAS Documentation - **Clarified Buffer Shapes**: Added explicit documentation to `MlasQ4GemmPackB` to specify that the input `FpData` buffer expects a shape of `[K, N]`. This helps prevent layout-related errors in future integrations. ### 3. Test Updates - **PyTorch Parity Fixes**: Refactored `onnxruntime/test/python/transformers/test_qmoe_cpu.py` to use `swiglu_fusion` and improved the test structure for better parity checks with PyTorch. ## Verification - Verified by running `test_qmoe_cpu.py` to ensure all QMoE parity tests pass on CPU.

## Summary This change improves QMoE CPU performance by moving more work to prepack time and enabling the DirectQ4 GEMM fast path where appropriate, while preserving an env-var switch for performance/accuracy A/B testing. This PR introduces: - Prepack and cache infrastructure for QMoE expert weights. - DirectQ4 packed-B cache built during prepack (instead of mutable runtime cache in `Compute()`). - Fast-path support for block-wise cases (including block size 32 where supported by MLAS Q4 type). - Runtime toggle via `ORT_USE_MLAS_Q4_GEMM_MOE`. - Default fast-path policy refined to avoid known accuracy-loss scenarios unless explicitly overridden by env var. - Test and benchmark refinements for QMoE CPU validation. ## Key Implementation Changes ### 1. Prepack-time cache build - Moves DirectQ4 packed-B cache construction to prepack stage. - Removes mutable runtime cache maintenance from `Compute()`. - Reduces per-inference overhead and avoids mutable shared cache complexity. ### 2. Fast path vs fallback - Keeps two execution modes: - DirectQ4 GEMM fast path (`MlasQ4GemmPackB` + `DirectQ4Gemm` cache usage). - Fallback path (`DequantizePrePacked` + `MlasGemm`). - Allows controlled fallback for accuracy-sensitive configurations. ### 3. Environment variable behavior - `ORT_USE_MLAS_Q4_GEMM_MOE=1`: force fast path when supported. - `ORT_USE_MLAS_Q4_GEMM_MOE=0`: force fallback path. - Unset: use default policy that enables fast path unless a known accuracy-loss pattern is detected. ### 4. Test updates - QMoE CPU tests were refined to validate env-var on/off behavior and no-env behavior. - Coverage includes parity checks for symmetric/asymmetric, row-wise/block-wise settings. ## Benchmark Results (1000 inferences, `benchmark_qmoe.py`) Note: PyTorch latency fluctuates across runs and is excluded from conclusions below. ### ORT results comparison | Config | Baseline ORT Time (ms) | Baseline ORT tok/s | New ORT Time (env=0) (ms) | New ORT tok/s (env=0) | New ORT Time (env=1) (ms) | New ORT tok/s (env=1) | |---|---:|---:|---:|---:|---:|---:| | Medium-4bit | 748.594 | 1.3 | 237.219 | 4.2 | 178.943 | 5.6 | | Medium-8bit | 209.277 | 4.8 | 212.074 | 4.7 | 203.882 | 4.9 | ### ORT speedup vs baseline | Config | env=0 speedup vs baseline (time) | env=1 speedup vs baseline (time) | |---|---:|---:| | Medium-4bit | 3.16x faster | 4.18x faster | | Medium-8bit | 0.99x (about flat) | 1.03x faster | ## Accuracy Notes - `env=1` (forced fast path) provides the best 4-bit performance but may show non-zero max diff in known cases. - `env=0` (fallback) maintains parity behavior with zero observed max diff in the reported benchmark table. - Default no-env policy is designed to avoid known accuracy-loss cases while still enabling fast path where safe.

…1D (#27459) ### Description The `SkipLayerNormFusion` optimizer skips fusion when the `LayerNormalization` gamma or beta inputs are not 1D tensors (e.g. shape `[1, 1, hidden_size]`). The `SkipLayerNormalization` kernel strictly requires 1D gamma/beta, so fusing without this check caused a hard runtime error. - **`skip_layer_norm_fusion.cc`**: After matching the Add+LayerNorm pattern, check that gamma (and beta if present) have exactly 1 dimension before proceeding with fusion. If shape info is unavailable (dynamic), fusion is allowed and runtime validation takes over. - **`graph_transform_test_layernorm.cc`**: Added `SkipLayerNormFusion_3DGamma_NoFusion` test — builds a graph with `Add + LayerNormalization` where gamma/beta are `[1, 1, 4]` and asserts no `SkipLayerNormalization` node is created. ### Motivation and Context Models with residual connections followed by `LayerNormalization` where the scale/bias tensors carry extra batch/sequence dimensions (e.g. exported as `[1, 1, hidden_size]` rather than `[hidden_size]`) would trigger fusion and then fail at runtime: ``` Non-zero status code returned while running SkipLayerNormalization node. Status Message: gamma is expected to have 1 dimension, got 3 ``` The error only appeared with 3D inputs and disappeared at `ORT_ENABLE_BASIC` optimization level (which disables the fusion), confirming the optimizer as the source of the regression.  --- 💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey). --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>

- Add ARM64X build to packaging pipeline. An additional zip archive artifact with the ARM64X binaries will be produced. - Add basic C++ sample program. - Add binary archive tests using the C++ sample program to package test pipeline.  Address request for ARM64X binaries. Add testing of binary archives to package test pipeline.

### Description allows new memory info name for WebGPU. ### Motivation and Context This allows at least 1.24.3 works with future (1.25.x) WebGPU plugin DLL

hariharans29

LGTM for my fix

tianleiwu and others added 21 commits February 26, 2026 14:11

Update version number to 1.24.3

5981814

[web] fix usage of wasmBinary together with a blob URL for .mjs (#27411)

fbf081c

### Description Fixes the issue 1 described in #27317 ### Motivation and Context

Increase ios build timeout to 360 minutes

8f3b84f

Propagate parameters correctly

6a2cd62

[Patch] allows new memory info name for WebGPU (#27475)

f408549

### Description allows new memory info name for WebGPU. ### Motivation and Context This allows at least 1.24.3 works with future (1.25.x) WebGPU plugin DLL

tianleiwu requested review from adrianlizarraga, chilo-ms, edgchen1, eserscor, fs-eire and hariharans29 February 27, 2026 04:32

hariharans29 approved these changes Feb 27, 2026

View reviewed changes

apsonawane approved these changes Feb 27, 2026

View reviewed changes

tianleiwu merged commit ee26608 into rel-1.24.3 Feb 27, 2026
201 of 205 checks passed

tianleiwu deleted the tlwu/rel-1.24.3 branch February 27, 2026 23:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ORT 1.24.3 release cherry pick round 1#27476

ORT 1.24.3 release cherry pick round 1#27476
tianleiwu merged 21 commits intorel-1.24.3from
tlwu/rel-1.24.3

tianleiwu commented Feb 27, 2026

Uh oh!

hariharans29 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

Conversation

tianleiwu commented Feb 27, 2026

Uh oh!

hariharans29 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants