Skip to content

ORT 1.24.3 release cherry pick round 1#27476

Merged
tianleiwu merged 21 commits intorel-1.24.3from
tlwu/rel-1.24.3
Feb 27, 2026
Merged

ORT 1.24.3 release cherry pick round 1#27476
tianleiwu merged 21 commits intorel-1.24.3from
tlwu/rel-1.24.3

Conversation

@tianleiwu
Copy link
Copy Markdown
Contributor

This cherry-picks the following commits for the release:

Commit ID PR Number Commit Title
decd177 #27090 Fix GatherND division by zero when batch dimensions mismatch
55f8234 #27360 Fix QMoE CPU Operator
df9146f #27403 [MLAS] Adding DynamicQGemm function pointers and ukernel interface
0f93853 #27318 [js/web] Use embedded WASM module in Blob URL workers when wasmBinary is provided
b2a6e69 #27364 QMoE CPU Performance Update (Up to 4x on 4-bit)
f501e1d #27413 Fix refcount bug in map input conversion that caused shutdown segfault
b32b205 #27421 Fix error where bytes is not assigned for dynamic qgemm pack b size
426b006 #27397 Fix DllImportResolver
0982844 #27412 MatmulNBits prepacking scales fix
9afb0d2 #27430 Fix validation for external data paths for models loaded from bytes
71d2cd0 #27401 Enable Python 3.14 CI and Upgrade Dependencies
79e0676 #27419 fix: out of bounds access for resize operation
82eb99c #27459 Fix SkipLayerNorm fusion incorrectly applied when gamma/beta are not 1D
355278a #27444 Fix GatherCopyData Integer Truncation Leading to Heap Out-of-Bounds Read/Write
cf96123 #27411 [web] fix usage of wasmBinary together with a blob URL for .mjs
1131a86 #27399 [web] remove the unhelpful "Unknown CPU vendor" warning.
ffbbc4f #27316 Build Windows ARM64X binaries as part of packaging pipeline

tianleiwu and others added 21 commits February 26, 2026 14:11
…27403)

### Description
* Adding function pointer overrides to KleidiAI DynamicQGemm
* Making use of ukernel interface for DynamicQGemm to select between SME
and SME2 variants

### Motivation and Context
Fixes #26377
… is provided (#27318)

Fixes #27317

When running inside a Blob URL Web Worker with `wasmBinary` provided and
`numThreads=1`, `isSameOrigin(scriptSrc)` can fail because blob: URLs
have opaque origins. This causes a fallback to dynamic
`import('./ort-wasm-simd-threaded.mjs')` which doesn't exist in that
context.

Since `wasmBinary` is already provided and no worker spawning is needed
(single-threaded), the embedded Emscripten module can be used directly —
no URL resolution or same-origin check is needed.

**Change:** One line in `wasm-utils-import.ts` line 275:
```typescript
// Before:
useEmbeddedModule = isSameOrigin(scriptSrc);

// After:
useEmbeddedModule = isSameOrigin(scriptSrc) || (isWasmOverridden && !isMultiThreaded);
```

This extends the existing pattern from the `!scriptSrc` case (line 268)
to also apply when `scriptSrc` is available but fails same-origin
checks. The condition (`wasmBinary` provided + single-threaded)
guarantees no file resolution or worker spawning is needed.
#27413)

Fix Python refcount bug in map input conversion that caused shutdown
segfault in `onnxruntime_test_python_mlops.py` ( see
#27392).

## Summary
This PR fixes a Python reference-count ownership bug in the map
conversion path in `onnxruntime/python/onnxruntime_pybind_mlvalue.cc`.

In Python 3.14 test runs, the bug could manifest as a segmentation fault
after tests completed (typically at interpreter shutdown), even when
test assertions passed. .

## Root Cause
In `CreateMapMLValue_LoopIntoMap`, error paths decremented `item`
unconditionally.

- In single-map flow, `item` is a **borrowed reference** (must not be
decref'd there).
- In iterator/vector-map flow, `item` is an **owned reference** (must be
decref'd).

The unconditional decref in borrowed-reference flow caused refcount
corruption and eventually a crash.

## Fix
Add explicit ownership handling for `item`:

- `CreateMapMLValue_LoopIntoMap(..., bool owns_item_ref, ...)`
- Pass `owns_item_ref = false` from single-map path
(`CreateMapMLValue_Map`)
- Pass `owns_item_ref = true` from vector-map path
(`CreateMapMLValue_VectorMap`)
- Only `Py_XDECREF(item)` on error when `owns_item_ref` is true

This preserves existing behavior while correcting reference ownership.

## Validation
```bash
cd onnxruntime/test/python
python onnxruntime_test_python_mlops.py
```

Result:
- `OK`
- Exit code `0` (no shutdown segfault)

## Notes
- Although this became reproducible in Python 3.14, the underlying
refcount bug is version-agnostic C-extension undefined behavior.
…27421)

### Description
Fix for dynamic qgemm pack B size. Byte assignment accidentally removed
in previous commit. Which causes test failures with the following error
message

C++ exception with description "Dynamic QGEMM requires non-null PackedB
pointer." thrown in the test body.

Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>
### Description
This PR addresses two issues related to the newly added
`DllImportResolver` for `.NET` native library loading:

1. **Fix IL3000 Warning during Native AOT / Single-File Publish**
When publishing projects that reference `Microsoft.ML.OnnxRuntime` as a
single file or using Native AOT in .NET 9, the compiler reports an
`IL3000` warning/error because `DllImportResolver` accesses
`Assembly.Location`. In these deployment models, `Assembly.Location`
always returns an empty string or throws.
Since `DllImportResolver` already correctly handles the empty string
failure and falls back to `AppContext.BaseDirectory` (which is fully
supported), this PR adds the `[UnconditionalSuppressMessage]` attribute
to suppress the build warning statically.

2. **Fix `TypeInitializationException` in `NativeMethods` Static
Constructor**
Users reported a `System.TypeInitializationException: The type
initializer for 'Microsoft.ML.OnnxRuntime.NativeMethods' threw an
exception.` when initializing the ONNX Runtime environment.
This occurs because the `DllImportResolver` (registered in the static
constructor) is invoked on the first P/Invoke (`OrtGetApiBase`). If any
API within the resolver throws an unhandled exception (for instance,
`AppContext.BaseDirectory` throwing `AppDomainUnloadedException` in
sandboxed AppDomains or `Environment.GetEnvironmentVariable` throwing
`SecurityException`), the exception bubbles up and crashes the
application with a type initialization failure.
This PR wraps the `DllImportResolver` logic in a `try-catch` block
(specifically handling `AppContext.BaseDirectory` edge cases) so that
any resolution failure safely swallows the error and falls back to
`IntPtr.Zero`, allowing the default .NET Platform Invoke mechanism to
take over and throw a standard `DllNotFoundException` instead of a fatal
type initialization crash.
A unit test (`TestDllImportResolverDoesNotThrow`) has been added to
`OrtEnvTests.cs` to verify that `DllImportResolver` successfully
swallows internal exceptions without crashing the initialization
process.

### Motivation and Context
These changes ensure that .NET developers can safely compile Native
AOT/Single-File applications without build errors and prevent hard
application crashes in environments with restricted permissions.
### Description
Fix incorrect scales element count while pre-packing scales while we
processing the B input in the Prepack() method of MatmulNBits operator


### Motivation and Context
Fix potential crash due to incorrect element count

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…27430)

### Description
This PR fixes the validation of external data paths when ONNX models are
loaded from bytes (in-memory). Previously, when a model was loaded from
bytes without an explicit external data folder path being set, path
using ".." sequences were not properly validated, potentially allowing
access to arbitrary files on the filesystem.


### Motivation and Context
Address a security concern
This pull request enables Python 3.14 testing in the CI pipelines and
upgrades several key dependencies to support the new Python version.

Previously, python 3.14 CI was not enabled since some dependent packages
not support python 3.14 at that time. Now it is the time to upgrade
them.
Key Python dependencies have been updated to versions that support
Python 3.14. The conditional version checks (based on `python_version`)
have been removed in favor of these updated versions across all
environments:
- **pybind**: Upgraded to `3.0.2`.
- **numpy**: Upgraded to `2.4.2`.
- **onnxscript**: Upgraded to `0.6.2`.
- **onnx-ir**: Upgraded to `0.1.16`.
- **onnx**: Standardized on `1.20.1`.
- **torch**: Upgraded to `2.10.0`.
- **triton**: Upgraded to `3.5.0`.
These updates affect multiple `requirements.txt` files across Linux and
Windows Docker images and build stages.
- Use `dynamo=False` for onnx export in failed python tests since
PyTorch 2.10 changed `dynamo=True` as default, which broke a few test
cases.
The conditional logic that previously skipped Python 3.14 tests has been
removed from the Azure Pipelines configuration.
- **Python 3.14 Tests Enabled**: Removed `condition: and(succeeded(),
ne('${{ parameters.PYTHON_VERSION }}', '3.14'))` from
`py-win-webgpu-stage.yml`.
- **Test Execution Flow**: Updated `py-win-cpu.yml` to remove the
restriction that prevented `onnxruntime` tests and
`onnx_backend_test_series.py` from running on Python 3.14.
#27392
### Description
This PR fixes:
* An out-of-bounds write in CUDA Resize for LINEAR mode when running
trilinear paths (3D/5D)
* A race condition for the reduction kernel

### Root cause
1. The temporary dims-mapping buffer for LINEAR mode was sized using
only H+W, while the trilinear coordinate mapping kernel writes D+H+W
entries.
2. shared-memory race in the block-level reduction loop inside
[reduction_functions.cu](vscode-file://vscode-app/c:/Users/lukas.folle/AppData/Local/Programs/Microsoft%20VS%20Code/072586267e/resources/app/out/vs/code/electron-browser/workbench/workbench.html).
The condition allowed threads outside the active lower half to update
shared memory in the same stride phase, creating overlapping read/write
hazards

My colleague @korbinian-mechlem-snkeos noticed this warning from
compute-sanitzer
> ========= Invalid __global__ write of size 4 bytes
========= at void
onnxruntime::cuda::_ResizeTrilinearCoordinateMapping<float,
onnxruntime::cuda::TransformCoordinate_HALF_PIXEL>(long long, long long,
long long, long long, long long, long long, float, float, float, float,
float, float, float, float, float, unsigned long long, bool, const T2 &,
onnxruntime::cuda::LinearMappingInfo *)+0x400
=========     by thread (17,0,0) in block (2,0,0)
=========     Address 0xb28fff7cc is out of bounds
========= and is 205 bytes after the nearest allocation at 0xb28fff400
of size 768 bytes
========= Saved host backtrace up to driver entry point at kernel launch
time

AND

> ========= Warning: Race reported between Read access at void
onnxruntime::cuda::detail::reduce_matrix_columns_kernel<float, float,
float, onnxruntime::cuda::Identity, onnxruntime::cuda::Identity,
(bool)0>(int, int, const T1 *, T2 *, T3 *, int *)+0xe80
========= and Write access at void
onnxruntime::cuda::detail::reduce_matrix_columns_kernel<float, float,
float, onnxruntime::cuda::Identity, onnxruntime::cuda::Identity,
(bool)0>(int, int, const T1 *, T2 *, T3 *, int *)+0xea0 [337920 hazards]

### Motivation and Context
Update LINEAR buffer size calculation to:
* use H+W for bilinear (2D/4D)
* use D+H+W for trilinear (3D/5D)

Prevents invalid global writes and intermittent CUDA memory errors in
trilinear resize workloads.

@johannes-rehm-snkeos
…ead/Write (#27444)

### Description
This pull request improves the robustness and correctness of the CPU
implementation of the Gather operator in ONNX Runtime. The key changes
focus on preventing integer overflow issues in parallel processing and
output shape calculations, as well as enhancing test coverage to verify
these safeguards.

Enhancements to overflow handling and parallel processing:

* Changed the lambda function in `GatherCopyData` to use `ptrdiff_t`
instead of `int64_t` for the index, and explicitly cast batch and i
variables, ensuring safer arithmetic for large tensor sizes.
* Updated the parallel loop in `GatherCopyData` to iterate using
`ptrdiff_t` indices, preventing potential overflow when processing large
tensors.

Testing improvements:

* Added a new unit test `Gather_overflow_check` in `gather_op_test.cc`
to verify that the Gather operator correctly handles very large output
shapes without overflowing, specifically testing dimensions that exceed
the 32-bit integer limit.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description

Fixes the issue 1 described in #27317


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description

remove the "Unknown CPU vendor" warning for webassembly.

CPU info is not supposed to expose in a browser environment, so it is
expected to have no CPU info at runtime. Disable the confusing warning
message for WebAssembly.

### Motivation and Context

fixes #27336
Fixes #23828

Added validation to check:
- num_batches is not zero
- num_slices is divisible by num_batches

Before this fix, mismatched batch dimensions caused a crash due to
division by zero.

### Description
<!-- Describe your changes. -->
This PR fixes a division by zero crash in the GatherND operator when
batch dimensions mismatch between input and indices tensors.

Changes made:

Added validation in gather_nd.cc to check that num_batches is not zero
before division
Added validation that num_slices is divisible by num_batches
Added a unit test to verify the fix


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Description
Fixes #23828

When batch_dims is set but the actual batch dimensions of the input
tensor and indices tensor don't align correctly, the code performs a
division that can result in division by zero, causing a crash.

For example, with:

Input shape: [2, 2, 2]
Indices shape: [2, 1]
batch_dims=1
The calculation num_slices / num_batches would crash if num_batches is
0, or produce unexpected results if they don't divide evenly.

This fix returns a clear error message instead of crashing.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This PR addresses several issues in the QMoE CPU implementation,
improves MLAS documentation.

## Changes

### 1. QMoE CPU Operator Fixes
- **Corrected Bias Handling**: Renamed `fc2_bias_handled_by_q4_gemm` to
`fc2_bias_added_by_mlas` and updated the logic to consistently track
whether FC2 bias has been applied. This ensures that bias is not
double-counted or missed when using `DirectQ4Gemm`.
- **SwiGLU Attribute Update**: Switched from `swiglu_interleaved` to
`swiglu_fusion` in both the C++ operator and the Python test
infrastructure to align with the latest QMoE implementation standards.

### 2. MLAS Documentation
- **Clarified Buffer Shapes**: Added explicit documentation to
`MlasQ4GemmPackB` to specify that the input `FpData` buffer expects a
shape of `[K, N]`. This helps prevent layout-related errors in future
integrations.

### 3. Test Updates
- **PyTorch Parity Fixes**: Refactored
`onnxruntime/test/python/transformers/test_qmoe_cpu.py` to use
`swiglu_fusion` and improved the test structure for better parity checks
with PyTorch.

## Verification
- Verified by running `test_qmoe_cpu.py` to ensure all QMoE parity tests
pass on CPU.
## Summary
This change improves QMoE CPU performance by moving more work to prepack
time and enabling the DirectQ4 GEMM fast path where appropriate, while
preserving an env-var switch for performance/accuracy A/B testing.

This PR introduces:
- Prepack and cache infrastructure for QMoE expert weights.
- DirectQ4 packed-B cache built during prepack (instead of mutable
runtime cache in `Compute()`).
- Fast-path support for block-wise cases (including block size 32 where
supported by MLAS Q4 type).
- Runtime toggle via `ORT_USE_MLAS_Q4_GEMM_MOE`.
- Default fast-path policy refined to avoid known accuracy-loss
scenarios unless explicitly overridden by env var.
- Test and benchmark refinements for QMoE CPU validation.

## Key Implementation Changes

### 1. Prepack-time cache build
- Moves DirectQ4 packed-B cache construction to prepack stage.
- Removes mutable runtime cache maintenance from `Compute()`.
- Reduces per-inference overhead and avoids mutable shared cache
complexity.

### 2. Fast path vs fallback
- Keeps two execution modes:
- DirectQ4 GEMM fast path (`MlasQ4GemmPackB` + `DirectQ4Gemm` cache
usage).
  - Fallback path (`DequantizePrePacked` + `MlasGemm`).
- Allows controlled fallback for accuracy-sensitive configurations.

### 3. Environment variable behavior
- `ORT_USE_MLAS_Q4_GEMM_MOE=1`: force fast path when supported.
- `ORT_USE_MLAS_Q4_GEMM_MOE=0`: force fallback path.
- Unset: use default policy that enables fast path unless a known
accuracy-loss pattern is detected.

### 4. Test updates
- QMoE CPU tests were refined to validate env-var on/off behavior and
no-env behavior.
- Coverage includes parity checks for symmetric/asymmetric,
row-wise/block-wise settings.

## Benchmark Results (1000 inferences, `benchmark_qmoe.py`)

Note: PyTorch latency fluctuates across runs and is excluded from
conclusions below.

### ORT results comparison
| Config | Baseline ORT Time (ms) | Baseline ORT tok/s | New ORT Time
(env=0) (ms) | New ORT tok/s (env=0) | New ORT Time (env=1) (ms) | New
ORT tok/s (env=1) |
|---|---:|---:|---:|---:|---:|---:|
| Medium-4bit | 748.594 | 1.3 | 237.219 | 4.2 | 178.943 | 5.6 |
| Medium-8bit | 209.277 | 4.8 | 212.074 | 4.7 | 203.882 | 4.9 |

### ORT speedup vs baseline
| Config | env=0 speedup vs baseline (time) | env=1 speedup vs baseline
(time) |
|---|---:|---:|
| Medium-4bit | 3.16x faster | 4.18x faster |
| Medium-8bit | 0.99x (about flat) | 1.03x faster |

## Accuracy Notes
- `env=1` (forced fast path) provides the best 4-bit performance but may
show non-zero max diff in known cases.
- `env=0` (fallback) maintains parity behavior with zero observed max
diff in the reported benchmark table.
- Default no-env policy is designed to avoid known accuracy-loss cases
while still enabling fast path where safe.
…1D (#27459)

### Description

The `SkipLayerNormFusion` optimizer skips fusion when the
`LayerNormalization` gamma or beta inputs are not 1D tensors (e.g. shape
`[1, 1, hidden_size]`). The `SkipLayerNormalization` kernel strictly
requires 1D gamma/beta, so fusing without this check caused a hard
runtime error.

- **`skip_layer_norm_fusion.cc`**: After matching the Add+LayerNorm
pattern, check that gamma (and beta if present) have exactly 1 dimension
before proceeding with fusion. If shape info is unavailable (dynamic),
fusion is allowed and runtime validation takes over.
- **`graph_transform_test_layernorm.cc`**: Added
`SkipLayerNormFusion_3DGamma_NoFusion` test — builds a graph with `Add +
LayerNormalization` where gamma/beta are `[1, 1, 4]` and asserts no
`SkipLayerNormalization` node is created.

### Motivation and Context

Models with residual connections followed by `LayerNormalization` where
the scale/bias tensors carry extra batch/sequence dimensions (e.g.
exported as `[1, 1, hidden_size]` rather than `[hidden_size]`) would
trigger fusion and then fail at runtime:

```
Non-zero status code returned while running SkipLayerNormalization node.
Status Message: gamma is expected to have 1 dimension, got 3
```

The error only appeared with 3D inputs and disappeared at
`ORT_ENABLE_BASIC` optimization level (which disables the fusion),
confirming the optimizer as the source of the regression.

<!-- START COPILOT CODING AGENT TIPS -->
---

💬 We'd love your input! Share your thoughts on Copilot coding agent in
our [2 minute survey](https://gh.io/copilot-coding-agent-survey).

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
<!-- Describe your changes. -->

- Add ARM64X build to packaging pipeline. An additional zip archive
artifact with the ARM64X binaries will be produced.
- Add basic C++ sample program.
- Add binary archive tests using the C++ sample program to package test
pipeline.
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Address request for ARM64X binaries.
Add testing of binary archives to package test pipeline.
### Description

allows new memory info name for WebGPU.

### Motivation and Context

This allows at least 1.24.3 works with future (1.25.x) WebGPU plugin DLL
Copy link
Copy Markdown
Member

@hariharans29 hariharans29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for my fix

@tianleiwu tianleiwu merged commit ee26608 into rel-1.24.3 Feb 27, 2026
201 of 205 checks passed
@tianleiwu tianleiwu deleted the tlwu/rel-1.24.3 branch February 27, 2026 23:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.