Skip to content

ORT 1.24.0 release cherry pick round 4#27202

Merged
tianleiwu merged 4 commits intorel-1.24.0from
tlwu/rel-1.24.0_cherry_pick_round_4
Jan 29, 2026
Merged

ORT 1.24.0 release cherry pick round 4#27202
tianleiwu merged 4 commits intorel-1.24.0from
tlwu/rel-1.24.0_cherry_pick_round_4

Conversation

@tianleiwu
Copy link
Copy Markdown
Contributor

@tianleiwu tianleiwu commented Jan 29, 2026

Commit Commit Title Author
6861526 [MLAS] Fix Data Race in MlasLutGemm by Serializing LUT Generation (#27179) tianleiwu
592bcb4 remove coloredlogs (#27135) tianleiwu
0f153de Add API GetTensorElementTypeAndShapeDataReference (#27175) adrianlizarraga
1caa3e6 [MLAS] Fix Flaky LuT GEMM Tests by Replacing Gather with Shuffle (#27174) tianleiwu

tianleiwu and others added 4 commits January 28, 2026 23:38
…7179)

## Problem Description
The `MatMulNBitsLutGemm.Float32_2Bits_Asymmetric_Batch32_256x256` test
was exhibiting flaky behavior (failure rate ~2-20%) with numerical
mismatches.
Investigation revealed a **race condition** in the
[GenerateLUT](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp#L326)
step within
[MlasLutGemm](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/inc/mlas_qnbit.h#L328).

When the batch size `M > 1`,
[MlasLutGemm](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/inc/mlas_qnbit.h#L328)
attempted to parallelize the LUT generation over the batch dimension
using `MlasTrySimpleParallel`. However, the underlying
[GenerateLUT](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp#L326)
implementation (specifically shared usage of `lut_scales`/`lut_biases`
or internal buffers) is not thread-safe for concurrent execution on the
same destination buffers or related state. This led to corruption of the
Look-Up Tables or scales, causing random output errors.

## Solution
This PR modifies
[onnxruntime/core/mlas/lib/qlutgemm.cpp](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/lib/qlutgemm.cpp)
to **serialize the
[GenerateLUT](file:///home/tlwu/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp#324-355)
loop**.
Instead of using `MlasTrySimpleParallel`, we now use a simple `for` loop
to process each row of the batch sequentially.

**Performance Impact:**
The
[GenerateLUT](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp#L326)
step is computationally lightweight compared to the subsequent
[TMACComputeGemm](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp#L505)
matrix multiplication. Serializing this setup step has negligible impact
on overall inference latency (micro-benchmarks showed no measurable
regression), but effectively eliminates the race condition.

## Verification
* **Reproduction:** The issue was reliably reproduced by running
`MatMulNBitsLutGemm.Float32_2Bits_Asymmetric_Batch32_256x256` in a loop
(failing ~1 in 5 times).
* **Verification:** After applying the fix, the same test passed **50/50
iterations** consistently.
* **Regression Testing:** Standard `MatMulNBitsLutGemm` tests (including
`BlkLen64` and `M=1` cases) continue to pass.
Adds C/C++ API named `GetTensorElementTypeAndShapeDataReference` that
returns an OrtValue tensor's shape and type without allocating a new
buffer for the shape data.
This new API function can be used instead of `OrtApi::GetTypeInfo()` or
`OrtApi::GetTensorTypeAndShape` to decrease the number of heap
allocations and thus improve inference latency for plugin EPs kernels
that frequently retrieve tensor shapes during inference. (e.g., WebGPU
plugin EP)
)

## Problem Description
The `MatMulNBitsLutGemm` test suite, specifically
`Float32_2Bits_Symmetric_256x256_BlkLen64`, was observing intermittent
failures (flakiness).
The failure manifested as numerical mismatches exceeding the tolerance,
suggesting non-deterministic behavior in the kernel execution.

## Root Cause Analysis
The issue was traced to the usage of `_mm256_i32gather_ps` in
sqnbitgemm_lut_kernel_avx2.cpp
While the gather indices were technically calculating addresses within
the bounds of the allocated buffer, gather instructions on certain AVX2
hardware implementations can exhibit non-deterministic behavior or
subtle performance/prefetching artifacts when operating on specific
stride patterns (in this case, gathering with a stride of 4 floats).

## Solution
This PR replaces the `_mm256_i32gather_ps` instruction with a sequence
of **contiguous loads (`_mm256_loadu_ps`) followed by deterministic
shuffles**.

### How it works:
1. **Contiguous Load**: We load 4 contiguous vectors of 8 floats
elements using `_mm256_loadu_ps`. This is always memory-safe and
deterministic.
2. **Deterministic Shuffle**: We apply a verified sequence of `unpack`
and `permutevar8x32` instructions to rearrange these 32 linearly loaded
elements into the exact same stride-4 layout that the gather instruction
produced.

### Benefits:
* **Stability**: Eliminates the hardware-dependent non-determinism of
gather.
* **Safety**: Usage of `loadu` guarantees we only touch memory within
the explicit range of the 32 elements we intend to load.
* **Correctness**: The shuffle logic was verified against the reference
gather behavior using a C++ reproduction script to ensure bit-exact
layout equivalence.

### Performance

Micro-benchmark on MatMulNBitsLutGemm (256x256, BlkLen=64).
Original (Gather): ~55.55 us
Fixed (Load+Shuffle): ~57.79 us
Delta: +2.24 us (~4% slower)

The slight performance regression is expected because replacing a single
hardware gather instruction with a sequence of loadu, unpack, and
permute instructions adds instruction count overhead. However, this is a
necessary tradeoff to ensure deterministic behavior and memory safety
across all AVX2 implementations.

## Verification
* **Tests**: All 9 tests in `MatMulNBitsLutGemm` passed successfully
(including the previously flaky `BlkLen64` case).
@tianleiwu tianleiwu enabled auto-merge (squash) January 29, 2026 16:55
@tianleiwu tianleiwu merged commit 526e527 into rel-1.24.0 Jan 29, 2026
99 of 104 checks passed
@tianleiwu tianleiwu deleted the tlwu/rel-1.24.0_cherry_pick_round_4 branch January 29, 2026 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants