[MLAS] Fix Data Race in MlasLutGemm by Serializing LUT Generation by tianleiwu · Pull Request #27179 · microsoft/onnxruntime

tianleiwu · 2026-01-28T01:50:18Z

Problem Description

The MatMulNBitsLutGemm.Float32_2Bits_Asymmetric_Batch32_256x256 test was exhibiting flaky behavior (failure rate ~2-20%) with numerical mismatches.
Investigation revealed a race condition in the GenerateLUT step within MlasLutGemm.

When the batch size M > 1, MlasLutGemm attempted to parallelize the LUT generation over the batch dimension using MlasTrySimpleParallel. However, the underlying GenerateLUT implementation (specifically shared usage of lut_scales/lut_biases or internal buffers) is not thread-safe for concurrent execution on the same destination buffers or related state. This led to corruption of the Look-Up Tables or scales, causing random output errors.

Solution

This PR modifies onnxruntime/core/mlas/lib/qlutgemm.cpp to serialize the GenerateLUT loop.
Instead of using MlasTrySimpleParallel, we now use a simple for loop to process each row of the batch sequentially.

Performance Impact:
The GenerateLUT step is computationally lightweight compared to the subsequent TMACComputeGemm matrix multiplication. Serializing this setup step has negligible impact on overall inference latency (micro-benchmarks showed no measurable regression), but effectively eliminates the race condition.

Verification

Reproduction: The issue was reliably reproduced by running MatMulNBitsLutGemm.Float32_2Bits_Asymmetric_Batch32_256x256 in a loop (failing ~1 in 5 times).
Verification: After applying the fix, the same test passed 50/50 iterations consistently.
Regression Testing: Standard MatMulNBitsLutGemm tests (including BlkLen64 and M=1 cases) continue to pass.

onnxruntime/core/mlas/lib/qlutgemm.cpp

vraspar

I ran two bit llama model and did not see any noticeable performance difference

…7179) ## Problem Description The `MatMulNBitsLutGemm.Float32_2Bits_Asymmetric_Batch32_256x256` test was exhibiting flaky behavior (failure rate ~2-20%) with numerical mismatches. Investigation revealed a **race condition** in the [GenerateLUT](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp#L326) step within [MlasLutGemm](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/inc/mlas_qnbit.h#L328). When the batch size `M > 1`, [MlasLutGemm](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/inc/mlas_qnbit.h#L328) attempted to parallelize the LUT generation over the batch dimension using `MlasTrySimpleParallel`. However, the underlying [GenerateLUT](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp#L326) implementation (specifically shared usage of `lut_scales`/`lut_biases` or internal buffers) is not thread-safe for concurrent execution on the same destination buffers or related state. This led to corruption of the Look-Up Tables or scales, causing random output errors. ## Solution This PR modifies [onnxruntime/core/mlas/lib/qlutgemm.cpp](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/lib/qlutgemm.cpp) to **serialize the [GenerateLUT](file:///home/tlwu/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp#324-355) loop**. Instead of using `MlasTrySimpleParallel`, we now use a simple `for` loop to process each row of the batch sequentially. **Performance Impact:** The [GenerateLUT](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp#L326) step is computationally lightweight compared to the subsequent [TMACComputeGemm](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp#L505) matrix multiplication. Serializing this setup step has negligible impact on overall inference latency (micro-benchmarks showed no measurable regression), but effectively eliminates the race condition. ## Verification * **Reproduction:** The issue was reliably reproduced by running `MatMulNBitsLutGemm.Float32_2Bits_Asymmetric_Batch32_256x256` in a loop (failing ~1 in 5 times). * **Verification:** After applying the fix, the same test passed **50/50 iterations** consistently. * **Regression Testing:** Standard `MatMulNBitsLutGemm` tests (including `BlkLen64` and `M=1` cases) continue to pass.

| Commit | Commit Title | Author | | :--- | :--- | :--- | | `6861526` | [MLAS] Fix Data Race in MlasLutGemm by Serializing LUT Generation (#27179) | tianleiwu | | `592bcb4` | remove coloredlogs (#27135) | tianleiwu | | `0f153de` | Add API GetTensorElementTypeAndShapeDataReference (#27175) | adrianlizarraga | | `1caa3e6` | [MLAS] Fix Flaky LuT GEMM Tests by Replacing Gather with Shuffle (#27174) | tianleiwu | --------- Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>

…crosoft#27179) ## Problem Description The `MatMulNBitsLutGemm.Float32_2Bits_Asymmetric_Batch32_256x256` test was exhibiting flaky behavior (failure rate ~2-20%) with numerical mismatches. Investigation revealed a **race condition** in the [GenerateLUT](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp#L326) step within [MlasLutGemm](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/inc/mlas_qnbit.h#L328). When the batch size `M > 1`, [MlasLutGemm](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/inc/mlas_qnbit.h#L328) attempted to parallelize the LUT generation over the batch dimension using `MlasTrySimpleParallel`. However, the underlying [GenerateLUT](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp#L326) implementation (specifically shared usage of `lut_scales`/`lut_biases` or internal buffers) is not thread-safe for concurrent execution on the same destination buffers or related state. This led to corruption of the Look-Up Tables or scales, causing random output errors. ## Solution This PR modifies [onnxruntime/core/mlas/lib/qlutgemm.cpp](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/lib/qlutgemm.cpp) to **serialize the [GenerateLUT](file:///home/tlwu/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp#324-355) loop**. Instead of using `MlasTrySimpleParallel`, we now use a simple `for` loop to process each row of the batch sequentially. **Performance Impact:** The [GenerateLUT](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp#L326) step is computationally lightweight compared to the subsequent [TMACComputeGemm](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp#L505) matrix multiplication. Serializing this setup step has negligible impact on overall inference latency (micro-benchmarks showed no measurable regression), but effectively eliminates the race condition. ## Verification * **Reproduction:** The issue was reliably reproduced by running `MatMulNBitsLutGemm.Float32_2Bits_Asymmetric_Batch32_256x256` in a loop (failing ~1 in 5 times). * **Verification:** After applying the fix, the same test passed **50/50 iterations** consistently. * **Regression Testing:** Standard `MatMulNBitsLutGemm` tests (including `BlkLen64` and `M=1` cases) continue to pass.

Fix data race

38dfc91

tianleiwu requested a review from vraspar January 28, 2026 01:50

tianleiwu added the release:1.24.0 label Jan 28, 2026

hariharans29 reviewed Jan 28, 2026

View reviewed changes

onnxruntime/core/mlas/lib/qlutgemm.cpp Show resolved Hide resolved

vraspar approved these changes Jan 28, 2026

View reviewed changes

tianleiwu enabled auto-merge (squash) January 28, 2026 19:30

tianleiwu merged commit 6861526 into main Jan 28, 2026
90 of 92 checks passed

tianleiwu deleted the tlwu/fix_lut_gemm_data_race branch January 28, 2026 20:08

tianleiwu mentioned this pull request Jan 29, 2026

ORT 1.24.0 release cherry pick round 4 #27202

Merged

tianleiwu removed the release:1.24.0 label Feb 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MLAS] Fix Data Race in MlasLutGemm by Serializing LUT Generation#27179

[MLAS] Fix Data Race in MlasLutGemm by Serializing LUT Generation#27179
tianleiwu merged 1 commit intomainfrom
tlwu/fix_lut_gemm_data_race

tianleiwu commented Jan 28, 2026

Uh oh!

Uh oh!

vraspar left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tianleiwu commented Jan 28, 2026

Problem Description

Solution

Verification

Uh oh!

Uh oh!

vraspar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants