[MLAS] Fix rotary interleaved NEON kernel#26390
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull Request Overview
This PR fixes a critical bug in the NEON fp16 rotary embedding kernel for the interleaved mode. The kernel was incorrectly indexing the sin/cos lookup tables with the full input dimension index instead of half the dimension, causing out-of-bounds reads. The fix aligns the NEON implementation with the AVX2 kernel and the fallback implementation by dividing indices by 2 when accessing the sin/cos tables in interleaved mode.
Key changes:
- Corrected sin/cos table indexing in the interleaved NEON fp16 kernel from
itoi/2 - Added comprehensive unit tests for NEON fp16 RoPE operations covering various dimensions and both interleaved/non-interleaved modes
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
onnxruntime/test/mlas/unittest/test_rope_neon_fp16.cpp |
New test file validating NEON fp16 RoPE kernel against fallback implementation |
onnxruntime/core/mlas/lib/rotary_embedding_kernel_neon_fp16.cpp |
Fixed all sin/cos table access points in interleaved mode to use i/2 indexing |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
b88d691 to
6384fbf
Compare
6384fbf to
b9f4c6d
Compare
kunal-vaishnavi
approved these changes
Jan 30, 2026
hariharans29
approved these changes
Jan 30, 2026
tianleiwu
added a commit
that referenced
this pull request
Feb 3, 2026
The logic of interleaved NEON kernel is not correct from code review:
1. **Test Code Logic:**
The test code `test_rope.h` allocates the `sin` and `cos` tables based
on the `interleaved` flag:
```c++
size_t table_len = interleaved ? rotary_emb_dim / 2 : rotary_emb_dim;
std::vector<float> sin_data(table_len);
std::vector<float> cos_data(table_len);
```
For the `interleaved = true` case, the test creates `sin` and `cos`
tables of length `rotary_emb_dim / 2`.
2. **AVX2 (fp32) Kernel Logic (`interleaved = true`):**
This kernel loads the `sin`/`cos` data using an index of `i / 2`:
```c++
float32x8_t sin_val = _mm256_loadu_ps(sin_data + i / 2);
float32x8_t cos_val = _mm256_loadu_ps(cos_data + i / 2);
```
This logic expects a `sin`/`cos` table of length `rotary_emb_dim / 2`.
**Conclusion: The AVX2 (fp32) kernel is consistent with the test code.**
3. **NEON (fp16) Kernel Logic (`interleaved = true`):**
This kernel loads the `sin`/`cos` data using an index of `i`:
```c++
// Enters loop with sin_val = MlasLoadFloat16x8(sin + i);
//...
// Inside loop, for next iteration:
sin_val = MlasLoadFloat16x8(sin + i + 16);
```
This logic expects a `sin`/`cos` table of length `rotary_emb_dim`.
**Conclusion: The NEON (fp16) kernel is NOT consistent with the test
code.**
### Regression Test
```
cmake --build build/Linux/Release --config Release --target onnxruntime_mlas_test && ./build/Linux/Release/onnxruntime_mlas_test --gtest_filter=NeonFp16RoPE*
```
Before applying the fix, the test failed:
```
[ FAILED ] NeonFp16RoPE.ShortExecute (13 ms)
onnxruntime/onnxruntime/test/mlas/unittest/test_rope_neon_fp16.cpp:66: Failure
Value of: CloseEnough(output_impl[i].ToFloat(), output_ref[i].ToFloat())
Actual: false
Expected: true
Expected bits: 19491 (16.546875) Actual bits: 56596 (-325) @[16], rotary_emb_dim=24, interleaved=true
```
After applying the fix, test passed.
### Summary
The `RopeKernel_Avx2_fp32_Impl<true>` kernel correctly aligns with the
test code (and the fallback implementation) by expecting a `sin`/`cos`
table of length `rotary_emb_dim / 2`.
The `RopeKernel_Fp16_Impl<true>` (NEON) kernel incorrectly expects a
table of length `rotary_emb_dim`. When run against the provided test,
the NEON kernel will read past the end of the `sin_data` and `cos_data`
vectors.
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
tianleiwu
added a commit
that referenced
this pull request
Feb 3, 2026
Cherry-pick round 1 for 1.24.1 release: #27157: [Fix: Replace pkg_resources with importlib.metadata in machine_info.py](40469f0) #27124: [Remove x86 from nuget (#27124)](5c98f5c) #26390: [[MLAS] Fix rotary interleaved NEON kernel](536c6c9) #27215: [Fix Conv LHS packing padding/uninitialized ptrs V2](62a3890) #27221: [Fix WebGPU MoE swiglu_limit (default to infinity)](98b6ce9) #26994: [Fix for https://github.com/microsoft/onnxruntime/issues/25145](https://github.com/microsoft/onnxruntime/commit/bce7b4faca24ae2ae279ab8fa2de637a46e7f45b) --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Hariharan Seshadri <shariharan91@gmail.com> Co-authored-by: Joshua Lochner <admin@xenova.com> Co-authored-by: umangb-09 <umangb@nvidia.com>
milpuz01
pushed a commit
to milpuz01/onnxruntime
that referenced
this pull request
Feb 4, 2026
The logic of interleaved NEON kernel is not correct from code review:
1. **Test Code Logic:**
The test code `test_rope.h` allocates the `sin` and `cos` tables based
on the `interleaved` flag:
```c++
size_t table_len = interleaved ? rotary_emb_dim / 2 : rotary_emb_dim;
std::vector<float> sin_data(table_len);
std::vector<float> cos_data(table_len);
```
For the `interleaved = true` case, the test creates `sin` and `cos`
tables of length `rotary_emb_dim / 2`.
2. **AVX2 (fp32) Kernel Logic (`interleaved = true`):**
This kernel loads the `sin`/`cos` data using an index of `i / 2`:
```c++
float32x8_t sin_val = _mm256_loadu_ps(sin_data + i / 2);
float32x8_t cos_val = _mm256_loadu_ps(cos_data + i / 2);
```
This logic expects a `sin`/`cos` table of length `rotary_emb_dim / 2`.
**Conclusion: The AVX2 (fp32) kernel is consistent with the test code.**
3. **NEON (fp16) Kernel Logic (`interleaved = true`):**
This kernel loads the `sin`/`cos` data using an index of `i`:
```c++
// Enters loop with sin_val = MlasLoadFloat16x8(sin + i);
//...
// Inside loop, for next iteration:
sin_val = MlasLoadFloat16x8(sin + i + 16);
```
This logic expects a `sin`/`cos` table of length `rotary_emb_dim`.
**Conclusion: The NEON (fp16) kernel is NOT consistent with the test
code.**
### Regression Test
```
cmake --build build/Linux/Release --config Release --target onnxruntime_mlas_test && ./build/Linux/Release/onnxruntime_mlas_test --gtest_filter=NeonFp16RoPE*
```
Before applying the fix, the test failed:
```
[ FAILED ] NeonFp16RoPE.ShortExecute (13 ms)
onnxruntime/onnxruntime/test/mlas/unittest/test_rope_neon_fp16.cpp:66: Failure
Value of: CloseEnough(output_impl[i].ToFloat(), output_ref[i].ToFloat())
Actual: false
Expected: true
Expected bits: 19491 (16.546875) Actual bits: 56596 (-325) @[16], rotary_emb_dim=24, interleaved=true
```
After applying the fix, test passed.
### Summary
The `RopeKernel_Avx2_fp32_Impl<true>` kernel correctly aligns with the
test code (and the fallback implementation) by expecting a `sin`/`cos`
table of length `rotary_emb_dim / 2`.
The `RopeKernel_Fp16_Impl<true>` (NEON) kernel incorrectly expects a
table of length `rotary_emb_dim`. When run against the provided test,
the NEON kernel will read past the end of the `sin_data` and `cos_data`
vectors.
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The logic of interleaved NEON kernel is not correct from code review:
Test Code Logic:
The test code
test_rope.hallocates thesinandcostables based on theinterleavedflag:For the
interleaved = truecase, the test createssinandcostables of lengthrotary_emb_dim / 2.AVX2 (fp32) Kernel Logic (
interleaved = true):This kernel loads the
sin/cosdata using an index ofi / 2:This logic expects a
sin/costable of lengthrotary_emb_dim / 2.Conclusion: The AVX2 (fp32) kernel is consistent with the test code.
NEON (fp16) Kernel Logic (
interleaved = true):This kernel loads the
sin/cosdata using an index ofi:This logic expects a
sin/costable of lengthrotary_emb_dim.Conclusion: The NEON (fp16) kernel is NOT consistent with the test code.
Regression Test
Before applying the fix, the test failed:
After applying the fix, test passed.
Summary
The
RopeKernel_Avx2_fp32_Impl<true>kernel correctly aligns with the test code (and the fallback implementation) by expecting asin/costable of lengthrotary_emb_dim / 2.The
RopeKernel_Fp16_Impl<true>(NEON) kernel incorrectly expects a table of lengthrotary_emb_dim. When run against the provided test, the NEON kernel will read past the end of thesin_dataandcos_datavectors.