Add validation of position_ids in RotaryEmbedding operators by tianleiwu · Pull Request #27597 · microsoft/onnxruntime

tianleiwu · 2026-03-09T20:22:29Z

Description

Fix out-of-bounds read in the RotaryEmbedding operator when user-provided position_ids values exceed the cos/sin cache bounds (max_sequence_length).

Problem

When position_ids contains values that are negative or >= max_sequence_length, the kernel computes cache_offset = position_id * half_rotary_embedding_dim and reads out-of-bounds from cos_cache / sin_cache. This can cause undefined behavior (incorrect results, crashes, or memory corruption).

Fix

CPU (rotary_embedding.cc):

Added upfront validation of all position_ids values before the parallel computation loop. Returns an INVALID_ARGUMENT error if any value is out of range [0, max_sequence_length).
Validation is only applied when position_ids_format != 0 (i.e., when position_ids are explicitly provided). When position_ids is not provided (format 0), the cache is shaped (B, S, H/2) and the index b * S + s is always in-bounds by construction.

CUDA (rotary_embedding_impl.cu):

Plumbed the previously-unused max_sequence_length parameter through to the kernel.
Added a bounds check inside the position_ids_format != 0 branch. Out-of-bounds position IDs cause the kernel to pass through the input unchanged (errors cannot be propagated from GPU kernels).
The bounds check is scoped to the position_ids_format != 0 branch only. When format is 0 (no position_ids), the cache is (B*S, H/2) and b_s_index = b * S + s is deterministically valid — applying the check unconditionally would incorrectly reject all batches beyond the first since max_sequence_length == sequence_length in that case.

Tests

Added three CPU test cases for the ONNX domain RotaryEmbedding op:

RotaryEmbedding_PositionIds_ExceedsMaxSeqLen — position_id far exceeding cache size
RotaryEmbedding_PositionIds_Negative — negative position_id
RotaryEmbedding_PositionIds_OOB_InBatch — OOB position_id in a multi-batch, multi-sequence scenario

Motivation and Context

Security hardening — prevent out-of-bounds memory access from untrusted model inputs.

Copilot

Pull request overview

This PR hardens the ONNX-domain RotaryEmbedding operator against out-of-bounds reads when user-provided position_ids contain invalid indices relative to the cos/sin cache (max_sequence_length), addressing a potential correctness and security issue.

Changes:

CPU: Validate position_ids values upfront (when explicitly provided) and return INVALID_ARGUMENT on out-of-range values.
CUDA: Plumb max_sequence_length into the kernel and add a device-side bounds check (pass-through on OOB since kernels can’t surface errors).
Tests: Add CPU unit tests that assert invalid position_ids are rejected with an appropriate error substring.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`onnxruntime/core/providers/cpu/llm/rotary_embedding.cc`	Adds upfront `position_ids` range validation to prevent OOB cache access on CPU.
`onnxruntime/core/providers/cuda/llm/rotary_embedding_impl.cu`	Passes `max_sequence_length` to the CUDA kernel and guards cache indexing for explicit `position_ids`.
`onnxruntime/test/providers/cpu/llm/rotary_embedding_op_test.cc`	Adds negative / exceeds-max / in-batch OOB test cases for CPU failure behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

titaiwangms · 2026-03-09T20:42:29Z

@tianleiwu I remember ONNX RotaryEmbedding is copied from Contrib op. Did you also fix on that side?

titaiwangms · 2026-03-09T21:06:23Z

4. Consolidated Findings

4.1 Must-Fix Issues

#	Finding	Severity	Source
F1	int64→int truncation before bounds check	🟠 High	Code Review, Critical Review

Both CPU and CUDA cast position_ids[i] (int64_t) to int before the bounds check:

int pos = static_cast<int>(position_ids[i]);         // truncation happens here
if (pos < 0 || pos >= max_sequence_length) { ... }   // check is on truncated value

A value like 4294967296 + N (where N < max_seq_len) truncates to N, passes the check, and silently produces wrong results. While not exploitable for memory corruption (the same truncation occurs in both validation and computation), it defeats the purpose of validation.

One-line fix (CPU):

int64_t pos64 = position_ids[i];
if (pos64 < 0 || pos64 >= static_cast<int64_t>(max_sequence_length)) {
    return ORT_MAKE_STATUS(ONNXRUNTIME, INVALID_ARGUMENT,
                           "position_ids value ", pos64, " at index ", i,
                           " is out of range [0, ", max_sequence_length, ")");
}

One-line fix (CUDA):

int64_t raw_pos = position_ids[b_s_index];
if (raw_pos < 0 || raw_pos >= static_cast<int64_t>(max_sequence_length)) {
    output_data[i] = input_data[i];
    return;
}
b_s_index = static_cast<int>(raw_pos);

4.2 Tracked Separately (Separate PR Recommended)

#	Finding	Severity	Source
F2	Identical OOB vulnerability in contrib_ops CPU/CUDA	🔴 Critical (but out of scope)	Architect, Critical Review

The contrib_ops implementations at contrib_ops/cpu/bert/rotary_embedding.cc (L80-82) and contrib_ops/cuda/bert/rotary_embedding_impl.cu (L71-81) have the identical vulnerability. However, because the implementations have diverged significantly (different input order, different position_ids semantics, 3 vs 2 formats, extra parameters), the fix requires different validation logic:

Format 0 (offset): Validate position_ids[0] + sequence_length - 1 < max_sequence_length
Format 1 (explicit): Same as ONNX fix
Format 2 (past_seqlens): Validate past_sequence_lengths[b] + sequence_length - 1 < max_sequence_length

Recommendation: File a tracked security issue and address in a follow-up PR. The scope of this PR is well-defined for the ONNX domain.

4.3 Should-Fix (In This PR)

#	Finding	Severity	Source
F3	Add boundary test: `position_id = max_sequence_length` (exact boundary, should fail)	🟡 Medium	Code Review
F4	Add test for large int64 value (e.g., `INT_MAX + 1`) to validate the truncation fix	🟡 Medium	Code Review, Critical Review
F5	Improve CPU validation comment to explain WHY (security boundary, not just WHAT)	🟢 Low	Readability Review

4.4 Nice-to-Have (Non-Blocking)

#	Finding	Severity	Source
F6	Extract test helper to reduce ~60 lines of boilerplate across 3 tests	🟢 Low	Readability Review
F7	Rename `b_s_index` after reassignment for semantic clarity (it becomes a cache row index, not a batch-seq index)	🟢 Low	Readability Review
F8	Add TODO/comment about CPU vs CUDA behavioral inconsistency (error vs pass-through)	🟢 Low	Code Review, Readability Review
F9	CUDA-specific OOB test (gated on `HasCudaEnvironment`)	🟢 Low	Code Review

5. What's Correct and Well-Done

All four reviewers agreed these aspects are positive:

Architecturally sound approach: CPU validates upfront (fail-fast), CUDA checks in-kernel (only viable option). Clean separation of concerns.
Excellent error messages: CPU error includes value, index, and valid range — immediately actionable for debugging.
Minimal, surgical diff: +118/-5 lines. No existing behavior changes for valid inputs.
Correct max_sequence_length plumbing: CUDA fix properly uncomments the previously-unused parameter and threads it through to the kernel.
No race conditions: Validation is sequential before parallel work. Parallel loop reads position_ids read-only.
CUDA thread consistency: All threads in a block compute the same b_s_index, so either all pass through or all compute — no partial application within a head.
Performance impact: Negligible. CPU validation loop is O(B×S) vs compute O(B×S×N×H). CUDA adds ~2 instructions with no warp divergence on the happy path.
Test naming and organization: Follows existing patterns, descriptive names, good inline comments.

6. Verdict: Approve with Changes

Required for merge:

F1: Fix int64→int truncation — Compare as int64_t before casting. One-line change in each file (CPU + CUDA). This is a correctness gap in the security validation.

Strongly recommended for this PR:

F3/F4: Add boundary value test and large-int64 test to validate the truncation fix.
F5: Improve the CPU validation comment.

Separate follow-up:

F2: File tracked issue for contrib_ops OOB fix (requires different validation logic for 3 position_ids formats).

Nice-to-have (author's discretion):

F6: Test helper to reduce boilerplate
F7: Variable rename for clarity
F8: Cross-reference comment about CPU/CUDA inconsistency
F9: CUDA OOB test

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

onnxruntime/contrib_ops/cuda/bert/rotary_embedding_impl.cu:91

The new bounds checks cover position_ids formats 0 and 1, but format 2 (past_sequence_length + s) can still produce negative or >= max_sequence_length position_id values, leading to out-of-bounds reads from cos_cache/sin_cache. Add the same range validation for format 2 (and handle negative past_sequence_lengths[b]), falling back to pass-through on OOB like the other formats.

    position_id = static_cast<int>(pos);
  } else if (position_ids_format == 2) {
    // format 2: past_sequence_length + s
    // used for Decoding (past_sequence_length = seqlens_k[b]) or First Prompt (past=0 if nullptr)
    int past = (past_sequence_lengths == nullptr) ? 0 : past_sequence_lengths[b];
    position_id = past + s;
  }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

onnxruntime/contrib_ops/cuda/bert/rotary_embedding_impl.cu:93

In the CUDA contrib rotary embedding kernel, the new bounds checks cover formats 0 and 1, but the format-2 path (position_id = past_sequence_lengths[b] + s) still computes cache_offset without validating that the resulting position_id is within [0, max_sequence_length). If position_ids_format=2 is ever used with a large/negative past_sequence_lengths value, this can still read out of bounds from cos_cache/sin_cache. Please add an equivalent bounds check for format 2 (e.g., validate past in range and that past + sequence_length <= max_sequence_length) and apply the same pass-through behavior on failure.

  } else if (position_ids_format == 2) {
    // format 2: past_sequence_length + s
    // used for Decoding (past_sequence_length = seqlens_k[b]) or First Prompt (past=0 if nullptr)
    int past = (past_sequence_lengths == nullptr) ? 0 : past_sequence_lengths[b];
    position_id = past + s;
  }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

squariaa-dev · 2026-04-23T12:35:26Z

Description

Fix out-of-bounds read in the RotaryEmbedding operator when user-provided position_ids values exceed the cos/sin cache bounds (max_sequence_length).

Problem

When position_ids contains values that are negative or >= max_sequence_length, the kernel computes cache_offset = position_id * half_rotary_embedding_dim and reads out-of-bounds from cos_cache / sin_cache. This can cause undefined behavior (incorrect results, crashes, or memory corruption).

Fix

CPU (rotary_embedding.cc):

Added upfront validation of all position_ids values before the parallel computation loop. Returns an INVALID_ARGUMENT error if any value is out of range [0, max_sequence_length).

Validation is only applied when position_ids_format != 0 (i.e., when position_ids are explicitly provided). When position_ids is not provided (format 0), the cache is shaped (B, S, H/2) and the index b * S + s is always in-bounds by construction.

CUDA (rotary_embedding_impl.cu):

Plumbed the previously-unused max_sequence_length parameter through to the kernel.

Added a bounds check inside the position_ids_format != 0 branch. Out-of-bounds position IDs cause the kernel to pass through the input unchanged (errors cannot be propagated from GPU kernels).

The bounds check is scoped to the position_ids_format != 0 branch only. When format is 0 (no position_ids), the cache is (B*S, H/2) and b_s_index = b * S + s is deterministically valid — applying the check unconditionally would incorrectly reject all batches beyond the first since max_sequence_length == sequence_length in that case.

Tests

Added three CPU test cases for the ONNX domain RotaryEmbedding op:

RotaryEmbedding_PositionIds_ExceedsMaxSeqLen — position_id far exceeding cache size

RotaryEmbedding_PositionIds_Negative — negative position_id

RotaryEmbedding_PositionIds_OOB_InBatch — OOB position_id in a multi-batch, multi-sequence scenario

Motivation and Context

Security hardening — prevent out-of-bounds memory access from untrusted model inputs.

Marmaris, necersaary/ acaranka: betaksimgyo
relasen:midas42, poweruz_eönd"50010100*

- [x] nce-telekominnikasyon^abIe,öl.8piç.

Add shader-side bounds checks to the WebGPU RotaryEmbedding and FusedQKRotaryEmbedding GPU shaders to prevent out-of-bounds reads from cos_cache/sin_cache when position_ids values exceed the cache dimensions. For RotaryEmbeddingProgram: - Check raw_pos < 0 to catch negative position_ids (i32 from truncated int64 avoids u32 wraparound bypass) - Check position_id >= cos_cache_shape[0] after u32 conversion and sequence offset addition - On OOB, pass through input unchanged (matches CUDA kernel behavior) For FusedQKRotaryEmbeddingProgram: - Check position_id >= cos_cache_shape[0] before accessing cos/sin cache - On OOB, pass through both Q and K inputs unchanged This complements the CPU and CUDA fixes from PR #27597 (commit 056bab3) which missed the WebGPU execution provider. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add host-side validation of position_ids values before shader dispatch in all three WebGPU RotaryEmbedding implementations. This prevents out-of-bounds reads from cos_cache/sin_cache when position_ids values exceed the cache dimensions. Changes: 1. contrib_ops/webgpu/bert/rotary_embedding.cc: - Add InputMemoryType(OrtMemTypeCPUInput, 1) to keep position_ids on CPU for validation - Add bounds checking in ComputeInternal() before shader dispatch: format 0 (scalar): base_pos in [0, max_seq_len - seq_len] format 1 (2D array): each value in [0, max_sequence_length) - Returns INVALID_ARGUMENT error on violation - Shader-side bounds checks remain as defense-in-depth 2. core/providers/webgpu/llm/rotary_embedding.cc: - Add InputMemoryType(OrtMemTypeCPUInput, 3) for optional position_ids input - Add bounds checking in the position_ids != nullptr branch - Returns INVALID_ARGUMENT error on violation 3. js/web/lib/wasm/jsep/webgpu/ops/rotary-embedding.ts: - Add value validation in validateInputs() using getBigInt64Array() - Validates both format 0 (scalar offset) and format 1 (2D array) - Throws Error with descriptive message on violation All three implementations follow the same validation pattern as the CPU contrib fix (PR #27597), returning errors rather than silently passing through. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add shader-side bounds checks to the WebGPU RotaryEmbedding and FusedQKRotaryEmbedding GPU shaders to prevent out-of-bounds reads from cos_cache/sin_cache when position_ids values exceed the cache dimensions. For RotaryEmbeddingProgram: - Check raw_pos < 0 to catch negative position_ids (i32 from truncated int64 avoids u32 wraparound bypass) - Check position_id >= cos_cache_shape[0] after u32 conversion and sequence offset addition - On OOB, pass through input unchanged (matches CUDA kernel behavior) For FusedQKRotaryEmbeddingProgram: - Check position_id >= cos_cache_shape[0] before accessing cos/sin cache - On OOB, pass through both Q and K inputs unchanged This complements the CPU and CUDA fixes from PR #27597 (commit 056bab3) which missed the WebGPU execution provider. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add host-side validation of position_ids values before shader dispatch in all three WebGPU RotaryEmbedding implementations. This prevents out-of-bounds reads from cos_cache/sin_cache when position_ids values exceed the cache dimensions. Changes: 1. contrib_ops/webgpu/bert/rotary_embedding.cc: - Add InputMemoryType(OrtMemTypeCPUInput, 1) to keep position_ids on CPU for validation - Add bounds checking in ComputeInternal() before shader dispatch: format 0 (scalar): base_pos in [0, max_seq_len - seq_len] format 1 (2D array): each value in [0, max_sequence_length) - Returns INVALID_ARGUMENT error on violation - Shader-side bounds checks remain as defense-in-depth 2. core/providers/webgpu/llm/rotary_embedding.cc: - Add InputMemoryType(OrtMemTypeCPUInput, 3) for optional position_ids input - Add bounds checking in the position_ids != nullptr branch - Returns INVALID_ARGUMENT error on violation 3. js/web/lib/wasm/jsep/webgpu/ops/rotary-embedding.ts: - Add value validation in validateInputs() using getBigInt64Array() - Validates both format 0 (scalar offset) and format 1 (2D array) - Throws Error with descriptive message on violation All three implementations follow the same validation pattern as the CPU contrib fix (PR #27597), returning errors rather than silently passing through. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ls (#28214) This PR adds position_ids bounds checking to WebGPU and JS RotaryEmbedding implementations, completing the security fix started in PR #27597 (commit 056bab3) which covered CPU and CUDA. ## Problem The `com.microsoft::RotaryEmbedding` kernel uses position_ids as row indices into cos_cache/sin_cache without bounds validation. While PR #27597 fixed CPU and CUDA paths, WebGPU and JS implementations were still missing bounds checks, which could produce silently wrong results (WebGPU hardware clamps OOB reads). ## Changes - **contrib_ops/webgpu/bert/rotary_embedding.cc**: Host-side validation (ORT_MAKE_STATUS) + shader-side defense-in-depth (pass-through on OOB) - **core/providers/webgpu/llm/rotary_embedding.cc**: Host-side validation with format-0 awareness - **js/web/lib/wasm/jsep/webgpu/ops/rotary-embedding.ts**: TypeScript validation using getBigInt64Array - **7 new C++ OOB test cases** across contrib and ONNX domains targeting WebGPU EP ## Security Addresses the same vulnerability as #27597 (OOB read via position_ids, CVSS 7.5-9.1) for WebGPU/JS execution providers. ## Testing - 7 new unit tests (3 contrib + 4 ONNX domain) with GTEST_SKIP when WebGPU EP unavailable - JS/TS error tests not feasible with current JSONC test format (documented) - Build environment lacks C++20/emsdk for full compilation verification; validated structurally --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fix rotary OOB

9d8b722

tianleiwu requested review from Copilot and titaiwangms March 9, 2026 20:23

Copilot started reviewing on behalf of tianleiwu March 9, 2026 20:24 View session

Copilot AI reviewed Mar 9, 2026

View reviewed changes

Comment thread onnxruntime/core/providers/cpu/llm/rotary_embedding.cc

Comment thread onnxruntime/core/providers/cuda/llm/rotary_embedding_impl.cu

Comment thread onnxruntime/core/providers/cuda/llm/rotary_embedding_impl.cu Outdated

tianleiwu added 2 commits March 9, 2026 21:54

address feedback

c4bc32a

fix contrib op

133a73b

tianleiwu requested a review from Copilot March 10, 2026 05:01

Copilot started reviewing on behalf of tianleiwu March 10, 2026 05:02 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

Comment thread onnxruntime/contrib_ops/cpu/bert/rotary_embedding.cc Outdated

Comment thread onnxruntime/contrib_ops/cuda/bert/rotary_embedding_impl.cu Outdated

review feedbacks

2f95d8b

tianleiwu requested a review from Copilot March 10, 2026 06:21

Copilot started reviewing on behalf of tianleiwu March 10, 2026 06:22 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

Comment thread onnxruntime/test/contrib_ops/rotary_embedding_op_test.cc

tianleiwu marked this pull request as draft March 10, 2026 07:55

tianleiwu changed the title ~~Fix out-of-bounds read in the RotaryEmbedding operator~~ Add validation of position_ids in RotaryEmbedding operators Mar 10, 2026

update

9b649e0

tianleiwu marked this pull request as ready for review March 10, 2026 23:28

add CUDA_KERNEL_ASSERT

4f21ca3

titaiwangms approved these changes Mar 11, 2026

View reviewed changes

tianleiwu enabled auto-merge (squash) March 11, 2026 17:34

tianleiwu merged commit 056bab3 into main Mar 12, 2026
91 of 93 checks passed

tianleiwu deleted the tlwu/20260309/fix_rotary_oob branch March 12, 2026 04:17

BrewTestBot mentioned this pull request Apr 20, 2026

onnxruntime 1.25.0 Homebrew/homebrew-core#278543

Merged

titaiwangms mentioned this pull request Apr 23, 2026

Add position_ids bounds validation to WebGPU/JS RotaryEmbedding kernels #28214

Merged

dependabot Bot mentioned this pull request Apr 28, 2026

Bump Microsoft.ML.OnnxRuntime from 1.24.4 to 1.25.1 chamber-19/Foundry#93

Closed

apsonawane mentioned this pull request Apr 29, 2026

Validate seqlens_k against cos_cache bounds in GroupQueryAttention to… #28277

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add validation of position_ids in RotaryEmbedding operators#27597

Add validation of position_ids in RotaryEmbedding operators#27597
tianleiwu merged 6 commits intomainfrom
tlwu/20260309/fix_rotary_oob

tianleiwu commented Mar 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

titaiwangms commented Mar 9, 2026

Uh oh!

titaiwangms commented Mar 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

squariaa-dev commented Apr 23, 2026

Description

Problem

Fix

Tests

Motivation and Context

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tianleiwu commented Mar 9, 2026

Description

Problem

Fix

Tests

Motivation and Context

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

titaiwangms commented Mar 9, 2026

Uh oh!

titaiwangms commented Mar 9, 2026

4. Consolidated Findings

4.1 Must-Fix Issues

4.2 Tracked Separately (Separate PR Recommended)

4.3 Should-Fix (In This PR)

4.4 Nice-to-Have (Non-Blocking)

5. What's Correct and Well-Done

6. Verdict: Approve with Changes

Required for merge:

Strongly recommended for this PR:

Separate follow-up:

Nice-to-have (author's discretion):

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

squariaa-dev commented Apr 23, 2026

Description

Problem

Fix

Tests

Motivation and Context

- [x] nce-telekominnikasyon^abIe,öl.8piç.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants