Skip to content

Validate seqlens_k against cos_cache bounds in GroupQueryAttention to…#28277

Open
apsonawane wants to merge 8 commits intomainfrom
asonawane/embedlookup
Open

Validate seqlens_k against cos_cache bounds in GroupQueryAttention to…#28277
apsonawane wants to merge 8 commits intomainfrom
asonawane/embedlookup

Conversation

@apsonawane
Copy link
Copy Markdown
Contributor

Description

Validate seqlens_k values against cos_cache.shape[0] in GroupQueryAttention::Compute() when do_rotary is enabled, to prevent out-of-bounds reads in the rotary embedding lookup.

Root Cause

CheckRotaryCaches() validates cos_cache.shape[0] >= total_sequence_length, but runtime position IDs are derived from seqlens_k (a separate, per-batch input). An attacker can set total_sequence_length small enough to pass the guard while setting seqlens_k[b] far beyond cos_cache.shape[0], causing position_id = seqlens_k[b] to index out of bounds into the cos/sin cache. The resulting heap bytes are used as rotation values and propagate into the inference output.

Fix

Add an explicit bounds check in Compute() that rejects any seqlens_k[b] >= cos_cache.shape[0] before position IDs are computed. This is defense-in-depth alongside the existing RunRotaryEmbedding position_ids validation added in #27597.

Security

  • Impact: Heap OOB read (CWE-125) — adjacent heap memory leaks into inference output via cos/sin rotation values.
  • Attack vector: Any GQA-based LLM serving endpoint (Llama, Phi, Mistral) that accepts seqlens_k as an inference input. No model modification required.

Testing

Verified that crafted inputs with seqlens_k exceeding cos_cache dimensions now return INVALID_ARGUMENT instead of silently producing results containing leaked heap data.

@vraspar vraspar requested a review from Copilot April 29, 2026 19:03
Comment thread onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the CPU GroupQueryAttention implementation by validating runtime seqlens_k values against the rotary embedding cache length when do_rotary is enabled, preventing out-of-bounds reads from the cos/sin caches that could leak heap data into inference outputs.

Changes:

  • Add a per-batch bounds check in GroupQueryAttention::Compute() rejecting seqlens_k[b] >= cos_cache.shape[0] under do_rotary_.
  • Return INVALID_ARGUMENT with a descriptive error when the bound is violated.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc Outdated
Comment thread onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc Outdated
Comment thread onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc Outdated
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Comment thread onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc Outdated
Comment thread onnxruntime/test/contrib_ops/group_query_attention_op_test.cc Outdated
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Comment thread onnxruntime/test/contrib_ops/group_query_attention_op_test.cc Outdated
Copy link
Copy Markdown
Contributor

@vraspar vraspar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice security fix -- the gap between CheckRotaryCaches (validates against total_sequence_length) and the actual position ID derivation (from seqlens_k) is a real and subtle bug. Good catch. A few non-blocking nits below.


Nit 1: CUDA EP coverage — This check only protects the CPU EP. The CUDA GQA kernel also uses do_rotary and derives position IDs similarly. Consider moving this validation into CheckInputs() in group_query_attention_helper.h (around line 282, after CheckRotaryCaches) so all EPs get the protection in one place. That said, seqlens_k may be on GPU in the CUDA path making a host-side loop infeasible, so the current placement is pragmatic.

Nit 2: Negative seqlens_k values — The condition seqlens_k_data[b] >= rotary_cache_max_seq will let negative values through, which would also produce OOB position IDs. Adding seqlens_k_data[b] < 0 to the check would catch that here with a clear error message. Likely already caught downstream by the RunRotaryEmbedding validation from #27597, so not urgent.

Nit 3: Multi-batch test — The two test cases cover the key scenarios well. A possible addition: a batch_size=2 test where seqlens_k = {3, 10} (one valid, one OOB) to verify the loop iterates correctly and the error message includes the right batch index (seqlens_k[1] = 10).

Overall this is clean, well-scoped, and well-documented. The error messages are descriptive and the tests directly exercise the vulnerability path.

@apsonawane apsonawane enabled auto-merge (squash) May 4, 2026 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants