Add RotaryEmbedding fusion for Qwen3 on-the-fly RoPE patterns by Rishi-Dave · Pull Request #27590 · microsoft/onnxruntime

Rishi-Dave · 2026-03-09T07:04:36Z

Description

Extend FusionRotaryEmbeddings to handle Qwen3's on-the-fly rotary position embedding computation, where cos/sin values are computed from inv_freq at runtime instead of being looked up from a pre-computed cache.

This is a follow-up to #27556 (Qwen3 basic model type support). Depends on #27556.

Part of #25083.

Motivation and Context

Qwen3 models (ranked 4th on MTEB) compute RoPE differently from existing supported models (Phi, LLaMA, etc.). Instead of pre-computing cos/sin caches and looking them up via Gather(cache, position_ids), Qwen3 computes them on-the-fly:

freqs = inv_freq_expanded @ position_ids_expanded   # MatMul
emb = torch.cat((freqs, freqs), dim=-1)             # Concat
cos = emb.cos() * attention_scaling                  # Cos, Mul
sin = emb.sin() * attention_scaling                  # Sin, Mul

Additionally, TorchScript exports of Qwen3 insert Cast nodes in the rotate_half pattern (from torch.floor_divide tracing), which the existing path patterns don't account for.

Changes

fusion_rotary_attention.py:

Add Cast-tolerant rotate_half path patterns (rotate_half_x2_path_2_3, _2_4, rotate_half_x1_path_2_3, _2_4) that allow 1-2 Cast nodes between Unsqueeze and Div in the dynamic Slice index computation
Add sin_path_5 / cos_path_5 patterns matching the on-the-fly computation: MatMul → Transpose → Concat → Cos/Sin → Mul(scaling) → Unsqueeze → Mul, with optional Cast variant (the optimizer's earlier Cast fusion pass may remove the Cast)
Add create_cos_sin_cache_from_on_the_fly_rope() helper that extracts inv_freq weights, computes cos/sin caches as model initializers, and traces position_ids from the graph
Handle per-layer vs shared node removal correctly (only remove per-layer Unsqueeze/outer Mul nodes; shared MatMul/Cos/Sin nodes are pruned automatically by the optimizer)

qwen3_model_generator.py:

Add include_rope=True parameter to create_qwen3_decoder_layer()
Generate full on-the-fly RoPE computation graph: inv_freq initializer, position_ids input, MatMul/Transpose/Concat/Cos/Sin/Mul nodes, and rotate_half pattern with dynamic Slice indices (including Cast nodes from floor division)
Apply RoPE to both Q and K paths

test_attention_fusion.py:

Add test_qwen3_rotary_embedding_fusion verifying 2 RotaryEmbedding nodes are fused along with 3 SimplifiedLayerNormalization and 1 SkipSimplifiedLayerNormalization

Verification

Unit tests: All 15 test_attention_fusion.py tests pass (14 existing + 1 new)
Real model: Verified on Qwen3-Embedding-0.6B (28 layers): 56 RotaryEmbedding nodes fused (28 layers × 2 per layer for Q and K), reducing total node count from 7416 → 4661 (37% reduction)
No regressions: All changes are additive alternative path patterns — existing models that use dynamic Slice indices or cache-based RoPE never hit the new paths
Lint: lintrunner -a clean on all modified files

onnxruntime/python/tools/transformers/fusion_rotary_attention.py

tianleiwu · 2026-03-09T18:44:13Z

Below is AI analysis (some might not be correct):

Modeling Code Alignment Check (vs `transformers/models/qwen3`)

Qwen3RotaryEmbedding.forward — Verified Correct

The HuggingFace Qwen3RotaryEmbedding.forward() computes:

inv_freq_expanded = self.inv_freq[None, :, None].float().expand(B, -1, 1)  # (B, head_dim/2, 1)
position_ids_expanded = position_ids[:, None, :].float()                    # (B, 1, S)
freqs = (inv_freq_expanded @ position_ids_expanded).transpose(1, 2)         # (B, S, head_dim/2)
emb = torch.cat((freqs, freqs), dim=-1)                                     # (B, S, head_dim)
cos = emb.cos() * self.attention_scaling
sin = emb.sin() * self.attention_scaling

The PR's create_cos_sin_cache_from_on_the_fly_rope() pre-computes this as a cache:

freqs[pos, i] = pos * inv_freq[i]   # equivalent to the MatMul for sequential positions
emb = concat([freqs, freqs])         # matches the Concat(freqs, freqs)
cos_cache = cos(emb) * scaling       # matches cos * attention_scaling

Math is correct — the MatMul operation inv_freq @ position_ids is equivalent to the element-wise pos * inv_freq[i] for sequential position IDs, and the RotaryEmbedding op correctly indexes cos_cache[position_ids[b,s], :] at runtime.

rotate_half — Verified Correct

The HuggingFace code:

x1 = x[..., : x.shape[-1] // 2]
x2 = x[..., x.shape[-1] // 2 :]
return torch.cat((-x2, x1), dim=-1)

TorchScript traces x.shape[-1] // 2 using Shape -> Gather -> floor_divide, where floor_divide exports as Div -> Cast -> Cast (int truncation). The new rotate_half_x2_path_2_3 (2 Cast) and _2_4 (1 Cast) patterns correctly handle this.

apply_rotary_pos_emb — Verified Correct

cos = cos.unsqueeze(unsqueeze_dim)  # -> Unsqueeze node
sin = sin.unsqueeze(unsqueeze_dim)  # -> Unsqueeze node
q_embed = (q * cos) + (rotate_half(q) * sin)  # -> Mul, Mul, Add

The sin/cos_path_5 patterns match: [Mul, Unsqueeze, (Cast?), Mul(scaling), Cos/Sin, Concat, Transpose, MatMul], which correctly captures the unsqueeze and optional Cast (from .to(dtype=x.dtype)).

Qwen3 Architecture — Verified Correct

QK-Norm: Qwen3 applies RMSNorm to Q and K after projection and reshape, before RoPE. The test model generator correctly mirrors: QProj -> Reshape -> q_norm(RMSNorm) -> Transpose -> RoPE.
Shared RoPE: In the real model, position_embeddings = self.rotary_emb(hidden_states, position_ids) is called once and the result is passed to all layers. The fusion correctly handles this by only removing per-layer Unsqueeze/Mul nodes and letting shared nodes be auto-pruned.
GQA: num_kv_heads < num_heads is correctly modeled in the test generator.

Critical Issues

1. `max_seq_len = 2048` will cause out-of-bounds memory access (Must Fix)

File: fusion_rotary_attention.py line 1338

max_seq_len = 2048  # Default max sequence length for cache

Qwen3Config sets max_position_embeddings = 32768 by default. The RotaryEmbedding CUDA/CPU kernels compute cache_offset = position_id * half_rotary_emb_dim with no bounds check (verified in rotary_embedding.cc L87-L90 and rotary_embedding_impl.cu L76-L87). If a position_ids value >= 2048 is passed at inference time, this causes:

CPU: Out-of-bounds read / segfault
CUDA: Undefined behavior / garbage values

The only kernel guard checks sequence_length > max_sequence_length, NOT individual position_ids values.

The max_position_embeddings config value is NOT encoded in the ONNX graph — there is no node, initializer, or metadata from which to derive it. The original on-the-fly computation had no such limit since it computed cos/sin dynamically.

Recommendations (pick one):

Increase the default significantly: Use max_seq_len = 131072 (covers most LLM use cases). Memory cost for head_dim=128: 131072 * 64 * 4 bytes * 2 caches = ~64 MB — very modest.
Make configurable via FusionOptions: Add a max_rotary_seq_len field that users can set. Use 131072 as default.
Derive from model hints: If the optimizer is invoked with model config (not currently available in the fusion), extract from there.

Option 1 is the quickest safe fix.

2. `inv_freq` tracing bug when Cast/Expand/Where nodes are present (Latent Bug)

File: fusion_rotary_attention.py lines 1298-1305

inv_freq_node = self.model.get_parent(matmul_node, 0, output_name_to_node=None)
while inv_freq_node is not None and inv_freq_node.op_type in ("Cast", "Expand", "Where"):
    inv_freq_node = self.model.get_parent(inv_freq_node, 0, output_name_to_node=None)

inv_freq_name = inv_freq_node.output[0] if inv_freq_node is not None else matmul_node.input[0]

When the loop traverses past Cast/Expand/Where nodes and reaches an initializer input (where get_parent() returns None because the input is not produced by any node), the fallback uses matmul_node.input[0] — which is the MatMul's direct input (e.g., a Cast output name), NOT the underlying initializer name. This causes get_initializer() to return None, and the Constant node search also fails.

Why it works today: For current Qwen3 exports, inv_freq is a float32 registered buffer. self.inv_freq[None, :, None] is constant-folded during torch.onnx.export, and .float() is a no-op on float32. So the exported graph has inv_freq (initializer, shape (1, head_dim/2, 1)) feeding directly into MatMul with no intermediate Cast/Expand/Where nodes. The loop never enters, and matmul_node.input[0] IS the initializer name.

When it would break: If a future model or export setting produces Cast/Expand/Where nodes between inv_freq and MatMul (e.g., different dtype inv_freq, explicit batch expansion with dynamic axes), the fusion would silently fail and fall back to no fusion.

Fix: Track the leaf input name during traversal:

inv_freq_input_name = matmul_node.input[0]
inv_freq_node = self.model.get_parent(matmul_node, 0, output_name_to_node=None)
while inv_freq_node is not None and inv_freq_node.op_type in ("Cast", "Expand", "Where", "Unsqueeze"):
    inv_freq_input_name = inv_freq_node.input[0]
    inv_freq_node = self.model.get_parent(inv_freq_node, 0, output_name_to_node=None)

inv_freq_name = inv_freq_node.output[0] if inv_freq_node is not None else inv_freq_input_name

Also add "Unsqueeze" to the skip set for robustness (handles exports where inv_freq[None, :, None] isn't constant-folded).

Medium Issues

3. Test model diverges from real export pattern

The test generator (qwen3_model_generator.py) creates inv_freq directly as shape [1, head_dim/2, 1] and feeds it straight to MatMul with no Expand/Cast/Where nodes. This is a pragmatic simplification, but it means the test does not exercise the Cast/Expand/Where traversal code path in create_cos_sin_cache_from_on_the_fly_rope().

Recommendation: Consider adding a variant test that includes an Expand node between inv_freq and MatMul, to cover the traversal loop. This would have caught Issue #2 above.

4. Redundant Concat + truncation in cache computation

emb = np.concatenate([freqs, freqs], axis=-1)  # (max_seq_len, head_dim)
...
cos_cache_data = cos_cache_data[:, : (head_size // 2)]  # take first half

Since emb[:, :head_size//2] == freqs, the Concat followed by truncation is mathematically equivalent to just using freqs directly:

cos_cache_data = np.cos(freqs) * scaling_value
sin_cache_data = np.sin(freqs) * scaling_value

Not a correctness issue, but removing the dead computation improves clarity.

Minor / Pre-existing Issues

5. Missing `return` after "failed to match common cache paths" (Pre-existing)

File: fusion_rotary_attention.py lines 1743-1744

In the non-on-the-fly path, the final else branch logs a debug message but doesn't return:

else:
    logger.debug("fuse_rotary_embeddings: failed to match common cache paths")
# falls through to create_rotary_embeddings_from_nodes

This means if sin/cos paths match different path variants (e.g., sin matched path_1 but cos matched path_3), fusion proceeds with potentially mismatched cache information. Pre-existing issue on main, not introduced by this PR. Unlikely to trigger in practice.

6. Hardcoded cache names "cos_cache" / "sin_cache"

The names "cos_cache" and "sin_cache" are shared across all fusion paths (both function-based and on-the-fly). The code guards with if self.model.get_initializer(cos_cache_name) is None to avoid duplicates. This works because Qwen3 models won't mix function-based and on-the-fly RoPE paths. No action needed, just noting potential for confusion if architectures mix patterns.

Positive Observations

7. Elegant auto-pruning strategy for shared nodes

The per-layer vs. shared node removal logic is well-designed. Removing only Unsqueeze/outer Mul/Cast (per-layer) and relying on prune_graph to clean up shared MatMul/Cos/Sin/Concat/Transpose/Mul(scaling) nodes when all consumers are gone is robust and avoids double-removal bugs in multi-layer models.

8. Cast-tolerant rotate_half patterns

The rotate_half_x2_path_2_3/_2_4 and _x1_path_2_3/_2_4 variants correctly handle 1-2 Cast nodes from TorchScript floor_divide tracing. These are additive patterns that don't affect existing model support.

9. SkipSimplifiedLayerNormalization fallback

The change in fusion_skiplayernorm.py to continue with skip_index=1 when symbolic shape inference fails is well-justified since both Add inputs are already verified as non-initializer dynamic tensors.

10. Correct `attention_scaling` extraction

The scaling value extraction correctly checks both input[0] and input[1] of the Mul(scaling) node, handling either operand ordering. Defaults to 1.0 if not found, which is correct for the common case (Qwen3Config default rope_type returns attention_factor = 1.0).

Summary

Category	Issue	Severity	Action
max_seq_len = 2048	Causes OOB memory access for sequences > 2048	Critical	Increase to 131072 or make configurable
inv_freq tracing fallback	Uses wrong name when Cast/Expand nodes present	Medium (latent)	Track leaf input during traversal
Test coverage gap	Test model skips Expand/Cast in inv_freq path	Low	Add variant test
Redundant Concat + truncation	Dead computation in cache generation	Low	Simplify for clarity
Missing return (pre-existing)	Fallthrough after mismatched cache paths	Low	Not introduced in this PR

The fusion logic is structurally sound and the modeling code alignment is verified correct. The critical blocker is the max_seq_len = 2048 limit which will cause runtime crashes for real Qwen3 inference beyond 2048 tokens.

Rishi-Dave · 2026-03-09T20:30:46Z

Thanks for the thorough analysis! Addressed all items in 3a2cd4b:

Critical (#1) — max_seq_len = 2048 → 131072: Increased to 131072 to cover most LLM contexts (Qwen3 default is 32768, many models go up to 128k). Memory cost for head_dim=128 is ~64 MB — very modest.

Medium (#2) — inv_freq tracing bug: Fixed by tracking the leaf input name during traversal through Cast/Expand/Where/Unsqueeze nodes, so the correct initializer name is used even when get_parent() returns None at the end of the chain.

Low (#3) — Test coverage gap: Added test_qwen3_rotary_embedding_fusion_with_expand that inserts Cast → Expand → Where nodes between inv_freq and MatMul, exercising the traversal loop.

Low (#4) — Redundant Concat + truncation: Simplified to use freqs directly (np.cos(freqs) * scaling) instead of Concat(freqs, freqs) followed by truncation to first half.

Code scanning (#5): Removed the unnecessary premature assignment of position_ids_from_sin_path and position_ids_from_cos_path — they're now only assigned in the else branch where they're used.

All 16 tests pass (14 existing + 2 new Qwen3 RoPE tests). Lintrunner clean.

Copilot

Pull request overview

This PR extends the ONNX Runtime transformer optimizer (FusionRotaryEmbeddings) to handle Qwen3's on-the-fly rotary position embedding (RoPE) computation pattern, where cos/sin are computed from inv_freq at runtime via MatMul rather than being looked up from a pre-computed cache.

Changes:

Add new path patterns (sin_path_5/cos_path_5, rotate_half_x2_path_2_3/_2_4, rotate_half_x1_path_2_3/_2_4) for Qwen3's Cast-tolerant RoPE pattern
Add create_cos_sin_cache_from_on_the_fly_rope() helper that extracts inv_freq, computes cos/sin caches at optimization time, and adds them as model initializers
Add create_qwen3_decoder_layer() test generator and new tests for both on-the-fly RoPE and the Cast+Expand+Where variant

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`fusion_rotary_attention.py`	Core changes: new path patterns for Qwen3 RoPE, on-the-fly cache generation helper, updated `fuse()` to handle on-the-fly vs. cache-based paths separately
`optimizer.py`	Registers `"qwen3"` model type with `Gpt2OnnxModel` and opt_level=0
`fusion_options.py`	Sets Qwen3-specific defaults: disables `EmbedLayerNorm`, uses `NoMask` attention
`fusion_skiplayernorm.py`	Removes early return when symbolic shape inference fails, allowing Skip LN fusion with `skip_index=1` default
`qwen3_model_generator.py`	New test graph generator for Qwen3 decoder layer with optional on-the-fly RoPE and Expand/Where inv_freq path
`test_attention_fusion.py`	Adds 2 new tests for on-the-fly RoPE fusion and its Cast+Expand+Where variant

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

onnxruntime/python/tools/transformers/fusion_rotary_attention.py

onnxruntime/test/python/transformers/test_attention_fusion.py

onnxruntime/python/tools/transformers/fusion_rotary_attention.py

onnxruntime/test/python/transformers/qwen3_model_generator.py

onnxruntime/python/tools/transformers/fusion_rotary_attention.py

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

onnxruntime/python/tools/transformers/fusion_rotary_attention.py

tianleiwu

The core logic is correct and well-tested. The fusion_skiplayernorm.py change is safe given the pre-existing checks. The on-the-fly RoPE fusion is a valuable addition for Qwen3 model optimization.

Minor Suggestions

fusion_skiplayernorm.py behavioral change (When shape_infer_helper is None (shape inference failed), previously the fusion was skipped entirely with an early return. Now it proceeds with skip_index=1 as a safe default) affects all model types. Consider adding more specific comments about the safety guarantee.
Hardcoded max_seq_len=131072 — acceptable for now but may need to be configurable in the future.
No negative test cases — consider adding tests for malformed graphs or unsupported variants to ensure graceful fallback.
No numerical validation in tests — the tests only count fused nodes but don't verify that the generated cos/sin caches match expected values.

tianleiwu · 2026-03-12T03:59:53Z

@Rishi-Dave, there are some CI pipelines failed, please take a look.

For example, there is test error in transformers/test_attention_fusion.py::TestFusion::test_gpt2_attention_no_past_fusion

You can reproduce this like (need build and reinstall wheel first):

pip install -r tools/ci_build/requirements/transformers-test/requirements.txt
cd onnxruntime/test/python/transformers
python test_attention_fusion.py

Rishi-Dave · 2026-03-12T16:44:31Z

Thanks for flagging! The CI failure in test_gpt2_attention_no_past_fusion was caused by an interaction between two changes:

The SkipLayerNorm shape-inference fallback I included in the base commit (which allowed fusion to proceed without shape validation)
The GPT-2 no-past test from Fix GPT-2 no-past attention fusion for transformers >= 4.27 #27449 (now merged in main)

The fallback was incorrectly fusing nodes in the GPT-2 no-past graph where shape inference fails.

Fix: Rebased onto latest upstream/main and reverted the SkipLayerNorm fallback — it's not needed for Qwen3 support. Updated the Qwen3 test assertions to expect 4 SimplifiedLayerNormalization (no SkipSimplifiedLayerNormalization when shape inference is unavailable for the synthetic test graph). All 17 test_attention_fusion.py tests + all 13 test_skip_layer_norm_fusion.py tests pass locally.

Rishi-Dave · 2026-03-15T19:16:03Z

Thanks for the suggestions! Addressed in 78dc48f:

Negative test (#3): Added test_qwen3_rotary_embedding_fusion_negative_dynamic_inv_freq — generates a graph where inv_freq is a dynamic graph input (not an initializer), so the fusion cannot extract the values at optimization time. Verifies graceful fallback: 0 RotaryEmbedding nodes, no crash. This exercises the create_cos_sin_cache_from_on_the_fly_rope → return None, None, None path.

Numerical validation (#4): Added test_qwen3_rotary_embedding_fusion_cache_numerical_validation — after fusion, extracts the cos_cache and sin_cache initializers via numpy_helper.to_array(), verifies shape is (max_seq_len, head_dim//2), then spot-checks values at positions [0, 1, 7, 100, 1000] against the expected computation cos(pos * inv_freq) * scaling / sin(pos * inv_freq) * scaling.

All 19 tests pass (17 existing + 2 new). Lintrunner clean.

Extend FusionRotaryEmbeddings to handle Qwen3's on-the-fly rotary position embedding computation, where cos/sin values are computed from inv_freq at runtime instead of being looked up from a pre-computed cache. Changes: - Add Cast-tolerant rotate_half path patterns for TorchScript exports that insert Cast nodes between Unsqueeze and Div - Add sin_path_5/cos_path_5 patterns matching the on-the-fly computation: MatMul → Transpose → Concat → Cos/Sin → Mul(scaling) → Unsqueeze → Mul, with optional Cast variant - Add create_cos_sin_cache_from_on_the_fly_rope() helper that extracts inv_freq weights, computes cos/sin caches as initializers, and traces position_ids from the graph - Handle per-layer vs shared node removal correctly (only remove per-layer Unsqueeze/outer Mul; shared MatMul/Cos/Sin nodes are pruned automatically) - Update qwen3_model_generator.py with full RoPE computation graph - Add test_qwen3_rotary_embedding_fusion verifying 2 RotaryEmbedding nodes are fused Verified on real Qwen3-Embedding-0.6B: 56 RotaryEmbedding fused (28 layers × 2), reducing 7416 → 4661 nodes (37% reduction).

…cache - Increase max_seq_len from 2048 to 131072 to prevent OOB memory access for sequences beyond 2048 tokens (Qwen3 default is 32768) - Fix inv_freq tracing to track leaf input name through Cast/Expand/Where /Unsqueeze nodes, preventing wrong fallback name when intermediate nodes are present - Simplify cache computation: use freqs directly instead of redundant Concat(freqs, freqs) followed by truncation to first half - Remove unnecessary premature variable assignments flagged by code scanning (position_ids_from_sin/cos_path) - Add test_qwen3_rotary_embedding_fusion_with_expand covering the Cast → Expand → Where traversal path in inv_freq tracing

- Fix inv_freq tracing through Where nodes: follow input[1] (true branch / data path) instead of input[0] (condition). Where has 3 inputs [condition, x, y] and inv_freq flows through x. - Use numpy_helper.from_array for cache tensor serialization instead of flatten().tolist(), avoiding intermediate Python list for ~8M float values - Remove unused sin_path parameter from create_cos_sin_cache_from_on_the_fly_rope (only cos_path is used) - Remove unused max_seq_len parameter from _on_the_fly_rope_nodes

- Early-return from create_cos_sin_cache_from_on_the_fly_rope when cos/sin cache initializers already exist, avoiding redundant 131072 × head_dim/2 cos/sin computation on every layer's fusion - Guard per-layer node removal with single-consumer check to prevent removing shared nodes that still have other consumers

Remove the shape-inference-failed fallback in FusionSkipLayerNormalization that was causing test_gpt2_attention_no_past_fusion to fail — the fallback allowed SkipLayerNorm fusion without shape validation, which incorrectly fused nodes in the GPT-2 no-past graph. Update Qwen3 test assertions to expect 4 SimplifiedLayerNormalization (no SkipSLN fusion when shape inference is unavailable for the synthetic test graph).

Address reviewer feedback: add test for graceful fallback when inv_freq is a dynamic graph input (not an extractable initializer), and add numerical validation that verifies cos/sin cache values match the expected mathematical computation at multiple positions.

Rishi-Dave · 2026-03-15T19:22:48Z

@tianleiwu — Rebased onto latest upstream/main and force-pushed. All feedback has been addressed:

Your suggestions:

~~SkipLayerNorm comments~~ — Reverted entirely (not needed for Qwen3)
max_seq_len=131072 — Acknowledged; acceptable for now per your note
Negative test — Added test_qwen3_rotary_embedding_fusion_negative_dynamic_inv_freq (dynamic inv_freq input → graceful fallback, 0 RotaryEmbedding, no crash)
Numerical validation — Added test_qwen3_rotary_embedding_fusion_cache_numerical_validation (spot-checks cos/sin cache values at positions [0, 1, 7, 100, 1000] against cos(pos * inv_freq) * scaling)

CI fix:

Reverted the SkipLayerNorm shape-inference fallback that was causing test_gpt2_attention_no_past_fusion to fail
Updated Qwen3 test assertions accordingly (4 SimplifiedLayerNormalization instead of SkipSimplifiedLayerNormalization)

Test results (post-rebase):

19/19 test_attention_fusion.py pass
13/13 test_skip_layer_norm_fusion.py pass
Lintrunner clean

Ready for re-review when you have a chance.

tianleiwu · 2026-03-15T20:45:35Z

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-03-15T20:45:54Z

Azure Pipelines successfully started running 4 pipeline(s).

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

github-advanced-security bot found potential problems Mar 9, 2026

View reviewed changes

onnxruntime/python/tools/transformers/fusion_rotary_attention.py Fixed Show fixed Hide fixed

onnxruntime/python/tools/transformers/fusion_rotary_attention.py Fixed Show fixed Hide fixed

tianleiwu requested a review from Copilot March 10, 2026 05:22

Copilot started reviewing on behalf of tianleiwu March 10, 2026 05:23 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

tianleiwu requested a review from Copilot March 10, 2026 18:49

Copilot started reviewing on behalf of tianleiwu March 10, 2026 18:51 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

onnxruntime/python/tools/transformers/fusion_rotary_attention.py Outdated Show resolved Hide resolved

onnxruntime/python/tools/transformers/fusion_rotary_attention.py Show resolved Hide resolved

onnxruntime/python/tools/transformers/fusion_rotary_attention.py Outdated Show resolved Hide resolved

tianleiwu previously approved these changes Mar 11, 2026

View reviewed changes

Rishi-Dave dismissed tianleiwu’s stale review via 190c66e March 12, 2026 16:44

Rishi-Dave force-pushed the rishidave/feat/qwen3-rotary-embedding-fusion branch from 09507b2 to 190c66e Compare March 12, 2026 16:44

Rishi-Dave force-pushed the rishidave/feat/qwen3-rotary-embedding-fusion branch 2 times, most recently from 0ed8785 to bfc0ffc Compare March 14, 2026 17:40

Rishi-Dave added 6 commits March 15, 2026 12:21

Rishi-Dave force-pushed the rishidave/feat/qwen3-rotary-embedding-fusion branch from 78dc48f to 41f3a17 Compare March 15, 2026 19:22

tianleiwu requested a review from Copilot March 16, 2026 07:30

Copilot started reviewing on behalf of tianleiwu March 16, 2026 07:31 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

tianleiwu approved these changes Mar 16, 2026

View reviewed changes

tianleiwu enabled auto-merge (squash) March 16, 2026 07:52

tianleiwu merged commit 227c3d5 into microsoft:main Mar 16, 2026
93 checks passed

Conversation

Rishi-Dave commented Mar 9, 2026

Description

Motivation and Context

Changes

Verification

Uh oh!

Uh oh!

Uh oh!

tianleiwu commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Modeling Code Alignment Check (vs transformers/models/qwen3)

Qwen3RotaryEmbedding.forward — Verified Correct

rotate_half — Verified Correct

apply_rotary_pos_emb — Verified Correct

Qwen3 Architecture — Verified Correct

Critical Issues

1. max_seq_len = 2048 will cause out-of-bounds memory access (Must Fix)

2. inv_freq tracing bug when Cast/Expand/Where nodes are present (Latent Bug)

Medium Issues

3. Test model diverges from real export pattern

4. Redundant Concat + truncation in cache computation

Minor / Pre-existing Issues

5. Missing return after "failed to match common cache paths" (Pre-existing)

6. Hardcoded cache names "cos_cache" / "sin_cache"

Positive Observations

7. Elegant auto-pruning strategy for shared nodes

8. Cast-tolerant rotate_half patterns

9. SkipSimplifiedLayerNormalization fallback

10. Correct attention_scaling extraction

Summary

Uh oh!

Rishi-Dave commented Mar 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianleiwu left a comment

Choose a reason for hiding this comment

Minor Suggestions

Uh oh!

tianleiwu commented Mar 12, 2026

Uh oh!

Rishi-Dave commented Mar 12, 2026

Uh oh!

Rishi-Dave commented Mar 15, 2026

Uh oh!

Rishi-Dave commented Mar 15, 2026

Uh oh!

tianleiwu commented Mar 15, 2026

Uh oh!

azure-pipelines bot commented Mar 15, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

tianleiwu commented Mar 9, 2026 •

edited

Loading

Modeling Code Alignment Check (vs `transformers/models/qwen3`)

1. `max_seq_len = 2048` will cause out-of-bounds memory access (Must Fix)

2. `inv_freq` tracing bug when Cast/Expand/Where nodes are present (Latent Bug)

5. Missing `return` after "failed to match common cache paths" (Pre-existing)

10. Correct `attention_scaling` extraction