Skip to content

Add RotaryEmbedding fusion for Qwen3 on-the-fly RoPE patterns#27590

Merged
tianleiwu merged 6 commits intomicrosoft:mainfrom
Rishi-Dave:rishidave/feat/qwen3-rotary-embedding-fusion
Mar 16, 2026
Merged

Add RotaryEmbedding fusion for Qwen3 on-the-fly RoPE patterns#27590
tianleiwu merged 6 commits intomicrosoft:mainfrom
Rishi-Dave:rishidave/feat/qwen3-rotary-embedding-fusion

Conversation

@Rishi-Dave
Copy link
Copy Markdown
Contributor

Description

Extend FusionRotaryEmbeddings to handle Qwen3's on-the-fly rotary position embedding computation, where cos/sin values are computed from inv_freq at runtime instead of being looked up from a pre-computed cache.

This is a follow-up to #27556 (Qwen3 basic model type support). Depends on #27556.

Part of #25083.

Motivation and Context

Qwen3 models (ranked 4th on MTEB) compute RoPE differently from existing supported models (Phi, LLaMA, etc.). Instead of pre-computing cos/sin caches and looking them up via Gather(cache, position_ids), Qwen3 computes them on-the-fly:

freqs = inv_freq_expanded @ position_ids_expanded   # MatMul
emb = torch.cat((freqs, freqs), dim=-1)             # Concat
cos = emb.cos() * attention_scaling                  # Cos, Mul
sin = emb.sin() * attention_scaling                  # Sin, Mul

Additionally, TorchScript exports of Qwen3 insert Cast nodes in the rotate_half pattern (from torch.floor_divide tracing), which the existing path patterns don't account for.

Changes

fusion_rotary_attention.py:

  • Add Cast-tolerant rotate_half path patterns (rotate_half_x2_path_2_3, _2_4, rotate_half_x1_path_2_3, _2_4) that allow 1-2 Cast nodes between Unsqueeze and Div in the dynamic Slice index computation
  • Add sin_path_5 / cos_path_5 patterns matching the on-the-fly computation: MatMul → Transpose → Concat → Cos/Sin → Mul(scaling) → Unsqueeze → Mul, with optional Cast variant (the optimizer's earlier Cast fusion pass may remove the Cast)
  • Add create_cos_sin_cache_from_on_the_fly_rope() helper that extracts inv_freq weights, computes cos/sin caches as model initializers, and traces position_ids from the graph
  • Handle per-layer vs shared node removal correctly (only remove per-layer Unsqueeze/outer Mul nodes; shared MatMul/Cos/Sin nodes are pruned automatically by the optimizer)

qwen3_model_generator.py:

  • Add include_rope=True parameter to create_qwen3_decoder_layer()
  • Generate full on-the-fly RoPE computation graph: inv_freq initializer, position_ids input, MatMul/Transpose/Concat/Cos/Sin/Mul nodes, and rotate_half pattern with dynamic Slice indices (including Cast nodes from floor division)
  • Apply RoPE to both Q and K paths

test_attention_fusion.py:

  • Add test_qwen3_rotary_embedding_fusion verifying 2 RotaryEmbedding nodes are fused along with 3 SimplifiedLayerNormalization and 1 SkipSimplifiedLayerNormalization

Verification

  • Unit tests: All 15 test_attention_fusion.py tests pass (14 existing + 1 new)
  • Real model: Verified on Qwen3-Embedding-0.6B (28 layers): 56 RotaryEmbedding nodes fused (28 layers × 2 per layer for Q and K), reducing total node count from 7416 → 4661 (37% reduction)
  • No regressions: All changes are additive alternative path patterns — existing models that use dynamic Slice indices or cache-based RoPE never hit the new paths
  • Lint: lintrunner -a clean on all modified files

@tianleiwu
Copy link
Copy Markdown
Contributor

tianleiwu commented Mar 9, 2026

Below is AI analysis (some might not be correct):


Modeling Code Alignment Check (vs transformers/models/qwen3)

Qwen3RotaryEmbedding.forward — Verified Correct

The HuggingFace Qwen3RotaryEmbedding.forward() computes:

inv_freq_expanded = self.inv_freq[None, :, None].float().expand(B, -1, 1)  # (B, head_dim/2, 1)
position_ids_expanded = position_ids[:, None, :].float()                    # (B, 1, S)
freqs = (inv_freq_expanded @ position_ids_expanded).transpose(1, 2)         # (B, S, head_dim/2)
emb = torch.cat((freqs, freqs), dim=-1)                                     # (B, S, head_dim)
cos = emb.cos() * self.attention_scaling
sin = emb.sin() * self.attention_scaling

The PR's create_cos_sin_cache_from_on_the_fly_rope() pre-computes this as a cache:

freqs[pos, i] = pos * inv_freq[i]   # equivalent to the MatMul for sequential positions
emb = concat([freqs, freqs])         # matches the Concat(freqs, freqs)
cos_cache = cos(emb) * scaling       # matches cos * attention_scaling

Math is correct — the MatMul operation inv_freq @ position_ids is equivalent to the element-wise pos * inv_freq[i] for sequential position IDs, and the RotaryEmbedding op correctly indexes cos_cache[position_ids[b,s], :] at runtime.

rotate_half — Verified Correct

The HuggingFace code:

x1 = x[..., : x.shape[-1] // 2]
x2 = x[..., x.shape[-1] // 2 :]
return torch.cat((-x2, x1), dim=-1)

TorchScript traces x.shape[-1] // 2 using Shape -> Gather -> floor_divide, where floor_divide exports as Div -> Cast -> Cast (int truncation). The new rotate_half_x2_path_2_3 (2 Cast) and _2_4 (1 Cast) patterns correctly handle this.

apply_rotary_pos_emb — Verified Correct

cos = cos.unsqueeze(unsqueeze_dim)  # -> Unsqueeze node
sin = sin.unsqueeze(unsqueeze_dim)  # -> Unsqueeze node
q_embed = (q * cos) + (rotate_half(q) * sin)  # -> Mul, Mul, Add

The sin/cos_path_5 patterns match: [Mul, Unsqueeze, (Cast?), Mul(scaling), Cos/Sin, Concat, Transpose, MatMul], which correctly captures the unsqueeze and optional Cast (from .to(dtype=x.dtype)).

Qwen3 Architecture — Verified Correct

  • QK-Norm: Qwen3 applies RMSNorm to Q and K after projection and reshape, before RoPE. The test model generator correctly mirrors: QProj -> Reshape -> q_norm(RMSNorm) -> Transpose -> RoPE.
  • Shared RoPE: In the real model, position_embeddings = self.rotary_emb(hidden_states, position_ids) is called once and the result is passed to all layers. The fusion correctly handles this by only removing per-layer Unsqueeze/Mul nodes and letting shared nodes be auto-pruned.
  • GQA: num_kv_heads < num_heads is correctly modeled in the test generator.

Critical Issues

1. max_seq_len = 2048 will cause out-of-bounds memory access (Must Fix)

File: fusion_rotary_attention.py line 1338

max_seq_len = 2048  # Default max sequence length for cache

Qwen3Config sets max_position_embeddings = 32768 by default. The RotaryEmbedding CUDA/CPU kernels compute cache_offset = position_id * half_rotary_emb_dim with no bounds check (verified in rotary_embedding.cc L87-L90 and rotary_embedding_impl.cu L76-L87). If a position_ids value >= 2048 is passed at inference time, this causes:

  • CPU: Out-of-bounds read / segfault
  • CUDA: Undefined behavior / garbage values

The only kernel guard checks sequence_length > max_sequence_length, NOT individual position_ids values.

The max_position_embeddings config value is NOT encoded in the ONNX graph — there is no node, initializer, or metadata from which to derive it. The original on-the-fly computation had no such limit since it computed cos/sin dynamically.

Recommendations (pick one):

  1. Increase the default significantly: Use max_seq_len = 131072 (covers most LLM use cases). Memory cost for head_dim=128: 131072 * 64 * 4 bytes * 2 caches = ~64 MB — very modest.
  2. Make configurable via FusionOptions: Add a max_rotary_seq_len field that users can set. Use 131072 as default.
  3. Derive from model hints: If the optimizer is invoked with model config (not currently available in the fusion), extract from there.

Option 1 is the quickest safe fix.

2. inv_freq tracing bug when Cast/Expand/Where nodes are present (Latent Bug)

File: fusion_rotary_attention.py lines 1298-1305

inv_freq_node = self.model.get_parent(matmul_node, 0, output_name_to_node=None)
while inv_freq_node is not None and inv_freq_node.op_type in ("Cast", "Expand", "Where"):
    inv_freq_node = self.model.get_parent(inv_freq_node, 0, output_name_to_node=None)

inv_freq_name = inv_freq_node.output[0] if inv_freq_node is not None else matmul_node.input[0]

When the loop traverses past Cast/Expand/Where nodes and reaches an initializer input (where get_parent() returns None because the input is not produced by any node), the fallback uses matmul_node.input[0] — which is the MatMul's direct input (e.g., a Cast output name), NOT the underlying initializer name. This causes get_initializer() to return None, and the Constant node search also fails.

Why it works today: For current Qwen3 exports, inv_freq is a float32 registered buffer. self.inv_freq[None, :, None] is constant-folded during torch.onnx.export, and .float() is a no-op on float32. So the exported graph has inv_freq (initializer, shape (1, head_dim/2, 1)) feeding directly into MatMul with no intermediate Cast/Expand/Where nodes. The loop never enters, and matmul_node.input[0] IS the initializer name.

When it would break: If a future model or export setting produces Cast/Expand/Where nodes between inv_freq and MatMul (e.g., different dtype inv_freq, explicit batch expansion with dynamic axes), the fusion would silently fail and fall back to no fusion.

Fix: Track the leaf input name during traversal:

inv_freq_input_name = matmul_node.input[0]
inv_freq_node = self.model.get_parent(matmul_node, 0, output_name_to_node=None)
while inv_freq_node is not None and inv_freq_node.op_type in ("Cast", "Expand", "Where", "Unsqueeze"):
    inv_freq_input_name = inv_freq_node.input[0]
    inv_freq_node = self.model.get_parent(inv_freq_node, 0, output_name_to_node=None)

inv_freq_name = inv_freq_node.output[0] if inv_freq_node is not None else inv_freq_input_name

Also add "Unsqueeze" to the skip set for robustness (handles exports where inv_freq[None, :, None] isn't constant-folded).


Medium Issues

3. Test model diverges from real export pattern

The test generator (qwen3_model_generator.py) creates inv_freq directly as shape [1, head_dim/2, 1] and feeds it straight to MatMul with no Expand/Cast/Where nodes. This is a pragmatic simplification, but it means the test does not exercise the Cast/Expand/Where traversal code path in create_cos_sin_cache_from_on_the_fly_rope().

Recommendation: Consider adding a variant test that includes an Expand node between inv_freq and MatMul, to cover the traversal loop. This would have caught Issue #2 above.

4. Redundant Concat + truncation in cache computation

emb = np.concatenate([freqs, freqs], axis=-1)  # (max_seq_len, head_dim)
...
cos_cache_data = cos_cache_data[:, : (head_size // 2)]  # take first half

Since emb[:, :head_size//2] == freqs, the Concat followed by truncation is mathematically equivalent to just using freqs directly:

cos_cache_data = np.cos(freqs) * scaling_value
sin_cache_data = np.sin(freqs) * scaling_value

Not a correctness issue, but removing the dead computation improves clarity.


Minor / Pre-existing Issues

5. Missing return after "failed to match common cache paths" (Pre-existing)

File: fusion_rotary_attention.py lines 1743-1744

In the non-on-the-fly path, the final else branch logs a debug message but doesn't return:

else:
    logger.debug("fuse_rotary_embeddings: failed to match common cache paths")
# falls through to create_rotary_embeddings_from_nodes

This means if sin/cos paths match different path variants (e.g., sin matched path_1 but cos matched path_3), fusion proceeds with potentially mismatched cache information. Pre-existing issue on main, not introduced by this PR. Unlikely to trigger in practice.

6. Hardcoded cache names "cos_cache" / "sin_cache"

The names "cos_cache" and "sin_cache" are shared across all fusion paths (both function-based and on-the-fly). The code guards with if self.model.get_initializer(cos_cache_name) is None to avoid duplicates. This works because Qwen3 models won't mix function-based and on-the-fly RoPE paths. No action needed, just noting potential for confusion if architectures mix patterns.


Positive Observations

7. Elegant auto-pruning strategy for shared nodes

The per-layer vs. shared node removal logic is well-designed. Removing only Unsqueeze/outer Mul/Cast (per-layer) and relying on prune_graph to clean up shared MatMul/Cos/Sin/Concat/Transpose/Mul(scaling) nodes when all consumers are gone is robust and avoids double-removal bugs in multi-layer models.

8. Cast-tolerant rotate_half patterns

The rotate_half_x2_path_2_3/_2_4 and _x1_path_2_3/_2_4 variants correctly handle 1-2 Cast nodes from TorchScript floor_divide tracing. These are additive patterns that don't affect existing model support.

9. SkipSimplifiedLayerNormalization fallback

The change in fusion_skiplayernorm.py to continue with skip_index=1 when symbolic shape inference fails is well-justified since both Add inputs are already verified as non-initializer dynamic tensors.

10. Correct attention_scaling extraction

The scaling value extraction correctly checks both input[0] and input[1] of the Mul(scaling) node, handling either operand ordering. Defaults to 1.0 if not found, which is correct for the common case (Qwen3Config default rope_type returns attention_factor = 1.0).


Summary

Category Issue Severity Action
max_seq_len = 2048 Causes OOB memory access for sequences > 2048 Critical Increase to 131072 or make configurable
inv_freq tracing fallback Uses wrong name when Cast/Expand nodes present Medium (latent) Track leaf input during traversal
Test coverage gap Test model skips Expand/Cast in inv_freq path Low Add variant test
Redundant Concat + truncation Dead computation in cache generation Low Simplify for clarity
Missing return (pre-existing) Fallthrough after mismatched cache paths Low Not introduced in this PR

The fusion logic is structurally sound and the modeling code alignment is verified correct. The critical blocker is the max_seq_len = 2048 limit which will cause runtime crashes for real Qwen3 inference beyond 2048 tokens.

@Rishi-Dave
Copy link
Copy Markdown
Contributor Author

Thanks for the thorough analysis! Addressed all items in 3a2cd4b:

Critical (#1) — max_seq_len = 2048 → 131072: Increased to 131072 to cover most LLM contexts (Qwen3 default is 32768, many models go up to 128k). Memory cost for head_dim=128 is ~64 MB — very modest.

Medium (#2) — inv_freq tracing bug: Fixed by tracking the leaf input name during traversal through Cast/Expand/Where/Unsqueeze nodes, so the correct initializer name is used even when get_parent() returns None at the end of the chain.

Low (#3) — Test coverage gap: Added test_qwen3_rotary_embedding_fusion_with_expand that inserts Cast → Expand → Where nodes between inv_freq and MatMul, exercising the traversal loop.

Low (#4) — Redundant Concat + truncation: Simplified to use freqs directly (np.cos(freqs) * scaling) instead of Concat(freqs, freqs) followed by truncation to first half.

Code scanning (#5): Removed the unnecessary premature assignment of position_ids_from_sin_path and position_ids_from_cos_path — they're now only assigned in the else branch where they're used.

All 16 tests pass (14 existing + 2 new Qwen3 RoPE tests). Lintrunner clean.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the ONNX Runtime transformer optimizer (FusionRotaryEmbeddings) to handle Qwen3's on-the-fly rotary position embedding (RoPE) computation pattern, where cos/sin are computed from inv_freq at runtime via MatMul rather than being looked up from a pre-computed cache.

Changes:

  • Add new path patterns (sin_path_5/cos_path_5, rotate_half_x2_path_2_3/_2_4, rotate_half_x1_path_2_3/_2_4) for Qwen3's Cast-tolerant RoPE pattern
  • Add create_cos_sin_cache_from_on_the_fly_rope() helper that extracts inv_freq, computes cos/sin caches at optimization time, and adds them as model initializers
  • Add create_qwen3_decoder_layer() test generator and new tests for both on-the-fly RoPE and the Cast+Expand+Where variant

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
fusion_rotary_attention.py Core changes: new path patterns for Qwen3 RoPE, on-the-fly cache generation helper, updated fuse() to handle on-the-fly vs. cache-based paths separately
optimizer.py Registers "qwen3" model type with Gpt2OnnxModel and opt_level=0
fusion_options.py Sets Qwen3-specific defaults: disables EmbedLayerNorm, uses NoMask attention
fusion_skiplayernorm.py Removes early return when symbolic shape inference fails, allowing Skip LN fusion with skip_index=1 default
qwen3_model_generator.py New test graph generator for Qwen3 decoder layer with optional on-the-fly RoPE and Expand/Where inv_freq path
test_attention_fusion.py Adds 2 new tests for on-the-fly RoPE fusion and its Cast+Expand+Where variant

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

tianleiwu
tianleiwu previously approved these changes Mar 11, 2026
Copy link
Copy Markdown
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core logic is correct and well-tested. The fusion_skiplayernorm.py change is safe given the pre-existing checks. The on-the-fly RoPE fusion is a valuable addition for Qwen3 model optimization.

Minor Suggestions

  1. fusion_skiplayernorm.py behavioral change (When shape_infer_helper is None (shape inference failed), previously the fusion was skipped entirely with an early return. Now it proceeds with skip_index=1 as a safe default) affects all model types. Consider adding more specific comments about the safety guarantee.
  2. Hardcoded max_seq_len=131072 — acceptable for now but may need to be configurable in the future.
  3. No negative test cases — consider adding tests for malformed graphs or unsupported variants to ensure graceful fallback.
  4. No numerical validation in tests — the tests only count fused nodes but don't verify that the generated cos/sin caches match expected values.

@tianleiwu
Copy link
Copy Markdown
Contributor

@Rishi-Dave, there are some CI pipelines failed, please take a look.

For example, there is test error in transformers/test_attention_fusion.py::TestFusion::test_gpt2_attention_no_past_fusion

You can reproduce this like (need build and reinstall wheel first):

pip install -r tools/ci_build/requirements/transformers-test/requirements.txt
cd onnxruntime/test/python/transformers
python test_attention_fusion.py

@Rishi-Dave Rishi-Dave force-pushed the rishidave/feat/qwen3-rotary-embedding-fusion branch from 09507b2 to 190c66e Compare March 12, 2026 16:44
@Rishi-Dave
Copy link
Copy Markdown
Contributor Author

Thanks for flagging! The CI failure in test_gpt2_attention_no_past_fusion was caused by an interaction between two changes:

  1. The SkipLayerNorm shape-inference fallback I included in the base commit (which allowed fusion to proceed without shape validation)
  2. The GPT-2 no-past test from Fix GPT-2 no-past attention fusion for transformers >= 4.27 #27449 (now merged in main)

The fallback was incorrectly fusing nodes in the GPT-2 no-past graph where shape inference fails.

Fix: Rebased onto latest upstream/main and reverted the SkipLayerNorm fallback — it's not needed for Qwen3 support. Updated the Qwen3 test assertions to expect 4 SimplifiedLayerNormalization (no SkipSimplifiedLayerNormalization when shape inference is unavailable for the synthetic test graph). All 17 test_attention_fusion.py tests + all 13 test_skip_layer_norm_fusion.py tests pass locally.

@Rishi-Dave Rishi-Dave force-pushed the rishidave/feat/qwen3-rotary-embedding-fusion branch 2 times, most recently from 0ed8785 to bfc0ffc Compare March 14, 2026 17:40
@Rishi-Dave
Copy link
Copy Markdown
Contributor Author

Thanks for the suggestions! Addressed in 78dc48f:

Negative test (#3): Added test_qwen3_rotary_embedding_fusion_negative_dynamic_inv_freq — generates a graph where inv_freq is a dynamic graph input (not an initializer), so the fusion cannot extract the values at optimization time. Verifies graceful fallback: 0 RotaryEmbedding nodes, no crash. This exercises the create_cos_sin_cache_from_on_the_fly_ropereturn None, None, None path.

Numerical validation (#4): Added test_qwen3_rotary_embedding_fusion_cache_numerical_validation — after fusion, extracts the cos_cache and sin_cache initializers via numpy_helper.to_array(), verifies shape is (max_seq_len, head_dim//2), then spot-checks values at positions [0, 1, 7, 100, 1000] against the expected computation cos(pos * inv_freq) * scaling / sin(pos * inv_freq) * scaling.

All 19 tests pass (17 existing + 2 new). Lintrunner clean.

Extend FusionRotaryEmbeddings to handle Qwen3's on-the-fly rotary
position embedding computation, where cos/sin values are computed
from inv_freq at runtime instead of being looked up from a
pre-computed cache.

Changes:
- Add Cast-tolerant rotate_half path patterns for TorchScript exports
  that insert Cast nodes between Unsqueeze and Div
- Add sin_path_5/cos_path_5 patterns matching the on-the-fly
  computation: MatMul → Transpose → Concat → Cos/Sin → Mul(scaling)
  → Unsqueeze → Mul, with optional Cast variant
- Add create_cos_sin_cache_from_on_the_fly_rope() helper that
  extracts inv_freq weights, computes cos/sin caches as initializers,
  and traces position_ids from the graph
- Handle per-layer vs shared node removal correctly (only remove
  per-layer Unsqueeze/outer Mul; shared MatMul/Cos/Sin nodes are
  pruned automatically)
- Update qwen3_model_generator.py with full RoPE computation graph
- Add test_qwen3_rotary_embedding_fusion verifying 2 RotaryEmbedding
  nodes are fused

Verified on real Qwen3-Embedding-0.6B: 56 RotaryEmbedding fused
(28 layers × 2), reducing 7416 → 4661 nodes (37% reduction).
…cache

- Increase max_seq_len from 2048 to 131072 to prevent OOB memory access
  for sequences beyond 2048 tokens (Qwen3 default is 32768)
- Fix inv_freq tracing to track leaf input name through Cast/Expand/Where
  /Unsqueeze nodes, preventing wrong fallback name when intermediate
  nodes are present
- Simplify cache computation: use freqs directly instead of redundant
  Concat(freqs, freqs) followed by truncation to first half
- Remove unnecessary premature variable assignments flagged by code
  scanning (position_ids_from_sin/cos_path)
- Add test_qwen3_rotary_embedding_fusion_with_expand covering the
  Cast → Expand → Where traversal path in inv_freq tracing
- Fix inv_freq tracing through Where nodes: follow input[1] (true
  branch / data path) instead of input[0] (condition). Where has
  3 inputs [condition, x, y] and inv_freq flows through x.
- Use numpy_helper.from_array for cache tensor serialization instead
  of flatten().tolist(), avoiding intermediate Python list for ~8M
  float values
- Remove unused sin_path parameter from
  create_cos_sin_cache_from_on_the_fly_rope (only cos_path is used)
- Remove unused max_seq_len parameter from _on_the_fly_rope_nodes
- Early-return from create_cos_sin_cache_from_on_the_fly_rope when
  cos/sin cache initializers already exist, avoiding redundant
  131072 × head_dim/2 cos/sin computation on every layer's fusion
- Guard per-layer node removal with single-consumer check to prevent
  removing shared nodes that still have other consumers
Remove the shape-inference-failed fallback in FusionSkipLayerNormalization
that was causing test_gpt2_attention_no_past_fusion to fail — the
fallback allowed SkipLayerNorm fusion without shape validation, which
incorrectly fused nodes in the GPT-2 no-past graph. Update Qwen3 test
assertions to expect 4 SimplifiedLayerNormalization (no SkipSLN fusion
when shape inference is unavailable for the synthetic test graph).
Address reviewer feedback: add test for graceful fallback when inv_freq
is a dynamic graph input (not an extractable initializer), and add
numerical validation that verifies cos/sin cache values match the
expected mathematical computation at multiple positions.
@Rishi-Dave Rishi-Dave force-pushed the rishidave/feat/qwen3-rotary-embedding-fusion branch from 78dc48f to 41f3a17 Compare March 15, 2026 19:22
@Rishi-Dave
Copy link
Copy Markdown
Contributor Author

@tianleiwu — Rebased onto latest upstream/main and force-pushed. All feedback has been addressed:

Your suggestions:

  1. SkipLayerNorm comments — Reverted entirely (not needed for Qwen3)
  2. max_seq_len=131072 — Acknowledged; acceptable for now per your note
  3. Negative test — Added test_qwen3_rotary_embedding_fusion_negative_dynamic_inv_freq (dynamic inv_freq input → graceful fallback, 0 RotaryEmbedding, no crash)
  4. Numerical validation — Added test_qwen3_rotary_embedding_fusion_cache_numerical_validation (spot-checks cos/sin cache values at positions [0, 1, 7, 100, 1000] against cos(pos * inv_freq) * scaling)

CI fix:

  • Reverted the SkipLayerNorm shape-inference fallback that was causing test_gpt2_attention_no_past_fusion to fail
  • Updated Qwen3 test assertions accordingly (4 SimplifiedLayerNormalization instead of SkipSimplifiedLayerNormalization)

Test results (post-rebase):

  • 19/19 test_attention_fusion.py pass
  • 13/13 test_skip_layer_norm_fusion.py pass
  • Lintrunner clean

Ready for re-review when you have a chance.

@tianleiwu
Copy link
Copy Markdown
Contributor

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 4 pipeline(s).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@tianleiwu tianleiwu enabled auto-merge (squash) March 16, 2026 07:52
@tianleiwu tianleiwu merged commit 227c3d5 into microsoft:main Mar 16, 2026
93 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants