Fix BART attention fusion for SDPA pattern from transformers >= 4.49#27458
Merged
tianleiwu merged 2 commits intomicrosoft:mainfrom Feb 27, 2026
Merged
Conversation
HuggingFace Transformers >= 4.49 replaced BartAttention with BartSdpaAttention, changing the ONNX graph topology in several ways that broke FusionBartAttention pattern matching. This adds SDPA-aware match paths so that BART attention fusion succeeds on modern exports.
982b168 to
fe4dfce
Compare
Contributor
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
tianleiwu
reviewed
Feb 27, 2026
tianleiwu
reviewed
Feb 27, 2026
tianleiwu
previously approved these changes
Feb 27, 2026
…lback - Document why mask presence is derived from the QK pattern match result rather than re-walking the graph (line 352 feedback). - Add logger.debug when num_heads/hidden_size falls back to user-specified values, logging both detected and fallback values (line 410 feedback).
Contributor
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline |
tianleiwu
approved these changes
Feb 27, 2026
|
Azure Pipelines successfully started running 4 pipeline(s). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
FusionBartAttentionso that BART attention fusion succeeds on models exported with HuggingFace Transformers >= 4.49Motivation
Fixes #23864
HuggingFace Transformers >= 4.49 replaced
BartAttentionwithBartSdpaAttention(commit2c47618), changing the ONNX export graph topology in several ways that brokeFusionBartAttentionpattern matching. Runningoptimize_model(..., model_type="bart")on these newer exports produces zero fused Attention nodes.Changes
fusion_bart_attention.pyThe SDPA refactor introduces four structural changes to the attention subgraph. Each required a new match path:
QKV output path — LayerNormalization anchor fallback
For SDPA models, symbolic shape inference often fails, which prevents SkipLayerNormalization fusion. When the anchor node is a plain
LayerNormalizationinstead ofSkipLayerNormalization, there's an extra residualAddbetween the LayerNorm and the attention output projection. Added a fallback match:["Add", "Add", "MatMul", "Reshape", "Transpose", "MatMul"]with[0, None, 0, 0, 0, 0].QK path — NaN guard (Where + IsNaN)
SDPA wraps the Softmax output in a NaN guard:
Where(IsNaN(softmax), 0.0, softmax). TheWherenode's input[2] is the Softmax output. Added two new QK paths:["Where", "Softmax", "MatMul"]with[0, 2, 0]["Where", "Softmax", "Add", "MatMul"]with[0, 2, 0, 0]Q and K scaling paths
Instead of a single combined scale on the QK MatMul output, SDPA applies separate
Mul(1/sqrt(head_dim))to Q and K before the QK MatMul. Added:["Mul", "Transpose", "Reshape", "Add", "MatMul"]with[0, 0, 0, 0, None]["Mul", "Reshape", "Transpose", "Reshape", "Transpose", "Reshape", "Add", "MatMul"]with[1, 0, 0, 0, 0, 0, 0, None](K^T uses aReshape→Transpose(0,2,1)→Reshapechain)num_heads fallback for dynamic shapes
SDPA models use
-1in reshape shape tensors for dynamic dimensions, causingget_num_heads_and_hidden_sizeto return negative values. Added a fallback to user-specifiednum_heads/hidden_sizewhen detected values are invalid.bart_model_generator.py(new)Synthetic BART SDPA attention graph generator that builds a minimal but complete attention subgraph matching the SDPA topology. Tests both
with_mask=True(decoder self-attention) andwith_mask=False(encoder attention) variants.test_attention_fusion.pyAdded
test_bart_attention_sdpa_fusionthat verifies:num_headsattributeunidirectionalattribute (1 for decoder self-attention with mask, 0 for encoder)Test Plan
python -m pytest test_attention_fusion.py -v— all 10 tests passlintrunneron all 3 changed files — no issueshf-internal-testing/tiny-random-bart): 2 Attention nodes fused, graph reduced from 120 → 34 nodes