Revert "Skip materialised causal attn_bias on FSDPA for non-GDN hybri…#1482
Merged
iboiko-habana merged 3 commits intoMay 27, 2026
Conversation
…d models (vllm-project#1413)" This reverts commit 808dbfa. Signed-off-by: Radoslaw Smyrek <radoslawx.smyrek@intel.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Removes the “non-GDN hybrid” topology detection and the FSDPA-native causal short-circuit paths that skipped materializing attn_bias for certain hybrid models.
Changes:
- Deleted
is_non_gdn_hybridtopology detection in both runner/metadata init paths. - Removed early-return short-circuit in
set_attn_bias/_set_attn_biasthat avoided building a large causalattn_biastensor.
Comment on lines
3891
to
3896
| or not attn_metadata.is_prompt): | ||
| return attn_metadata | ||
|
|
||
| # Extended FSDPA-native causal short-circuit for non-GDN hybrid models | ||
| # (e.g. Granite-4 Mamba2+Transformer). FusedSDPA can encode a purely | ||
| # causal mask natively via is_causal=True + valid_seq_lengths, including | ||
| # chunked prefill where block_list is non-None. Skipping the | ||
| # materialised [bs, 1, q_len, total_kv_len] attn_bias avoids a large | ||
| # add_bf16 on the attention critical path (significant at long | ||
| # context). Conservative scope: only non-GDN hybrid models; GDN / | ||
| # pure-transformer / other topologies keep the materialised bias path | ||
| # until validated. | ||
| if (self.prefill_use_fusedsdpa and self.is_causal and not self.is_pooling_model | ||
| and not getattr(self, 'sliding_window', None) | ||
| and not getattr(self, 'model_has_chunked_attention', False) | ||
| and getattr(self, 'alibi_slopes', None) is None and self.is_non_gdn_hybrid): | ||
| return attn_metadata | ||
|
|
||
| if attn_metadata.attn_bias is not None: | ||
| return attn_metadata | ||
|
|
Comment on lines
6780
to
6785
| or not attn_metadata.is_prompt): | ||
| return attn_metadata | ||
|
|
||
| # Extended FSDPA-native causal short-circuit for non-GDN hybrid models | ||
| # (e.g. Granite-4 Mamba2+Transformer). FusedSDPA handles a purely | ||
| # causal mask natively (is_causal=True + valid_seq_lengths). Skip | ||
| # materialising a [bs, 1, q_len, total_kv_len] attn_bias even during | ||
| # chunked prefill (block_list is non-None) for these topologies; this | ||
| # removes a sizable add_bf16 from the attention critical path during | ||
| # long-context chunked prefill. interleaved_sliding_window and | ||
| # chunked-attention bias paths (window_attn_bias / chunked_attn_bias) | ||
| # are populated later in process_metadata and used by hpu_attn | ||
| # instead. Conservative scope: only non-GDN hybrid models; all other | ||
| # topologies retain the original behaviour. | ||
| if (self.prefill_use_fusedsdpa and not self.interleaved_sliding_window and self.is_non_gdn_hybrid): | ||
| return attn_metadata | ||
|
|
||
| if attn_metadata.attn_bias is not None: | ||
| return attn_metadata | ||
|
|
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
✅ CI PassedAll checks passed successfully against the following vllm commit: |
iboiko-habana
approved these changes
May 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…d models (#1413)"
This reverts commit 808dbfa.