Skip to content

Fixing condition for materialised causal attn_bias#1433

Merged
ksmusz merged 1 commit into
mainfrom
dev/ksmusz/fix-materialised-causal-attn-bias
May 11, 2026
Merged

Fixing condition for materialised causal attn_bias#1433
ksmusz merged 1 commit into
mainfrom
dev/ksmusz/fix-materialised-causal-attn-bias

Conversation

@ksmusz
Copy link
Copy Markdown
Collaborator

@ksmusz ksmusz commented May 11, 2026

The PR fixes the condition introduced in #1413

Signed-off-by: Krzysztof Smusz <ksmusz@habana.ai>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts the conditions under which prompt-phase attn_bias is materialized, aiming to avoid building a large causal bias tensor in cases where FusedSDPA can apply causal masking natively (via is_causal=True + valid_seq_lengths).

Changes:

  • Added/expanded early-return conditions in set_attn_bias and _set_attn_bias to skip materializing attn_bias in additional FusedSDPA scenarios.
  • Updated the surrounding comments to describe the intended short-circuit behavior (though some statements no longer match the broadened gating).

Comment on lines +3838 to +3846
# Extended FSDPA-native causal short-circuit for non-GDN hybrid models
# (e.g. Granite-4 Mamba2+Transformer). FusedSDPA can encode a purely
# causal mask natively via is_causal=True + valid_seq_lengths, including
# chunked prefill where block_list is non-None. Skipping the
# materialised [bs, 1, q_len, total_kv_len] attn_bias avoids a large
# add_bf16 on the attention critical path (significant at long
# context). Conservative scope: only non-GDN hybrid models; GDN /
# pure-transformer / other topologies keep the materialised bias path
# until validated.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the branch fires for more cases, as that's how it worked before #1413
The comment mentions how it's extended usage applies to non-GDN hybrid models.

Comment on lines +6696 to +6706
# Extended FSDPA-native causal short-circuit for non-GDN hybrid models
# (e.g. Granite-4 Mamba2+Transformer). FusedSDPA handles a purely
# causal mask natively (is_causal=True + valid_seq_lengths). Skip
# materialising a [bs, 1, q_len, total_kv_len] attn_bias even during
# chunked prefill (block_list is non-None) for these topologies; this
# removes a sizable add_bf16 from the attention critical path during
# long-context chunked prefill. interleaved_sliding_window and
# chunked-attention bias paths (window_attn_bias / chunked_attn_bias)
# are populated later in process_metadata and used by hpu_attn
# instead. Conservative scope: only non-GDN hybrid models; all other
# topologies retain the original behaviour.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the branch fires for more cases, as that's how it worked before #1413
The comment mentions how it's extended usage applies to non-GDN hybrid models.

@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
54f548e9e58087f0155e4e164e416ad7efdfde6d

@ksmusz ksmusz merged commit dfd3d1f into main May 11, 2026
6 checks passed
@ksmusz ksmusz deleted the dev/ksmusz/fix-materialised-causal-attn-bias branch May 11, 2026 16:01
iboiko-habana pushed a commit that referenced this pull request May 18, 2026
#1433 fixed a Qwen3.5
accuracy regression that was only detected
when the prompt bucket batch size is large. Adding
VLLM_PROMPT_BS_BUCKET_MAX=32 to the CI test covers that case.
Also tighten the passing threshold to better catch future regressions.

Signed-off-by: Seunghyuk Park <separk@habana.ai>
Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com>
Co-authored-by: Libin Tang <libin.tang@intel.com>
iboiko-habana pushed a commit that referenced this pull request May 18, 2026
#1433 fixed a Qwen3.5
accuracy regression that was only detected
when the prompt bucket batch size is large. Adding
VLLM_PROMPT_BS_BUCKET_MAX=32 to the CI test covers that case.
Also tighten the passing threshold to better catch future regressions.

Signed-off-by: Seunghyuk Park <separk@habana.ai>
Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com>
Co-authored-by: Libin Tang <libin.tang@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants