Fixing condition for materialised causal attn_bias#1433
Conversation
Signed-off-by: Krzysztof Smusz <ksmusz@habana.ai>
There was a problem hiding this comment.
Pull request overview
This PR adjusts the conditions under which prompt-phase attn_bias is materialized, aiming to avoid building a large causal bias tensor in cases where FusedSDPA can apply causal masking natively (via is_causal=True + valid_seq_lengths).
Changes:
- Added/expanded early-return conditions in
set_attn_biasand_set_attn_biasto skip materializingattn_biasin additional FusedSDPA scenarios. - Updated the surrounding comments to describe the intended short-circuit behavior (though some statements no longer match the broadened gating).
| # Extended FSDPA-native causal short-circuit for non-GDN hybrid models | ||
| # (e.g. Granite-4 Mamba2+Transformer). FusedSDPA can encode a purely | ||
| # causal mask natively via is_causal=True + valid_seq_lengths, including | ||
| # chunked prefill where block_list is non-None. Skipping the | ||
| # materialised [bs, 1, q_len, total_kv_len] attn_bias avoids a large | ||
| # add_bf16 on the attention critical path (significant at long | ||
| # context). Conservative scope: only non-GDN hybrid models; GDN / | ||
| # pure-transformer / other topologies keep the materialised bias path | ||
| # until validated. |
There was a problem hiding this comment.
yes, the branch fires for more cases, as that's how it worked before #1413
The comment mentions how it's extended usage applies to non-GDN hybrid models.
| # Extended FSDPA-native causal short-circuit for non-GDN hybrid models | ||
| # (e.g. Granite-4 Mamba2+Transformer). FusedSDPA handles a purely | ||
| # causal mask natively (is_causal=True + valid_seq_lengths). Skip | ||
| # materialising a [bs, 1, q_len, total_kv_len] attn_bias even during | ||
| # chunked prefill (block_list is non-None) for these topologies; this | ||
| # removes a sizable add_bf16 from the attention critical path during | ||
| # long-context chunked prefill. interleaved_sliding_window and | ||
| # chunked-attention bias paths (window_attn_bias / chunked_attn_bias) | ||
| # are populated later in process_metadata and used by hpu_attn | ||
| # instead. Conservative scope: only non-GDN hybrid models; all other | ||
| # topologies retain the original behaviour. |
There was a problem hiding this comment.
yes, the branch fires for more cases, as that's how it worked before #1413
The comment mentions how it's extended usage applies to non-GDN hybrid models.
✅ CI PassedAll checks passed successfully against the following vllm commit: |
#1433 fixed a Qwen3.5 accuracy regression that was only detected when the prompt bucket batch size is large. Adding VLLM_PROMPT_BS_BUCKET_MAX=32 to the CI test covers that case. Also tighten the passing threshold to better catch future regressions. Signed-off-by: Seunghyuk Park <separk@habana.ai> Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com> Co-authored-by: Libin Tang <libin.tang@intel.com>
#1433 fixed a Qwen3.5 accuracy regression that was only detected when the prompt bucket batch size is large. Adding VLLM_PROMPT_BS_BUCKET_MAX=32 to the CI test covers that case. Also tighten the passing threshold to better catch future regressions. Signed-off-by: Seunghyuk Park <separk@habana.ai> Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com> Co-authored-by: Libin Tang <libin.tang@intel.com>
The PR fixes the condition introduced in #1413