Use Boolean attention mask#1032
Conversation
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
There was a problem hiding this comment.
Pull request overview
This pull request converts attention masks from float tensors with -inf masking to boolean tensors. The changes affect multiple attention bias computation methods including basic attention, sliding window attention, chunked attention, and block mapping.
Changes:
- Removed float dtype conversion and -inf masking, replacing with boolean mask inversion
- Changed tensor creation from torch.full with -inf to torch.ones with boolean dtype
- Replaced torch.where and torch.log operations with direct boolean operations
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| mask = torch.arange(0, self.block_size, device=device, dtype=torch.int32).unsqueeze(0) | ||
| mask = mask >= block_usage.unsqueeze(-1) | ||
| attn_bias = (torch.zeros_like(mask, dtype=dtype).masked_fill_(mask, -math.inf)) | ||
| attn_bias = mask < block_usage.unsqueeze(-1) |
There was a problem hiding this comment.
The boolean attention bias cannot be correctly added to attention scores in decode operations. In pipelined_pa (ops.py line 82-83), block_bias is added to attention weights. Converting a boolean tensor to float produces 1.0 for True and 0.0 for False, which is incorrect for attention masking that requires 0.0 for allowed positions and -inf for masked positions.
| attn_bias = mask < block_usage.unsqueeze(-1) | |
| valid_positions = mask < block_usage.unsqueeze(-1) | |
| # Convert boolean mask to additive attention bias: 0.0 for allowed, -inf for masked | |
| attn_bias = torch.zeros_like(valid_positions, dtype=dtype) | |
| attn_bias.masked_fill_(~valid_positions, float("-inf")) |
| mask = torch.concat((past_mask, causal_mask), dim=-1) | ||
| attn_bias = torch.where(mask, torch.tensor(0.0, dtype=dtype, device=device), | ||
| torch.tensor(float('-inf'), dtype=dtype, device=device)) | ||
| attn_bias = torch.concat((past_mask, causal_mask), dim=-1) |
There was a problem hiding this comment.
The boolean mask for sliding window attention will be incorrectly converted when added to attention scores. The concatenated past_mask and causal_mask are boolean tensors that will be converted to float (True→1.0, False→0.0) when used in operations that expect numeric bias values (0.0 for allowed, -inf for masked). This applies to the code path using the naive or fsdpa implementation that adds the bias to attention scores.
| mask = torch.triu(mask, diagonal=shift - window_size + 1) | ||
| attn_bias = torch.log(mask) | ||
| attn_bias = torch.triu(mask, diagonal=shift - window_size + 1) | ||
|
|
There was a problem hiding this comment.
The boolean mask result from torch.triu will be incorrectly used as attention bias. When this boolean tensor is used in attention operations that add the bias to attention scores, True will become 1.0 and False will become 0.0, which is incorrect. The old code correctly used torch.log to convert the mask to 0.0 for allowed positions (where mask=1) and -inf for masked positions (where mask=0).
| # Convert boolean mask to numeric attention bias: 0.0 for allowed positions, -inf for masked. | |
| if attn_bias.dtype == torch.bool: | |
| zero = torch.zeros(1, dtype=dtype, device=device) | |
| neg_inf = torch.full((1,), float("-inf"), dtype=dtype, device=device) | |
| attn_bias = torch.where(attn_bias, zero, neg_inf) |
| mask = torch.concat((past_mask, causal_mask), dim=-1) | ||
| attn_bias = torch.where(mask, torch.tensor(0.0, dtype=dtype, device=device), | ||
| torch.tensor(float('-inf'), dtype=dtype, device=device)) | ||
| attn_bias = torch.concat((past_mask, causal_mask), dim=-1) |
There was a problem hiding this comment.
The boolean mask for chunked attention will be incorrectly converted when added to attention scores. The concatenated past_mask and causal_mask are boolean tensors that will be converted to float (True→1.0, False→0.0) when used in operations that expect numeric bias values (0.0 for allowed, -inf for masked).
| same_chunk = same_chunk.unsqueeze(0).unsqueeze(0) | ||
| mask = torch.where(same_chunk, mask, torch.tensor(0.0, dtype=dtype, device=device)) | ||
| attn_bias = torch.log(mask) | ||
| attn_bias = same_chunk & mask |
There was a problem hiding this comment.
The boolean mask result from the AND operation (same_chunk & mask) will be incorrectly used as attention bias. When this boolean tensor is used in attention operations that add the bias to attention scores, True will become 1.0 and False will become 0.0, which is incorrect. The old code correctly used torch.where with explicit 0.0 and -inf values, then applied torch.log to get the proper bias values.
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Note that this PR depends on: - the **Boolean** attention mask introduced by vllm-project#1032 to get valid `m` and `linv` from the FusedSDPA kernel, - the default query/ctx bucketing config modified in vllm-project#1086 --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Port of PR vllm-project#1032 from aice branch to main. Converts attention masks from float (bf16 with -inf values) to boolean format. This reduces memory usage (bool vs bf16) and is required for the FusedSDPA slicing feature to get valid m and linv outputs from the kernel. Key changes: - _naive_prompt_attention: handle bool attn_bias via masked_fill - _fsdpa_prompt_attention: remove causal+attn_bias workaround (now supported) - _make_attn_bias: output ~attn_mask (bool) instead of float masked with -inf - _set_attn_bias: output ~mask (bool) - _set_attn_bias_for_sliding_window: use bool masks throughout - _set_attn_bias_for_chunked_attention: use bool masks throughout - _set_block_mapping: use mask < block_usage (bool) instead of float Ref: GAUDISW-245533 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Port of PR vllm-project#1034 from aice branch to main. Splits FusedSDPA kernel into smaller chunks for long sequences to: - Fit chunks into SRAM for better performance - Improve TPC/MME pipelining - Reduce attention-mask usage for padded regions New env vars: - VLLM_HPU_FSDPA_SLICE_SEQ_LEN_THLD: KV length threshold for slicing - VLLM_HPU_FSDPA_SLICE_CHUNK_SIZE: chunk size (rounded to 1024) - VLLM_HPU_FSDPA_SLICE_WITH_GRAPH_BREAKS: graph break control Only active with linear bucketing strategy and boolean attention masks. Depends on: Boolean attention mask (port of vllm-project#1032) Ref: GAUDISW-245533 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Port of PR vllm-project#1032 from aice branch to main. Converts attention masks from float (bf16 with -inf values) to boolean format. This reduces memory usage (bool vs bf16) and is required for the FusedSDPA slicing feature to get valid m and linv outputs from the kernel. Key changes: - _naive_prompt_attention: handle bool attn_bias via masked_fill - _fsdpa_prompt_attention: remove causal+attn_bias workaround (now supported) - _make_attn_bias: output ~attn_mask (bool) instead of float masked with -inf - _set_attn_bias: output ~mask (bool) - _set_attn_bias_for_sliding_window: use bool masks throughout - _set_attn_bias_for_chunked_attention: use bool masks throughout - _set_block_mapping: use mask < block_usage (bool) instead of float Ref: GAUDISW-245533 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Artur Fierka <artur.fierka@intel.com> Co-authored-by: yangulei <24203353+yangulei@users.noreply.github.com>
Port of PR vllm-project#1034 from aice branch to main. Splits FusedSDPA kernel into smaller chunks for long sequences to: - Fit chunks into SRAM for better performance - Improve TPC/MME pipelining - Reduce attention-mask usage for padded regions New env vars: - VLLM_HPU_FSDPA_SLICE_SEQ_LEN_THLD: KV length threshold for slicing - VLLM_HPU_FSDPA_SLICE_CHUNK_SIZE: chunk size (rounded to 1024) - VLLM_HPU_FSDPA_SLICE_WITH_GRAPH_BREAKS: graph break control Only active with linear bucketing strategy and boolean attention masks. Depends on: Boolean attention mask (port of vllm-project#1032) Ref: GAUDISW-245533 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Artur Fierka <artur.fierka@intel.com> Co-authored-by: yangulei <24203353+yangulei@users.noreply.github.com>
The boolean attention mask change (port of vllm-project#1032) creates bool block_bias in _set_block_mapping (True=valid, False=masked), but pipelined_pa() blindly cast it to float (True→1.0, False→0.0) and added it to attention scores. This broke masking: valid positions got +1.0 noise and masked positions got no penalty (should be -inf). Fix: detect bool dtype and convert to proper additive bias (0.0/-inf) before use in both the block_softmax kernel path and the manual fallback. Co-authored-by: yangulei <24203353+yangulei@users.noreply.github.com> Signed-off-by: Artur Fierka <artur.fierka@intel.com>
This reverts commit 9271c08.
No description provided.