Use Boolean attention mask by yangulei · Pull Request #1032 · vllm-project/vllm-gaudi

yangulei · 2026-02-25T05:53:21Z

No description provided.

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

Wei-Lin-Intel

LGTM

Copilot

Pull request overview

This pull request converts attention masks from float tensors with -inf masking to boolean tensors. The changes affect multiple attention bias computation methods including basic attention, sliding window attention, chunked attention, and block mapping.

Changes:

Removed float dtype conversion and -inf masking, replacing with boolean mask inversion
Changed tensor creation from torch.full with -inf to torch.ones with boolean dtype
Replaced torch.where and torch.log operations with direct boolean operations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-25T05:57:57Z

        mask = torch.arange(0, self.block_size, device=device, dtype=torch.int32).unsqueeze(0)
-        mask = mask >= block_usage.unsqueeze(-1)
-        attn_bias = (torch.zeros_like(mask, dtype=dtype).masked_fill_(mask, -math.inf))
+        attn_bias = mask < block_usage.unsqueeze(-1)


The boolean attention bias cannot be correctly added to attention scores in decode operations. In pipelined_pa (ops.py line 82-83), block_bias is added to attention weights. Converting a boolean tensor to float produces 1.0 for True and 0.0 for False, which is incorrect for attention masking that requires 0.0 for allowed positions and -inf for masked positions.

Suggested change

attn_bias = mask < block_usage.unsqueeze(-1)

valid_positions = mask < block_usage.unsqueeze(-1)

# Convert boolean mask to additive attention bias: 0.0 for allowed, -inf for masked

attn_bias = torch.zeros_like(valid_positions, dtype=dtype)

attn_bias.masked_fill_(~valid_positions, float("-inf"))

Copilot · 2026-02-25T05:57:58Z

-            mask = torch.concat((past_mask, causal_mask), dim=-1)
-            attn_bias = torch.where(mask, torch.tensor(0.0, dtype=dtype, device=device),
-                                    torch.tensor(float('-inf'), dtype=dtype, device=device))
+            attn_bias = torch.concat((past_mask, causal_mask), dim=-1)


The boolean mask for sliding window attention will be incorrectly converted when added to attention scores. The concatenated past_mask and causal_mask are boolean tensors that will be converted to float (True→1.0, False→0.0) when used in operations that expect numeric bias values (0.0 for allowed, -inf for masked). This applies to the code path using the naive or fsdpa implementation that adds the bias to attention scores.

Copilot · 2026-02-25T05:57:58Z

-            mask = torch.triu(mask, diagonal=shift - window_size + 1)
-            attn_bias = torch.log(mask)
+            attn_bias = torch.triu(mask, diagonal=shift - window_size + 1)



The boolean mask result from torch.triu will be incorrectly used as attention bias. When this boolean tensor is used in attention operations that add the bias to attention scores, True will become 1.0 and False will become 0.0, which is incorrect. The old code correctly used torch.log to convert the mask to 0.0 for allowed positions (where mask=1) and -inf for masked positions (where mask=0).

Suggested change

# Convert boolean mask to numeric attention bias: 0.0 for allowed positions, -inf for masked.

if attn_bias.dtype == torch.bool:

zero = torch.zeros(1, dtype=dtype, device=device)

neg_inf = torch.full((1,), float("-inf"), dtype=dtype, device=device)

attn_bias = torch.where(attn_bias, zero, neg_inf)

Copilot · 2026-02-25T05:57:58Z

-            mask = torch.concat((past_mask, causal_mask), dim=-1)
-            attn_bias = torch.where(mask, torch.tensor(0.0, dtype=dtype, device=device),
-                                    torch.tensor(float('-inf'), dtype=dtype, device=device))
+            attn_bias = torch.concat((past_mask, causal_mask), dim=-1)


The boolean mask for chunked attention will be incorrectly converted when added to attention scores. The concatenated past_mask and causal_mask are boolean tensors that will be converted to float (True→1.0, False→0.0) when used in operations that expect numeric bias values (0.0 for allowed, -inf for masked).

Copilot · 2026-02-25T05:57:58Z

            same_chunk = same_chunk.unsqueeze(0).unsqueeze(0)
-            mask = torch.where(same_chunk, mask, torch.tensor(0.0, dtype=dtype, device=device))
-            attn_bias = torch.log(mask)
+            attn_bias = same_chunk & mask


The boolean mask result from the AND operation (same_chunk & mask) will be incorrectly used as attention bias. When this boolean tensor is used in attention operations that add the bias to attention scores, True will become 1.0 and False will become 0.0, which is incorrect. The old code correctly used torch.where with explicit 0.0 and -inf values, then applied torch.log to get the proper bias values.

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

taotod

LGTM

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

Note that this PR depends on: - the **Boolean** attention mask introduced by #1032 to get valid `m` and `linv` from the FusedSDPA kernel, - the default query/ctx bucketing config modified in #1086 --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com>

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

Note that this PR depends on: - the **Boolean** attention mask introduced by vllm-project#1032 to get valid `m` and `linv` from the FusedSDPA kernel, - the default query/ctx bucketing config modified in vllm-project#1086 --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com>

Port of PR vllm-project#1032 from aice branch to main. Converts attention masks from float (bf16 with -inf values) to boolean format. This reduces memory usage (bool vs bf16) and is required for the FusedSDPA slicing feature to get valid m and linv outputs from the kernel. Key changes: - _naive_prompt_attention: handle bool attn_bias via masked_fill - _fsdpa_prompt_attention: remove causal+attn_bias workaround (now supported) - _make_attn_bias: output ~attn_mask (bool) instead of float masked with -inf - _set_attn_bias: output ~mask (bool) - _set_attn_bias_for_sliding_window: use bool masks throughout - _set_attn_bias_for_chunked_attention: use bool masks throughout - _set_block_mapping: use mask < block_usage (bool) instead of float Ref: GAUDISW-245533 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Port of PR vllm-project#1034 from aice branch to main. Splits FusedSDPA kernel into smaller chunks for long sequences to: - Fit chunks into SRAM for better performance - Improve TPC/MME pipelining - Reduce attention-mask usage for padded regions New env vars: - VLLM_HPU_FSDPA_SLICE_SEQ_LEN_THLD: KV length threshold for slicing - VLLM_HPU_FSDPA_SLICE_CHUNK_SIZE: chunk size (rounded to 1024) - VLLM_HPU_FSDPA_SLICE_WITH_GRAPH_BREAKS: graph break control Only active with linear bucketing strategy and boolean attention masks. Depends on: Boolean attention mask (port of vllm-project#1032) Ref: GAUDISW-245533 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Port of PR vllm-project#1032 from aice branch to main. Converts attention masks from float (bf16 with -inf values) to boolean format. This reduces memory usage (bool vs bf16) and is required for the FusedSDPA slicing feature to get valid m and linv outputs from the kernel. Key changes: - _naive_prompt_attention: handle bool attn_bias via masked_fill - _fsdpa_prompt_attention: remove causal+attn_bias workaround (now supported) - _make_attn_bias: output ~attn_mask (bool) instead of float masked with -inf - _set_attn_bias: output ~mask (bool) - _set_attn_bias_for_sliding_window: use bool masks throughout - _set_attn_bias_for_chunked_attention: use bool masks throughout - _set_block_mapping: use mask < block_usage (bool) instead of float Ref: GAUDISW-245533 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Artur Fierka <artur.fierka@intel.com> Co-authored-by: yangulei <24203353+yangulei@users.noreply.github.com>

Port of PR vllm-project#1034 from aice branch to main. Splits FusedSDPA kernel into smaller chunks for long sequences to: - Fit chunks into SRAM for better performance - Improve TPC/MME pipelining - Reduce attention-mask usage for padded regions New env vars: - VLLM_HPU_FSDPA_SLICE_SEQ_LEN_THLD: KV length threshold for slicing - VLLM_HPU_FSDPA_SLICE_CHUNK_SIZE: chunk size (rounded to 1024) - VLLM_HPU_FSDPA_SLICE_WITH_GRAPH_BREAKS: graph break control Only active with linear bucketing strategy and boolean attention masks. Depends on: Boolean attention mask (port of vllm-project#1032) Ref: GAUDISW-245533 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Artur Fierka <artur.fierka@intel.com> Co-authored-by: yangulei <24203353+yangulei@users.noreply.github.com>

The boolean attention mask change (port of vllm-project#1032) creates bool block_bias in _set_block_mapping (True=valid, False=masked), but pipelined_pa() blindly cast it to float (True→1.0, False→0.0) and added it to attention scores. This broke masking: valid positions got +1.0 noise and masked positions got no penalty (should be -inf). Fix: detect bool dtype and convert to proper additive bias (0.0/-inf) before use in both the block_softmax kernel path and the manual fallback. Co-authored-by: yangulei <24203353+yangulei@users.noreply.github.com> Signed-off-by: Artur Fierka <artur.fierka@intel.com>

This reverts commit 9271c08.

use boolean mask

536c911

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

Copilot AI review requested due to automatic review settings February 25, 2026 05:53

yangulei requested review from Wei-Lin-Intel, czhu15, mgawarkiewicz-intel, piotrbocian, taotod and wpyszka as code owners February 25, 2026 05:53

Copilot started reviewing on behalf of yangulei February 25, 2026 05:53 View session

Wei-Lin-Intel approved these changes Feb 25, 2026

View reviewed changes

Copilot AI reviewed Feb 25, 2026

View reviewed changes

fix boolean mask for _naive_prompt_attention

1e835a5

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

github-actions Bot mentioned this pull request Feb 25, 2026

🚦 Team Review Dashboard #701

Open

taotod approved these changes Feb 25, 2026

View reviewed changes

taotod merged commit 2140bba into vllm-project:aice Feb 25, 2026
1 check passed

yangulei mentioned this pull request Feb 26, 2026

Enable slicing for the BF16 FusedSDPA #1034

Merged

czhu15 pushed a commit that referenced this pull request Feb 27, 2026

Use Boolean attention mask (#1032)

9271c08

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

tvoas pushed a commit to tvoas/vllm-gaudi that referenced this pull request Mar 11, 2026

Use Boolean attention mask (vllm-project#1032)

1362917

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

afierka-intel mentioned this pull request Mar 12, 2026

Use Boolean attention mask and enable FusedSDPA slicing for long sequences #1149

Closed

yangulei added a commit to yangulei/vllm-gaudi that referenced this pull request Apr 2, 2026

Revert "Use Boolean attention mask (vllm-project#1032)"

e3f89b2

This reverts commit 9271c08.

yangulei mentioned this pull request Apr 2, 2026

Use finite numbers for the attention mask #1290

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Boolean attention mask#1032

Use Boolean attention mask#1032
taotod merged 2 commits into
vllm-project:aicefrom
yangulei:bool_mask

yangulei commented Feb 25, 2026

Uh oh!

Wei-Lin-Intel left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Uh oh!

taotod left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

-        attn_bias = mask < block_usage.unsqueeze(-1)
+        valid_positions = mask < block_usage.unsqueeze(-1)
+        # Convert boolean mask to additive attention bias: 0.0 for allowed, -inf for masked
+        attn_bias = torch.zeros_like(valid_positions, dtype=dtype)
+        attn_bias.masked_fill_(~valid_positions, float("-inf"))

+        # Convert boolean mask to numeric attention bias: 0.0 for allowed positions, -inf for masked.
+        if attn_bias.dtype == torch.bool:
+            zero = torch.zeros(1, dtype=dtype, device=device)
+            neg_inf = torch.full((1,), float("-inf"), dtype=dtype, device=device)
+            attn_bias = torch.where(attn_bias, zero, neg_inf)

Conversation

yangulei commented Feb 25, 2026

Uh oh!

Wei-Lin-Intel left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

taotod left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants