Skip to content

Use Boolean attention mask#1032

Merged
taotod merged 2 commits into
vllm-project:aicefrom
yangulei:bool_mask
Feb 25, 2026
Merged

Use Boolean attention mask#1032
taotod merged 2 commits into
vllm-project:aicefrom
yangulei:bool_mask

Conversation

@yangulei
Copy link
Copy Markdown
Collaborator

No description provided.

Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Copy link
Copy Markdown
Collaborator

@Wei-Lin-Intel Wei-Lin-Intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request converts attention masks from float tensors with -inf masking to boolean tensors. The changes affect multiple attention bias computation methods including basic attention, sliding window attention, chunked attention, and block mapping.

Changes:

  • Removed float dtype conversion and -inf masking, replacing with boolean mask inversion
  • Changed tensor creation from torch.full with -inf to torch.ones with boolean dtype
  • Replaced torch.where and torch.log operations with direct boolean operations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread vllm_gaudi/v1/worker/hpu_model_runner.py
mask = torch.arange(0, self.block_size, device=device, dtype=torch.int32).unsqueeze(0)
mask = mask >= block_usage.unsqueeze(-1)
attn_bias = (torch.zeros_like(mask, dtype=dtype).masked_fill_(mask, -math.inf))
attn_bias = mask < block_usage.unsqueeze(-1)
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The boolean attention bias cannot be correctly added to attention scores in decode operations. In pipelined_pa (ops.py line 82-83), block_bias is added to attention weights. Converting a boolean tensor to float produces 1.0 for True and 0.0 for False, which is incorrect for attention masking that requires 0.0 for allowed positions and -inf for masked positions.

Suggested change
attn_bias = mask < block_usage.unsqueeze(-1)
valid_positions = mask < block_usage.unsqueeze(-1)
# Convert boolean mask to additive attention bias: 0.0 for allowed, -inf for masked
attn_bias = torch.zeros_like(valid_positions, dtype=dtype)
attn_bias.masked_fill_(~valid_positions, float("-inf"))

Copilot uses AI. Check for mistakes.
mask = torch.concat((past_mask, causal_mask), dim=-1)
attn_bias = torch.where(mask, torch.tensor(0.0, dtype=dtype, device=device),
torch.tensor(float('-inf'), dtype=dtype, device=device))
attn_bias = torch.concat((past_mask, causal_mask), dim=-1)
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The boolean mask for sliding window attention will be incorrectly converted when added to attention scores. The concatenated past_mask and causal_mask are boolean tensors that will be converted to float (True→1.0, False→0.0) when used in operations that expect numeric bias values (0.0 for allowed, -inf for masked). This applies to the code path using the naive or fsdpa implementation that adds the bias to attention scores.

Copilot uses AI. Check for mistakes.
mask = torch.triu(mask, diagonal=shift - window_size + 1)
attn_bias = torch.log(mask)
attn_bias = torch.triu(mask, diagonal=shift - window_size + 1)

Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The boolean mask result from torch.triu will be incorrectly used as attention bias. When this boolean tensor is used in attention operations that add the bias to attention scores, True will become 1.0 and False will become 0.0, which is incorrect. The old code correctly used torch.log to convert the mask to 0.0 for allowed positions (where mask=1) and -inf for masked positions (where mask=0).

Suggested change
# Convert boolean mask to numeric attention bias: 0.0 for allowed positions, -inf for masked.
if attn_bias.dtype == torch.bool:
zero = torch.zeros(1, dtype=dtype, device=device)
neg_inf = torch.full((1,), float("-inf"), dtype=dtype, device=device)
attn_bias = torch.where(attn_bias, zero, neg_inf)

Copilot uses AI. Check for mistakes.
mask = torch.concat((past_mask, causal_mask), dim=-1)
attn_bias = torch.where(mask, torch.tensor(0.0, dtype=dtype, device=device),
torch.tensor(float('-inf'), dtype=dtype, device=device))
attn_bias = torch.concat((past_mask, causal_mask), dim=-1)
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The boolean mask for chunked attention will be incorrectly converted when added to attention scores. The concatenated past_mask and causal_mask are boolean tensors that will be converted to float (True→1.0, False→0.0) when used in operations that expect numeric bias values (0.0 for allowed, -inf for masked).

Copilot uses AI. Check for mistakes.
same_chunk = same_chunk.unsqueeze(0).unsqueeze(0)
mask = torch.where(same_chunk, mask, torch.tensor(0.0, dtype=dtype, device=device))
attn_bias = torch.log(mask)
attn_bias = same_chunk & mask
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The boolean mask result from the AND operation (same_chunk & mask) will be incorrectly used as attention bias. When this boolean tensor is used in attention operations that add the bias to attention scores, True will become 1.0 and False will become 0.0, which is incorrect. The old code correctly used torch.where with explicit 0.0 and -inf values, then applied torch.log to get the proper bias values.

Copilot uses AI. Check for mistakes.
Comment thread vllm_gaudi/v1/worker/hpu_model_runner.py
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Copy link
Copy Markdown
Collaborator

@taotod taotod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@taotod taotod merged commit 2140bba into vllm-project:aice Feb 25, 2026
1 check passed
czhu15 pushed a commit that referenced this pull request Feb 27, 2026
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
czhu15 pushed a commit that referenced this pull request Mar 5, 2026
Note that this PR depends on:
- the **Boolean** attention mask introduced by
#1032 to get valid `m`
and `linv` from the FusedSDPA kernel,
- the default query/ctx bucketing config modified in
#1086

---------

Signed-off-by: Youlei Yang <youlei.yang@intel.com>
tvoas pushed a commit to tvoas/vllm-gaudi that referenced this pull request Mar 11, 2026
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
tvoas pushed a commit to tvoas/vllm-gaudi that referenced this pull request Mar 11, 2026
Note that this PR depends on:
- the **Boolean** attention mask introduced by
vllm-project#1032 to get valid `m`
and `linv` from the FusedSDPA kernel,
- the default query/ctx bucketing config modified in
vllm-project#1086

---------

Signed-off-by: Youlei Yang <youlei.yang@intel.com>
afierka-intel added a commit to afierka-intel/vllm-gaudi that referenced this pull request Mar 12, 2026
Port of PR vllm-project#1032 from aice branch to main.

Converts attention masks from float (bf16 with -inf values) to boolean
format. This reduces memory usage (bool vs bf16) and is required for
the FusedSDPA slicing feature to get valid m and linv outputs from
the kernel.

Key changes:
- _naive_prompt_attention: handle bool attn_bias via masked_fill
- _fsdpa_prompt_attention: remove causal+attn_bias workaround (now supported)
- _make_attn_bias: output ~attn_mask (bool) instead of float masked with -inf
- _set_attn_bias: output ~mask (bool)
- _set_attn_bias_for_sliding_window: use bool masks throughout
- _set_attn_bias_for_chunked_attention: use bool masks throughout
- _set_block_mapping: use mask < block_usage (bool) instead of float

Ref: GAUDISW-245533

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
afierka-intel added a commit to afierka-intel/vllm-gaudi that referenced this pull request Mar 12, 2026
Port of PR vllm-project#1034 from aice branch to main.

Splits FusedSDPA kernel into smaller chunks for long sequences to:
- Fit chunks into SRAM for better performance
- Improve TPC/MME pipelining
- Reduce attention-mask usage for padded regions

New env vars:
- VLLM_HPU_FSDPA_SLICE_SEQ_LEN_THLD: KV length threshold for slicing
- VLLM_HPU_FSDPA_SLICE_CHUNK_SIZE: chunk size (rounded to 1024)
- VLLM_HPU_FSDPA_SLICE_WITH_GRAPH_BREAKS: graph break control

Only active with linear bucketing strategy and boolean attention masks.

Depends on: Boolean attention mask (port of vllm-project#1032)
Ref: GAUDISW-245533

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
afierka-intel added a commit to afierka-intel/vllm-gaudi that referenced this pull request Mar 17, 2026
Port of PR vllm-project#1032 from aice branch to main.

Converts attention masks from float (bf16 with -inf values) to boolean
format. This reduces memory usage (bool vs bf16) and is required for
the FusedSDPA slicing feature to get valid m and linv outputs from
the kernel.

Key changes:
- _naive_prompt_attention: handle bool attn_bias via masked_fill
- _fsdpa_prompt_attention: remove causal+attn_bias workaround (now supported)
- _make_attn_bias: output ~attn_mask (bool) instead of float masked with -inf
- _set_attn_bias: output ~mask (bool)
- _set_attn_bias_for_sliding_window: use bool masks throughout
- _set_attn_bias_for_chunked_attention: use bool masks throughout
- _set_block_mapping: use mask < block_usage (bool) instead of float

Ref: GAUDISW-245533

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Artur Fierka <artur.fierka@intel.com>
Co-authored-by: yangulei <24203353+yangulei@users.noreply.github.com>
afierka-intel added a commit to afierka-intel/vllm-gaudi that referenced this pull request Mar 17, 2026
Port of PR vllm-project#1034 from aice branch to main.

Splits FusedSDPA kernel into smaller chunks for long sequences to:
- Fit chunks into SRAM for better performance
- Improve TPC/MME pipelining
- Reduce attention-mask usage for padded regions

New env vars:
- VLLM_HPU_FSDPA_SLICE_SEQ_LEN_THLD: KV length threshold for slicing
- VLLM_HPU_FSDPA_SLICE_CHUNK_SIZE: chunk size (rounded to 1024)
- VLLM_HPU_FSDPA_SLICE_WITH_GRAPH_BREAKS: graph break control

Only active with linear bucketing strategy and boolean attention masks.

Depends on: Boolean attention mask (port of vllm-project#1032)
Ref: GAUDISW-245533

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Artur Fierka <artur.fierka@intel.com>
Co-authored-by: yangulei <24203353+yangulei@users.noreply.github.com>
afierka-intel added a commit to afierka-intel/vllm-gaudi that referenced this pull request Mar 17, 2026
The boolean attention mask change (port of vllm-project#1032) creates bool block_bias
in _set_block_mapping (True=valid, False=masked), but pipelined_pa()
blindly cast it to float (True→1.0, False→0.0) and added it to attention
scores. This broke masking: valid positions got +1.0 noise and masked
positions got no penalty (should be -inf).

Fix: detect bool dtype and convert to proper additive bias (0.0/-inf)
before use in both the block_softmax kernel path and the manual fallback.

Co-authored-by: yangulei <24203353+yangulei@users.noreply.github.com>

Signed-off-by: Artur Fierka <artur.fierka@intel.com>
yangulei added a commit to yangulei/vllm-gaudi that referenced this pull request Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants