Support boolean attention mask in Attention(23) CUDA - MHA case#27428
Support boolean attention mask in Attention(23) CUDA - MHA case#27428justinchuby merged 7 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds CUDA support for boolean attn_mask in the ONNX Attention (opset 23) CUDA implementation (MHA path) by converting boolean masks into additive attention bias on-GPU, and enables corresponding CUDA test coverage.
Changes:
- Added CUDA kernel + launcher to convert
boolattention masks into additive bias (true -> 0,false -> mask_filter_value). - Updated CUDA
Attention<T>::ComputeInternalto accept boolean masks and run the conversion into a scratch buffer. - Enabled CUDA execution for existing boolean-mask attention tests.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| onnxruntime/test/providers/cpu/llm/attention_op_test.cc | Enables CUDA runs for bool-mask test cases (including an all-false degenerate mask). |
| onnxruntime/core/providers/cuda/llm/attention_mask_impl.h | Declares the new bool-mask-to-bias conversion launcher. |
| onnxruntime/core/providers/cuda/llm/attention_mask_impl.cu | Implements the CUDA kernel and explicit instantiations for float/half/bfloat16. |
| onnxruntime/core/providers/cuda/llm/attention.cc | Uses the new conversion path for boolean masks and aligns mask_filter_value with CPU helper. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
🔍 Independent Review by a Team of CopilotsReview team: 2 Architects (claude-opus-4.6), 1 Code Reviewer (gemini-3-pro-preview), 1 Critical Reviewer (claude-sonnet-4.6) — coordinated by a Copilot Project Lead Verdict: ✅ Conditional ApproveExcellent PR — clean design, great documentation, well-scoped change. One blocking issue to fix, two non-blocking suggestions. 🔴 BLOCKING: Grid Size Overflow (H1)File:
Suggested fix — grid-stride loop (common CUDA best practice): __global__ void ConvertBoolMaskToAttentionBiasKernel(...) {
for (int64_t idx = static_cast<int64_t>(blockIdx.x) * blockDim.x + threadIdx.x;
idx < num_elements;
idx += static_cast<int64_t>(gridDim.x) * blockDim.x) {
attention_bias[idx] = attn_mask[idx] ? T(0.0f) : T(mask_filter_value);
}
}And cap the grid: 🟡 Non-blocking suggestionsM1 — Unused include ( M2 — Test coverage gaps (
✅ Confirmed SafeWe investigated several potential concerns and confirmed they are not issues:
👏 Highlights
Review performed by a team of Copilots — 4 specialist agents with different models and perspectives, coordinated by a Copilot Project Lead. Findings represent cross-validated consensus. |
|
A 4D mask [1, 32, 128K, 128K] = 32 × 128K × 128K ≈ 537 billion elements. At 1 byte per bool, that's 537 GB — it wouldn't even fit in GPU memory. The max gridDim.x But practically, even a 4D mask [8, 32, 8192, 8192] (very generous) = ~17 billion elements = ~17 GB of bool, which is already well beyond what any real model That said, the fix is trivially correct and costs nothing at runtime, so it's a reasonable defensive improvement. It's up to you — both "accept and fix" (takes 2 |
Should have realized that 😅 AI team failed. |
Replace and reland #27129
Comparison between this PR approach and inline in softmax
Tradeoffs
This pull request enhances the ONNX Runtime CUDA Attention operator to support boolean attention masks (bool masks) in the Multi-Head Attention (MHA) path, converting them to additive attention bias on the GPU. It also improves test coverage to ensure correctness and parity with the CPU implementation. The main changes include implementing a CUDA kernel for mask conversion, updating the operator logic to handle bool masks, clarifying broadcasting rules, and adding comprehensive unit tests.
CUDA Attention Operator Improvements:
LaunchConvertBoolMaskToAttentionBias) that converts boolean attention masks to additive bias (True → 0.0, False → mask_filter_value) for the MHA path, ensuring efficient GPU execution. [1] [2]attention.ccto use this kernel, correctly handle bool masks in the MHA path, and clarified the broadcasting logic and mask shape interpretation for both GQA and MHA. [1] [2] [3] [4] [5]Testing and Documentation Enhancements:
Test Coverage and Reliability:
These changes make the CUDA Attention operator more robust and feature-complete, aligning its behavior with the CPU implementation and ONNX specifications.