[Triton] Flash Attention Triton Windows build support by micmelesse · Pull Request #2433 · ROCm/aiter

micmelesse · 2026-03-23T17:01:53Z

Motivation

Users where using the triton backend of flash attention on windows and the recent migration to AITER caused issues. This PR is a continuation of #2428 by @0xDELUXA . It cherry picks the author's commits with their permission. I take responsibility for getting the pr merged.

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

github-actions · 2026-03-23T17:02:26Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 2433 --add-label <label>

Copilot

Pull request overview

Adds Windows support for using aiter’s Triton FlashAttention backend by avoiding CK/HIP JIT build paths on Windows, making top-level imports more tolerant when the ROCm/HIP JIT runtime isn’t present, and adding a Windows CI import smoke test.

Changes:

Disable CK and prebuilt-kernel compilation on Windows in setup.py, and adjust packaging layout accordingly.
Make aiter top-level imports resilient to missing ROCm/HIP JIT runtime by catching ImportError and continuing (so Triton can still be used).
Add a Windows GitHub Actions smoke test to verify Triton FlashAttention modules are importable.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
setup.py	Adds Windows detection and changes CK/prebuild behavior + packaging of `hsa/`.
aiter/init.py	Wraps ROCm/HIP JIT + ops imports in a try/except to allow import when unavailable.
.github/workflows/flash_attention_integration.yaml	Triggers on `aiter/__init__.py` changes and adds a Windows import-only CI job.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

astrelsky · 2026-03-24T18:52:00Z

First problem I noticed is flydsl is not optional, making it impossible to install the requirements for aiter. This would need to be done in both requirements.txt and the pyproject.toml so the instructions in the README don't fail.

The second thing I noticed is the following warning and failure trying to run one of the op tests.

The first part of that warning is wholly untrue, it is supported. If it wasn't, we wouldn't have torchaudio or torchvision. I couldn't find what changes were made by aiter to cpp_extension.py since it appears to be very out-of-date with the one in pytorch. It should probably be updated with whatever changes are needed by aiter. https://github.com/pytorch/pytorch/blob/main/torch/utils/cpp_extension.py

micmelesse · 2026-03-24T22:39:01Z

I have posted an update on Dao-AILab/flash-attention#2385 (comment) . I think the latest version fixes these issues.

astrelsky · 2026-03-24T23:22:23Z

I have posted an update on Dao-AILab/flash-attention#2385 (comment) . I think the latest version fixes these issues.

These issues I mentioned above are in aiter. I'm not sure if you were replying directly to my previous comment or not (sorry I'm falling asleep and not thinking clearly).

lint minimize diff windows smoke test address copilot fixes remove is_windows error improve windows message

micmelesse · 2026-03-26T15:51:57Z

Ok, merging this based on feedback on Dao-AILab/flash-attention#2385.

brunomazzottiamd

Approving the PR since the upstream requesters are happy with the proposed changes. I'd ask a thumbs up from AITER CI team (FYI: @gyohuangxin).

0xDELUXA · 2026-03-26T17:14:02Z

Huge thanks to @micmelesse for moving this forward!

* add fused_qknorm hip kernel (#2442) * base impl * improve big M * optimize v2 * fallback split norm for bigger M * fix rebase failed * code format * fix include error * support outplace works for torch compile * add output args --------- Co-authored-by: Guanbao Yu <gyu@amd.com> * feat: add fast gelu (#2220) * add fast gelu * fix ut * refactor * add more log * change to empty to avoid perf downgrade * make black lint happy * make ruff happy * fix import error * Fix build CK pipeline (#2399) * Fix renamed CK pipeline * Fix CK version * Rename eight_warps to eight_waves for consistency --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * [OPUS] gfx1250 support for opus wmma scale and moe_sorting kernel (#2449) * [OPUS] Add gfx1250 support for opus moe_sorting kernel Add gfx1250 (MI450) to the GPU architecture maps and fix the opus moe_sorting kernel to use the correct wait counter instruction on gfx12 ISA (s_wait_dscnt instead of s_waitcnt_lgkmcnt). * [OPUS] Add WMMA scale instruction support for gfx1250 Add scaled WMMA dispatch (BX32 int / BX16 long) to struct wmma for: - wmma_scale[16]_f32_16x16x128_f8f6f4 (fp8 fmt=0, fp4 fmt=4) - wmma_scale[16]_f32_32x16x128_f4 (dedicated fp4) Key changes in opus.hpp: - Add fmt_a/fmt_b format codes and per-lane E8M0 scale operator() overloads with compile-time scale_sel (OPSEL) for lane group selection - Fix vtype for packed fp4: use reg_bytes_a/b to compute correct hardware register size (elem * bits / 8 bytes, not elem * sizeof) - Add zero-pad helper for fp4-via-f8f6f4 (i32x8 -> i32x16) - Add scaled overloads to wmma_adaptor_swap_ab Device tests (op_tests/opus/device/test_wmma_scale.cu): - Raw warp-level: 6 variants (fp8/fp4 x BX32/BX16 + 32x16 fp4) - Tiled MMA: 1x1, 2x2, 4x1 wave configurations via make_tiled_mma - Per-lane random scale: E8M0 exponents in [122..133], bitwise exact - All tests PASS on gfx1250, SKIP on gfx942/gfx950 * [OPUS] Fix black/ruff lint: rename ambiguous var `l` to `lane` * reduce wasted get_module overhead for module with custom module name (#2455) * fix moe gemm tuned config (#2463) * fix * add e=256 k=8 tuned config * revert t=1024,2048 back to ck * update * update * Skip `test_metadata_redirect.py` on archs other than `gfx942` (#2456) This test was designed to run only on `gfx942`. We want to run all Triton tests on `gfx950` as well - this is the reason behind the proposed changes. * [Triton] Fix bench_mha (#2317) * fix bench_mha * fix bench_mha parsing logic of bench_attn_models * fix bench_mla_decode parsing logic of bench_attn_models --------- Co-authored-by: Bruno Mazzotti <bruno.mazzotti@amd.com> * test_mla_persistent.py split kv reference fix max_seq_q != 1 error (#2363) * CI: fix dubious ownership for sglang checkout (#2477) * CI: mark sglang checkout paths as safe directories Add safe.directory entries before dependency installation for both host and container checkout paths to avoid git dubious ownership failures on /sglang-checkout. * CI: align sglang safe.directory fix with upstream Move safe.directory configuration to run right after container startup and only mark /sglang-checkout, matching upstream sglang handling for dubious ownership. * CI: use pip editable install and safe.directory in runtime CI (#2474) * Fix CKTile blockscale GEMM to read strides from tensor metadata (#2466) The CKTile blockscale GEMM wrapper hardcoded leading-dimension strides as stride_A = K, stride_B = K, stride_C = N, assuming fully contiguous row-major layout. This produced silently wrong results when input tensors had non-standard strides. In vLLM on ROCm, _maybe_pad_fp8_weight pads FP8 weight tensors for alignment and then creates a narrowed view, producing tensors whose logical shape is [N, K] but whose physical stride is [K+pad, 1]. The hardcoded stride_B = K caused the kernel to read from wrong memory offsets, leading to garbage output. Fix: read leading-dimension strides from the PyTorch tensor metadata (XQ.stride(0), WQ.stride(0), etc.) instead of assuming dense layout. Add TORCH_CHECK assertions to verify inner-dimension contiguity (stride == 1), which is required by the CKTile kernel. The old CK backend (gemm_a8w8_blockscale_common.cuh) already reads strides from tensor metadata, which is why it was unaffected. Made-with: Cursor * Add LSE output support for MLA decode qseqlen=1 persistent kernel (gf… (#2440) * Add LSE output support for MLA decode qseqlen=1 persistent kernel (gfx950) * Add qseqlen fold for MLA on gfx950: use qh64 kernel instead of qh16 * Add qseqlen fold for MLA on gfx950: use qh64 kernel instead of qh16 --------- Co-authored-by: root <root@gbt350-odcdh2-a11-1.png-odc.dcgpu> * tuned qwen3.5 gemm (#2485) Signed-off-by: Guanbao Yu <gyu@amd.com> Co-authored-by: Guanbao Yu <gyu@amd.com> * [Triton] Flash Attention Triton Windows build support (#2433) * Initial FA-2 Triton Windows build support * Continue Work lint minimize diff windows smoke test address copilot fixes remove is_windows error improve windows message --------- Co-authored-by: 0xDELUXA <djernovevo@gmail.com> * fix(gemm): add EVEN_MN heuristic to restore vectorized store in gemm(#2482) * fix(gemm): add EVEN_MN heuristic to restore vectorized store in gemm_a16w16 The do_not_specialize=["M","N"] change (commit 3170a51) prevents kernel recompilation when M/N change, but removes tt.divisibility=16 from M/N, causing AxisInfo to lose contiguity through RemOp and CmpOp, which degrades buffer_store from vectorized dwordx2 (vec=4) to scalar short (vec=1). Introduce EVEN_MN constexpr heuristic that checks M%BLOCK_SIZE_M==0 and N%BLOCK_SIZE_N==0 at compile time. When true, skip the modulo wrap on offs_am/offs_bn and use unmasked tl.store, restoring contiguity for vectorized memory operations without sacrificing the recompilation benefit. Made-with: Cursor * fix(gemm): add EVEN_MN heuristic to batched_gemm_a8w8 kernel Same fix as gemm_a16w16: the do_not_specialize=["M","N"] removes tt.divisibility=16, breaking vectorized store/load. Add EVEN_MN constexpr heuristic to conditionally skip modulo and mask when M%BLOCK_SIZE_M==0 and N%BLOCK_SIZE_N==0, restoring contiguity. Made-with: Cursor --------- Co-authored-by: jianlian <jianlian@amd.com> * rm gemm_common bind (#2425) * rm gemm_commona= and quant type bind * retune failed shape in a8w8_bpreshuffle_tuned_gemm.csv * recover enum * assert when found duplicated tuned shape (#2376) * assert when found duplicated tuned shape * rm duplicated tuned shape and update tuned file in model_configs * fix lint --------- Signed-off-by: Guanbao Yu <gyu@amd.com> Co-authored-by: XiaobingZhang <xiaobingzhangupc@gmail.com> Co-authored-by: Guanbao Yu <gyu@amd.com> Co-authored-by: ChenYou <youchen@amd.com> Co-authored-by: Enrico Degregori <73224202+EnricoDeg@users.noreply.github.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Co-authored-by: carlushuang <carlus.huang@amd.com> Co-authored-by: Elton <zhimding@amd.com> Co-authored-by: Bruno Mazzotti <bruno.mazzotti@amd.com> Co-authored-by: Michael Melesse <micmelesse@gmail.com> Co-authored-by: minmengdie <memin@amd.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: Sami Remes <samremes@amd.com> Co-authored-by: fangche123 <Fang.Che@amd.com> Co-authored-by: root <root@gbt350-odcdh2-a11-1.png-odc.dcgpu> Co-authored-by: Pleaplusone <ygan@amd.com> Co-authored-by: 0xDELUXA <djernovevo@gmail.com> Co-authored-by: jianhao <Jianhao.Liang@amd.com> Co-authored-by: jianlian <jianlian@amd.com> Co-authored-by: yzhou103 <Ying.Zhou2@amd.com>

micmelesse force-pushed the micmelesse/windows-rocm-support branch from 567b3ee to 317b097 Compare March 23, 2026 17:51

0xDELUXA mentioned this pull request Mar 23, 2026

Initial Windows ROCm build support for FlashAttention-2 ROCm/aiter Triton backend #2428

Closed

micmelesse force-pushed the micmelesse/windows-rocm-support branch from 33cc6a1 to f8f9b47 Compare March 23, 2026 18:55

micmelesse marked this pull request as ready for review March 23, 2026 18:55

micmelesse requested review from a team and Copilot March 23, 2026 18:55

Copilot started reviewing on behalf of micmelesse March 23, 2026 18:58 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

Comment thread setup.py

Comment thread .github/workflows/flash_attention_integration.yaml Outdated

micmelesse force-pushed the micmelesse/windows-rocm-support branch 11 times, most recently from 589edf9 to 46510ef Compare March 23, 2026 22:03

micmelesse mentioned this pull request Mar 23, 2026

[ROCM] Fix windows issues Dao-AILab/flash-attention#2385

Merged

micmelesse force-pushed the micmelesse/windows-rocm-support branch 3 times, most recently from cc28888 to 2e3899c Compare March 23, 2026 22:54

0xDELUXA mentioned this pull request Mar 24, 2026

Initial Windows ROCm build support for FlashAttention-2 ROCm/aiter Triton backend Dao-AILab/flash-attention#2384

Closed

brunomazzottiamd reviewed Mar 24, 2026

View reviewed changes

Comment thread setup.py Outdated

Comment thread .github/workflows/flash_attention_integration.yaml

micmelesse force-pushed the micmelesse/windows-rocm-support branch 4 times, most recently from 676aaf6 to c746b26 Compare March 24, 2026 15:15

micmelesse force-pushed the micmelesse/windows-rocm-support branch 2 times, most recently from e6b0f62 to 74d4cd4 Compare March 25, 2026 16:21

0xDELUXA and others added 2 commits March 26, 2026 05:57

Initial FA-2 Triton Windows build support

e51f459

Continue Work

eb538a7

lint minimize diff windows smoke test address copilot fixes remove is_windows error improve windows message

micmelesse force-pushed the micmelesse/windows-rocm-support branch from 74d4cd4 to eb538a7 Compare March 26, 2026 10:02

brunomazzottiamd self-requested a review March 26, 2026 16:10

brunomazzottiamd approved these changes Mar 26, 2026

View reviewed changes

micmelesse merged commit b4b7516 into main Mar 26, 2026
58 of 59 checks passed

micmelesse deleted the micmelesse/windows-rocm-support branch March 26, 2026 16:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Triton] Flash Attention Triton Windows build support#2433

[Triton] Flash Attention Triton Windows build support#2433
micmelesse merged 2 commits intomainfrom
micmelesse/windows-rocm-support

micmelesse commented Mar 23, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

astrelsky commented Mar 24, 2026 •

edited

Loading

Uh oh!

micmelesse commented Mar 24, 2026 •

edited

Loading

Uh oh!

astrelsky commented Mar 24, 2026 •

edited

Loading

Uh oh!

micmelesse commented Mar 26, 2026

Uh oh!

brunomazzottiamd left a comment

Uh oh!

Uh oh!

0xDELUXA commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

micmelesse commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Test Plan

Test Result

Submission Checklist

Uh oh!

github-actions bot commented Mar 23, 2026

🏷️ CI Guide

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

astrelsky commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

micmelesse commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

astrelsky commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

micmelesse commented Mar 26, 2026

Uh oh!

brunomazzottiamd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

0xDELUXA commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

micmelesse commented Mar 23, 2026 •

edited

Loading

astrelsky commented Mar 24, 2026 •

edited

Loading

micmelesse commented Mar 24, 2026 •

edited

Loading

astrelsky commented Mar 24, 2026 •

edited

Loading