Skip to content

[ROCm Windows] fix build failed#2519

Merged
tridao merged 2 commits into
Dao-AILab:mainfrom
Apophis3158:main
May 6, 2026
Merged

[ROCm Windows] fix build failed#2519
tridao merged 2 commits into
Dao-AILab:mainfrom
Apophis3158:main

Conversation

@Apophis3158
Copy link
Copy Markdown
Contributor

@Apophis3158 Apophis3158 commented Apr 29, 2026

I encountered some issues while trying to build fa2 on ROCm Windows:

CK backend:

In the link phase, the cmd length will exceed the maximum length 32,767: pypa/distutils#406

Fix: write each obj line by line to the rsp file.

Building uses the hipcc.exe in the ROCm SDK and its llvm clang.exe, but no linker is specified, and it defaults to MSVC link.exe, which caused some issues (even with MSVC environment activated).

Fix: change to llvm's lld-link.exe.

After these two changes:

[...]
DEBUG [2669/2669] H:\ROCm\.venv\Lib\site-packages\_rocm_sdk_devel\bin\hipcc.exe  -std=c++20 -Xcompiler -D__HIP_PLATFORM_AMD__=1 -Xcompiler -DUSE_ROCM=1 -Xcompiler -DHIPBLAS_V2 -Xcompiler -fms-runtime-lib=dll -IH:\ROCm\flash-attention\csrc\composable_kernel\include -IH:\ROCm\flash-attention\csrc\composable_kernel\library\include -IH:\ROCm\flash-attention\csrc\composable_kernel\example\ck_tile\01_fmha -IH:\ROCm\.venv\Lib\site-packages\torch\include -IH:\ROCm\.venv\Lib\site-packages\torch\include\torch\csrc\api\include -IH:\ROCm\.venv\Lib\site-packages\torch\include\THH -IH:\ROCm\.venv\Lib\site-packages\_rocm_sdk_devel\include -IH:\ROCm\.venv\include -IC:\Users\Administrator\AppData\Roaming\uv\python\cpython-3.13-windows-x86_64-none\include -IC:\Users\Administrator\AppData\Roaming\uv\python\cpython-3.13-windows-x86_64-none\Include -c H:\ROCm\flash-attention\csrc\flash_attn_ck\mha_varlen_fwd.hip -o H:\ROCm\flash-attention\build\temp.win-amd64-cpython-313\Release\csrc\flash_attn_ck\mha_varlen_fwd.obj -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -fms-extensions -Wno-ignored-attributes -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -fms-runtime-lib=dll --offload-arch=gfx1201 -O3 -std=c++20 -Wno-unknown-warning-option -Wno-unused-command-line-argument -Wno-shift-count-overflow -Wno-deprecated-declarations -Wno-pass-failed -Wno-inconsistent-dllimport -Wno-cuda-compat -fbracket-depth=1024 -DCK_TILE_FMHA_FWD_FAST_EXP2=1 -fgpu-flush-denormals-to-zero -DCK_ENABLE_BF16 -DCK_ENABLE_BF8 -DCK_ENABLE_FP16 -DCK_ENABLE_FP32 -DCK_ENABLE_FP64 -DCK_ENABLE_FP8 -DCK_ENABLE_INT8 -DCK_USE_XDL -DUSE_PROF_API=1 -D__HIP_PLATFORM_HCC__=1 -D_CRT_SECURE_NO_WARNINGS -DCK_TILE_FLOAT_TO_BFLOAT16_DEFAULT=3 -mllvm --lsr-drop-solution=1 -fno-offload-uniform-block -mllvm -enable-post-misched=0 -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false -DHIPIFY_V2 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -fno-gpu-rdc
DEBUG Link RSP: build\lib.win-amd64-cpython-313\flash_attn_2_cuda.cp313-win_amd64.pyd.rsp (2669 objects, cmd reduced from 592201 to 929 chars)
DEBUG H:\ROCm\.venv\Lib\site-packages\_rocm_sdk_devel\lib\llvm\bin\lld-link.exe @build\lib.win-amd64-cpython-313\flash_attn_2_cuda.cp313-win_amd64.pyd.rsp /nologo /INCREMENTAL:NO /LTCG /DLL /MANIFEST:EMBED,ID=2 /MANIFESTUAC:NO /LIBPATH:H:\ROCm\.venv\Lib\site-packages\torch\lib /LIBPATH:H:\ROCm\.venv\Lib\site-packages\_rocm_sdk_devel\lib /LIBPATH:H:\ROCm\.venv\Lib\site-packages\_rocm_sdk_devel\hip\lib /LIBPATH:H:\ROCm\.venv\libs /LIBPATH:C:\Users\Administrator\AppData\Roaming\uv\python\cpython-3.13-windows-x86_64-none\libs /LIBPATH:C:\Users\Administrator\AppData\Roaming\uv\python\cpython-3.13-windows-x86_64-none /LIBPATH:H:\ROCm\.venv\PCbuild\amd64 c10.lib torch.lib torch_cpu.lib torch_python.lib amdhip64.lib c10_hip.lib torch_hip.lib /OUT:build\lib.win-amd64-cpython-313\flash_attn_2_cuda.cp313-win_amd64.pyd /IMPLIB:H:\ROCm\flash-attention\build\temp.win-amd64-cpython-313\Release\build\flash_attn_2_cuda.cp313-win_amd64.lib
DEBUG installing to build\bdist.win-amd64\wheel
DEBUG running install
[...]

Triton backend:

The default requirement is triton==3.5.1, but it is not provided by aiter yet on Windows.

Fix: use triton-windows instead.

@Apophis3158
Copy link
Copy Markdown
Contributor Author

Apophis3158 commented Apr 29, 2026

After getting rid of MSVC link.exe, building only requires a few cmds:

.\.venv\Scripts\activate.ps1

$ROCM_ROOT = rocm-sdk path --root
$env:ROCM_HOME = $ROCM_ROOT
$env:PATH = "$ROCM_ROOT\lib\llvm\bin;$ROCM_ROOT\bin;$env:PATH"

$env:DISTUTILS_USE_SDK = "1"

uv pip install --no-build-isolation -v .

And I don't know if it's a defect of the ROCm SDK that the link tool using MSVC's rather than LLVM's.

@jammm
Copy link
Copy Markdown
Contributor

jammm commented Apr 29, 2026

Can you try building again ? #2517 was merged which should fix your issue.

@jammm
Copy link
Copy Markdown
Contributor

jammm commented Apr 29, 2026

And I don't know if it's a defect of the ROCm SDK that the link tool using MSVC's rather than LLVM's.

it's more to do with setuptools, which is why pypa/distutils#406 is trying to fix it. Perhaps the lack of link.exe in your PATH led setuptools too find an alternative linker. But it still needs MSVC headers to do builds with, so it's not really a long-term fix I feel.

@crashingalexsan
Copy link
Copy Markdown

Can confirm building works for me and CK backend works (tested on GFX1201)

image

@jammm
Copy link
Copy Markdown
Contributor

jammm commented Apr 30, 2026 via email

@crashingalexsan
Copy link
Copy Markdown

Yeah... sorry. tested main with #2517 patch. We should be able to close this PR.

@astrelsky
Copy link
Copy Markdown

And I don't know if it's a defect of the ROCm SDK that the link tool using MSVC's rather than LLVM's.

it's more to do with setuptools, which is why pypa/distutils#406 is trying to fix it. Perhaps the lack of link.exe in your PATH led setuptools too find an alternative linker. But it still needs MSVC headers to do builds with, so it's not really a long-term fix I feel.

To be honest it shouldn't fix anything at all because the problem is a windows limitation. I suspect the use of lld or some other linker caused something different to happen elsewhere that worked around the problem.

@Apophis3158
Copy link
Copy Markdown
Contributor Author

I'm quite sure AMD's intention was to build CK with LLVM, but since MSVC's linker works fine, it doesn't matter. The linking issue I encountered may also be related to uv, but rather than struggling with a linker, it is better to see how to make ccache/sccache run.

With $env:PYTORCH_NVCC = "sccache.exe $ROCM_ROOT\bin\hipcc.exe" set I got this:

  clang: error: cannot find libdevice for sm_52; provide path to different CUDA installation via '--cuda-path', or pass '-nocudalib' to build without linking with libdevice
  clang: error: cannot find CUDA installation; provide its path via '--cuda-path', or pass '-nocudainc' to build without CUDA includes

@Apophis3158
Copy link
Copy Markdown
Contributor Author

@jammm This PR now only contains issue with Triton backend, can you take a look?

@jammm
Copy link
Copy Markdown
Contributor

jammm commented May 6, 2026

lgtm!

@jammm
Copy link
Copy Markdown
Contributor

jammm commented May 6, 2026

@tridao can we merge this too please? This is for windows support with triton-windows.

@tridao
Copy link
Copy Markdown
Member

tridao commented May 6, 2026

is there a version of window-triton we should pin?

@Apophis3158
Copy link
Copy Markdown
Contributor Author

Not needed for now, aiter uses the latest version to run smoke test.

@jammm
Copy link
Copy Markdown
Contributor

jammm commented May 6, 2026

I would pin it to >=3.6.0. I'm not sure if 3.5.x has proper AMD support.

@tridao
Copy link
Copy Markdown
Member

tridao commented May 6, 2026

Let's pin it to >=3.6.0 then, then we'll merge

@Apophis3158
Copy link
Copy Markdown
Contributor Author

Pinned triton-windows>=3.6.0

@tridao tridao merged commit 495ef79 into Dao-AILab:main May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants