Skip to content

Add --use-flash-attention flag.#7223

Merged
comfyanonymous merged 2 commits intoComfy-Org:masterfrom
FeepingCreature:master
Mar 14, 2025
Merged

Add --use-flash-attention flag.#7223
comfyanonymous merged 2 commits intoComfy-Org:masterfrom
FeepingCreature:master

Conversation

@FeepingCreature
Copy link
Contributor

@FeepingCreature FeepingCreature commented Mar 13, 2025

This is useful on AMD systems, as FA builds are still 10% faster than Pytorch cross-attention. Without even using torch.compile, this can bring SDXL 1024x1024 from 5.5s to 5s.

(I did a bench over at https://www.reddit.com/r/buildapc/comments/1j2zfbv/9070_xt_vs_7900_xtx/mhkmv0x/ )

@FeepingCreature
Copy link
Contributor Author

FeepingCreature commented Mar 13, 2025

(For testing, I recommend the @gel-crabs branch pip install -U git+https://github.com/gel-crabs/flash-attention-gfx11@headdim512 - make sure to pip uninstall flash-attn first! )

@comfyanonymous
Copy link
Member

one of the ruff fails is my fault but can you fix the other one?

This is useful on AMD systems, as FA builds are still 10% faster than Pytorch cross-attention.
@FeepingCreature
Copy link
Contributor Author

better?

@comfyanonymous comfyanonymous merged commit 7aceb9f into Comfy-Org:master Mar 14, 2025
5 checks passed
meimeilook pushed a commit to meimeilook/ComfyUI that referenced this pull request Mar 14, 2025
* Add --use-flash-attention flag.
This is useful on AMD systems, as FA builds are still 10% faster than Pytorch cross-attention.
@mcmonkey4eva
Copy link
Contributor

jsyk user with torch 2.3 reports this causes a total launch failure even when not active
image

@FeepingCreature
Copy link
Contributor Author

Huh. Okay, I'll make that conditional.

@bigcat88
Copy link
Contributor

bigcat88 commented Mar 15, 2025

(For testing, I recommend the @gel-crabs branch pip install -U git+https://github.com/gel-crabs/flash-attention-gfx11@headdim512 - make sure to pip uninstall flash-attn first! )

just note for that people who will test it: you need to disable iGPU if you have AMD motherboard(and enable it after building flash-attention) or do

export HIP_VISIBLE_DEVICES=0
export ROCR_VISIBLE_DEVICES=0

before installing flash-attention-gfx11@headdim512

reference: vladmandic/sdnext#3515

bigcat88 added a commit to Visionatrix/Visionatrix that referenced this pull request Mar 15, 2025
@githust66
Copy link

(For testing, I recommend the @gel-crabs branch pip install -U git+https://github.com/gel-crabs/flash-attention-gfx11@headdim512 - make sure to pip uninstall flash-attn first! )

The construction using pip install -U git+https://github.com/gel-crabs/flash-attention-gfx11@headdim512 fails in ROCm 6.4 and will report an error.

@FeepingCreature
Copy link
Contributor Author

FeepingCreature commented Apr 13, 2025

Can confirm...

      In file included from /tmp/pip-req-build-0mn1epxx/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_bf16_noncasual_gfx110x_hip.hip:25:
      In file included from /tmp/pip-req-build-0mn1epxx/csrc/flash_attn_rocm/src/flash_runner_hip.hpp:30:
      In file included from /tmp/pip-req-build-0mn1epxx/csrc/flash_attn_rocm/src/fwd_device_gemm_invoker_hip.hpp:27:
      In file included from /tmp/pip-req-build-0mn1epxx/csrc/flash_attn_rocm/src/fwd_device_gemm_template_hip.hpp:27:
      In file included from /tmp/pip-req-build-0mn1epxx/csrc/flash_attn_rocm/src/device_gemm_trait_hip.hpp:45:
      In file included from /tmp/pip-req-build-0mn1epxx/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_query_attention_forward_wmma_hip.hpp:23:
      In file included from /tmp/pip-req-build-0mn1epxx/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_batched_gemm_softmax_gemm_wmma_cshuffle_hip.hpp:12:
      In file included from /tmp/pip-req-build-0mn1epxx/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_gemm_pipeline_selector_hip.hpp:9:
      In file included from /tmp/pip-req-build-0mn1epxx/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_gemm_pipeline_v1_hip.hpp:8:
      /tmp/pip-req-build-0mn1epxx/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/blockwise_gemm_xdlops_hip.hpp:870:32: error: no member named 'a_origin' in 'BlockwiseGemmXdlops_v2<BlockSize, FloatAB, FloatAcc, ATileDesc, BTileDesc, AMmaTileDesc, BMmaTileDesc, MPerBlock, NPerBlock, KPerBlock, MPerXDL, NPerXDL, MRepeat, NRepeat, KPack, TransposeC, AMmaKStride, BMmaKStride>'
        870 |         : a_thread_copy_(other.a_origin), b_thread_copy_(other.b_origin)
            |                          ~~~~~ ^
      /tmp/pip-req-build-0mn1epxx/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/blockwise_gemm_xdlops_hip.hpp:870:64: error: no member named 'b_origin' in 'BlockwiseGemmXdlops_v2<BlockSize, FloatAB, FloatAcc, ATileDesc, BTileDesc, AMmaTileDesc, BMmaTileDesc, MPerBlock, NPerBlock, KPerBlock, MPerXDL, NPerXDL, MRepeat, NRepeat, KPack, TransposeC, AMmaKStride, BMmaKStride>'
        870 |         : a_thread_copy_(other.a_origin), b_thread_copy_(other.b_origin)
            |                                                          ~~~~~ ^

Wonder what AMD broke this time.

Lemme have a look.

@FeepingCreature
Copy link
Contributor Author

FeepingCreature commented Apr 13, 2025

Uh oh. Does anyone have flash_attn_rocm checked out? Howiejay's upstream branch for rocm/composable_kernel seems to be gone.

Ping @dejay-vu? The upstream CK commit in https://github.com/ROCm/flash-attention/tree/howiejay/navi_support can no longer be found.

@FeepingCreature
Copy link
Contributor Author

FeepingCreature commented Apr 13, 2025

Lucky: I found an old checkout with howiejay's branch.

I pushed a fork of howiejay's CK work and also cherrypicked the fix to the ROCm 6.4 issue. This seems to work:

pip install -U git+https://github.com/FeepingCreature/flash-attention-gfx11@gel-crabs-headdim512

Branch is at https://github.com/FeepingCreature/composable_kernel/tree/howiejayz/supports_all_arch

@githust66
Copy link

Lucky: I found an old checkout with howiejay's branch.

I pushed a fork of howiejay's CK work and also cherrypicked the fix to the ROCm 6.4 issue. This seems to work:

pip install -U git+https://github.com/FeepingCreature/flash-attention-gfx11@gel-crabs-headdim512

Branch is at https://github.com/FeepingCreature/composable_kernel/tree/howiejayz/supports_all_arch

Thank you for your efforts. This command can now be installed and used normally. However, after testing in the torch2.6 + rocm6.4 environment, its speed is slower than that of --use-pytorch-cross-attention.

@FeepingCreature
Copy link
Contributor Author

FeepingCreature commented Apr 15, 2025

What sort of speeds are you seeing? On Pytorch nightly, I get 3.7it/s with Pytorch cross attention and 4it/s with Flash Attention on my 7900 XTX.

@githust66
Copy link

githust66 commented Apr 15, 2025

What sort of speeds are you seeing? On Pytorch nightly, I get 3.7it/s with Pytorch cross attention and 4it/s with Flash Attention on my 7900 XTX.

Is there a nightly version of pytorch for ROCm6.4 now? I used pytorch2.6.0+ROCm6.4 officially provided by AMD. In the most basic flux Text-generated pictures workflow, I get 1.47s/it with Pytorch cross attention and 1.59s/it with Flash Attention on my 7900 XT, and in the flux everything migration workflow, I get 4.12s/it with Pytorch cross attention and 5.22s/it with Flash Attention on my 7900 XT

@FeepingCreature
Copy link
Contributor Author

FeepingCreature commented Apr 15, 2025

Huh. Try SDXL so we're comparing the same thing? I don't know how Flux works.

You can just use Pytorch nightly with ROCm 6.4, you don't have to use the "officially supported" ones. Usually it works.

@YarvixPA
Copy link

I could use this with a 3060 RTX?

@FeepingCreature
Copy link
Contributor Author

FeepingCreature commented Apr 15, 2025

Probably! I don't know how much it'll do for you though.

(Note you can't use the pip FlashAttention repos recommended here, those are AMD only. But with NVidia you can install the upstream FlashAttention instead.)

@githust66
Copy link

Huh. Try SDXL so we're comparing the same thing? I don't know how Flux works.

You can just use Pytorch nightly with ROCm 6.4, you don't have to use the "officially supported" ones. Usually it works.

Okay, I'll give it a try with the SDXL model later. Let me explain that in my previous environment, when using PyTorch 2.7 + ROCm 6.3, --use-flash-attention was faster than --use-pytorch-cross-attention, For ROCm 6.4, the speeds of these two have reversed.

@githust66
Copy link

Huh. Try SDXL so we're comparing the same thing? I don't know how Flux works.

You can just use Pytorch nightly with ROCm 6.4, you don't have to use the "officially supported" ones. Usually it works.

The PyTorch nightly version is only available for the ROCm 6.3 version. I couldn't find the one for ROCm 6.4.

@FeepingCreature
Copy link
Contributor Author

FeepingCreature commented Apr 16, 2025

You can just use the one from Pytorch.org: pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.3 it still works on 6.4.

Odd, I didn't observe any performance change at all.

@FeepingCreature
Copy link
Contributor Author

Jeez, yeah okay I'm at 3000Mhz. No wonder.

@Hakim3i
Copy link

Hakim3i commented Apr 20, 2025

Jeez, yeah okay I'm at 3000Mhz. No wonder.

Can you test if there is difference between flashattn and cross-attention ? since I can't compile it is it worth it?

Have you tried sage attention?

@FeepingCreature
Copy link
Contributor Author

FeepingCreature commented Apr 21, 2025

I reliably get 10% more on FA, but some people have reported differently.

SageAttn does absolutely not support AMD (they write manual shaders).

@Hakim3i
Copy link

Hakim3i commented Apr 21, 2025

I reliably get 10% more on FA, but some people have reported differently.

SageAttn does absolutely not support AMD (they write manual shaders).

When compiling I get this errors:


> (venv) kim@Asus:~/ComfyUI-cp312$ pip install -U git+https://github.com/FeepingCreature/flash-attention-gfx11@gel-crabs-headdim512
> Collecting git+https://github.com/FeepingCreature/flash-attention-gfx11@gel-crabs-headdim512
>   Cloning https://github.com/FeepingCreature/flash-attention-gfx11 (to revision gel-crabs-headdim512) to /tmp/pip-req-build-e29df3z_
>   Running command git clone --filter=blob:none --quiet https://github.com/FeepingCreature/flash-attention-gfx11 /tmp/pip-req-build-e29df3z_
>   Running command git checkout -b gel-crabs-headdim512 --track origin/gel-crabs-headdim512
>   Switched to a new branch 'gel-crabs-headdim512'
>   branch 'gel-crabs-headdim512' set up to track 'origin/gel-crabs-headdim512'.
>   Resolved https://github.com/FeepingCreature/flash-attention-gfx11 to commit 65ba5c649d112aeab5794c8cca76bc04b8c12f3c
>   Running command git submodule update --init --recursive -q
>   Preparing metadata (setup.py) ... done
> Requirement already satisfied: torch in ./venv/lib/python3.12/site-packages (from flash_attn==2.0.4) (2.6.0+rocm6.4.0.git2fb0ac2b)
> Requirement already satisfied: einops in ./venv/lib/python3.12/site-packages (from flash_attn==2.0.4) (0.8.1)
> Requirement already satisfied: packaging in ./venv/lib/python3.12/site-packages (from flash_attn==2.0.4) (25.0)
> Collecting ninja (from flash_attn==2.0.4)
>   Using cached ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB)
> Requirement already satisfied: filelock in ./venv/lib/python3.12/site-packages (from torch->flash_attn==2.0.4) (3.18.0)
> Requirement already satisfied: typing-extensions>=4.10.0 in ./venv/lib/python3.12/site-packages (from torch->flash_attn==2.0.4) (4.13.2)
> Requirement already satisfied: setuptools in ./venv/lib/python3.12/site-packages (from torch->flash_attn==2.0.4) (79.0.0)
> Requirement already satisfied: sympy==1.13.1 in ./venv/lib/python3.12/site-packages (from torch->flash_attn==2.0.4) (1.13.1)
> Requirement already satisfied: networkx in ./venv/lib/python3.12/site-packages (from torch->flash_attn==2.0.4) (3.4.2)
> Requirement already satisfied: jinja2 in ./venv/lib/python3.12/site-packages (from torch->flash_attn==2.0.4) (3.1.6)
> Requirement already satisfied: fsspec in ./venv/lib/python3.12/site-packages (from torch->flash_attn==2.0.4) (2025.3.2)
> Requirement already satisfied: pytorch-triton-rocm==3.2.0+rocm6.4.0.git6da9e660 in ./venv/lib/python3.12/site-packages (from torch->flash_attn==2.0.4) (3.2.0+rocm6.4.0.git6da9e660)
> Requirement already satisfied: mpmath<1.4,>=1.1.0 in ./venv/lib/python3.12/site-packages (from sympy==1.13.1->torch->flash_attn==2.0.4) (1.3.0)
> Requirement already satisfied: MarkupSafe>=2.0 in ./venv/lib/python3.12/site-packages (from jinja2->torch->flash_attn==2.0.4) (3.0.2)
> Using cached ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB)
> Building wheels for collected packages: flash_attn
>   Building wheel for flash_attn (setup.py) ... error
>   error: subprocess-exited-with-error
>   
>   × python setup.py bdist_wheel did not run successfully.
>   │ exit code: 1
>   ╰─> [600 lines of output]
>       
>       
>       torch.__version__  = 2.6.0+rocm6.4.0.git2fb0ac2b
>       
>       
>       RTZ IS USED
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/ck.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/ck.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/host_utility/device_prop.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/host_utility/device_prop.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include/ck/library/utility/device_memory.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include/ck/library/utility/device_memory_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/gemm_specialization.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/gemm_specialization.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/tensor_specialization.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/tensor_specialization.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/integral_constant.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/integral_constant.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/enable_if.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/enable_if.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/type.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/type.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/functional.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/functional.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/number.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/number.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/math.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/math.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/sequence.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/sequence.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/functional2.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/functional2.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/tuple.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/tuple.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/statically_indexed_array.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/statically_indexed_array.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/data_type.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/data_type.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/math_v2.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/math_v2.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/get_id.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/get_id.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/f8_utils.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/f8_utils.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/random_gen.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/random_gen.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/type_convert.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/type_convert.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/element/unary_element_wise_operation.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/element/unary_element_wise_operation.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/element/binary_element_wise_operation.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/element/binary_element_wise_operation.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/element/quantization_operation.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/element/quantization_operation.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/element/element_wise_operation.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/element/element_wise_operation.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/utils.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/utils_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/params.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/params_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/array.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/array.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/sequence_helper.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/sequence_helper.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/functional4.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/functional4.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/tuple_helper.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/tuple_helper.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/container_element_picker.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/container_element_picker.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/container_helper.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/container_helper.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/array_multi_index.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/array_multi_index_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/statically_indexed_array_multi_index.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/statically_indexed_array_multi_index_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/multi_index.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/multi_index_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/functional3.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/functional3_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/ignore.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/ignore.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/magic_division.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/magic_division.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/c_style_pointer_cast.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/c_style_pointer_cast.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/is_known_at_compile_time.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/is_known_at_compile_time.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/transpose_vectors.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/transpose_vectors.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/inner_product.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/inner_product.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/thread_group.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/thread_group.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/debug.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/debug_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_buffer_addressing.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_buffer_addressing.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_wave_read_first_lane.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_wave_read_first_lane.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/generic_memory_space_atomic.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/generic_memory_space_atomic.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/synchronization.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/synchronization_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_address_space.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_address_space.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/static_buffer.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/static_buffer.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/dynamic_buffer.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/dynamic_buffer.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_inline_asm.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_inline_asm.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_xdlops.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_xdlops.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/common_header.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/common_header_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/philox_rand.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/philox_rand.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/multi_index_transform.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/multi_index_transform_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_descriptor.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_descriptor_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/multi_index_transform_helper.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/multi_index_transform_helper_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_descriptor_helper.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_descriptor_helper_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/tensor_layout.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/tensor_layout.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/stream_config.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/stream_config.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/device_base.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/device_base.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/masking_specialization.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/masking_specialization.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/device_grouped_gemm_softmax_gemm_permute.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/device_grouped_gemm_softmax_gemm_permute.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/matrix_padder.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/matrix_padder_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_adaptor.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_adaptor_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/block_to_ctile_map.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/block_to_ctile_map_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_space_filling_curve.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_space_filling_curve_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/thread/threadwise_tensor_slice_transfer.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/thread/threadwise_tensor_slice_transfer_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/warp/xdlops_gemm.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/warp/xdlops_gemm_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/blockwise_gemm_xdlops.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/blockwise_gemm_xdlops_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_gemm_pipeline_v1.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_gemm_pipeline_v1_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_gemm_pipeline_v2.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_gemm_pipeline_v2_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_gemm_pipeline_selector.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_gemm_pipeline_selector_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/cluster_descriptor.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/cluster_descriptor_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor/static_tensor.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor/static_tensor.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/thread/threadwise_tensor_slice_transfer_v3r1.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/thread/threadwise_tensor_slice_transfer_v3r1_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/thread_group_tensor_slice_transfer_v4r1.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/thread_group_tensor_slice_transfer_v4r1_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/thread/threadwise_tensor_slice_transfer_v6r1.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/thread/threadwise_tensor_slice_transfer_v6r1_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/thread_group_tensor_slice_transfer_v6r1.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/thread_group_tensor_slice_transfer_v6r1_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/reduction_enums.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/reduction_enums.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/reduction_common.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/reduction_common.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/reduction_operator.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/reduction_operator.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/reduction_functions_accumulate.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/reduction_functions_accumulate.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/get_shift.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/get_shift.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/reduction_functions_blockwise.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/reduction_functions_blockwise_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/thread/reduction_functions_threadwise.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/thread/reduction_functions_threadwise.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/blockwise_softmax.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/blockwise_softmax_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/blockwise_dropout.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/blockwise_dropout.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_fwd_xdl_cshuffle_v2.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_fwd_xdl_cshuffle_v2_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/operator_transform/transform_contraction_to_gemm.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/operator_transform/transform_contraction_to_gemm_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/host_utility/hip_check_error.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/host_utility/hip_check_error.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/host_utility/kernel_launch.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/host_utility/kernel_launch_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_mha_fwd_xdl_cshuffle_v2.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_mha_fwd_xdl_cshuffle_v2_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_bwd_xdl_cshuffle_qloop_b2t_light_v1.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_bwd_xdl_cshuffle_qloop_b2t_light_v1_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_bwd_xdl_cshuffle_qloop_ydotygrad.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_bwd_xdl_cshuffle_qloop_ydotygrad_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_mha_bwd_xdl_cshuffle_qloop_light_v1.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_mha_bwd_xdl_cshuffle_qloop_light_v1_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_bwd_xdl_cshuffle_qloop_b2t_light_v2.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_bwd_xdl_cshuffle_qloop_b2t_light_v2_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_mha_bwd_xdl_cshuffle_qloop_light_v2.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_mha_bwd_xdl_cshuffle_qloop_light_v2_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/device_batched_gemm_softmax_gemm_permute.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/device_batched_gemm_softmax_gemm_permute.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_batched_mha_fwd_xdl_cshuffle_v2.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_batched_mha_fwd_xdl_cshuffle_v2_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_batched_mha_bwd_xdl_cshuffle_qloop_light_v1.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_batched_mha_bwd_xdl_cshuffle_qloop_light_v1_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_batched_mha_bwd_xdl_cshuffle_qloop_light_v2.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_batched_mha_bwd_xdl_cshuffle_qloop_light_v2_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_wmma.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_wmma.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/warp/wmma_gemm.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/warp/wmma_gemm_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/blockwise_gemm_wmma.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/blockwise_gemm_wmma_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_batched_gemm_softmax_gemm_wmma_cshuffle.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_batched_gemm_softmax_gemm_wmma_cshuffle_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/operator_transform/transform_contraction_to_gemm_arraybase.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/operator_transform/transform_contraction_to_gemm_arraybase_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_query_attention_forward_wmma.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_query_attention_forward_wmma_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_multi_query_attention_forward_wmma.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_multi_query_attention_forward_wmma_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/device_gemm_trait.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/device_gemm_trait_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/bwd_device_gemm_template.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/bwd_device_gemm_template_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/bwd_device_gemm_invoker.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/bwd_device_gemm_invoker_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/fwd_device_gemm_template.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/fwd_device_gemm_template_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/fwd_device_gemm_invoker.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/fwd_device_gemm_invoker_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/static_switch.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/static_switch.hpp [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_runner.hpp -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_runner_hip.hpp [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/flash_api.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/flash_api_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_fp16_noncausal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_fp16_noncausal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_causal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_noncausal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_causal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_noncausal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_noncausal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_noncausal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_noncausal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_causal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_bf16_noncausal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_fp16_noncasual_gfx110x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_fp16_noncasual_gfx110x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_bf16_causal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_bf16_causal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_bf16_causal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_bf16_causal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_fp16_causal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_fp16_causal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_noncausal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_noncausal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_bf16_casual_gfx110x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_bf16_casual_gfx110x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_fp16_casual_gfx110x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_fp16_casual_gfx110x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_fp16_noncausal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_causal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_causal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_causal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_causal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_causal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_causal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_noncausal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_noncausal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_fp16_casual_gfx110x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_fp16_casual_gfx110x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_fp16_casual_gfx110x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_fp16_casual_gfx110x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_noncausal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_causal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_bf16_casual_gfx110x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_bf16_casual_gfx110x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_bf16_noncausal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_bf16_noncausal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_bf16_noncasual_gfx110x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_bf16_noncasual_gfx110x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_causal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_causal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_bf16_causal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_bf16_causal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_noncausal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_noncausal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_fp16_causal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_fp16_causal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_noncausal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_noncausal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_bf16_casual_gfx110x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_bf16_casual_gfx110x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_causal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_fp16_noncasual_gfx110x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_fp16_noncasual_gfx110x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/device_memory.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/device_memory_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_fp16_causal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_causal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_noncausal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_noncausal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_fp16_causal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_fp16_causal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_causal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_bf16_noncasual_gfx110x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_bf16_noncasual_gfx110x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_causal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_causal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/device_memory_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/device_memory_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_bf16_noncausal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_bf16_noncausal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_bf16_noncasual_gfx110x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_bf16_noncasual_gfx110x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_noncausal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_noncausal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_noncausal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_bf16_noncausal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_bf16_noncausal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_noncausal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_causal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_causal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_causal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_causal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_bf16_casual_gfx110x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_bf16_casual_gfx110x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_bf16_causal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_bf16_causal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_noncausal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_fp16_noncausal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_fp16_noncausal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_bf16_noncasual_gfx110x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_bf16_noncasual_gfx110x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_causal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_causal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_causal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_noncausal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_causal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_causal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_bf16_causal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_bf16_causal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_noncausal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_fp16_casual_gfx110x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_fp16_casual_gfx110x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_noncausal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_bf16_noncausal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_bf16_noncausal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_bf16_causal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_bf16_causal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_noncausal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_noncausal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_noncausal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_noncausal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_fp16_noncausal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_fp16_noncausal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_causal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_causal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_fp16_noncasual_gfx110x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_fp16_noncasual_gfx110x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_causal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_fp16_causal_gfx9x.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_fp16_causal_gfx9x_hip.hip [skipped, already hipified]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_fp16_noncasual_gfx110x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_fp16_noncasual_gfx110x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_noncausal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_noncausal_gfx9x_hip.hip [skipped, no changes]
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_fp16_noncausal_gfx9x_hip.hip -> /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_fp16_noncausal_gfx9x_hip.hip [skipped, no changes]
>       Successfully preprocessed all matching files.
>       Total number of unsupported CUDA function calls: 0
>       
>       
>       Total number of replaced kernel launches: 10
>       /home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/setuptools/dist.py:759: SetuptoolsDeprecationWarning: License classifiers are deprecated.
>       !!
>       
>               ********************************************************************************
>               Please consider removing the following classifiers in favor of a SPDX license expression:
>       
>               License :: OSI Approved :: BSD License
>       
>               See https://packaging.python.org/en/latest/guides/writing-pyproject-toml/#license for details.
>               ********************************************************************************
>       
>       !!
>         self._finalize_license_expression()
>       running bdist_wheel
>       running build
>       running build_py
>       creating build/lib.linux-x86_64-cpython-312/flash_attn
>       copying flash_attn/flash_attn_interface.py -> build/lib.linux-x86_64-cpython-312/flash_attn
>       copying flash_attn/bert_padding.py -> build/lib.linux-x86_64-cpython-312/flash_attn
>       copying flash_attn/flash_blocksparse_attn_interface.py -> build/lib.linux-x86_64-cpython-312/flash_attn
>       copying flash_attn/fused_softmax.py -> build/lib.linux-x86_64-cpython-312/flash_attn
>       copying flash_attn/flash_blocksparse_attention.py -> build/lib.linux-x86_64-cpython-312/flash_attn
>       copying flash_attn/flash_attn_triton_og.py -> build/lib.linux-x86_64-cpython-312/flash_attn
>       copying flash_attn/flash_attn_triton.py -> build/lib.linux-x86_64-cpython-312/flash_attn
>       copying flash_attn/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn
>       creating build/lib.linux-x86_64-cpython-312/flash_attn/losses
>       copying flash_attn/losses/cross_entropy.py -> build/lib.linux-x86_64-cpython-312/flash_attn/losses
>       copying flash_attn/losses/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/losses
>       creating build/lib.linux-x86_64-cpython-312/flash_attn/modules
>       copying flash_attn/modules/embedding.py -> build/lib.linux-x86_64-cpython-312/flash_attn/modules
>       copying flash_attn/modules/mha.py -> build/lib.linux-x86_64-cpython-312/flash_attn/modules
>       copying flash_attn/modules/block.py -> build/lib.linux-x86_64-cpython-312/flash_attn/modules
>       copying flash_attn/modules/mlp.py -> build/lib.linux-x86_64-cpython-312/flash_attn/modules
>       copying flash_attn/modules/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/modules
>       creating build/lib.linux-x86_64-cpython-312/flash_attn/layers
>       copying flash_attn/layers/rotary.py -> build/lib.linux-x86_64-cpython-312/flash_attn/layers
>       copying flash_attn/layers/patch_embed.py -> build/lib.linux-x86_64-cpython-312/flash_attn/layers
>       copying flash_attn/layers/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/layers
>       creating build/lib.linux-x86_64-cpython-312/flash_attn/ops
>       copying flash_attn/ops/rms_norm.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops
>       copying flash_attn/ops/layer_norm.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops
>       copying flash_attn/ops/activations.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops
>       copying flash_attn/ops/fused_dense.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops
>       copying flash_attn/ops/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops
>       creating build/lib.linux-x86_64-cpython-312/flash_attn/utils
>       copying flash_attn/utils/distributed.py -> build/lib.linux-x86_64-cpython-312/flash_attn/utils
>       copying flash_attn/utils/pretrained.py -> build/lib.linux-x86_64-cpython-312/flash_attn/utils
>       copying flash_attn/utils/benchmark.py -> build/lib.linux-x86_64-cpython-312/flash_attn/utils
>       copying flash_attn/utils/generation.py -> build/lib.linux-x86_64-cpython-312/flash_attn/utils
>       copying flash_attn/utils/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/utils
>       creating build/lib.linux-x86_64-cpython-312/flash_attn/models
>       copying flash_attn/models/falcon.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
>       copying flash_attn/models/gptj.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
>       copying flash_attn/models/bert.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
>       copying flash_attn/models/gpt_neox.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
>       copying flash_attn/models/llama.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
>       copying flash_attn/models/vit.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
>       copying flash_attn/models/gpt.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
>       copying flash_attn/models/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
>       copying flash_attn/models/opt.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
>       running build_ext
>       building 'flash_attn_2_cuda' extension
>       creating /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm
>       creating /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src
>       Emitting ninja build file /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/build.ninja...
>       Compiling objects...
>       Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
>       [1/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [2/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_causal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [3/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [4/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [5/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [6/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [7/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_noncausal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [8/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [9/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_noncausal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [10/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [11/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_causal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [12/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [13/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [14/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [15/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_causal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [16/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [17/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/device_memory_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/device_memory_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [18/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_causal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [19/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [20/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_noncausal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [21/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [22/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [23/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [24/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [25/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [26/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_causal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [27/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_noncausal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [28/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [29/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [30/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [31/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [32/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_bf16_noncausal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [33/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_bf16_causal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [34/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_fp16_causal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [35/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [36/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [37/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_bf16_causal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [38/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [39/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_bf16_noncausal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [40/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [41/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [42/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_fp16_causal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [43/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_bf16_causal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [44/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [45/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_fp16_noncausal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [46/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [47/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [48/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [49/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_fp16_noncausal_gfx9x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [50/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/flash_api_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/flash_api_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       FAILED: /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/flash_api_hip.o
>       /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/flash_api_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/flash_api_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/flash_api_hip.hip:14:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_runner_hip.hpp:30:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/fwd_device_gemm_invoker_hip.hpp:27:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/fwd_device_gemm_template_hip.hpp:27:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/device_gemm_trait_hip.hpp:45:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_query_attention_forward_wmma_hip.hpp:17:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_descriptor_hip.hpp:7:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/common_header_hip.hpp:37:
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_buffer_addressing.hpp:32:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'
>          32 |     wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD;
>             |                                                ^
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_buffer_addressing.hpp:47:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'
>          47 |     wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD;
>             |                                                ^
>       2 errors generated when compiling for gfx1032.
>       failed to execute:/opt/rocm-6.4.0/lib/llvm/bin/clang++  --offload-arch=native  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c -x hip /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/flash_api_hip.hip -o "/tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/flash_api_hip.o" -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [51/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_bf16_noncasual_gfx110x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_bf16_noncasual_gfx110x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       FAILED: /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_bf16_noncasual_gfx110x_hip.o
>       /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_bf16_noncasual_gfx110x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_bf16_noncasual_gfx110x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_bf16_noncasual_gfx110x_hip.hip:25:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_runner_hip.hpp:30:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/fwd_device_gemm_invoker_hip.hpp:27:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/fwd_device_gemm_template_hip.hpp:27:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/device_gemm_trait_hip.hpp:45:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_query_attention_forward_wmma_hip.hpp:17:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_descriptor_hip.hpp:7:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/common_header_hip.hpp:37:
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_buffer_addressing.hpp:32:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'
>          32 |     wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD;
>             |                                                ^
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_buffer_addressing.hpp:47:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'
>          47 |     wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD;
>             |                                                ^
>       2 errors generated when compiling for gfx1032.
>       failed to execute:/opt/rocm-6.4.0/lib/llvm/bin/clang++  --offload-arch=native  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c -x hip /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_bf16_noncasual_gfx110x_hip.hip -o "/tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_bf16_noncasual_gfx110x_hip.o" -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [52/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_fp16_noncasual_gfx110x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_fp16_noncasual_gfx110x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       FAILED: /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_fp16_noncasual_gfx110x_hip.o
>       /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_fp16_noncasual_gfx110x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_fp16_noncasual_gfx110x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_fp16_noncasual_gfx110x_hip.hip:25:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_runner_hip.hpp:30:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/fwd_device_gemm_invoker_hip.hpp:27:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/fwd_device_gemm_template_hip.hpp:27:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/device_gemm_trait_hip.hpp:45:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_query_attention_forward_wmma_hip.hpp:17:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_descriptor_hip.hpp:7:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/common_header_hip.hpp:37:
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_buffer_addressing.hpp:32:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'
>          32 |     wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD;
>             |                                                ^
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_buffer_addressing.hpp:47:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'
>          47 |     wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD;
>             |                                                ^
>       2 errors generated when compiling for gfx1032.
>       failed to execute:/opt/rocm-6.4.0/lib/llvm/bin/clang++  --offload-arch=native  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c -x hip /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_fp16_noncasual_gfx110x_hip.hip -o "/tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_fp16_noncasual_gfx110x_hip.o" -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [53/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_fp16_casual_gfx110x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_fp16_casual_gfx110x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       FAILED: /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_fp16_casual_gfx110x_hip.o
>       /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_fp16_casual_gfx110x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_fp16_casual_gfx110x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_fp16_casual_gfx110x_hip.hip:25:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_runner_hip.hpp:30:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/fwd_device_gemm_invoker_hip.hpp:27:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/fwd_device_gemm_template_hip.hpp:27:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/device_gemm_trait_hip.hpp:45:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_query_attention_forward_wmma_hip.hpp:17:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_descriptor_hip.hpp:7:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/common_header_hip.hpp:37:
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_buffer_addressing.hpp:32:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'
>          32 |     wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD;
>             |                                                ^
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_buffer_addressing.hpp:47:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'
>          47 |     wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD;
>             |                                                ^
>       2 errors generated when compiling for gfx1032.
>       failed to execute:/opt/rocm-6.4.0/lib/llvm/bin/clang++  --offload-arch=native  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c -x hip /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_fp16_casual_gfx110x_hip.hip -o "/tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_fp16_casual_gfx110x_hip.o" -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [54/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_bf16_casual_gfx110x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_bf16_casual_gfx110x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       FAILED: /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_bf16_casual_gfx110x_hip.o
>       /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_bf16_casual_gfx110x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_bf16_casual_gfx110x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_bf16_casual_gfx110x_hip.hip:25:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_runner_hip.hpp:30:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/fwd_device_gemm_invoker_hip.hpp:27:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/fwd_device_gemm_template_hip.hpp:27:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/device_gemm_trait_hip.hpp:45:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_query_attention_forward_wmma_hip.hpp:17:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_descriptor_hip.hpp:7:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/common_header_hip.hpp:37:
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_buffer_addressing.hpp:32:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'
>          32 |     wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD;
>             |                                                ^
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_buffer_addressing.hpp:47:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'
>          47 |     wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD;
>             |                                                ^
>       2 errors generated when compiling for gfx1032.
>       failed to execute:/opt/rocm-6.4.0/lib/llvm/bin/clang++  --offload-arch=native  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c -x hip /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_bf16_casual_gfx110x_hip.hip -o "/tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_bf16_casual_gfx110x_hip.o" -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [55/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_fp16_casual_gfx110x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_fp16_casual_gfx110x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       FAILED: /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_fp16_casual_gfx110x_hip.o
>       /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_fp16_casual_gfx110x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_fp16_casual_gfx110x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_fp16_casual_gfx110x_hip.hip:25:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_runner_hip.hpp:30:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/fwd_device_gemm_invoker_hip.hpp:27:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/fwd_device_gemm_template_hip.hpp:27:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/device_gemm_trait_hip.hpp:45:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_query_attention_forward_wmma_hip.hpp:17:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_descriptor_hip.hpp:7:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/common_header_hip.hpp:37:
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_buffer_addressing.hpp:32:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'
>          32 |     wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD;
>             |                                                ^
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_buffer_addressing.hpp:47:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'
>          47 |     wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD;
>             |                                                ^
>       2 errors generated when compiling for gfx1032.
>       failed to execute:/opt/rocm-6.4.0/lib/llvm/bin/clang++  --offload-arch=native  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c -x hip /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_fp16_casual_gfx110x_hip.hip -o "/tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_fp16_casual_gfx110x_hip.o" -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [56/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_bf16_noncasual_gfx110x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_bf16_noncasual_gfx110x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       FAILED: /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_bf16_noncasual_gfx110x_hip.o
>       /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_bf16_noncasual_gfx110x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_bf16_noncasual_gfx110x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_bf16_noncasual_gfx110x_hip.hip:25:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_runner_hip.hpp:30:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/fwd_device_gemm_invoker_hip.hpp:27:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/fwd_device_gemm_template_hip.hpp:27:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/device_gemm_trait_hip.hpp:45:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_query_attention_forward_wmma_hip.hpp:17:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_descriptor_hip.hpp:7:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/common_header_hip.hpp:37:
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_buffer_addressing.hpp:32:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'
>          32 |     wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD;
>             |                                                ^
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_buffer_addressing.hpp:47:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'
>          47 |     wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD;
>             |                                                ^
>       2 errors generated when compiling for gfx1032.
>       failed to execute:/opt/rocm-6.4.0/lib/llvm/bin/clang++  --offload-arch=native  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c -x hip /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_bf16_noncasual_gfx110x_hip.hip -o "/tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_bf16_noncasual_gfx110x_hip.o" -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [57/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_fp16_noncasual_gfx110x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_fp16_noncasual_gfx110x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       FAILED: /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_fp16_noncasual_gfx110x_hip.o
>       /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_fp16_noncasual_gfx110x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_fp16_noncasual_gfx110x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_fp16_noncasual_gfx110x_hip.hip:25:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_runner_hip.hpp:30:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/fwd_device_gemm_invoker_hip.hpp:27:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/fwd_device_gemm_template_hip.hpp:27:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/device_gemm_trait_hip.hpp:45:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_query_attention_forward_wmma_hip.hpp:17:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_descriptor_hip.hpp:7:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/common_header_hip.hpp:37:
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_buffer_addressing.hpp:32:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'
>          32 |     wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD;
>             |                                                ^
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_buffer_addressing.hpp:47:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'
>          47 |     wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD;
>             |                                                ^
>       2 errors generated when compiling for gfx1032.
>       failed to execute:/opt/rocm-6.4.0/lib/llvm/bin/clang++  --offload-arch=native  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c -x hip /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_fp16_noncasual_gfx110x_hip.hip -o "/tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_gqa_fp16_noncasual_gfx110x_hip.o" -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       [58/58] /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_bf16_casual_gfx110x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_bf16_casual_gfx110x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       FAILED: /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_bf16_casual_gfx110x_hip.o
>       /opt/rocm-6.4.0/bin/hipcc  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_bf16_casual_gfx110x_hip.hip -o /tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_bf16_casual_gfx110x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_bf16_casual_gfx110x_hip.hip:25:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_runner_hip.hpp:30:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/fwd_device_gemm_invoker_hip.hpp:27:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/fwd_device_gemm_template_hip.hpp:27:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/device_gemm_trait_hip.hpp:45:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_query_attention_forward_wmma_hip.hpp:17:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_descriptor_hip.hpp:7:
>       In file included from /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/common_header_hip.hpp:37:
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_buffer_addressing.hpp:32:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'
>          32 |     wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD;
>             |                                                ^
>       /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_buffer_addressing.hpp:47:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'
>          47 |     wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD;
>             |                                                ^
>       2 errors generated when compiling for gfx1032.
>       failed to execute:/opt/rocm-6.4.0/lib/llvm/bin/clang++  --offload-arch=native  -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/composable_kernel/library/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/TH -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THC -I/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.4.0/include -I/home/kim/ComfyUI-cp312/venv/include -I/usr/include/python3.12 -c -c -x hip /tmp/pip-req-build-e29df3z_/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_bf16_casual_gfx110x_hip.hip -o "/tmp/pip-req-build-e29df3z_/build/temp.linux-x86_64-cpython-312/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_mqa_bf16_casual_gfx110x_hip.o" -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -O3 -std=c++17 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -D__WMMA__ -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
>       ninja: build stopped: subcommand failed.
>       Traceback (most recent call last):
>         File "/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2215, in _run_ninja_build
>           subprocess.run(
>         File "/usr/lib/python3.12/subprocess.py", line 571, in run
>           raise CalledProcessError(retcode, process.args,
>       subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
>       
>       The above exception was the direct cause of the following exception:
>       
>       Traceback (most recent call last):
>         File "<string>", line 2, in <module>
>         File "<pip-setuptools-caller>", line 34, in <module>
>         File "/tmp/pip-req-build-e29df3z_/setup.py", line 380, in <module>
>           setup(
>         File "/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/setuptools/__init__.py", line 117, in setup
>           return distutils.core.setup(**attrs)
>                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>         File "/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/setuptools/_distutils/core.py", line 186, in setup
>           return run_commands(dist)
>                  ^^^^^^^^^^^^^^^^^^
>         File "/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/setuptools/_distutils/core.py", line 202, in run_commands
>           dist.run_commands()
>         File "/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/setuptools/_distutils/dist.py", line 1002, in run_commands
>           self.run_command(cmd)
>         File "/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/setuptools/dist.py", line 1104, in run_command
>           super().run_command(command)
>         File "/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in run_command
>           cmd_obj.run()
>         File "/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/setuptools/command/bdist_wheel.py", line 370, in run
>           self.run_command("build")
>         File "/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/setuptools/_distutils/cmd.py", line 357, in run_command
>           self.distribution.run_command(command)
>         File "/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/setuptools/dist.py", line 1104, in run_command
>           super().run_command(command)
>         File "/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in run_command
>           cmd_obj.run()
>         File "/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/setuptools/_distutils/command/build.py", line 135, in run
>           self.run_command(cmd_name)
>         File "/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/setuptools/_distutils/cmd.py", line 357, in run_command
>           self.distribution.run_command(command)
>         File "/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/setuptools/dist.py", line 1104, in run_command
>           super().run_command(command)
>         File "/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in run_command
>           cmd_obj.run()
>         File "/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/setuptools/command/build_ext.py", line 99, in run
>           _build_ext.run(self)
>         File "/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/setuptools/_distutils/command/build_ext.py", line 368, in run
>           self.build_extensions()
>         File "/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 906, in build_extensions
>           build_ext.build_extensions(self)
>         File "/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/setuptools/_distutils/command/build_ext.py", line 484, in build_extensions
>           self._build_extensions_serial()
>         File "/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/setuptools/_distutils/command/build_ext.py", line 510, in _build_extensions_serial
>           self.build_extension(ext)
>         File "/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/setuptools/command/build_ext.py", line 264, in build_extension
>           _build_ext.build_extension(self, ext)
>         File "/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/Cython/Distutils/build_ext.py", line 135, in build_extension
>           super(build_ext, self).build_extension(ext)
>         File "/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/setuptools/_distutils/command/build_ext.py", line 565, in build_extension
>           objects = self.compiler.compile(
>                     ^^^^^^^^^^^^^^^^^^^^^^
>         File "/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 719, in unix_wrap_ninja_compile
>           _write_ninja_file_and_compile_objects(
>         File "/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1875, in _write_ninja_file_and_compile_objects
>           _run_ninja_build(
>         File "/home/kim/ComfyUI-cp312/venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2231, in _run_ninja_build
>           raise RuntimeError(message) from e
>       RuntimeError: Error compiling objects for extension
>       [end of output]
>   
>   note: This error originates from a subprocess, and is likely not a problem with pip.
>   ERROR: Failed building wheel for flash_attn
>   Running setup.py clean for flash_attn
> Failed to build flash_attn
> ERROR: Failed to build installable wheels for some pyproject.toml based projects (flash_attn)
> (venv) kim@Asus:~/ComfyUI-cp312$ 

@githust66
Copy link

My 7900 XT running SDXL can only reach 1it/s, while your 7900 XTx can reach up to 5it/s, which is 5 times faster than mine, there shouldn't be such a big gap, should there?

@Hakim3i
Copy link

Hakim3i commented Apr 21, 2025

WSL2 Ubuntu 24.04 LTS and HIP 6.2.4
First two results no OC second OC
I let the screenshot speaks

image

@FeepingCreature
Copy link
Contributor Author

FeepingCreature commented Apr 21, 2025

@Hakim3i Isn't that 1.5, not SDXL? It's loading SD1ClipModel.

Re your build error:

  2 errors generated when compiling for gfx1032.

That's a 6600XT I think? I think they forgot to handle that specific GPU, lol. I can try to patch it but no guarantee.

@githust66 Do you mean 9700 XT? Yeah AMD is really really bad at providing timely support for their own consumer hardware. It took two years for the 7900 XTX to get supported. It'll probably get better maybe.

@FeepingCreature
Copy link
Contributor Author

@Hakim3i Try to build again now.

@Hakim3i
Copy link

Hakim3i commented Apr 21, 2025

@Hakim3i Try to build again now.

Sorry silly me I have 2 GPU I forgot to hide it to build flash-attn, my integrated gpu don't want to display on ubuntu I didn't find a fix.
Success under WSL2, It is another SDXL model but it doesn't matter because same steps were used to test ubuntu vs wsl and they are close to 1it/s difference but am downloading the SDXL now from huggingface just for the test to be as accurate as possible with no overclocking.
Those are the results with sd_xl_base_1.0.safetensors and the basic comfyui worflow 1024x1024.

Workflow:
image

--use-pytorch-cross-attention
image

--use-flash-attention
image

Bonus flash-attn with OC:
image
image

Btw when I built flash-attn those two custom nodes stopped importing and me who tought they were broken under linux:
0.1 seconds (IMPORT FAILED): /home/kim/ComfyUI/custom_nodes/ComfyUI-Easy-Use
0.8 seconds (IMPORT FAILED): /home/kim/ComfyUI/custom_nodes/was-node-suite-comfyui

Loading: ComfyUI-Impact-Pack (V8.11)

[Impact Pack] Wildcards loading done.
Traceback (most recent call last):
File "/home/kim/ComfyUI/venv/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 1967, in _get_module
return importlib.import_module("." + module_name, self.name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/importlib/init.py", line 90, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 1387, in _gcd_import
File "", line 1360, in _find_and_load
File "", line 1331, in _find_and_load_unlocked
File "", line 935, in _load_unlocked
File "", line 995, in exec_module
File "", line 488, in _call_with_frames_removed
File "/home/kim/ComfyUI/venv/lib/python3.12/site-packages/transformers/models/blip/modeling_blip.py", line 29, in
from ...modeling_utils import PreTrainedModel
File "/home/kim/ComfyUI/venv/lib/python3.12/site-packages/transformers/modeling_utils.py", line 62, in
from .integrations.flash_attention import flash_attention_forward
File "/home/kim/ComfyUI/venv/lib/python3.12/site-packages/transformers/integrations/flash_attention.py", line 5, in
from ..modeling_flash_attention_utils import _flash_attention_forward, flash_attn_supports_top_left_mask
File "/home/kim/ComfyUI/venv/lib/python3.12/site-packages/transformers/modeling_flash_attention_utils.py", line 38, in
from flash_attn.layers.rotary import apply_rotary_emb # noqa
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/kim/ComfyUI/venv/lib/python3.12/site-packages/flash_attn/layers/rotary.py", line 10, in
import rotary_emb
ModuleNotFoundError: No module named 'rotary_emb'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/kim/ComfyUI/nodes.py", line 2128, in load_custom_node
module_spec.loader.exec_module(module)
File "", line 995, in exec_module
File "", line 488, in _call_with_frames_removed
File "/home/kim/ComfyUI/custom_nodes/was-node-suite-comfyui/init.py", line 1, in
from .WAS_Node_Suite import NODE_CLASS_MAPPINGS
File "/home/kim/ComfyUI/custom_nodes/was-node-suite-comfyui/WAS_Node_Suite.py", line 2415, in
from transformers import BlipProcessor, BlipForConditionalGeneration, BlipForQuestionAnswering
File "", line 1412, in _handle_fromlist
File "/home/kim/ComfyUI/venv/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 1956, in getattr
value = getattr(module, name)
^^^^^^^^^^^^^^^^^^^^^
File "/home/kim/ComfyUI/venv/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 1955, in getattr
module = self._get_module(self._class_to_module[name])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/kim/ComfyUI/venv/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 1969, in _get_module
raise RuntimeError(
RuntimeError: Failed to import transformers.models.blip.modeling_blip because of the following error (look up to see its traceback):
No module named 'rotary_emb'

Cannot import /home/kim/ComfyUI/custom_nodes/was-node-suite-comfyui module for custom nodes: Failed to import transformers.models.blip.modeling_blip because of the following error (look up to see its traceback):
No module named 'rotary_emb'
Total VRAM 24490 MB, total RAM 60274 MB
pytorch version: 2.6.0+rocm6.4.0.git2fb0ac2b
AMD arch: gfx1100
Set vram state to: NORMAL_VRAM
Device: cuda:0 AMD Radeon RX 7900 XTX : native
Traceback (most recent call last):
File "/home/kim/ComfyUI/venv/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 1967, in _get_module
return importlib.import_module("." + module_name, self.name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/importlib/init.py", line 90, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 1387, in _gcd_import
File "", line 1360, in _find_and_load
File "", line 1331, in _find_and_load_unlocked
File "", line 935, in _load_unlocked
File "", line 995, in exec_module
File "", line 488, in _call_with_frames_removed
File "/home/kim/ComfyUI/venv/lib/python3.12/site-packages/transformers/modeling_utils.py", line 62, in
from .integrations.flash_attention import flash_attention_forward
File "/home/kim/ComfyUI/venv/lib/python3.12/site-packages/transformers/integrations/flash_attention.py", line 5, in
from ..modeling_flash_attention_utils import _flash_attention_forward, flash_attn_supports_top_left_mask
File "/home/kim/ComfyUI/venv/lib/python3.12/site-packages/transformers/modeling_flash_attention_utils.py", line 38, in
from flash_attn.layers.rotary import apply_rotary_emb # noqa
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/kim/ComfyUI/venv/lib/python3.12/site-packages/flash_attn/layers/rotary.py", line 10, in
import rotary_emb
ModuleNotFoundError: No module named 'rotary_emb'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/kim/ComfyUI/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 820, in _get_module
return importlib.import_module("." + module_name, self.name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/importlib/init.py", line 90, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 1387, in _gcd_import
File "", line 1360, in _find_and_load
File "", line 1331, in _find_and_load_unlocked
File "", line 935, in _load_unlocked
File "", line 995, in exec_module
File "", line 488, in _call_with_frames_removed
File "/home/kim/ComfyUI/venv/lib/python3.12/site-packages/diffusers/loaders/peft.py", line 38, in
from .lora_base import _fetch_state_dict, _func_optionally_disable_offloading
File "/home/kim/ComfyUI/venv/lib/python3.12/site-packages/diffusers/loaders/lora_base.py", line 51, in
from transformers import PreTrainedModel
File "", line 1412, in _handle_fromlist
File "/home/kim/ComfyUI/venv/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 1955, in getattr
module = self._get_module(self._class_to_module[name])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/kim/ComfyUI/venv/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 1969, in _get_module
raise RuntimeError(
RuntimeError: Failed to import transformers.modeling_utils because of the following error (look up to see its traceback):
No module named 'rotary_emb'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/kim/ComfyUI/nodes.py", line 2128, in load_custom_node
module_spec.loader.exec_module(module)
File "", line 995, in exec_module
File "", line 488, in _call_with_frames_removed
File "/home/kim/ComfyUI/custom_nodes/ComfyUI-Easy-Use/init.py", line 20, in
imported_module = importlib.import_module(".py.nodes.{}".format(module_name), name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/importlib/init.py", line 90, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 1387, in _gcd_import
File "", line 1360, in _find_and_load
File "", line 1331, in _find_and_load_unlocked
File "", line 935, in _load_unlocked
File "", line 995, in exec_module
File "", line 488, in _call_with_frames_removed
File "/home/kim/ComfyUI/custom_nodes/ComfyUI-Easy-Use/py/nodes/preSampling.py", line 7, in
from ..modules.layer_diffuse import LayerMethod
File "/home/kim/ComfyUI/custom_nodes/ComfyUI-Easy-Use/py/modules/layer_diffuse/init.py", line 12, in
from .model import ModelPatcher, TransparentVAEDecoder, calculate_weight_adjust_channel
File "/home/kim/ComfyUI/custom_nodes/ComfyUI-Easy-Use/py/modules/layer_diffuse/model.py", line 23, in
from diffusers.models.unets.unet_2d_blocks import UNetMidBlock2D, get_down_block, get_up_block
File "/home/kim/ComfyUI/venv/lib/python3.12/site-packages/diffusers/models/unets/init.py", line 6, in
from .unet_2d import UNet2DModel
File "/home/kim/ComfyUI/venv/lib/python3.12/site-packages/diffusers/models/unets/unet_2d.py", line 24, in
from .unet_2d_blocks import UNetMidBlock2D, get_down_block, get_up_block
File "/home/kim/ComfyUI/venv/lib/python3.12/site-packages/diffusers/models/unets/unet_2d_blocks.py", line 36, in
from ..transformers.dual_transformer_2d import DualTransformer2DModel
File "/home/kim/ComfyUI/venv/lib/python3.12/site-packages/diffusers/models/transformers/init.py", line 6, in
from .cogvideox_transformer_3d import CogVideoXTransformer3DModel
File "/home/kim/ComfyUI/venv/lib/python3.12/site-packages/diffusers/models/transformers/cogvideox_transformer_3d.py", line 22, in
from ...loaders import PeftAdapterMixin
File "", line 1412, in _handle_fromlist
File "/home/kim/ComfyUI/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 810, in getattr
module = self._get_module(self._class_to_module[name])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/kim/ComfyUI/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 822, in _get_module
raise RuntimeError(
RuntimeError: Failed to import diffusers.loaders.peft because of the following error (look up to see its traceback):
Failed to import transformers.modeling_utils because of the following error (look up to see its traceback):
No module named 'rotary_emb'

Cannot import /home/kim/ComfyUI/custom_nodes/ComfyUI-Easy-Use module for custom nodes: Failed to import diffusers.loaders.peft because of the following error (look up to see its traceback):
Failed to import transformers.modeling_utils because of the following error (look up to see its traceback):
No module named 'rotary_emb'

@FeepingCreature
Copy link
Contributor Author

FeepingCreature commented Apr 21, 2025

Huh. Neat! Your FA perf seems to be about where mine normally is, so it seems the 5it/s pytorch cross attention is the odd one out. Could you try with a fresh Comfy install? Asking for a fresh Pytorch install seems excessive.

Wait I can't read.

Yeah yours seems just sort of 10% faster than mine. I wonder if it's a GPU power limit thing, maybe the Windows drivers are more aggressive. Or just a better gpu?

@Hakim3i
Copy link

Hakim3i commented Apr 21, 2025

Huh. Neat! Your FA perf seems to be about where mine normally is, so it seems the 5it/s pytorch cross attention is the odd one out. Could you try with a fresh Comfy install? Asking for a fresh Pytorch install seems excessive.

Wait I can't read.

Yeah yours seems just sort of 10% faster than mine. I wonder if it's a GPU power limit thing, maybe the Windows drivers are more aggressive. Or just a better gpu?

It might be GPU drivers I don't know for sure there is more It under WSL then Ubuntu but the problem with WSL it is a VM and when I run WAN2.1 it just dosn't work it use all my 64GB of ram.

@FeepingCreature
Copy link
Contributor Author

FeepingCreature commented Apr 21, 2025

Can you check what your TDP limit is set to? Mine is at 339W by default. Monitoring software claims it's power limited.

@Hakim3i
Copy link

Hakim3i commented Apr 21, 2025

Can you check what your TDP limit is set to? Mine is at 339W by default. Monitoring software claims it's power limited.

This is with OC hovering 460W with some spikes:
image

This is no OC hovering 400W with some spikes:
image

@Hakim3i
Copy link

Hakim3i commented Apr 21, 2025

I want to do the flash-attn test under ubuntu but I don't want to break my comfyui workflow and looks like I need to hack into the custom node?
I cannot install rotary_emb
image

@FeepingCreature
Copy link
Contributor Author

FeepingCreature commented Apr 21, 2025

I can't recall having needed to install that manually. I don't know what's going on with that error.

I think 339W PPT Sustained is the same setting I have... so we have the same value there. That doesn't explain it.

Try just making a separate ComfyUI folder? Then you can run with only default nodes.

@Hakim3i
Copy link

Hakim3i commented Apr 21, 2025

File "/home/kim/ComfyUI/venv/lib/python3.12/site-packages/transformers/modeling_flash_attention_utils.py", line 38, in
from flash_attn.layers.rotary import apply_rotary_emb # noqa
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/kim/ComfyUI/venv/lib/python3.12/site-packages/flash_attn/layers/rotary.py", line 10, in
import rotary_emb
ModuleNotFoundError: No module named 'rotary_emb'

File "/home/kim/ComfyUI/venv/lib/python3.12/site-packages/transformers/modeling_flash_attention_utils.py", line 38, in
from flash_attn.layers.rotary import apply_rotary_emb # noqa
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/kim/ComfyUI/venv/lib/python3.12/site-packages/flash_attn/layers/rotary.py", line 10, in
import rotary_emb
ModuleNotFoundError: No module named 'rotary_emb'

Yep I can make copy but as you can see I have more it/s with flash-attn so I might just use it.

@FeepingCreature
Copy link
Contributor Author

I mean - yay? Mission accomplished? :)

@Hakim3i
Copy link

Hakim3i commented Apr 21, 2025

I mean - yay? Mission accomplished? :)

50% of the workflows on internet use WAS NODE SUITE I kinda need to get it working.

@FeepingCreature
Copy link
Contributor Author

Yeah I really don't know what's going on with that, maybe back flash_attn out for now.

@Hakim3i
Copy link

Hakim3i commented Apr 21, 2025

Yeah I really don't know what's going on with that, maybe back flash_attn out for now.

For easy use pip install diffusers==0.25.1
And for WAS NODE SUITE I needed to fake the blipwrapper class source code.

class BlipForConditionalGeneration:
    @staticmethod
    def from_pretrained(model_id, cache_dir=None):
        return None

    def to(self, device):
        return self

    def eval(self):
        pass

    def generate(self, **inputs):
        return None


class BlipForQuestionAnswering:
    @staticmethod
    def from_pretrained(model_id, cache_dir=None):
        return None

    def to(self, device):
        return self

    def eval(self):
        pass

    def generate(self, **inputs):
        return None


class BlipWrapper:
    def __init__(self, caption_model_id="Salesforce/blip-image-captioning-base", vqa_model_id="Salesforce/blip-vqa-base", device="cuda", cache_dir=None):
        self.device = "cpu"  # Simulating that it's always using CPU
        self.caption_processor = BlipProcessor.from_pretrained(caption_model_id, cache_dir=cache_dir)
        self.caption_model = BlipForConditionalGeneration.from_pretrained(caption_model_id, cache_dir=cache_dir).to(self.device)
        self.vqa_processor = BlipProcessor.from_pretrained(vqa_model_id, cache_dir=cache_dir)
        self.vqa_model = BlipForQuestionAnswering.from_pretrained(vqa_model_id, cache_dir=cache_dir).to(self.device)

    def generate_caption(self, image, min_length=50, max_length=100, num_beams=5, no_repeat_ngram_size=2, early_stopping=False):
        # Return None to simulate no output
        return None

    def answer_question(self, image, question, min_length=50, max_length=100, num_beams=5, no_repeat_ngram_size=2, early_stopping=False):
        # Return None to simulate no output
        return None

But my WAN workflow is no longer working getting also florence issue
cannot import name 'apply_rotary_emb' from 'flash_attn.layers.rotary' (/home/kim/ComfyUI/venv/lib/python3.12/site-packages/flash_attn/layers/rotary.py)

AND error on WAN:
miopenStatusUnknownError

But the basic workflow still work so this flash-attn is just hit and miss.

@FeepingCreature
Copy link
Contributor Author

Okay I've updated to 25.04 and now I also get the rotary_emb issue, lol.

I guess upstream switched to triton-based, I'mma go see if I can steal that commit.

@FeepingCreature
Copy link
Contributor Author

FeepingCreature commented Apr 21, 2025

Pushed a fix maybe? Issue has gone away locally, no idea if the upstream code actually works. It looks pretty generic.

@Hakim3i
Copy link

Hakim3i commented Apr 21, 2025

Pushed a fix maybe? Issue has gone away locally, no idea if the upstream code actually works. It looks pretty generic.

So what I just need to recompile?

@FeepingCreature
Copy link
Contributor Author

FeepingCreature commented Apr 21, 2025

Yep, rerun the pip. I'm just pushing to that branch.

(For others: pip install -U git+https://github.com/FeepingCreature/flash-attention-gfx11@gel-crabs-headdim512)

edit: Woah, nice. After the update to Ubuntu 25.04 I'm seeing FA+torch.compile at 5.04it/s.

@Hakim3i
Copy link

Hakim3i commented Apr 21, 2025

Yep, rerun the pip. I'm just pushing to that branch.

(For others: pip install -U git+https://github.com/FeepingCreature/flash-attention-gfx11@gel-crabs-headdim512)

edit: Woah, nice. After the update to Ubuntu 25.04 I'm seeing FA+torch.compile at 5.04it/s.

That's great.
I copied my ComfyUI ubuntu folder but I get this error:

kim@Asus:~/Comfyui-cp312-flashattn$ source venv/bin/activate
(venv) kim@Asus:~/Comfyui-cp312-flashattn$ pip install -U git+https://github.com/FeepingCreature/flash-attention-gfx11@gel-crabs-headdim512
Collecting git+https://github.com/FeepingCreature/flash-attention-gfx11@gel-crabs-headdim512
  Cloning https://github.com/FeepingCreature/flash-attention-gfx11 (to revision gel-crabs-headdim512) to /tmp/pip-req-build-ggs9hsd_
  Running command git clone --filter=blob:none --quiet https://github.com/FeepingCreature/flash-attention-gfx11 /tmp/pip-req-build-ggs9hsd_
  Running command git checkout -b gel-crabs-headdim512 --track origin/gel-crabs-headdim512
  Switched to a new branch 'gel-crabs-headdim512'
  branch 'gel-crabs-headdim512' set up to track 'origin/gel-crabs-headdim512'.
  Resolved https://github.com/FeepingCreature/flash-attention-gfx11 to commit 82755f4665437578e4be8518aad870f2b737ef3b
  Running command git submodule update --init --recursive -q
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error
  
  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [20 lines of output]
      Traceback (most recent call last):
        File "/home/kim/Comfyui-cp312-dev/venv/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
          main()
        File "/home/kim/Comfyui-cp312-dev/venv/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/home/kim/Comfyui-cp312-dev/venv/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
          return hook(config_settings)
                 ^^^^^^^^^^^^^^^^^^^^^
        File "/tmp/pip-build-env-sxed4g6n/overlay/lib/python3.12/site-packages/setuptools/build_meta.py", line 331, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=[])
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/tmp/pip-build-env-sxed4g6n/overlay/lib/python3.12/site-packages/setuptools/build_meta.py", line 301, in _get_build_requires
          self.run_setup()
        File "/tmp/pip-build-env-sxed4g6n/overlay/lib/python3.12/site-packages/setuptools/build_meta.py", line 512, in run_setup
          super().run_setup(setup_script=setup_script)
        File "/tmp/pip-build-env-sxed4g6n/overlay/lib/python3.12/site-packages/setuptools/build_meta.py", line 317, in run_setup
          exec(code, locals())
        File "<string>", line 16, in <module>
      ModuleNotFoundError: No module named 'torch'
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
(venv) kim@Asus:~/Comfyui-cp312-flashattn$

image

@FeepingCreature
Copy link
Contributor Author

I... can not tell you what is up with that, that error makes no sense.

@Hakim3i
Copy link

Hakim3i commented Apr 21, 2025

I... can not tell you what is up with that, that error makes no sense.

I have torch from amd repo if that clear things?

@Hakim3i
Copy link

Hakim3i commented Apr 22, 2025

I... can not tell you what is up with that, that error makes no sense.

I was able to use flashattn like this:

git clone -b gel-crabs-headdim512 https://github.com/FeepingCreature/flash-attention-gfx11.git
cd flash-attention-gfx11
git submodule update --init --recursive
source /home/kim/Comfyui-cp312-dev/venv/bin/activate
pip install ninja cmake
pip install . --no-build-isolation
python setup.py install

It works and am getting 5.10 it/s with OC.
But I have this warnings

venv) kim@Asus:~/Comfyui-cp312-flashattn/flash-attention-gfx11$ pip list | grep flash
DEPRECATION: Loading egg at /home/kim/Comfyui-cp312-dev/venv/lib/python3.12/site-packages/flash_attn-2.0.4-py3.12-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330
flash_attn                 2.0.4
flash_attn                 2.0.4
(venv) kim@Asus:~/Comfyui-cp312-flashattn/flash-attention-gfx11$ import flash_attn
print(flash_attn.__file__)
Command 'import' not found, but can be installed with:
sudo apt install graphicsmagick-imagemagick-compat  # version 1.4+really1.3.42-1, or
sudo apt install imagemagick-6.q16                  # version 8:6.9.11.60+dfsg-1.6ubuntu1
sudo apt install imagemagick-6.q16hdri              # version 8:6.9.11.60+dfsg-1.6ubuntu1
bash: syntax error near unexpected token flash_attn.__file__'
(venv) kim@Asus:~/Comfyui-cp312-flashattn/flash-attention-gfx11$ ls /home/kim/Comfyui-cp312-dev/venv/lib/python3.12/site-packages | grep flash
flash_attn-2.0.4-py3.12-linux-x86_64.egg
(venv) kim@Asus:~/Comfyui-cp312-flashattn/flash-attention-gfx11$ cd ..
(venv) kim@Asus:~/Comfyui-cp312-flashattn$ ./run.sh
[START] Security scan
DEPRECATION: Loading egg at /home/kim/Comfyui-cp312-dev/venv/lib/python3.12/site-packages/flash_attn-2.0.4-py3.12-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330

And holy it is much faster I can make a 3 seconds video 480p in 160 seconds where with-use-pytorch-cross-attention it is about 230 seconds.

Edit: Asked ChatGPT and it delivered so what I did:

export BUILD_TARGET=rocm
export FLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE
python setup.py clean --all
python setup.py bdist_wheel

and I got the whl package then:
pip install dist/flash_attn-*.whl

@FeepingCreature
Copy link
Contributor Author

Nice, congrats on the performance! That warning is harmless, don't worry about it.

@jnolck
Copy link

jnolck commented Feb 1, 2026

There's been two updates on this triton branch. Dao-AILab/flash-attention#2217 and Dao-AILab/flash-attention#2178 that ones already in. The other one still getting reviewed. flash attention 3 and infinity cache usage. Can you take a look? It's probably super easy for you to integrate that here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants