Adds platform-specific warp mask to frag_layout_swizzle.cuh by diptorupd · Pull Request #7 · ROCm/flashinfer

diptorupd · 2025-10-02T19:19:00Z

The mask values for the __shfl_xor_sync used in the original CUDA version of the frag_layout_swizzle.cuh was designed for 32-thread warps. The design made the header incompatible with CDNA3 that has 64-thread warps. The PR adds a platform-specific
WARP_FULL_MASK constant to support both 32 thread and 64 thread warps.

Copilot

Pull Request Overview

This PR adds platform-specific warp mask constants to support both CUDA (32-thread warps) and HIP/CDNA3 (64-thread warps) architectures. The change replaces hardcoded 32-bit masks with a platform-conditional WARP_FULL_MASK constant.

Introduces platform-specific WARP_FULL_MASK constant (32-bit for CUDA, 64-bit for HIP)
Replaces hardcoded 0xffffffff masks in __shfl_xor_sync calls with the new constant
Enables compatibility with CDNA3's 64-thread warps while maintaining CUDA support

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

libflashinfer/include/flashinfer/attention/generic/frag_layout_swizzle.cuh

In this PR I remove the `libtorch` dependency and removed `test_page.cpp`. `test_page.cpp` is the only unit test that uses libtorch. However, we also have a pytest for testing page. We will use that for validation. Removing the libtorch dependency will help us speed docker builds and remove additional dependencies. ```Test project /root/flashinfer/libflashinfer/tests/hip/build Start 1: MathTest 1/8 Test #1: MathTest ............................ Passed 0.31 sec Start 2: PosEncTest 2/8 Test #2: PosEncTest .......................... Passed 0.31 sec Start 3: CascadeTest 3/8 Test #3: CascadeTest ......................... Passed 1369.12 sec Start 4: SingleDecodeTest 4/8 Test #4: SingleDecodeTest .................... Passed 7726.35 sec Start 5: BatchDecodeTest 5/8 Test #5: BatchDecodeTest ..................... Passed 811.61 sec Start 6: test_mfma_fp32_16x16x16fp16 6/8 Test #6: test_mfma_fp32_16x16x16fp16 ......... Passed 0.30 sec Start 7: test_transpose_4x4_half_registers 7/8 Test #7: test_transpose_4x4_half_registers ... Passed 0.28 sec Start 8: test_rowsum 8/8 Test #8: test_rowsum ......................... Passed 0.27 sec 100% tests passed, 0 tests failed out of 8 ```

The mask values for the __shfl_xor_sync used in the original CUDA version of the frag_layout_swizzle.cuh was designed for 32-thread warps. The design made the header incompatible with CDNA3 that has 64-thread warps. The PR adds a platform-specific WARP_FULL_MASK constant to support both 32 thread and 64 thread warps.

In this PR I remove the `libtorch` dependency and removed `test_page.cpp`. `test_page.cpp` is the only unit test that uses libtorch. However, we also have a pytest for testing page. We will use that for validation. Removing the libtorch dependency will help us speed docker builds and remove additional dependencies. ```Test project /root/flashinfer/libflashinfer/tests/hip/build Start 1: MathTest 1/8 Test ROCm#1: MathTest ............................ Passed 0.31 sec Start 2: PosEncTest 2/8 Test ROCm#2: PosEncTest .......................... Passed 0.31 sec Start 3: CascadeTest 3/8 Test ROCm#3: CascadeTest ......................... Passed 1369.12 sec Start 4: SingleDecodeTest 4/8 Test ROCm#4: SingleDecodeTest .................... Passed 7726.35 sec Start 5: BatchDecodeTest 5/8 Test ROCm#5: BatchDecodeTest ..................... Passed 811.61 sec Start 6: test_mfma_fp32_16x16x16fp16 6/8 Test ROCm#6: test_mfma_fp32_16x16x16fp16 ......... Passed 0.30 sec Start 7: test_transpose_4x4_half_registers 7/8 Test ROCm#7: test_transpose_4x4_half_registers ... Passed 0.28 sec Start 8: test_rowsum 8/8 Test ROCm#8: test_rowsum ......................... Passed 0.27 sec 100% tests passed, 0 tests failed out of 8 ```

The mask values for the __shfl_xor_sync used in the original CUDA version of the frag_layout_swizzle.cuh was designed for 32-thread warps. The design made the header incompatible with CDNA3 that has 64-thread warps. The PR adds a platform-specific WARP_FULL_MASK constant to support both 32 thread and 64 thread warps.

In this PR I remove the `libtorch` dependency and removed `test_page.cpp`. `test_page.cpp` is the only unit test that uses libtorch. However, we also have a pytest for testing page. We will use that for validation. Removing the libtorch dependency will help us speed docker builds and remove additional dependencies. ```Test project /root/flashinfer/libflashinfer/tests/hip/build Start 1: MathTest 1/8 Test #1: MathTest ............................ Passed 0.31 sec Start 2: PosEncTest 2/8 Test #2: PosEncTest .......................... Passed 0.31 sec Start 3: CascadeTest 3/8 Test #3: CascadeTest ......................... Passed 1369.12 sec Start 4: SingleDecodeTest 4/8 Test #4: SingleDecodeTest .................... Passed 7726.35 sec Start 5: BatchDecodeTest 5/8 Test #5: BatchDecodeTest ..................... Passed 811.61 sec Start 6: test_mfma_fp32_16x16x16fp16 6/8 Test #6: test_mfma_fp32_16x16x16fp16 ......... Passed 0.30 sec Start 7: test_transpose_4x4_half_registers 7/8 Test #7: test_transpose_4x4_half_registers ... Passed 0.28 sec Start 8: test_rowsum 8/8 Test #8: test_rowsum ......................... Passed 0.27 sec 100% tests passed, 0 tests failed out of 8 ```

The mask values for the __shfl_xor_sync used in the original CUDA version of the frag_layout_swizzle.cuh was designed for 32-thread warps. The design made the header incompatible with CDNA3 that has 64-thread warps. The PR adds a platform-specific WARP_FULL_MASK constant to support both 32 thread and 64 thread warps.

Add bit-width specific masks

ed95ebf

Copilot AI review requested due to automatic review settings October 2, 2025 19:19

diptorupd self-assigned this Oct 2, 2025

Copilot AI reviewed Oct 2, 2025

View reviewed changes

libflashinfer/include/flashinfer/attention/generic/frag_layout_swizzle.cuh Show resolved Hide resolved

diptorupd requested a review from demandal25 October 2, 2025 19:26

demandal25 approved these changes Oct 2, 2025

View reviewed changes

diptorupd merged commit 78b665e into amd-integration Oct 2, 2025
1 check passed

diptorupd deleted the update/arch-specific-warp-mask branch October 2, 2025 19:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds platform-specific warp mask to frag_layout_swizzle.cuh#7

Adds platform-specific warp mask to frag_layout_swizzle.cuh#7
diptorupd merged 1 commit intoamd-integrationfrom
update/arch-specific-warp-mask

diptorupd commented Oct 2, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

diptorupd commented Oct 2, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants