Skip to content

Adds platform-specific warp mask to frag_layout_swizzle.cuh#7

Merged
diptorupd merged 1 commit intoamd-integrationfrom
update/arch-specific-warp-mask
Oct 2, 2025
Merged

Adds platform-specific warp mask to frag_layout_swizzle.cuh#7
diptorupd merged 1 commit intoamd-integrationfrom
update/arch-specific-warp-mask

Conversation

@diptorupd
Copy link
Collaborator

The mask values for the __shfl_xor_sync used in the original CUDA version of the frag_layout_swizzle.cuh was designed for 32-thread warps. The design made the header incompatible with CDNA3 that has 64-thread warps. The PR adds a platform-specific
WARP_FULL_MASK constant to support both 32 thread and 64 thread warps.

Copilot AI review requested due to automatic review settings October 2, 2025 19:19
@diptorupd diptorupd self-assigned this Oct 2, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds platform-specific warp mask constants to support both CUDA (32-thread warps) and HIP/CDNA3 (64-thread warps) architectures. The change replaces hardcoded 32-bit masks with a platform-conditional WARP_FULL_MASK constant.

  • Introduces platform-specific WARP_FULL_MASK constant (32-bit for CUDA, 64-bit for HIP)
  • Replaces hardcoded 0xffffffff masks in __shfl_xor_sync calls with the new constant
  • Enables compatibility with CDNA3's 64-thread warps while maintaining CUDA support

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@diptorupd diptorupd requested a review from demandal25 October 2, 2025 19:26
@diptorupd diptorupd merged commit 78b665e into amd-integration Oct 2, 2025
1 check passed
@diptorupd diptorupd deleted the update/arch-specific-warp-mask branch October 2, 2025 19:32
diptorupd pushed a commit that referenced this pull request Dec 5, 2025
In this PR I remove the `libtorch` dependency and removed
`test_page.cpp`. `test_page.cpp` is the only unit test that uses
libtorch. However, we also have a pytest for testing page. We will use
that for validation.

Removing the libtorch dependency will help us speed docker builds and
remove additional dependencies.


```Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/8 Test #1: MathTest ............................   Passed    0.31 sec
    Start 2: PosEncTest
2/8 Test #2: PosEncTest ..........................   Passed    0.31 sec
    Start 3: CascadeTest
3/8 Test #3: CascadeTest .........................   Passed  1369.12 sec
    Start 4: SingleDecodeTest
4/8 Test #4: SingleDecodeTest ....................   Passed  7726.35 sec
    Start 5: BatchDecodeTest
5/8 Test #5: BatchDecodeTest .....................   Passed  811.61 sec
    Start 6: test_mfma_fp32_16x16x16fp16
6/8 Test #6: test_mfma_fp32_16x16x16fp16 .........   Passed    0.30 sec
    Start 7: test_transpose_4x4_half_registers
7/8 Test #7: test_transpose_4x4_half_registers ...   Passed    0.28 sec
    Start 8: test_rowsum
8/8 Test #8: test_rowsum .........................   Passed    0.27 sec

100% tests passed, 0 tests failed out of 8
```
diptorupd added a commit that referenced this pull request Dec 5, 2025
The mask values for the __shfl_xor_sync used in the original CUDA version of the frag_layout_swizzle.cuh was designed for 32-thread warps. The design made the header incompatible with CDNA3 that has 64-thread warps. The PR adds a platform-specific
WARP_FULL_MASK constant to support both 32 thread and 64 thread warps.
zhenhantech pushed a commit to zhenhantech/flashinfer that referenced this pull request Jan 9, 2026
In this PR I remove the `libtorch` dependency and removed
`test_page.cpp`. `test_page.cpp` is the only unit test that uses
libtorch. However, we also have a pytest for testing page. We will use
that for validation.

Removing the libtorch dependency will help us speed docker builds and
remove additional dependencies.


```Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/8 Test ROCm#1: MathTest ............................   Passed    0.31 sec
    Start 2: PosEncTest
2/8 Test ROCm#2: PosEncTest ..........................   Passed    0.31 sec
    Start 3: CascadeTest
3/8 Test ROCm#3: CascadeTest .........................   Passed  1369.12 sec
    Start 4: SingleDecodeTest
4/8 Test ROCm#4: SingleDecodeTest ....................   Passed  7726.35 sec
    Start 5: BatchDecodeTest
5/8 Test ROCm#5: BatchDecodeTest .....................   Passed  811.61 sec
    Start 6: test_mfma_fp32_16x16x16fp16
6/8 Test ROCm#6: test_mfma_fp32_16x16x16fp16 .........   Passed    0.30 sec
    Start 7: test_transpose_4x4_half_registers
7/8 Test ROCm#7: test_transpose_4x4_half_registers ...   Passed    0.28 sec
    Start 8: test_rowsum
8/8 Test ROCm#8: test_rowsum .........................   Passed    0.27 sec

100% tests passed, 0 tests failed out of 8
```
zhenhantech pushed a commit to zhenhantech/flashinfer that referenced this pull request Jan 9, 2026
The mask values for the __shfl_xor_sync used in the original CUDA version of the frag_layout_swizzle.cuh was designed for 32-thread warps. The design made the header incompatible with CDNA3 that has 64-thread warps. The PR adds a platform-specific
WARP_FULL_MASK constant to support both 32 thread and 64 thread warps.
diptorupd pushed a commit that referenced this pull request Feb 2, 2026
In this PR I remove the `libtorch` dependency and removed
`test_page.cpp`. `test_page.cpp` is the only unit test that uses
libtorch. However, we also have a pytest for testing page. We will use
that for validation.

Removing the libtorch dependency will help us speed docker builds and
remove additional dependencies.


```Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/8 Test #1: MathTest ............................   Passed    0.31 sec
    Start 2: PosEncTest
2/8 Test #2: PosEncTest ..........................   Passed    0.31 sec
    Start 3: CascadeTest
3/8 Test #3: CascadeTest .........................   Passed  1369.12 sec
    Start 4: SingleDecodeTest
4/8 Test #4: SingleDecodeTest ....................   Passed  7726.35 sec
    Start 5: BatchDecodeTest
5/8 Test #5: BatchDecodeTest .....................   Passed  811.61 sec
    Start 6: test_mfma_fp32_16x16x16fp16
6/8 Test #6: test_mfma_fp32_16x16x16fp16 .........   Passed    0.30 sec
    Start 7: test_transpose_4x4_half_registers
7/8 Test #7: test_transpose_4x4_half_registers ...   Passed    0.28 sec
    Start 8: test_rowsum
8/8 Test #8: test_rowsum .........................   Passed    0.27 sec

100% tests passed, 0 tests failed out of 8
```
diptorupd added a commit that referenced this pull request Feb 2, 2026
The mask values for the __shfl_xor_sync used in the original CUDA version of the frag_layout_swizzle.cuh was designed for 32-thread warps. The design made the header incompatible with CDNA3 that has 64-thread warps. The PR adds a platform-specific
WARP_FULL_MASK constant to support both 32 thread and 64 thread warps.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants