Skip to content

Feature/layout transform A mat to B mat#5

Merged
diptorupd merged 10 commits intoamd-integrationfrom
feature/layout_transform_Amat_to_B_mat
Oct 1, 2025
Merged

Feature/layout transform A mat to B mat#5
diptorupd merged 10 commits intoamd-integrationfrom
feature/layout_transform_Amat_to_B_mat

Conversation

@diptorupd
Copy link
Collaborator

Added a new transpose_inter_quad_fragments function to permutes MMA matrix fragments in-registers across specific thread quads. The function is required to transpose an MMA tile from A to B layout and vice-versa.

Rename transpose_4x4_half_registers to transpose_intra_quad_fragments

Added unit tests

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new transpose_inter_quad_fragments function that enables transforming MMA matrix fragments between A and B layouts in registers, along with renaming transpose_4x4_half_registers to transpose_intra_quad_fragments for clarity.

  • Added transpose_inter_quad_fragments function for inter-quad register permutation
  • Renamed transpose_4x4_half_registers to transpose_intra_quad_fragments and updated function references
  • Added comprehensive unit tests for layout transformations and matrix loading patterns

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
test_mfma_fp32_16x16x16fp16.cpp Updated to use new function names and simplified matrix indexing
test_layout_transform.cpp New comprehensive test file for layout transformation functionality
mma_ops.hpp Updated API to use new function names and corrected template constraints
mma_hip.h Added new transpose function, renamed existing one, and improved documentation
mma_debug_utils_hip.h New debug utilities for testing matrix layout transformations

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +196 to +199
transpose_intra_quad_fragments(reinterpret_cast<uint32_t*>(s_frag));
f16x4 a = reinterpret_cast<const f16x4*>(s_frag)[0];
f16x4 b = {f16(1.0f), f16(1.0f), f16(1.0f), f16(1.0f)};
f32x4 c = {0.f, 0.f, 0.f, 0.f};
f32x4 c = {d[0], d[1], d[2], d[3]};
Copy link

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transpose_intra_quad_fragments call modifies s_frag in-place but this side effect is not documented in the function signature or comments. Consider adding a comment explaining why the transpose is necessary for the rowsum operation.

Copilot uses AI. Check for mistakes.
@diptorupd diptorupd changed the title Feature/layout transform amat to b mat Feature/layout transform A mat to B mat Sep 29, 2025
@diptorupd diptorupd force-pushed the feature/layout_transform_Amat_to_B_mat branch from 01573de to 4421cc1 Compare October 1, 2025 16:39
@diptorupd diptorupd merged commit fb4cd49 into amd-integration Oct 1, 2025
1 check passed
@diptorupd diptorupd deleted the feature/layout_transform_Amat_to_B_mat branch October 1, 2025 17:09
diptorupd pushed a commit that referenced this pull request Dec 5, 2025
This PR fixes some of the unit test failures that occur in Single
Decode. It also disables clang formatting of headers.
The clang format of headers causes compilation issues. The compiler is
unable to find `HIP WARP SYNC INTRINSICS` causing failures. Disabling
clang format fixes these issues

```
    Start 1: MathTest
1/6 Test #1: MathTest .........................   Passed    3.31 sec
    Start 2: PosEncTest
2/6 Test #2: PosEncTest .......................   Passed    3.36 sec
    Start 3: CascadeTest
3/6 Test #3: CascadeTest ......................   Passed    3.35 sec
    Start 4: PageTest
4/6 Test #4: PageTest .........................   Passed  114.08 sec
    Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest .................   Passed   35.22 sec
    Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest ..................   Passed  559.75 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 719.07 sec
```
diptorupd pushed a commit that referenced this pull request Dec 5, 2025
In this PR, we add infra for enabling decode via flashinfer gpu_iface.
This PR does not change existing infrastructure and we can still build
decode using AOT and JIT.

Tested locally 
```
    Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest .................   Passed   35.12 sec
    Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest ..................   Passed  541.87 sec
```

We will have a follow up PR for enabling AOT decode using flashinfer
gpu_iface
diptorupd pushed a commit that referenced this pull request Dec 5, 2025
CPP test suite was using `hipified` headers. In this PR, we port over unit tests to use `gpu_iface`. This is necessary for us as the next step is to move the build infrastructure to use `gpu_iface`

This PR has been tested locally 
```
Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/6 Test #1: MathTest .........................   Passed    3.40 sec
    Start 2: PosEncTest
2/6 Test #2: PosEncTest .......................   Passed    3.40 sec
    Start 3: CascadeTest
3/6 Test #3: CascadeTest ......................   Passed  985.27 sec
    Start 4: PageTest
4/6 Test #4: PageTest .........................   Passed  112.40 sec
    Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest .................   Passed   35.46 sec
    Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest ..................   Passed  556.81 sec

100% tests passed, 0 tests failed out of 6
```

To replicate the tests
```
cd flashinfer/libflashinfer/tests/hip
```
```
mkdir build && cd build/
```
```
cmake -DCMAKE_PREFIX_PATH=/root/libtorch -DCMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++ -DFLASHINFER_INCLUDE_DIRS=/root/flashinfer/libflashinfer/include/ ..
```
```
make
```
```
ctest
```
diptorupd pushed a commit that referenced this pull request Dec 5, 2025
In this PR I remove the `libtorch` dependency and removed
`test_page.cpp`. `test_page.cpp` is the only unit test that uses
libtorch. However, we also have a pytest for testing page. We will use
that for validation.

Removing the libtorch dependency will help us speed docker builds and
remove additional dependencies.


```Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/8 Test #1: MathTest ............................   Passed    0.31 sec
    Start 2: PosEncTest
2/8 Test #2: PosEncTest ..........................   Passed    0.31 sec
    Start 3: CascadeTest
3/8 Test #3: CascadeTest .........................   Passed  1369.12 sec
    Start 4: SingleDecodeTest
4/8 Test #4: SingleDecodeTest ....................   Passed  7726.35 sec
    Start 5: BatchDecodeTest
5/8 Test #5: BatchDecodeTest .....................   Passed  811.61 sec
    Start 6: test_mfma_fp32_16x16x16fp16
6/8 Test #6: test_mfma_fp32_16x16x16fp16 .........   Passed    0.30 sec
    Start 7: test_transpose_4x4_half_registers
7/8 Test #7: test_transpose_4x4_half_registers ...   Passed    0.28 sec
    Start 8: test_rowsum
8/8 Test #8: test_rowsum .........................   Passed    0.27 sec

100% tests passed, 0 tests failed out of 8
```
diptorupd added a commit that referenced this pull request Dec 5, 2025
Added a new transpose_inter_quad_fragments function to permutes MMA matrix fragments in-registers across specific thread quads. The function is required to transpose an MMA tile from A to B layout and vice-versa.

Rename transpose_4x4_half_registers to transpose_intra_quad_fragments
zhenhantech pushed a commit to zhenhantech/flashinfer that referenced this pull request Jan 9, 2026
This PR fixes some of the unit test failures that occur in Single
Decode. It also disables clang formatting of headers.
The clang format of headers causes compilation issues. The compiler is
unable to find `HIP WARP SYNC INTRINSICS` causing failures. Disabling
clang format fixes these issues

```
    Start 1: MathTest
1/6 Test ROCm#1: MathTest .........................   Passed    3.31 sec
    Start 2: PosEncTest
2/6 Test ROCm#2: PosEncTest .......................   Passed    3.36 sec
    Start 3: CascadeTest
3/6 Test ROCm#3: CascadeTest ......................   Passed    3.35 sec
    Start 4: PageTest
4/6 Test ROCm#4: PageTest .........................   Passed  114.08 sec
    Start 5: SingleDecodeTest
5/6 Test ROCm#5: SingleDecodeTest .................   Passed   35.22 sec
    Start 6: BatchDecodeTest
6/6 Test ROCm#6: BatchDecodeTest ..................   Passed  559.75 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 719.07 sec
```
zhenhantech pushed a commit to zhenhantech/flashinfer that referenced this pull request Jan 9, 2026
In this PR, we add infra for enabling decode via flashinfer gpu_iface.
This PR does not change existing infrastructure and we can still build
decode using AOT and JIT.

Tested locally 
```
    Start 5: SingleDecodeTest
5/6 Test ROCm#5: SingleDecodeTest .................   Passed   35.12 sec
    Start 6: BatchDecodeTest
6/6 Test ROCm#6: BatchDecodeTest ..................   Passed  541.87 sec
```

We will have a follow up PR for enabling AOT decode using flashinfer
gpu_iface
zhenhantech pushed a commit to zhenhantech/flashinfer that referenced this pull request Jan 9, 2026
CPP test suite was using `hipified` headers. In this PR, we port over unit tests to use `gpu_iface`. This is necessary for us as the next step is to move the build infrastructure to use `gpu_iface`

This PR has been tested locally 
```
Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/6 Test ROCm#1: MathTest .........................   Passed    3.40 sec
    Start 2: PosEncTest
2/6 Test ROCm#2: PosEncTest .......................   Passed    3.40 sec
    Start 3: CascadeTest
3/6 Test ROCm#3: CascadeTest ......................   Passed  985.27 sec
    Start 4: PageTest
4/6 Test ROCm#4: PageTest .........................   Passed  112.40 sec
    Start 5: SingleDecodeTest
5/6 Test ROCm#5: SingleDecodeTest .................   Passed   35.46 sec
    Start 6: BatchDecodeTest
6/6 Test ROCm#6: BatchDecodeTest ..................   Passed  556.81 sec

100% tests passed, 0 tests failed out of 6
```

To replicate the tests
```
cd flashinfer/libflashinfer/tests/hip
```
```
mkdir build && cd build/
```
```
cmake -DCMAKE_PREFIX_PATH=/root/libtorch -DCMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++ -DFLASHINFER_INCLUDE_DIRS=/root/flashinfer/libflashinfer/include/ ..
```
```
make
```
```
ctest
```
zhenhantech pushed a commit to zhenhantech/flashinfer that referenced this pull request Jan 9, 2026
In this PR I remove the `libtorch` dependency and removed
`test_page.cpp`. `test_page.cpp` is the only unit test that uses
libtorch. However, we also have a pytest for testing page. We will use
that for validation.

Removing the libtorch dependency will help us speed docker builds and
remove additional dependencies.


```Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/8 Test ROCm#1: MathTest ............................   Passed    0.31 sec
    Start 2: PosEncTest
2/8 Test ROCm#2: PosEncTest ..........................   Passed    0.31 sec
    Start 3: CascadeTest
3/8 Test ROCm#3: CascadeTest .........................   Passed  1369.12 sec
    Start 4: SingleDecodeTest
4/8 Test ROCm#4: SingleDecodeTest ....................   Passed  7726.35 sec
    Start 5: BatchDecodeTest
5/8 Test ROCm#5: BatchDecodeTest .....................   Passed  811.61 sec
    Start 6: test_mfma_fp32_16x16x16fp16
6/8 Test ROCm#6: test_mfma_fp32_16x16x16fp16 .........   Passed    0.30 sec
    Start 7: test_transpose_4x4_half_registers
7/8 Test ROCm#7: test_transpose_4x4_half_registers ...   Passed    0.28 sec
    Start 8: test_rowsum
8/8 Test ROCm#8: test_rowsum .........................   Passed    0.27 sec

100% tests passed, 0 tests failed out of 8
```
zhenhantech pushed a commit to zhenhantech/flashinfer that referenced this pull request Jan 9, 2026
Added a new transpose_inter_quad_fragments function to permutes MMA matrix fragments in-registers across specific thread quads. The function is required to transpose an MMA tile from A to B layout and vice-versa.

Rename transpose_4x4_half_registers to transpose_intra_quad_fragments
diptorupd pushed a commit that referenced this pull request Feb 2, 2026
This PR fixes some of the unit test failures that occur in Single
Decode. It also disables clang formatting of headers.
The clang format of headers causes compilation issues. The compiler is
unable to find `HIP WARP SYNC INTRINSICS` causing failures. Disabling
clang format fixes these issues

```
    Start 1: MathTest
1/6 Test #1: MathTest .........................   Passed    3.31 sec
    Start 2: PosEncTest
2/6 Test #2: PosEncTest .......................   Passed    3.36 sec
    Start 3: CascadeTest
3/6 Test #3: CascadeTest ......................   Passed    3.35 sec
    Start 4: PageTest
4/6 Test #4: PageTest .........................   Passed  114.08 sec
    Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest .................   Passed   35.22 sec
    Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest ..................   Passed  559.75 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 719.07 sec
```
diptorupd pushed a commit that referenced this pull request Feb 2, 2026
In this PR, we add infra for enabling decode via flashinfer gpu_iface.
This PR does not change existing infrastructure and we can still build
decode using AOT and JIT.

Tested locally 
```
    Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest .................   Passed   35.12 sec
    Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest ..................   Passed  541.87 sec
```

We will have a follow up PR for enabling AOT decode using flashinfer
gpu_iface
diptorupd pushed a commit that referenced this pull request Feb 2, 2026
CPP test suite was using `hipified` headers. In this PR, we port over unit tests to use `gpu_iface`. This is necessary for us as the next step is to move the build infrastructure to use `gpu_iface`

This PR has been tested locally 
```
Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/6 Test #1: MathTest .........................   Passed    3.40 sec
    Start 2: PosEncTest
2/6 Test #2: PosEncTest .......................   Passed    3.40 sec
    Start 3: CascadeTest
3/6 Test #3: CascadeTest ......................   Passed  985.27 sec
    Start 4: PageTest
4/6 Test #4: PageTest .........................   Passed  112.40 sec
    Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest .................   Passed   35.46 sec
    Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest ..................   Passed  556.81 sec

100% tests passed, 0 tests failed out of 6
```

To replicate the tests
```
cd flashinfer/libflashinfer/tests/hip
```
```
mkdir build && cd build/
```
```
cmake -DCMAKE_PREFIX_PATH=/root/libtorch -DCMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++ -DFLASHINFER_INCLUDE_DIRS=/root/flashinfer/libflashinfer/include/ ..
```
```
make
```
```
ctest
```
diptorupd pushed a commit that referenced this pull request Feb 2, 2026
In this PR I remove the `libtorch` dependency and removed
`test_page.cpp`. `test_page.cpp` is the only unit test that uses
libtorch. However, we also have a pytest for testing page. We will use
that for validation.

Removing the libtorch dependency will help us speed docker builds and
remove additional dependencies.


```Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/8 Test #1: MathTest ............................   Passed    0.31 sec
    Start 2: PosEncTest
2/8 Test #2: PosEncTest ..........................   Passed    0.31 sec
    Start 3: CascadeTest
3/8 Test #3: CascadeTest .........................   Passed  1369.12 sec
    Start 4: SingleDecodeTest
4/8 Test #4: SingleDecodeTest ....................   Passed  7726.35 sec
    Start 5: BatchDecodeTest
5/8 Test #5: BatchDecodeTest .....................   Passed  811.61 sec
    Start 6: test_mfma_fp32_16x16x16fp16
6/8 Test #6: test_mfma_fp32_16x16x16fp16 .........   Passed    0.30 sec
    Start 7: test_transpose_4x4_half_registers
7/8 Test #7: test_transpose_4x4_half_registers ...   Passed    0.28 sec
    Start 8: test_rowsum
8/8 Test #8: test_rowsum .........................   Passed    0.27 sec

100% tests passed, 0 tests failed out of 8
```
diptorupd added a commit that referenced this pull request Feb 2, 2026
Added a new transpose_inter_quad_fragments function to permutes MMA matrix fragments in-registers across specific thread quads. The function is required to transpose an MMA tile from A to B layout and vice-versa.

Rename transpose_4x4_half_registers to transpose_intra_quad_fragments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants