Feature/layout transform A mat to B mat#5
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR introduces a new transpose_inter_quad_fragments function that enables transforming MMA matrix fragments between A and B layouts in registers, along with renaming transpose_4x4_half_registers to transpose_intra_quad_fragments for clarity.
- Added
transpose_inter_quad_fragmentsfunction for inter-quad register permutation - Renamed
transpose_4x4_half_registerstotranspose_intra_quad_fragmentsand updated function references - Added comprehensive unit tests for layout transformations and matrix loading patterns
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| test_mfma_fp32_16x16x16fp16.cpp | Updated to use new function names and simplified matrix indexing |
| test_layout_transform.cpp | New comprehensive test file for layout transformation functionality |
| mma_ops.hpp | Updated API to use new function names and corrected template constraints |
| mma_hip.h | Added new transpose function, renamed existing one, and improved documentation |
| mma_debug_utils_hip.h | New debug utilities for testing matrix layout transformations |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
libflashinfer/include/gpu_iface/backend/hip/mma_debug_utils_hip.h
Outdated
Show resolved
Hide resolved
| transpose_intra_quad_fragments(reinterpret_cast<uint32_t*>(s_frag)); | ||
| f16x4 a = reinterpret_cast<const f16x4*>(s_frag)[0]; | ||
| f16x4 b = {f16(1.0f), f16(1.0f), f16(1.0f), f16(1.0f)}; | ||
| f32x4 c = {0.f, 0.f, 0.f, 0.f}; | ||
| f32x4 c = {d[0], d[1], d[2], d[3]}; |
There was a problem hiding this comment.
The transpose_intra_quad_fragments call modifies s_frag in-place but this side effect is not documented in the function signature or comments. Consider adding a comment explaining why the transpose is necessary for the rowsum operation.
01573de to
4421cc1
Compare
This PR fixes some of the unit test failures that occur in Single
Decode. It also disables clang formatting of headers.
The clang format of headers causes compilation issues. The compiler is
unable to find `HIP WARP SYNC INTRINSICS` causing failures. Disabling
clang format fixes these issues
```
Start 1: MathTest
1/6 Test #1: MathTest ......................... Passed 3.31 sec
Start 2: PosEncTest
2/6 Test #2: PosEncTest ....................... Passed 3.36 sec
Start 3: CascadeTest
3/6 Test #3: CascadeTest ...................... Passed 3.35 sec
Start 4: PageTest
4/6 Test #4: PageTest ......................... Passed 114.08 sec
Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest ................. Passed 35.22 sec
Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest .................. Passed 559.75 sec
100% tests passed, 0 tests failed out of 6
Total Test time (real) = 719.07 sec
```
In this PR, we add infra for enabling decode via flashinfer gpu_iface.
This PR does not change existing infrastructure and we can still build
decode using AOT and JIT.
Tested locally
```
Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest ................. Passed 35.12 sec
Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest .................. Passed 541.87 sec
```
We will have a follow up PR for enabling AOT decode using flashinfer
gpu_iface
CPP test suite was using `hipified` headers. In this PR, we port over unit tests to use `gpu_iface`. This is necessary for us as the next step is to move the build infrastructure to use `gpu_iface`
This PR has been tested locally
```
Test project /root/flashinfer/libflashinfer/tests/hip/build
Start 1: MathTest
1/6 Test #1: MathTest ......................... Passed 3.40 sec
Start 2: PosEncTest
2/6 Test #2: PosEncTest ....................... Passed 3.40 sec
Start 3: CascadeTest
3/6 Test #3: CascadeTest ...................... Passed 985.27 sec
Start 4: PageTest
4/6 Test #4: PageTest ......................... Passed 112.40 sec
Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest ................. Passed 35.46 sec
Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest .................. Passed 556.81 sec
100% tests passed, 0 tests failed out of 6
```
To replicate the tests
```
cd flashinfer/libflashinfer/tests/hip
```
```
mkdir build && cd build/
```
```
cmake -DCMAKE_PREFIX_PATH=/root/libtorch -DCMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++ -DFLASHINFER_INCLUDE_DIRS=/root/flashinfer/libflashinfer/include/ ..
```
```
make
```
```
ctest
```
In this PR I remove the `libtorch` dependency and removed
`test_page.cpp`. `test_page.cpp` is the only unit test that uses
libtorch. However, we also have a pytest for testing page. We will use
that for validation.
Removing the libtorch dependency will help us speed docker builds and
remove additional dependencies.
```Test project /root/flashinfer/libflashinfer/tests/hip/build
Start 1: MathTest
1/8 Test #1: MathTest ............................ Passed 0.31 sec
Start 2: PosEncTest
2/8 Test #2: PosEncTest .......................... Passed 0.31 sec
Start 3: CascadeTest
3/8 Test #3: CascadeTest ......................... Passed 1369.12 sec
Start 4: SingleDecodeTest
4/8 Test #4: SingleDecodeTest .................... Passed 7726.35 sec
Start 5: BatchDecodeTest
5/8 Test #5: BatchDecodeTest ..................... Passed 811.61 sec
Start 6: test_mfma_fp32_16x16x16fp16
6/8 Test #6: test_mfma_fp32_16x16x16fp16 ......... Passed 0.30 sec
Start 7: test_transpose_4x4_half_registers
7/8 Test #7: test_transpose_4x4_half_registers ... Passed 0.28 sec
Start 8: test_rowsum
8/8 Test #8: test_rowsum ......................... Passed 0.27 sec
100% tests passed, 0 tests failed out of 8
```
Added a new transpose_inter_quad_fragments function to permutes MMA matrix fragments in-registers across specific thread quads. The function is required to transpose an MMA tile from A to B layout and vice-versa. Rename transpose_4x4_half_registers to transpose_intra_quad_fragments
This PR fixes some of the unit test failures that occur in Single
Decode. It also disables clang formatting of headers.
The clang format of headers causes compilation issues. The compiler is
unable to find `HIP WARP SYNC INTRINSICS` causing failures. Disabling
clang format fixes these issues
```
Start 1: MathTest
1/6 Test ROCm#1: MathTest ......................... Passed 3.31 sec
Start 2: PosEncTest
2/6 Test ROCm#2: PosEncTest ....................... Passed 3.36 sec
Start 3: CascadeTest
3/6 Test ROCm#3: CascadeTest ...................... Passed 3.35 sec
Start 4: PageTest
4/6 Test ROCm#4: PageTest ......................... Passed 114.08 sec
Start 5: SingleDecodeTest
5/6 Test ROCm#5: SingleDecodeTest ................. Passed 35.22 sec
Start 6: BatchDecodeTest
6/6 Test ROCm#6: BatchDecodeTest .................. Passed 559.75 sec
100% tests passed, 0 tests failed out of 6
Total Test time (real) = 719.07 sec
```
In this PR, we add infra for enabling decode via flashinfer gpu_iface.
This PR does not change existing infrastructure and we can still build
decode using AOT and JIT.
Tested locally
```
Start 5: SingleDecodeTest
5/6 Test ROCm#5: SingleDecodeTest ................. Passed 35.12 sec
Start 6: BatchDecodeTest
6/6 Test ROCm#6: BatchDecodeTest .................. Passed 541.87 sec
```
We will have a follow up PR for enabling AOT decode using flashinfer
gpu_iface
CPP test suite was using `hipified` headers. In this PR, we port over unit tests to use `gpu_iface`. This is necessary for us as the next step is to move the build infrastructure to use `gpu_iface`
This PR has been tested locally
```
Test project /root/flashinfer/libflashinfer/tests/hip/build
Start 1: MathTest
1/6 Test ROCm#1: MathTest ......................... Passed 3.40 sec
Start 2: PosEncTest
2/6 Test ROCm#2: PosEncTest ....................... Passed 3.40 sec
Start 3: CascadeTest
3/6 Test ROCm#3: CascadeTest ...................... Passed 985.27 sec
Start 4: PageTest
4/6 Test ROCm#4: PageTest ......................... Passed 112.40 sec
Start 5: SingleDecodeTest
5/6 Test ROCm#5: SingleDecodeTest ................. Passed 35.46 sec
Start 6: BatchDecodeTest
6/6 Test ROCm#6: BatchDecodeTest .................. Passed 556.81 sec
100% tests passed, 0 tests failed out of 6
```
To replicate the tests
```
cd flashinfer/libflashinfer/tests/hip
```
```
mkdir build && cd build/
```
```
cmake -DCMAKE_PREFIX_PATH=/root/libtorch -DCMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++ -DFLASHINFER_INCLUDE_DIRS=/root/flashinfer/libflashinfer/include/ ..
```
```
make
```
```
ctest
```
In this PR I remove the `libtorch` dependency and removed
`test_page.cpp`. `test_page.cpp` is the only unit test that uses
libtorch. However, we also have a pytest for testing page. We will use
that for validation.
Removing the libtorch dependency will help us speed docker builds and
remove additional dependencies.
```Test project /root/flashinfer/libflashinfer/tests/hip/build
Start 1: MathTest
1/8 Test ROCm#1: MathTest ............................ Passed 0.31 sec
Start 2: PosEncTest
2/8 Test ROCm#2: PosEncTest .......................... Passed 0.31 sec
Start 3: CascadeTest
3/8 Test ROCm#3: CascadeTest ......................... Passed 1369.12 sec
Start 4: SingleDecodeTest
4/8 Test ROCm#4: SingleDecodeTest .................... Passed 7726.35 sec
Start 5: BatchDecodeTest
5/8 Test ROCm#5: BatchDecodeTest ..................... Passed 811.61 sec
Start 6: test_mfma_fp32_16x16x16fp16
6/8 Test ROCm#6: test_mfma_fp32_16x16x16fp16 ......... Passed 0.30 sec
Start 7: test_transpose_4x4_half_registers
7/8 Test ROCm#7: test_transpose_4x4_half_registers ... Passed 0.28 sec
Start 8: test_rowsum
8/8 Test ROCm#8: test_rowsum ......................... Passed 0.27 sec
100% tests passed, 0 tests failed out of 8
```
Added a new transpose_inter_quad_fragments function to permutes MMA matrix fragments in-registers across specific thread quads. The function is required to transpose an MMA tile from A to B layout and vice-versa. Rename transpose_4x4_half_registers to transpose_intra_quad_fragments
This PR fixes some of the unit test failures that occur in Single
Decode. It also disables clang formatting of headers.
The clang format of headers causes compilation issues. The compiler is
unable to find `HIP WARP SYNC INTRINSICS` causing failures. Disabling
clang format fixes these issues
```
Start 1: MathTest
1/6 Test #1: MathTest ......................... Passed 3.31 sec
Start 2: PosEncTest
2/6 Test #2: PosEncTest ....................... Passed 3.36 sec
Start 3: CascadeTest
3/6 Test #3: CascadeTest ...................... Passed 3.35 sec
Start 4: PageTest
4/6 Test #4: PageTest ......................... Passed 114.08 sec
Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest ................. Passed 35.22 sec
Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest .................. Passed 559.75 sec
100% tests passed, 0 tests failed out of 6
Total Test time (real) = 719.07 sec
```
In this PR, we add infra for enabling decode via flashinfer gpu_iface.
This PR does not change existing infrastructure and we can still build
decode using AOT and JIT.
Tested locally
```
Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest ................. Passed 35.12 sec
Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest .................. Passed 541.87 sec
```
We will have a follow up PR for enabling AOT decode using flashinfer
gpu_iface
CPP test suite was using `hipified` headers. In this PR, we port over unit tests to use `gpu_iface`. This is necessary for us as the next step is to move the build infrastructure to use `gpu_iface`
This PR has been tested locally
```
Test project /root/flashinfer/libflashinfer/tests/hip/build
Start 1: MathTest
1/6 Test #1: MathTest ......................... Passed 3.40 sec
Start 2: PosEncTest
2/6 Test #2: PosEncTest ....................... Passed 3.40 sec
Start 3: CascadeTest
3/6 Test #3: CascadeTest ...................... Passed 985.27 sec
Start 4: PageTest
4/6 Test #4: PageTest ......................... Passed 112.40 sec
Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest ................. Passed 35.46 sec
Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest .................. Passed 556.81 sec
100% tests passed, 0 tests failed out of 6
```
To replicate the tests
```
cd flashinfer/libflashinfer/tests/hip
```
```
mkdir build && cd build/
```
```
cmake -DCMAKE_PREFIX_PATH=/root/libtorch -DCMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++ -DFLASHINFER_INCLUDE_DIRS=/root/flashinfer/libflashinfer/include/ ..
```
```
make
```
```
ctest
```
In this PR I remove the `libtorch` dependency and removed
`test_page.cpp`. `test_page.cpp` is the only unit test that uses
libtorch. However, we also have a pytest for testing page. We will use
that for validation.
Removing the libtorch dependency will help us speed docker builds and
remove additional dependencies.
```Test project /root/flashinfer/libflashinfer/tests/hip/build
Start 1: MathTest
1/8 Test #1: MathTest ............................ Passed 0.31 sec
Start 2: PosEncTest
2/8 Test #2: PosEncTest .......................... Passed 0.31 sec
Start 3: CascadeTest
3/8 Test #3: CascadeTest ......................... Passed 1369.12 sec
Start 4: SingleDecodeTest
4/8 Test #4: SingleDecodeTest .................... Passed 7726.35 sec
Start 5: BatchDecodeTest
5/8 Test #5: BatchDecodeTest ..................... Passed 811.61 sec
Start 6: test_mfma_fp32_16x16x16fp16
6/8 Test #6: test_mfma_fp32_16x16x16fp16 ......... Passed 0.30 sec
Start 7: test_transpose_4x4_half_registers
7/8 Test #7: test_transpose_4x4_half_registers ... Passed 0.28 sec
Start 8: test_rowsum
8/8 Test #8: test_rowsum ......................... Passed 0.27 sec
100% tests passed, 0 tests failed out of 8
```
Added a new transpose_inter_quad_fragments function to permutes MMA matrix fragments in-registers across specific thread quads. The function is required to transpose an MMA tile from A to B layout and vice-versa. Rename transpose_4x4_half_registers to transpose_intra_quad_fragments
Added a new transpose_inter_quad_fragments function to permutes MMA matrix fragments in-registers across specific thread quads. The function is required to transpose an MMA tile from A to B layout and vice-versa.
Rename transpose_4x4_half_registers to transpose_intra_quad_fragments
Added unit tests