Feature/layout transform A mat to B mat by diptorupd · Pull Request #5 · ROCm/flashinfer

diptorupd · 2025-09-29T20:39:17Z

Added a new transpose_inter_quad_fragments function to permutes MMA matrix fragments in-registers across specific thread quads. The function is required to transpose an MMA tile from A to B layout and vice-versa.

Rename transpose_4x4_half_registers to transpose_intra_quad_fragments

Added unit tests

Copilot

Pull Request Overview

This PR introduces a new transpose_inter_quad_fragments function that enables transforming MMA matrix fragments between A and B layouts in registers, along with renaming transpose_4x4_half_registers to transpose_intra_quad_fragments for clarity.

Added transpose_inter_quad_fragments function for inter-quad register permutation
Renamed transpose_4x4_half_registers to transpose_intra_quad_fragments and updated function references
Added comprehensive unit tests for layout transformations and matrix loading patterns

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
test_mfma_fp32_16x16x16fp16.cpp	Updated to use new function names and simplified matrix indexing
test_layout_transform.cpp	New comprehensive test file for layout transformation functionality
mma_ops.hpp	Updated API to use new function names and corrected template constraints
mma_hip.h	Added new transpose function, renamed existing one, and improved documentation
mma_debug_utils_hip.h	New debug utilities for testing matrix layout transformations

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

libflashinfer/include/gpu_iface/mma_ops.hpp

libflashinfer/include/gpu_iface/backend/hip/mma_debug_utils_hip.h

libflashinfer/tests/hip/test_mfma_fp32_16x16x16fp16.cpp

Copilot · 2025-09-29T20:40:50Z

libflashinfer/include/gpu_iface/backend/hip/mma_hip.h

+  transpose_intra_quad_fragments(reinterpret_cast<uint32_t*>(s_frag));
  f16x4 a = reinterpret_cast<const f16x4*>(s_frag)[0];
  f16x4 b = {f16(1.0f), f16(1.0f), f16(1.0f), f16(1.0f)};
-  f32x4 c = {0.f, 0.f, 0.f, 0.f};
+  f32x4 c = {d[0], d[1], d[2], d[3]};


The transpose_intra_quad_fragments call modifies s_frag in-place but this side effect is not documented in the function signature or comments. Consider adding a comment explaining why the transpose is necessary for the rowsum operation.

This PR fixes some of the unit test failures that occur in Single Decode. It also disables clang formatting of headers. The clang format of headers causes compilation issues. The compiler is unable to find `HIP WARP SYNC INTRINSICS` causing failures. Disabling clang format fixes these issues ``` Start 1: MathTest 1/6 Test #1: MathTest ......................... Passed 3.31 sec Start 2: PosEncTest 2/6 Test #2: PosEncTest ....................... Passed 3.36 sec Start 3: CascadeTest 3/6 Test #3: CascadeTest ...................... Passed 3.35 sec Start 4: PageTest 4/6 Test #4: PageTest ......................... Passed 114.08 sec Start 5: SingleDecodeTest 5/6 Test #5: SingleDecodeTest ................. Passed 35.22 sec Start 6: BatchDecodeTest 6/6 Test #6: BatchDecodeTest .................. Passed 559.75 sec 100% tests passed, 0 tests failed out of 6 Total Test time (real) = 719.07 sec ```

In this PR, we add infra for enabling decode via flashinfer gpu_iface. This PR does not change existing infrastructure and we can still build decode using AOT and JIT. Tested locally ``` Start 5: SingleDecodeTest 5/6 Test #5: SingleDecodeTest ................. Passed 35.12 sec Start 6: BatchDecodeTest 6/6 Test #6: BatchDecodeTest .................. Passed 541.87 sec ``` We will have a follow up PR for enabling AOT decode using flashinfer gpu_iface

CPP test suite was using `hipified` headers. In this PR, we port over unit tests to use `gpu_iface`. This is necessary for us as the next step is to move the build infrastructure to use `gpu_iface` This PR has been tested locally ``` Test project /root/flashinfer/libflashinfer/tests/hip/build Start 1: MathTest 1/6 Test #1: MathTest ......................... Passed 3.40 sec Start 2: PosEncTest 2/6 Test #2: PosEncTest ....................... Passed 3.40 sec Start 3: CascadeTest 3/6 Test #3: CascadeTest ...................... Passed 985.27 sec Start 4: PageTest 4/6 Test #4: PageTest ......................... Passed 112.40 sec Start 5: SingleDecodeTest 5/6 Test #5: SingleDecodeTest ................. Passed 35.46 sec Start 6: BatchDecodeTest 6/6 Test #6: BatchDecodeTest .................. Passed 556.81 sec 100% tests passed, 0 tests failed out of 6 ``` To replicate the tests ``` cd flashinfer/libflashinfer/tests/hip ``` ``` mkdir build && cd build/ ``` ``` cmake -DCMAKE_PREFIX_PATH=/root/libtorch -DCMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++ -DFLASHINFER_INCLUDE_DIRS=/root/flashinfer/libflashinfer/include/ .. ``` ``` make ``` ``` ctest ```

In this PR I remove the `libtorch` dependency and removed `test_page.cpp`. `test_page.cpp` is the only unit test that uses libtorch. However, we also have a pytest for testing page. We will use that for validation. Removing the libtorch dependency will help us speed docker builds and remove additional dependencies. ```Test project /root/flashinfer/libflashinfer/tests/hip/build Start 1: MathTest 1/8 Test #1: MathTest ............................ Passed 0.31 sec Start 2: PosEncTest 2/8 Test #2: PosEncTest .......................... Passed 0.31 sec Start 3: CascadeTest 3/8 Test #3: CascadeTest ......................... Passed 1369.12 sec Start 4: SingleDecodeTest 4/8 Test #4: SingleDecodeTest .................... Passed 7726.35 sec Start 5: BatchDecodeTest 5/8 Test #5: BatchDecodeTest ..................... Passed 811.61 sec Start 6: test_mfma_fp32_16x16x16fp16 6/8 Test #6: test_mfma_fp32_16x16x16fp16 ......... Passed 0.30 sec Start 7: test_transpose_4x4_half_registers 7/8 Test #7: test_transpose_4x4_half_registers ... Passed 0.28 sec Start 8: test_rowsum 8/8 Test #8: test_rowsum ......................... Passed 0.27 sec 100% tests passed, 0 tests failed out of 8 ```

Added a new transpose_inter_quad_fragments function to permutes MMA matrix fragments in-registers across specific thread quads. The function is required to transpose an MMA tile from A to B layout and vice-versa. Rename transpose_4x4_half_registers to transpose_intra_quad_fragments

This PR fixes some of the unit test failures that occur in Single Decode. It also disables clang formatting of headers. The clang format of headers causes compilation issues. The compiler is unable to find `HIP WARP SYNC INTRINSICS` causing failures. Disabling clang format fixes these issues ``` Start 1: MathTest 1/6 Test ROCm#1: MathTest ......................... Passed 3.31 sec Start 2: PosEncTest 2/6 Test ROCm#2: PosEncTest ....................... Passed 3.36 sec Start 3: CascadeTest 3/6 Test ROCm#3: CascadeTest ...................... Passed 3.35 sec Start 4: PageTest 4/6 Test ROCm#4: PageTest ......................... Passed 114.08 sec Start 5: SingleDecodeTest 5/6 Test ROCm#5: SingleDecodeTest ................. Passed 35.22 sec Start 6: BatchDecodeTest 6/6 Test ROCm#6: BatchDecodeTest .................. Passed 559.75 sec 100% tests passed, 0 tests failed out of 6 Total Test time (real) = 719.07 sec ```

In this PR, we add infra for enabling decode via flashinfer gpu_iface. This PR does not change existing infrastructure and we can still build decode using AOT and JIT. Tested locally ``` Start 5: SingleDecodeTest 5/6 Test ROCm#5: SingleDecodeTest ................. Passed 35.12 sec Start 6: BatchDecodeTest 6/6 Test ROCm#6: BatchDecodeTest .................. Passed 541.87 sec ``` We will have a follow up PR for enabling AOT decode using flashinfer gpu_iface

CPP test suite was using `hipified` headers. In this PR, we port over unit tests to use `gpu_iface`. This is necessary for us as the next step is to move the build infrastructure to use `gpu_iface` This PR has been tested locally ``` Test project /root/flashinfer/libflashinfer/tests/hip/build Start 1: MathTest 1/6 Test ROCm#1: MathTest ......................... Passed 3.40 sec Start 2: PosEncTest 2/6 Test ROCm#2: PosEncTest ....................... Passed 3.40 sec Start 3: CascadeTest 3/6 Test ROCm#3: CascadeTest ...................... Passed 985.27 sec Start 4: PageTest 4/6 Test ROCm#4: PageTest ......................... Passed 112.40 sec Start 5: SingleDecodeTest 5/6 Test ROCm#5: SingleDecodeTest ................. Passed 35.46 sec Start 6: BatchDecodeTest 6/6 Test ROCm#6: BatchDecodeTest .................. Passed 556.81 sec 100% tests passed, 0 tests failed out of 6 ``` To replicate the tests ``` cd flashinfer/libflashinfer/tests/hip ``` ``` mkdir build && cd build/ ``` ``` cmake -DCMAKE_PREFIX_PATH=/root/libtorch -DCMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++ -DFLASHINFER_INCLUDE_DIRS=/root/flashinfer/libflashinfer/include/ .. ``` ``` make ``` ``` ctest ```

In this PR I remove the `libtorch` dependency and removed `test_page.cpp`. `test_page.cpp` is the only unit test that uses libtorch. However, we also have a pytest for testing page. We will use that for validation. Removing the libtorch dependency will help us speed docker builds and remove additional dependencies. ```Test project /root/flashinfer/libflashinfer/tests/hip/build Start 1: MathTest 1/8 Test ROCm#1: MathTest ............................ Passed 0.31 sec Start 2: PosEncTest 2/8 Test ROCm#2: PosEncTest .......................... Passed 0.31 sec Start 3: CascadeTest 3/8 Test ROCm#3: CascadeTest ......................... Passed 1369.12 sec Start 4: SingleDecodeTest 4/8 Test ROCm#4: SingleDecodeTest .................... Passed 7726.35 sec Start 5: BatchDecodeTest 5/8 Test ROCm#5: BatchDecodeTest ..................... Passed 811.61 sec Start 6: test_mfma_fp32_16x16x16fp16 6/8 Test ROCm#6: test_mfma_fp32_16x16x16fp16 ......... Passed 0.30 sec Start 7: test_transpose_4x4_half_registers 7/8 Test ROCm#7: test_transpose_4x4_half_registers ... Passed 0.28 sec Start 8: test_rowsum 8/8 Test ROCm#8: test_rowsum ......................... Passed 0.27 sec 100% tests passed, 0 tests failed out of 8 ```

Added a new transpose_inter_quad_fragments function to permutes MMA matrix fragments in-registers across specific thread quads. The function is required to transpose an MMA tile from A to B layout and vice-versa. Rename transpose_4x4_half_registers to transpose_intra_quad_fragments

This PR fixes some of the unit test failures that occur in Single Decode. It also disables clang formatting of headers. The clang format of headers causes compilation issues. The compiler is unable to find `HIP WARP SYNC INTRINSICS` causing failures. Disabling clang format fixes these issues ``` Start 1: MathTest 1/6 Test #1: MathTest ......................... Passed 3.31 sec Start 2: PosEncTest 2/6 Test #2: PosEncTest ....................... Passed 3.36 sec Start 3: CascadeTest 3/6 Test #3: CascadeTest ...................... Passed 3.35 sec Start 4: PageTest 4/6 Test #4: PageTest ......................... Passed 114.08 sec Start 5: SingleDecodeTest 5/6 Test #5: SingleDecodeTest ................. Passed 35.22 sec Start 6: BatchDecodeTest 6/6 Test #6: BatchDecodeTest .................. Passed 559.75 sec 100% tests passed, 0 tests failed out of 6 Total Test time (real) = 719.07 sec ```

In this PR, we add infra for enabling decode via flashinfer gpu_iface. This PR does not change existing infrastructure and we can still build decode using AOT and JIT. Tested locally ``` Start 5: SingleDecodeTest 5/6 Test #5: SingleDecodeTest ................. Passed 35.12 sec Start 6: BatchDecodeTest 6/6 Test #6: BatchDecodeTest .................. Passed 541.87 sec ``` We will have a follow up PR for enabling AOT decode using flashinfer gpu_iface

CPP test suite was using `hipified` headers. In this PR, we port over unit tests to use `gpu_iface`. This is necessary for us as the next step is to move the build infrastructure to use `gpu_iface` This PR has been tested locally ``` Test project /root/flashinfer/libflashinfer/tests/hip/build Start 1: MathTest 1/6 Test #1: MathTest ......................... Passed 3.40 sec Start 2: PosEncTest 2/6 Test #2: PosEncTest ....................... Passed 3.40 sec Start 3: CascadeTest 3/6 Test #3: CascadeTest ...................... Passed 985.27 sec Start 4: PageTest 4/6 Test #4: PageTest ......................... Passed 112.40 sec Start 5: SingleDecodeTest 5/6 Test #5: SingleDecodeTest ................. Passed 35.46 sec Start 6: BatchDecodeTest 6/6 Test #6: BatchDecodeTest .................. Passed 556.81 sec 100% tests passed, 0 tests failed out of 6 ``` To replicate the tests ``` cd flashinfer/libflashinfer/tests/hip ``` ``` mkdir build && cd build/ ``` ``` cmake -DCMAKE_PREFIX_PATH=/root/libtorch -DCMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++ -DFLASHINFER_INCLUDE_DIRS=/root/flashinfer/libflashinfer/include/ .. ``` ``` make ``` ``` ctest ```

In this PR I remove the `libtorch` dependency and removed `test_page.cpp`. `test_page.cpp` is the only unit test that uses libtorch. However, we also have a pytest for testing page. We will use that for validation. Removing the libtorch dependency will help us speed docker builds and remove additional dependencies. ```Test project /root/flashinfer/libflashinfer/tests/hip/build Start 1: MathTest 1/8 Test #1: MathTest ............................ Passed 0.31 sec Start 2: PosEncTest 2/8 Test #2: PosEncTest .......................... Passed 0.31 sec Start 3: CascadeTest 3/8 Test #3: CascadeTest ......................... Passed 1369.12 sec Start 4: SingleDecodeTest 4/8 Test #4: SingleDecodeTest .................... Passed 7726.35 sec Start 5: BatchDecodeTest 5/8 Test #5: BatchDecodeTest ..................... Passed 811.61 sec Start 6: test_mfma_fp32_16x16x16fp16 6/8 Test #6: test_mfma_fp32_16x16x16fp16 ......... Passed 0.30 sec Start 7: test_transpose_4x4_half_registers 7/8 Test #7: test_transpose_4x4_half_registers ... Passed 0.28 sec Start 8: test_rowsum 8/8 Test #8: test_rowsum ......................... Passed 0.27 sec 100% tests passed, 0 tests failed out of 8 ```

Added a new transpose_inter_quad_fragments function to permutes MMA matrix fragments in-registers across specific thread quads. The function is required to transpose an MMA tile from A to B layout and vice-versa. Rename transpose_4x4_half_registers to transpose_intra_quad_fragments

diptorupd assigned rtmadduri and demandal25 Sep 29, 2025

diptorupd requested review from Copilot, demandal25 and rtmadduri September 29, 2025 20:39

diptorupd assigned diptorupd and unassigned rtmadduri and demandal25 Sep 29, 2025

Copilot AI reviewed Sep 29, 2025

View reviewed changes

diptorupd changed the title ~~Feature/layout transform amat to b mat~~ Feature/layout transform A mat to B mat Sep 29, 2025

diptorupd added 7 commits October 1, 2025 12:33

Fix load_fragment

583d096

Fix Array OOO access in debug function

9fd2613

Fixes based on Copilot review.

61e990f

New function to block transpose an MMA tile

f4cfc68

Update function name in top-level API

e60bc82

Rename function

95e2fe2

Fix pre-commit

4421cc1

diptorupd force-pushed the feature/layout_transform_Amat_to_B_mat branch from 01573de to 4421cc1 Compare October 1, 2025 16:39

diptorupd added 3 commits October 1, 2025 12:42

Fix typo in func name

a76e92d

Fix typo in comment

c19475c

Fix

4c1d9fc

demandal25 approved these changes Oct 1, 2025

View reviewed changes

diptorupd merged commit fb4cd49 into amd-integration Oct 1, 2025
1 check passed

diptorupd deleted the feature/layout_transform_Amat_to_B_mat branch October 1, 2025 17:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/layout transform A mat to B mat#5

Feature/layout transform A mat to B mat#5
diptorupd merged 10 commits intoamd-integrationfrom
feature/layout_transform_Amat_to_B_mat

diptorupd commented Sep 29, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Sep 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

diptorupd commented Sep 29, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants