Skip to content

Adds debug utility functions for CDNA3 MMA ops.#3

Merged
diptorupd merged 2 commits intoamd-integrationfrom
feature/mma_debug_utils
Oct 1, 2025
Merged

Adds debug utility functions for CDNA3 MMA ops.#3
diptorupd merged 2 commits intoamd-integrationfrom
feature/mma_debug_utils

Conversation

@diptorupd
Copy link
Collaborator

This PR adds debug utility functions specifically for CDNA3 MMA (Matrix Multiply Accumulate) operations in the HIP backend. The utilities provide debugging capabilities for matrix operations including initialization, loading, printing, and writing operations on shared memory arrays.

Key changes:

Adds comprehensive debug utilities for CDNA3 MMA operations
Provides functions for matrix fragment loading with A and B layout patterns
Includes utilities for printing and visualizing matrix data in shared memory

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds debug utility functions specifically for CDNA3 MMA (Matrix Multiply Accumulate) operations in the HIP backend. The utilities provide comprehensive debugging capabilities for matrix operations in GPU shared memory.

  • Adds functions for initializing, loading, and printing matrix data in shared memory arrays
  • Implements matrix fragment loading with distinct A and B layout patterns for CDNA3 architecture
  • Provides utilities for visualizing and writing matrix data back to shared memory

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@diptorupd diptorupd force-pushed the feature/mma_debug_utils branch from fb6e23a to 68aa735 Compare October 1, 2025 16:21
@diptorupd diptorupd merged commit 4518727 into amd-integration Oct 1, 2025
1 check passed
@diptorupd diptorupd deleted the feature/mma_debug_utils branch October 2, 2025 16:45
diptorupd pushed a commit that referenced this pull request Dec 5, 2025
This PR introduces a patch to includes

Tested with Unit Tests:

Test project /root/amd_eng/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/4 Test #1: MathTest .........................   Passed    3.25 sec
    Start 2: PosEncTest
2/4 Test #2: PosEncTest .......................   Passed    3.25 sec
    Start 3: CascadeTest
3/4 Test #3: CascadeTest ......................   Passed    3.24 sec
    Start 4: PageTest
4/4 Test #4: PageTest .........................   Passed  161.15 sec

100% tests passed, 0 tests failed out of 4
diptorupd pushed a commit that referenced this pull request Dec 5, 2025
This PR fixes some of the unit test failures that occur in Single
Decode. It also disables clang formatting of headers.
The clang format of headers causes compilation issues. The compiler is
unable to find `HIP WARP SYNC INTRINSICS` causing failures. Disabling
clang format fixes these issues

```
    Start 1: MathTest
1/6 Test #1: MathTest .........................   Passed    3.31 sec
    Start 2: PosEncTest
2/6 Test #2: PosEncTest .......................   Passed    3.36 sec
    Start 3: CascadeTest
3/6 Test #3: CascadeTest ......................   Passed    3.35 sec
    Start 4: PageTest
4/6 Test #4: PageTest .........................   Passed  114.08 sec
    Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest .................   Passed   35.22 sec
    Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest ..................   Passed  559.75 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 719.07 sec
```
diptorupd pushed a commit that referenced this pull request Dec 5, 2025
CPP test suite was using `hipified` headers. In this PR, we port over unit tests to use `gpu_iface`. This is necessary for us as the next step is to move the build infrastructure to use `gpu_iface`

This PR has been tested locally 
```
Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/6 Test #1: MathTest .........................   Passed    3.40 sec
    Start 2: PosEncTest
2/6 Test #2: PosEncTest .......................   Passed    3.40 sec
    Start 3: CascadeTest
3/6 Test #3: CascadeTest ......................   Passed  985.27 sec
    Start 4: PageTest
4/6 Test #4: PageTest .........................   Passed  112.40 sec
    Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest .................   Passed   35.46 sec
    Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest ..................   Passed  556.81 sec

100% tests passed, 0 tests failed out of 6
```

To replicate the tests
```
cd flashinfer/libflashinfer/tests/hip
```
```
mkdir build && cd build/
```
```
cmake -DCMAKE_PREFIX_PATH=/root/libtorch -DCMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++ -DFLASHINFER_INCLUDE_DIRS=/root/flashinfer/libflashinfer/include/ ..
```
```
make
```
```
ctest
```
diptorupd pushed a commit that referenced this pull request Dec 5, 2025
In this PR I remove the `libtorch` dependency and removed
`test_page.cpp`. `test_page.cpp` is the only unit test that uses
libtorch. However, we also have a pytest for testing page. We will use
that for validation.

Removing the libtorch dependency will help us speed docker builds and
remove additional dependencies.


```Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/8 Test #1: MathTest ............................   Passed    0.31 sec
    Start 2: PosEncTest
2/8 Test #2: PosEncTest ..........................   Passed    0.31 sec
    Start 3: CascadeTest
3/8 Test #3: CascadeTest .........................   Passed  1369.12 sec
    Start 4: SingleDecodeTest
4/8 Test #4: SingleDecodeTest ....................   Passed  7726.35 sec
    Start 5: BatchDecodeTest
5/8 Test #5: BatchDecodeTest .....................   Passed  811.61 sec
    Start 6: test_mfma_fp32_16x16x16fp16
6/8 Test #6: test_mfma_fp32_16x16x16fp16 .........   Passed    0.30 sec
    Start 7: test_transpose_4x4_half_registers
7/8 Test #7: test_transpose_4x4_half_registers ...   Passed    0.28 sec
    Start 8: test_rowsum
8/8 Test #8: test_rowsum .........................   Passed    0.27 sec

100% tests passed, 0 tests failed out of 8
```
diptorupd added a commit that referenced this pull request Dec 5, 2025
* Adds debug utility functions for CDNA3 MMA ops.
zhenhantech pushed a commit to zhenhantech/flashinfer that referenced this pull request Jan 9, 2026
This PR introduces a patch to includes

Tested with Unit Tests:

Test project /root/amd_eng/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/4 Test ROCm#1: MathTest .........................   Passed    3.25 sec
    Start 2: PosEncTest
2/4 Test ROCm#2: PosEncTest .......................   Passed    3.25 sec
    Start 3: CascadeTest
3/4 Test ROCm#3: CascadeTest ......................   Passed    3.24 sec
    Start 4: PageTest
4/4 Test ROCm#4: PageTest .........................   Passed  161.15 sec

100% tests passed, 0 tests failed out of 4
zhenhantech pushed a commit to zhenhantech/flashinfer that referenced this pull request Jan 9, 2026
This PR fixes some of the unit test failures that occur in Single
Decode. It also disables clang formatting of headers.
The clang format of headers causes compilation issues. The compiler is
unable to find `HIP WARP SYNC INTRINSICS` causing failures. Disabling
clang format fixes these issues

```
    Start 1: MathTest
1/6 Test ROCm#1: MathTest .........................   Passed    3.31 sec
    Start 2: PosEncTest
2/6 Test ROCm#2: PosEncTest .......................   Passed    3.36 sec
    Start 3: CascadeTest
3/6 Test ROCm#3: CascadeTest ......................   Passed    3.35 sec
    Start 4: PageTest
4/6 Test ROCm#4: PageTest .........................   Passed  114.08 sec
    Start 5: SingleDecodeTest
5/6 Test ROCm#5: SingleDecodeTest .................   Passed   35.22 sec
    Start 6: BatchDecodeTest
6/6 Test ROCm#6: BatchDecodeTest ..................   Passed  559.75 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 719.07 sec
```
zhenhantech pushed a commit to zhenhantech/flashinfer that referenced this pull request Jan 9, 2026
CPP test suite was using `hipified` headers. In this PR, we port over unit tests to use `gpu_iface`. This is necessary for us as the next step is to move the build infrastructure to use `gpu_iface`

This PR has been tested locally 
```
Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/6 Test ROCm#1: MathTest .........................   Passed    3.40 sec
    Start 2: PosEncTest
2/6 Test ROCm#2: PosEncTest .......................   Passed    3.40 sec
    Start 3: CascadeTest
3/6 Test ROCm#3: CascadeTest ......................   Passed  985.27 sec
    Start 4: PageTest
4/6 Test ROCm#4: PageTest .........................   Passed  112.40 sec
    Start 5: SingleDecodeTest
5/6 Test ROCm#5: SingleDecodeTest .................   Passed   35.46 sec
    Start 6: BatchDecodeTest
6/6 Test ROCm#6: BatchDecodeTest ..................   Passed  556.81 sec

100% tests passed, 0 tests failed out of 6
```

To replicate the tests
```
cd flashinfer/libflashinfer/tests/hip
```
```
mkdir build && cd build/
```
```
cmake -DCMAKE_PREFIX_PATH=/root/libtorch -DCMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++ -DFLASHINFER_INCLUDE_DIRS=/root/flashinfer/libflashinfer/include/ ..
```
```
make
```
```
ctest
```
zhenhantech pushed a commit to zhenhantech/flashinfer that referenced this pull request Jan 9, 2026
In this PR I remove the `libtorch` dependency and removed
`test_page.cpp`. `test_page.cpp` is the only unit test that uses
libtorch. However, we also have a pytest for testing page. We will use
that for validation.

Removing the libtorch dependency will help us speed docker builds and
remove additional dependencies.


```Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/8 Test ROCm#1: MathTest ............................   Passed    0.31 sec
    Start 2: PosEncTest
2/8 Test ROCm#2: PosEncTest ..........................   Passed    0.31 sec
    Start 3: CascadeTest
3/8 Test ROCm#3: CascadeTest .........................   Passed  1369.12 sec
    Start 4: SingleDecodeTest
4/8 Test ROCm#4: SingleDecodeTest ....................   Passed  7726.35 sec
    Start 5: BatchDecodeTest
5/8 Test ROCm#5: BatchDecodeTest .....................   Passed  811.61 sec
    Start 6: test_mfma_fp32_16x16x16fp16
6/8 Test ROCm#6: test_mfma_fp32_16x16x16fp16 .........   Passed    0.30 sec
    Start 7: test_transpose_4x4_half_registers
7/8 Test ROCm#7: test_transpose_4x4_half_registers ...   Passed    0.28 sec
    Start 8: test_rowsum
8/8 Test ROCm#8: test_rowsum .........................   Passed    0.27 sec

100% tests passed, 0 tests failed out of 8
```
zhenhantech pushed a commit to zhenhantech/flashinfer that referenced this pull request Jan 9, 2026
* Adds debug utility functions for CDNA3 MMA ops.
diptorupd pushed a commit that referenced this pull request Feb 2, 2026
This PR introduces a patch to includes

Tested with Unit Tests:

Test project /root/amd_eng/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/4 Test #1: MathTest .........................   Passed    3.25 sec
    Start 2: PosEncTest
2/4 Test #2: PosEncTest .......................   Passed    3.25 sec
    Start 3: CascadeTest
3/4 Test #3: CascadeTest ......................   Passed    3.24 sec
    Start 4: PageTest
4/4 Test #4: PageTest .........................   Passed  161.15 sec

100% tests passed, 0 tests failed out of 4
diptorupd pushed a commit that referenced this pull request Feb 2, 2026
This PR fixes some of the unit test failures that occur in Single
Decode. It also disables clang formatting of headers.
The clang format of headers causes compilation issues. The compiler is
unable to find `HIP WARP SYNC INTRINSICS` causing failures. Disabling
clang format fixes these issues

```
    Start 1: MathTest
1/6 Test #1: MathTest .........................   Passed    3.31 sec
    Start 2: PosEncTest
2/6 Test #2: PosEncTest .......................   Passed    3.36 sec
    Start 3: CascadeTest
3/6 Test #3: CascadeTest ......................   Passed    3.35 sec
    Start 4: PageTest
4/6 Test #4: PageTest .........................   Passed  114.08 sec
    Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest .................   Passed   35.22 sec
    Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest ..................   Passed  559.75 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 719.07 sec
```
diptorupd pushed a commit that referenced this pull request Feb 2, 2026
CPP test suite was using `hipified` headers. In this PR, we port over unit tests to use `gpu_iface`. This is necessary for us as the next step is to move the build infrastructure to use `gpu_iface`

This PR has been tested locally 
```
Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/6 Test #1: MathTest .........................   Passed    3.40 sec
    Start 2: PosEncTest
2/6 Test #2: PosEncTest .......................   Passed    3.40 sec
    Start 3: CascadeTest
3/6 Test #3: CascadeTest ......................   Passed  985.27 sec
    Start 4: PageTest
4/6 Test #4: PageTest .........................   Passed  112.40 sec
    Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest .................   Passed   35.46 sec
    Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest ..................   Passed  556.81 sec

100% tests passed, 0 tests failed out of 6
```

To replicate the tests
```
cd flashinfer/libflashinfer/tests/hip
```
```
mkdir build && cd build/
```
```
cmake -DCMAKE_PREFIX_PATH=/root/libtorch -DCMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++ -DFLASHINFER_INCLUDE_DIRS=/root/flashinfer/libflashinfer/include/ ..
```
```
make
```
```
ctest
```
diptorupd pushed a commit that referenced this pull request Feb 2, 2026
In this PR I remove the `libtorch` dependency and removed
`test_page.cpp`. `test_page.cpp` is the only unit test that uses
libtorch. However, we also have a pytest for testing page. We will use
that for validation.

Removing the libtorch dependency will help us speed docker builds and
remove additional dependencies.


```Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/8 Test #1: MathTest ............................   Passed    0.31 sec
    Start 2: PosEncTest
2/8 Test #2: PosEncTest ..........................   Passed    0.31 sec
    Start 3: CascadeTest
3/8 Test #3: CascadeTest .........................   Passed  1369.12 sec
    Start 4: SingleDecodeTest
4/8 Test #4: SingleDecodeTest ....................   Passed  7726.35 sec
    Start 5: BatchDecodeTest
5/8 Test #5: BatchDecodeTest .....................   Passed  811.61 sec
    Start 6: test_mfma_fp32_16x16x16fp16
6/8 Test #6: test_mfma_fp32_16x16x16fp16 .........   Passed    0.30 sec
    Start 7: test_transpose_4x4_half_registers
7/8 Test #7: test_transpose_4x4_half_registers ...   Passed    0.28 sec
    Start 8: test_rowsum
8/8 Test #8: test_rowsum .........................   Passed    0.27 sec

100% tests passed, 0 tests failed out of 8
```
diptorupd added a commit that referenced this pull request Feb 2, 2026
* Adds debug utility functions for CDNA3 MMA ops.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants