Enable FP8 support for Flashinfer ROCm decode kernels on CDNA3 by rtmadduri · Pull Request #40 · ROCm/flashinfer

rtmadduri · 2025-11-10T07:53:13Z

This PR enables support for __hip_fp8_e4m3fnuz and __hip_fp8_e5m2 dtypes.

This PR adds -

PyTorch support for __hip_fp8 variants for both AOT and JIT
Additional utility conversion functions to move between __hip_fp8_e4m3fnuz, __hip_fp8_e5m2, __half, float
Modifications to the chunking logic to accommodate fp8 dtype
A new batch decode pytest for fp8.

Note: This PR does not add RoPE-Llama support for the __hip_fp8_* variants. This will be addressed in a different PR.

PyTest results - tests/test_batch_decode_kernels_hip_fp8.py

===================== 864 passed, 4 warnings in 55.39s =====================

Running the entire test suite - scripts/run_hip_tests.sh

=========17252 passed, 18 skipped, 12 warnings in 332.42s (0:05:32) =========

demandal25

Left some minor comments and questions for clarifications.

flashinfer/csrc/pytorch_extension_utils.h

flashinfer/jit/core.py

libflashinfer/include/flashinfer/attention/generic/decode.cuh

tests/test_batch_decode_kernels_hip_fp8.py

pyproject.toml

demandal25

Thanks for addressing the comments

diptorupd · 2025-11-12T16:10:13Z

pyproject.toml

-FLASHINFER_ENABLE_FP8="OFF"
-FLASHINFER_ENABLE_FP8_E4M3="OFF"
-FLASHINFER_ENABLE_FP8_E5M2="OFF"
+FLASHINFER_ENABLE_FP8="ON"


Look at Options.cmake setting FLASHINFER_ENABLE_FP8 sets the FLASHINFER_ENABLE_FP8_E4M3 and FLASHINFER_ENABLE_FP8_E5M2 to true. So, we should only use the FLASHINFER_ENABLE_FP8 flag.

Can be fixed later.

Signed-off-by: Debasis Mandal <Debasis.Mandal@amd.com>

This PR fixes some of the unit test failures that occur in Single Decode. It also disables clang formatting of headers. The clang format of headers causes compilation issues. The compiler is unable to find `HIP WARP SYNC INTRINSICS` causing failures. Disabling clang format fixes these issues ``` Start 1: MathTest 1/6 Test #1: MathTest ......................... Passed 3.31 sec Start 2: PosEncTest 2/6 Test #2: PosEncTest ....................... Passed 3.36 sec Start 3: CascadeTest 3/6 Test #3: CascadeTest ...................... Passed 3.35 sec Start 4: PageTest 4/6 Test #4: PageTest ......................... Passed 114.08 sec Start 5: SingleDecodeTest 5/6 Test #5: SingleDecodeTest ................. Passed 35.22 sec Start 6: BatchDecodeTest 6/6 Test #6: BatchDecodeTest .................. Passed 559.75 sec 100% tests passed, 0 tests failed out of 6 Total Test time (real) = 719.07 sec ```

This PR enables support for `__hip_fp8_e4m3fnuz` and `__hip_fp8_e5m2` dtypes for the decode kernels This PR adds - - PyTorch support for `__hip_fp8` variants for both AOT and JIT - Additional utility conversion functions to move between `__hip_fp8_e4m3fnuz`, `__hip_fp8_e5m2`, `__half`, `float` - Modifications to the chunking logic to accommodate `fp8` dtype - A new batch decode pytest for fp8. Note: This PR does not add `RoPE-Llama` support for the `__hip_fp8_*` variants. This will be addressed in a different PR. PyTest results - `tests/test_batch_decode_kernels_hip_fp8.py` ``` ===================== 864 passed, 4 warnings in 55.39s ===================== ``` Running the entire test suite - `scripts/run_hip_tests.sh` ``` =========17252 passed, 18 skipped, 12 warnings in 332.42s (0:05:32) ========= ``` --------- Signed-off-by: Debasis Mandal <Debasis.Mandal@amd.com> Co-authored-by: Debasis Mandal <Debasis.Mandal@amd.com>

This PR fixes some of the unit test failures that occur in Single Decode. It also disables clang formatting of headers. The clang format of headers causes compilation issues. The compiler is unable to find `HIP WARP SYNC INTRINSICS` causing failures. Disabling clang format fixes these issues ``` Start 1: MathTest 1/6 Test ROCm#1: MathTest ......................... Passed 3.31 sec Start 2: PosEncTest 2/6 Test ROCm#2: PosEncTest ....................... Passed 3.36 sec Start 3: CascadeTest 3/6 Test ROCm#3: CascadeTest ...................... Passed 3.35 sec Start 4: PageTest 4/6 Test ROCm#4: PageTest ......................... Passed 114.08 sec Start 5: SingleDecodeTest 5/6 Test ROCm#5: SingleDecodeTest ................. Passed 35.22 sec Start 6: BatchDecodeTest 6/6 Test ROCm#6: BatchDecodeTest .................. Passed 559.75 sec 100% tests passed, 0 tests failed out of 6 Total Test time (real) = 719.07 sec ```

This PR enables support for `__hip_fp8_e4m3fnuz` and `__hip_fp8_e5m2` dtypes for the decode kernels This PR adds - - PyTorch support for `__hip_fp8` variants for both AOT and JIT - Additional utility conversion functions to move between `__hip_fp8_e4m3fnuz`, `__hip_fp8_e5m2`, `__half`, `float` - Modifications to the chunking logic to accommodate `fp8` dtype - A new batch decode pytest for fp8. Note: This PR does not add `RoPE-Llama` support for the `__hip_fp8_*` variants. This will be addressed in a different PR. PyTest results - `tests/test_batch_decode_kernels_hip_fp8.py` ``` ===================== 864 passed, 4 warnings in 55.39s ===================== ``` Running the entire test suite - `scripts/run_hip_tests.sh` ``` =========17252 passed, 18 skipped, 12 warnings in 332.42s (0:05:32) ========= ``` --------- Signed-off-by: Debasis Mandal <Debasis.Mandal@amd.com> Co-authored-by: Debasis Mandal <Debasis.Mandal@amd.com>

This PR fixes some of the unit test failures that occur in Single Decode. It also disables clang formatting of headers. The clang format of headers causes compilation issues. The compiler is unable to find `HIP WARP SYNC INTRINSICS` causing failures. Disabling clang format fixes these issues ``` Start 1: MathTest 1/6 Test #1: MathTest ......................... Passed 3.31 sec Start 2: PosEncTest 2/6 Test #2: PosEncTest ....................... Passed 3.36 sec Start 3: CascadeTest 3/6 Test #3: CascadeTest ...................... Passed 3.35 sec Start 4: PageTest 4/6 Test #4: PageTest ......................... Passed 114.08 sec Start 5: SingleDecodeTest 5/6 Test #5: SingleDecodeTest ................. Passed 35.22 sec Start 6: BatchDecodeTest 6/6 Test #6: BatchDecodeTest .................. Passed 559.75 sec 100% tests passed, 0 tests failed out of 6 Total Test time (real) = 719.07 sec ```

This PR enables support for `__hip_fp8_e4m3fnuz` and `__hip_fp8_e5m2` dtypes for the decode kernels This PR adds - - PyTorch support for `__hip_fp8` variants for both AOT and JIT - Additional utility conversion functions to move between `__hip_fp8_e4m3fnuz`, `__hip_fp8_e5m2`, `__half`, `float` - Modifications to the chunking logic to accommodate `fp8` dtype - A new batch decode pytest for fp8. Note: This PR does not add `RoPE-Llama` support for the `__hip_fp8_*` variants. This will be addressed in a different PR. PyTest results - `tests/test_batch_decode_kernels_hip_fp8.py` ``` ===================== 864 passed, 4 warnings in 55.39s ===================== ``` Running the entire test suite - `scripts/run_hip_tests.sh` ``` =========17252 passed, 18 skipped, 12 warnings in 332.42s (0:05:32) ========= ``` --------- Signed-off-by: Debasis Mandal <Debasis.Mandal@amd.com> Co-authored-by: Debasis Mandal <Debasis.Mandal@amd.com>

Enable FP8_E4M3 and FP8_E5M2

822b8d6

demandal25 self-requested a review November 10, 2025 15:08

diptorupd changed the title ~~Enable FP8 support for Flashinfer ROCm on CDNA3~~ Enable FP8 support for Flashinfer ROCm decode kernels on CDNA3 Nov 10, 2025

remove red. comment

51511c9

demandal25 reviewed Nov 10, 2025

View reviewed changes

rtmadduri added 2 commits November 11, 2025 18:11

address comments

857a230

remove extra lines

bbf026a

demandal25 self-requested a review November 11, 2025 18:17

demandal25 approved these changes Nov 11, 2025

View reviewed changes

diptorupd reviewed Nov 12, 2025

View reviewed changes

Merge branch 'amd-integration' into feature/try-fp8

3873e37

Signed-off-by: Debasis Mandal <Debasis.Mandal@amd.com>

diptorupd merged commit 1d29f90 into ROCm:amd-integration Nov 12, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable FP8 support for Flashinfer ROCm decode kernels on CDNA3#40

Enable FP8 support for Flashinfer ROCm decode kernels on CDNA3#40
diptorupd merged 5 commits intoROCm:amd-integrationfrom
rtmadduri:feature/try-fp8

rtmadduri commented Nov 10, 2025

Uh oh!

demandal25 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

demandal25 left a comment

Uh oh!

diptorupd Nov 12, 2025

Uh oh!

diptorupd Nov 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rtmadduri commented Nov 10, 2025

Uh oh!

demandal25 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

demandal25 left a comment

Choose a reason for hiding this comment

Uh oh!

diptorupd Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

diptorupd Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants