Updates to permuted_smem.cuh by diptorupd · Pull Request #8 · ROCm/flashinfer

diptorupd · 2025-10-02T19:37:45Z

Updated the upcast_size function to alwasy require a VectorWidthBits tparam
Added load_fragment, load_fragment_4x4_transposed, store_fragment, load_vector_async, store_64b, and store_vector fuctions.

- Updated the upcast_size function to alwasy require a VectorWidthBits tparam - Added load_fragment, load_fragment_4x4_transposed, store_fragment, load_vector_async, store_64b, and store_vector fuctions.

Copilot

Pull Request Overview

Updates the permuted_smem.cuh header to standardize template parameters and add new memory operation functions. The PR modifies the upcast_size function to require a VectorWidthBits template parameter and introduces several new fragment and vector operations for different platforms.

Standardized the upcast_size function template parameter naming from NumBits to VectorWidthBits and removed default value
Added new fragment operations (load_fragment, load_fragment_4x4_transposed, store_fragment) with platform-specific implementations
Added vector operations (load_vector_async, store_64b, store_vector) to complement existing memory functions

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

libflashinfer/include/flashinfer/attention/generic/permuted_smem.cuh

This PR improves the CMake build system for the C++ API of flashinfer, making it more modular, easier to configure, and overall easier to integrate into downstream projects. The changes introduce a proper component-based installation for different usage scenarios for the C++ API, adds CMake-based dependency management, and generally offers a cleaner build organization. The C++ sources, tests, and benchmarks are now reorganized inside a `libflashinfer` directory. The original directories are left unmodified. flashinfer/ ├── CMakeLists.txt ├── include/ │ └── flashinfer/ └── src/ flashinfer/ ├── CMakeLists.txt ├── cmake/ │ └── utils/ │ │ └── ConfigureTargets.cmake │ ├── Config.cmake.in │ ├── Dependencies.cmake │ ├── Options.cmake ├── libflashinfer/ │ ├── CMakeLists.txt │ ├── include/ │ │ └── flashinfer/ │ │ ├── attention/ │ │ └── distributed/ │ │ └── gemm/ │ ├── tests/ │ │ ├── CMakeLists.txt # Consistent test configuration │ │ └── fp8/ # FP8-specific tests │ ├── utils/ │ │ └── fp8/ # FP8 utility files │ └── benchmarks/ │ ├── CMakeLists.txt # Standardized benchmark setup │ └── fp8/ # FP8-specific benchmarks - Separated flashinfer CMake option definitions into distinct files for better organization - Created utility modules for configuring tests and benchmarks consistently - Consolidated dependency management in a single location - Added a flashinferconfig for downstream projects to be able to do `find_package(flashinfer)` in their CMake scripts. - The C++ API is broken down into multiple components: **Headers, Kernels, TVMBinding, Distributed.** - Components can be configured and built separately - Ex1. Configuring the C++ API to build AOT kernels, C++ unit tests and benchmarks ```bash cmake .. -GNinja -DFLASHINFER_CUDA_ARCHITECTURES=80 -DFLASHINFER_BUILD_KERNELS=ON -DFLASHINFER_UNITTESTS=ON -DFLASHINFER_CXX_BENCHMARKS=ON -DFLASHINFER_CUTLASS_DIR=../3rdparty/cutlass -DCMAKE_INSTALL_PREFIX=~/devel/install ``` - Ex2. Build various components as needed ```bash # Build the test_single_decode unit tests cmake --build . --target test_single_decode # Build all unit tests (broken right now) cmake --build . --target build_tests # Build bench_single_decode cmake --build . --target bench_single_decode # Build all benchmarks (broken right now) cmake --build . --target build_benchmarks ``` - Ex3. Install components ```bash cmake --build . --target install ``` - Added CMake targets to generate sources based on the same logic as in the current `setup.py`. Generated sources are now placed under the `CMAKE_CURRENT_BINARY_DIR` - Ex4. Install components ```bash cmake --build . --target generate_kernels ``` These changes are primarily meant to streamline the changes to add new CMake config options for a proposed HIP back end. However, it is foreseeable to build upon these changes to convert the build system to a fully unified CMake-based using `scikit-build` and use a single configuration system for both C++ and Python APIs.

In this PR I remove the `libtorch` dependency and removed `test_page.cpp`. `test_page.cpp` is the only unit test that uses libtorch. However, we also have a pytest for testing page. We will use that for validation. Removing the libtorch dependency will help us speed docker builds and remove additional dependencies. ```Test project /root/flashinfer/libflashinfer/tests/hip/build Start 1: MathTest 1/8 Test #1: MathTest ............................ Passed 0.31 sec Start 2: PosEncTest 2/8 Test #2: PosEncTest .......................... Passed 0.31 sec Start 3: CascadeTest 3/8 Test #3: CascadeTest ......................... Passed 1369.12 sec Start 4: SingleDecodeTest 4/8 Test #4: SingleDecodeTest .................... Passed 7726.35 sec Start 5: BatchDecodeTest 5/8 Test #5: BatchDecodeTest ..................... Passed 811.61 sec Start 6: test_mfma_fp32_16x16x16fp16 6/8 Test #6: test_mfma_fp32_16x16x16fp16 ......... Passed 0.30 sec Start 7: test_transpose_4x4_half_registers 7/8 Test #7: test_transpose_4x4_half_registers ... Passed 0.28 sec Start 8: test_rowsum 8/8 Test #8: test_rowsum ......................... Passed 0.27 sec 100% tests passed, 0 tests failed out of 8 ```

* Updates to permuted_smem.cuh - Updated the upcast_size function to always require a VectorWidthBits tparam - Added load_fragment, load_fragment_and_quad_transpose, store_fragment, load_vector_async, store_64b, and store_vector fuctions.

This PR improves the CMake build system for the C++ API of flashinfer, making it more modular, easier to configure, and overall easier to integrate into downstream projects. The changes introduce a proper component-based installation for different usage scenarios for the C++ API, adds CMake-based dependency management, and generally offers a cleaner build organization. The C++ sources, tests, and benchmarks are now reorganized inside a `libflashinfer` directory. The original directories are left unmodified. flashinfer/ ├── CMakeLists.txt ├── include/ │ └── flashinfer/ └── src/ flashinfer/ ├── CMakeLists.txt ├── cmake/ │ └── utils/ │ │ └── ConfigureTargets.cmake │ ├── Config.cmake.in │ ├── Dependencies.cmake │ ├── Options.cmake ├── libflashinfer/ │ ├── CMakeLists.txt │ ├── include/ │ │ └── flashinfer/ │ │ ├── attention/ │ │ └── distributed/ │ │ └── gemm/ │ ├── tests/ │ │ ├── CMakeLists.txt # Consistent test configuration │ │ └── fp8/ # FP8-specific tests │ ├── utils/ │ │ └── fp8/ # FP8 utility files │ └── benchmarks/ │ ├── CMakeLists.txt # Standardized benchmark setup │ └── fp8/ # FP8-specific benchmarks - Separated flashinfer CMake option definitions into distinct files for better organization - Created utility modules for configuring tests and benchmarks consistently - Consolidated dependency management in a single location - Added a flashinferconfig for downstream projects to be able to do `find_package(flashinfer)` in their CMake scripts. - The C++ API is broken down into multiple components: **Headers, Kernels, TVMBinding, Distributed.** - Components can be configured and built separately - Ex1. Configuring the C++ API to build AOT kernels, C++ unit tests and benchmarks ```bash cmake .. -GNinja -DFLASHINFER_CUDA_ARCHITECTURES=80 -DFLASHINFER_BUILD_KERNELS=ON -DFLASHINFER_UNITTESTS=ON -DFLASHINFER_CXX_BENCHMARKS=ON -DFLASHINFER_CUTLASS_DIR=../3rdparty/cutlass -DCMAKE_INSTALL_PREFIX=~/devel/install ``` - Ex2. Build various components as needed ```bash # Build the test_single_decode unit tests cmake --build . --target test_single_decode # Build all unit tests (broken right now) cmake --build . --target build_tests # Build bench_single_decode cmake --build . --target bench_single_decode # Build all benchmarks (broken right now) cmake --build . --target build_benchmarks ``` - Ex3. Install components ```bash cmake --build . --target install ``` - Added CMake targets to generate sources based on the same logic as in the current `setup.py`. Generated sources are now placed under the `CMAKE_CURRENT_BINARY_DIR` - Ex4. Install components ```bash cmake --build . --target generate_kernels ``` These changes are primarily meant to streamline the changes to add new CMake config options for a proposed HIP back end. However, it is foreseeable to build upon these changes to convert the build system to a fully unified CMake-based using `scikit-build` and use a single configuration system for both C++ and Python APIs.

In this PR I remove the `libtorch` dependency and removed `test_page.cpp`. `test_page.cpp` is the only unit test that uses libtorch. However, we also have a pytest for testing page. We will use that for validation. Removing the libtorch dependency will help us speed docker builds and remove additional dependencies. ```Test project /root/flashinfer/libflashinfer/tests/hip/build Start 1: MathTest 1/8 Test ROCm#1: MathTest ............................ Passed 0.31 sec Start 2: PosEncTest 2/8 Test ROCm#2: PosEncTest .......................... Passed 0.31 sec Start 3: CascadeTest 3/8 Test ROCm#3: CascadeTest ......................... Passed 1369.12 sec Start 4: SingleDecodeTest 4/8 Test ROCm#4: SingleDecodeTest .................... Passed 7726.35 sec Start 5: BatchDecodeTest 5/8 Test ROCm#5: BatchDecodeTest ..................... Passed 811.61 sec Start 6: test_mfma_fp32_16x16x16fp16 6/8 Test ROCm#6: test_mfma_fp32_16x16x16fp16 ......... Passed 0.30 sec Start 7: test_transpose_4x4_half_registers 7/8 Test ROCm#7: test_transpose_4x4_half_registers ... Passed 0.28 sec Start 8: test_rowsum 8/8 Test ROCm#8: test_rowsum ......................... Passed 0.27 sec 100% tests passed, 0 tests failed out of 8 ```

* Updates to permuted_smem.cuh - Updated the upcast_size function to always require a VectorWidthBits tparam - Added load_fragment, load_fragment_and_quad_transpose, store_fragment, load_vector_async, store_64b, and store_vector fuctions.

This PR improves the CMake build system for the C++ API of flashinfer, making it more modular, easier to configure, and overall easier to integrate into downstream projects. The changes introduce a proper component-based installation for different usage scenarios for the C++ API, adds CMake-based dependency management, and generally offers a cleaner build organization. The C++ sources, tests, and benchmarks are now reorganized inside a `libflashinfer` directory. The original directories are left unmodified. flashinfer/ ├── CMakeLists.txt ├── include/ │ └── flashinfer/ └── src/ flashinfer/ ├── CMakeLists.txt ├── cmake/ │ └── utils/ │ │ └── ConfigureTargets.cmake │ ├── Config.cmake.in │ ├── Dependencies.cmake │ ├── Options.cmake ├── libflashinfer/ │ ├── CMakeLists.txt │ ├── include/ │ │ └── flashinfer/ │ │ ├── attention/ │ │ └── distributed/ │ │ └── gemm/ │ ├── tests/ │ │ ├── CMakeLists.txt # Consistent test configuration │ │ └── fp8/ # FP8-specific tests │ ├── utils/ │ │ └── fp8/ # FP8 utility files │ └── benchmarks/ │ ├── CMakeLists.txt # Standardized benchmark setup │ └── fp8/ # FP8-specific benchmarks - Separated flashinfer CMake option definitions into distinct files for better organization - Created utility modules for configuring tests and benchmarks consistently - Consolidated dependency management in a single location - Added a flashinferconfig for downstream projects to be able to do `find_package(flashinfer)` in their CMake scripts. - The C++ API is broken down into multiple components: **Headers, Kernels, TVMBinding, Distributed.** - Components can be configured and built separately - Ex1. Configuring the C++ API to build AOT kernels, C++ unit tests and benchmarks ```bash cmake .. -GNinja -DFLASHINFER_CUDA_ARCHITECTURES=80 -DFLASHINFER_BUILD_KERNELS=ON -DFLASHINFER_UNITTESTS=ON -DFLASHINFER_CXX_BENCHMARKS=ON -DFLASHINFER_CUTLASS_DIR=../3rdparty/cutlass -DCMAKE_INSTALL_PREFIX=~/devel/install ``` - Ex2. Build various components as needed ```bash # Build the test_single_decode unit tests cmake --build . --target test_single_decode # Build all unit tests (broken right now) cmake --build . --target build_tests # Build bench_single_decode cmake --build . --target bench_single_decode # Build all benchmarks (broken right now) cmake --build . --target build_benchmarks ``` - Ex3. Install components ```bash cmake --build . --target install ``` - Added CMake targets to generate sources based on the same logic as in the current `setup.py`. Generated sources are now placed under the `CMAKE_CURRENT_BINARY_DIR` - Ex4. Install components ```bash cmake --build . --target generate_kernels ``` These changes are primarily meant to streamline the changes to add new CMake config options for a proposed HIP back end. However, it is foreseeable to build upon these changes to convert the build system to a fully unified CMake-based using `scikit-build` and use a single configuration system for both C++ and Python APIs.

In this PR I remove the `libtorch` dependency and removed `test_page.cpp`. `test_page.cpp` is the only unit test that uses libtorch. However, we also have a pytest for testing page. We will use that for validation. Removing the libtorch dependency will help us speed docker builds and remove additional dependencies. ```Test project /root/flashinfer/libflashinfer/tests/hip/build Start 1: MathTest 1/8 Test #1: MathTest ............................ Passed 0.31 sec Start 2: PosEncTest 2/8 Test #2: PosEncTest .......................... Passed 0.31 sec Start 3: CascadeTest 3/8 Test #3: CascadeTest ......................... Passed 1369.12 sec Start 4: SingleDecodeTest 4/8 Test #4: SingleDecodeTest .................... Passed 7726.35 sec Start 5: BatchDecodeTest 5/8 Test #5: BatchDecodeTest ..................... Passed 811.61 sec Start 6: test_mfma_fp32_16x16x16fp16 6/8 Test #6: test_mfma_fp32_16x16x16fp16 ......... Passed 0.30 sec Start 7: test_transpose_4x4_half_registers 7/8 Test #7: test_transpose_4x4_half_registers ... Passed 0.28 sec Start 8: test_rowsum 8/8 Test #8: test_rowsum ......................... Passed 0.27 sec 100% tests passed, 0 tests failed out of 8 ```

* Updates to permuted_smem.cuh - Updated the upcast_size function to always require a VectorWidthBits tparam - Added load_fragment, load_fragment_and_quad_transpose, store_fragment, load_vector_async, store_64b, and store_vector fuctions.

Updates to permuted_smem.cuh

0feba7c

- Updated the upcast_size function to alwasy require a VectorWidthBits tparam - Added load_fragment, load_fragment_4x4_transposed, store_fragment, load_vector_async, store_64b, and store_vector fuctions.

diptorupd requested review from Copilot and demandal25 October 2, 2025 19:37

Copilot AI reviewed Oct 2, 2025

View reviewed changes

libflashinfer/include/flashinfer/attention/generic/permuted_smem.cuh Outdated Show resolved Hide resolved

libflashinfer/include/flashinfer/attention/generic/permuted_smem.cuh Show resolved Hide resolved

demandal25 approved these changes Oct 2, 2025

View reviewed changes

diptorupd added 2 commits October 2, 2025 15:48

Update per copilot suggestions

3f64e21

Update per copilot suggestions

133282d

diptorupd merged commit 650f542 into amd-integration Oct 2, 2025
1 check passed

diptorupd deleted the update/permuted_smem branch October 2, 2025 19:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates to permuted_smem.cuh#8

Updates to permuted_smem.cuh#8
diptorupd merged 3 commits intoamd-integrationfrom
update/permuted_smem

diptorupd commented Oct 2, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

diptorupd commented Oct 2, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants