Skip to content

Updates to permuted_smem.cuh#8

Merged
diptorupd merged 3 commits intoamd-integrationfrom
update/permuted_smem
Oct 2, 2025
Merged

Updates to permuted_smem.cuh#8
diptorupd merged 3 commits intoamd-integrationfrom
update/permuted_smem

Conversation

@diptorupd
Copy link
Collaborator

  • Updated the upcast_size function to alwasy require a VectorWidthBits tparam
  • Added load_fragment, load_fragment_4x4_transposed, store_fragment, load_vector_async, store_64b, and store_vector fuctions.

    - Updated the upcast_size function to alwasy require a VectorWidthBits tparam
    - Added load_fragment, load_fragment_4x4_transposed, store_fragment, load_vector_async,
      store_64b, and store_vector fuctions.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Updates the permuted_smem.cuh header to standardize template parameters and add new memory operation functions. The PR modifies the upcast_size function to require a VectorWidthBits template parameter and introduces several new fragment and vector operations for different platforms.

  • Standardized the upcast_size function template parameter naming from NumBits to VectorWidthBits and removed default value
  • Added new fragment operations (load_fragment, load_fragment_4x4_transposed, store_fragment) with platform-specific implementations
  • Added vector operations (load_vector_async, store_64b, store_vector) to complement existing memory functions

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@diptorupd diptorupd merged commit 650f542 into amd-integration Oct 2, 2025
1 check passed
@diptorupd diptorupd deleted the update/permuted_smem branch October 2, 2025 19:50
diptorupd added a commit that referenced this pull request Dec 5, 2025
This PR improves the CMake build system for the C++ API of flashinfer,
making it more modular, easier to configure, and overall easier to
integrate into downstream projects. The changes introduce a proper
component-based installation for different usage scenarios for the C++
API, adds CMake-based dependency management, and generally offers a
cleaner build organization.

The C++ sources, tests, and benchmarks are now reorganized inside a
`libflashinfer` directory. The original directories are left unmodified.

flashinfer/
├── CMakeLists.txt
├── include/
│   └── flashinfer/
└── src/

flashinfer/
├── CMakeLists.txt
├── cmake/
│   └── utils/
│   │   └── ConfigureTargets.cmake
│   ├── Config.cmake.in
│   ├── Dependencies.cmake
│   ├── Options.cmake
├── libflashinfer/
│   ├── CMakeLists.txt
│   ├── include/
│   │   └── flashinfer/
│   │       ├── attention/
│   │       └── distributed/
│   │       └── gemm/
│   ├── tests/
│   │   ├── CMakeLists.txt          # Consistent test configuration
│   │   └── fp8/                    # FP8-specific tests
│   ├── utils/
│   │   └── fp8/                    # FP8 utility files
│   └── benchmarks/
│       ├── CMakeLists.txt          # Standardized benchmark setup
│       └── fp8/                    # FP8-specific benchmarks

- Separated flashinfer CMake option definitions into distinct files for
better organization
- Created utility modules for configuring tests and benchmarks
consistently
- Consolidated dependency management in a single location
- Added a flashinferconfig for downstream projects to be able to do
`find_package(flashinfer)` in their CMake scripts.

- The C++ API is broken down into multiple components: **Headers,
Kernels, TVMBinding, Distributed.**
- Components can be configured and built separately
- Ex1. Configuring the C++ API to build AOT kernels, C++ unit tests and
benchmarks
    ```bash
cmake .. -GNinja -DFLASHINFER_CUDA_ARCHITECTURES=80
-DFLASHINFER_BUILD_KERNELS=ON -DFLASHINFER_UNITTESTS=ON
-DFLASHINFER_CXX_BENCHMARKS=ON
-DFLASHINFER_CUTLASS_DIR=../3rdparty/cutlass
-DCMAKE_INSTALL_PREFIX=~/devel/install
    ```
    - Ex2. Build various components as needed
    ```bash
        # Build the test_single_decode unit tests
        cmake --build . --target test_single_decode
        # Build all unit tests (broken right now)
        cmake --build . --target build_tests
        # Build bench_single_decode
        cmake --build . --target bench_single_decode
        # Build all benchmarks (broken right now)
        cmake --build . --target build_benchmarks
    ```
   - Ex3. Install components
       ```bash
           cmake --build . --target install
       ```
- Added CMake targets to generate sources based on the same logic as in
the current `setup.py`. Generated sources are now placed under the
`CMAKE_CURRENT_BINARY_DIR`

   - Ex4. Install components
       ```bash
           cmake --build . --target generate_kernels
       ```

These changes are primarily meant to streamline the changes to add new
CMake config options for a proposed HIP back end. However, it is
foreseeable to build upon these changes to convert the build system to a
fully unified CMake-based using `scikit-build` and use a single
configuration system for both C++ and Python APIs.
diptorupd pushed a commit that referenced this pull request Dec 5, 2025
In this PR I remove the `libtorch` dependency and removed
`test_page.cpp`. `test_page.cpp` is the only unit test that uses
libtorch. However, we also have a pytest for testing page. We will use
that for validation.

Removing the libtorch dependency will help us speed docker builds and
remove additional dependencies.


```Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/8 Test #1: MathTest ............................   Passed    0.31 sec
    Start 2: PosEncTest
2/8 Test #2: PosEncTest ..........................   Passed    0.31 sec
    Start 3: CascadeTest
3/8 Test #3: CascadeTest .........................   Passed  1369.12 sec
    Start 4: SingleDecodeTest
4/8 Test #4: SingleDecodeTest ....................   Passed  7726.35 sec
    Start 5: BatchDecodeTest
5/8 Test #5: BatchDecodeTest .....................   Passed  811.61 sec
    Start 6: test_mfma_fp32_16x16x16fp16
6/8 Test #6: test_mfma_fp32_16x16x16fp16 .........   Passed    0.30 sec
    Start 7: test_transpose_4x4_half_registers
7/8 Test #7: test_transpose_4x4_half_registers ...   Passed    0.28 sec
    Start 8: test_rowsum
8/8 Test #8: test_rowsum .........................   Passed    0.27 sec

100% tests passed, 0 tests failed out of 8
```
diptorupd added a commit that referenced this pull request Dec 5, 2025
* Updates to permuted_smem.cuh

    - Updated the upcast_size function to always require a VectorWidthBits tparam
    - Added load_fragment, load_fragment_and_quad_transpose, store_fragment, load_vector_async,
      store_64b, and store_vector fuctions.
zhenhantech pushed a commit to zhenhantech/flashinfer that referenced this pull request Jan 9, 2026
This PR improves the CMake build system for the C++ API of flashinfer,
making it more modular, easier to configure, and overall easier to
integrate into downstream projects. The changes introduce a proper
component-based installation for different usage scenarios for the C++
API, adds CMake-based dependency management, and generally offers a
cleaner build organization.

The C++ sources, tests, and benchmarks are now reorganized inside a
`libflashinfer` directory. The original directories are left unmodified.

flashinfer/
├── CMakeLists.txt
├── include/
│   └── flashinfer/
└── src/

flashinfer/
├── CMakeLists.txt
├── cmake/
│   └── utils/
│   │   └── ConfigureTargets.cmake
│   ├── Config.cmake.in
│   ├── Dependencies.cmake
│   ├── Options.cmake
├── libflashinfer/
│   ├── CMakeLists.txt
│   ├── include/
│   │   └── flashinfer/
│   │       ├── attention/
│   │       └── distributed/
│   │       └── gemm/
│   ├── tests/
│   │   ├── CMakeLists.txt          # Consistent test configuration
│   │   └── fp8/                    # FP8-specific tests
│   ├── utils/
│   │   └── fp8/                    # FP8 utility files
│   └── benchmarks/
│       ├── CMakeLists.txt          # Standardized benchmark setup
│       └── fp8/                    # FP8-specific benchmarks

- Separated flashinfer CMake option definitions into distinct files for
better organization
- Created utility modules for configuring tests and benchmarks
consistently
- Consolidated dependency management in a single location
- Added a flashinferconfig for downstream projects to be able to do
`find_package(flashinfer)` in their CMake scripts.

- The C++ API is broken down into multiple components: **Headers,
Kernels, TVMBinding, Distributed.**
- Components can be configured and built separately
- Ex1. Configuring the C++ API to build AOT kernels, C++ unit tests and
benchmarks
    ```bash
cmake .. -GNinja -DFLASHINFER_CUDA_ARCHITECTURES=80
-DFLASHINFER_BUILD_KERNELS=ON -DFLASHINFER_UNITTESTS=ON
-DFLASHINFER_CXX_BENCHMARKS=ON
-DFLASHINFER_CUTLASS_DIR=../3rdparty/cutlass
-DCMAKE_INSTALL_PREFIX=~/devel/install
    ```
    - Ex2. Build various components as needed
    ```bash
        # Build the test_single_decode unit tests
        cmake --build . --target test_single_decode
        # Build all unit tests (broken right now)
        cmake --build . --target build_tests
        # Build bench_single_decode
        cmake --build . --target bench_single_decode
        # Build all benchmarks (broken right now)
        cmake --build . --target build_benchmarks
    ```
   - Ex3. Install components
       ```bash
           cmake --build . --target install
       ```
- Added CMake targets to generate sources based on the same logic as in
the current `setup.py`. Generated sources are now placed under the
`CMAKE_CURRENT_BINARY_DIR`

   - Ex4. Install components
       ```bash
           cmake --build . --target generate_kernels
       ```

These changes are primarily meant to streamline the changes to add new
CMake config options for a proposed HIP back end. However, it is
foreseeable to build upon these changes to convert the build system to a
fully unified CMake-based using `scikit-build` and use a single
configuration system for both C++ and Python APIs.
zhenhantech pushed a commit to zhenhantech/flashinfer that referenced this pull request Jan 9, 2026
In this PR I remove the `libtorch` dependency and removed
`test_page.cpp`. `test_page.cpp` is the only unit test that uses
libtorch. However, we also have a pytest for testing page. We will use
that for validation.

Removing the libtorch dependency will help us speed docker builds and
remove additional dependencies.


```Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/8 Test ROCm#1: MathTest ............................   Passed    0.31 sec
    Start 2: PosEncTest
2/8 Test ROCm#2: PosEncTest ..........................   Passed    0.31 sec
    Start 3: CascadeTest
3/8 Test ROCm#3: CascadeTest .........................   Passed  1369.12 sec
    Start 4: SingleDecodeTest
4/8 Test ROCm#4: SingleDecodeTest ....................   Passed  7726.35 sec
    Start 5: BatchDecodeTest
5/8 Test ROCm#5: BatchDecodeTest .....................   Passed  811.61 sec
    Start 6: test_mfma_fp32_16x16x16fp16
6/8 Test ROCm#6: test_mfma_fp32_16x16x16fp16 .........   Passed    0.30 sec
    Start 7: test_transpose_4x4_half_registers
7/8 Test ROCm#7: test_transpose_4x4_half_registers ...   Passed    0.28 sec
    Start 8: test_rowsum
8/8 Test ROCm#8: test_rowsum .........................   Passed    0.27 sec

100% tests passed, 0 tests failed out of 8
```
zhenhantech pushed a commit to zhenhantech/flashinfer that referenced this pull request Jan 9, 2026
* Updates to permuted_smem.cuh

    - Updated the upcast_size function to always require a VectorWidthBits tparam
    - Added load_fragment, load_fragment_and_quad_transpose, store_fragment, load_vector_async,
      store_64b, and store_vector fuctions.
diptorupd added a commit that referenced this pull request Feb 2, 2026
This PR improves the CMake build system for the C++ API of flashinfer,
making it more modular, easier to configure, and overall easier to
integrate into downstream projects. The changes introduce a proper
component-based installation for different usage scenarios for the C++
API, adds CMake-based dependency management, and generally offers a
cleaner build organization.

The C++ sources, tests, and benchmarks are now reorganized inside a
`libflashinfer` directory. The original directories are left unmodified.

flashinfer/
├── CMakeLists.txt
├── include/
│   └── flashinfer/
└── src/

flashinfer/
├── CMakeLists.txt
├── cmake/
│   └── utils/
│   │   └── ConfigureTargets.cmake
│   ├── Config.cmake.in
│   ├── Dependencies.cmake
│   ├── Options.cmake
├── libflashinfer/
│   ├── CMakeLists.txt
│   ├── include/
│   │   └── flashinfer/
│   │       ├── attention/
│   │       └── distributed/
│   │       └── gemm/
│   ├── tests/
│   │   ├── CMakeLists.txt          # Consistent test configuration
│   │   └── fp8/                    # FP8-specific tests
│   ├── utils/
│   │   └── fp8/                    # FP8 utility files
│   └── benchmarks/
│       ├── CMakeLists.txt          # Standardized benchmark setup
│       └── fp8/                    # FP8-specific benchmarks

- Separated flashinfer CMake option definitions into distinct files for
better organization
- Created utility modules for configuring tests and benchmarks
consistently
- Consolidated dependency management in a single location
- Added a flashinferconfig for downstream projects to be able to do
`find_package(flashinfer)` in their CMake scripts.

- The C++ API is broken down into multiple components: **Headers,
Kernels, TVMBinding, Distributed.**
- Components can be configured and built separately
- Ex1. Configuring the C++ API to build AOT kernels, C++ unit tests and
benchmarks
    ```bash
cmake .. -GNinja -DFLASHINFER_CUDA_ARCHITECTURES=80
-DFLASHINFER_BUILD_KERNELS=ON -DFLASHINFER_UNITTESTS=ON
-DFLASHINFER_CXX_BENCHMARKS=ON
-DFLASHINFER_CUTLASS_DIR=../3rdparty/cutlass
-DCMAKE_INSTALL_PREFIX=~/devel/install
    ```
    - Ex2. Build various components as needed
    ```bash
        # Build the test_single_decode unit tests
        cmake --build . --target test_single_decode
        # Build all unit tests (broken right now)
        cmake --build . --target build_tests
        # Build bench_single_decode
        cmake --build . --target bench_single_decode
        # Build all benchmarks (broken right now)
        cmake --build . --target build_benchmarks
    ```
   - Ex3. Install components
       ```bash
           cmake --build . --target install
       ```
- Added CMake targets to generate sources based on the same logic as in
the current `setup.py`. Generated sources are now placed under the
`CMAKE_CURRENT_BINARY_DIR`

   - Ex4. Install components
       ```bash
           cmake --build . --target generate_kernels
       ```

These changes are primarily meant to streamline the changes to add new
CMake config options for a proposed HIP back end. However, it is
foreseeable to build upon these changes to convert the build system to a
fully unified CMake-based using `scikit-build` and use a single
configuration system for both C++ and Python APIs.
diptorupd pushed a commit that referenced this pull request Feb 2, 2026
In this PR I remove the `libtorch` dependency and removed
`test_page.cpp`. `test_page.cpp` is the only unit test that uses
libtorch. However, we also have a pytest for testing page. We will use
that for validation.

Removing the libtorch dependency will help us speed docker builds and
remove additional dependencies.


```Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/8 Test #1: MathTest ............................   Passed    0.31 sec
    Start 2: PosEncTest
2/8 Test #2: PosEncTest ..........................   Passed    0.31 sec
    Start 3: CascadeTest
3/8 Test #3: CascadeTest .........................   Passed  1369.12 sec
    Start 4: SingleDecodeTest
4/8 Test #4: SingleDecodeTest ....................   Passed  7726.35 sec
    Start 5: BatchDecodeTest
5/8 Test #5: BatchDecodeTest .....................   Passed  811.61 sec
    Start 6: test_mfma_fp32_16x16x16fp16
6/8 Test #6: test_mfma_fp32_16x16x16fp16 .........   Passed    0.30 sec
    Start 7: test_transpose_4x4_half_registers
7/8 Test #7: test_transpose_4x4_half_registers ...   Passed    0.28 sec
    Start 8: test_rowsum
8/8 Test #8: test_rowsum .........................   Passed    0.27 sec

100% tests passed, 0 tests failed out of 8
```
diptorupd added a commit that referenced this pull request Feb 2, 2026
* Updates to permuted_smem.cuh

    - Updated the upcast_size function to always require a VectorWidthBits tparam
    - Added load_fragment, load_fragment_and_quad_transpose, store_fragment, load_vector_async,
      store_64b, and store_vector fuctions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants