Add tmp buffer and rotary mode to BatchDecode wrapper by MasterJH5574 · Pull Request #2 · flashinfer-ai/flashinfer

MasterJH5574 · 2023-09-13T18:06:44Z

No description provided.

yzh119

LGTM

This PR fixes some of the unit test failures that occur in Single Decode. It also disables clang formatting of headers. The clang format of headers causes compilation issues. The compiler is unable to find `HIP WARP SYNC INTRINSICS` causing failures. Disabling clang format fixes these issues ``` Start 1: MathTest 1/6 Test #1: MathTest ......................... Passed 3.31 sec Start 2: PosEncTest 2/6 Test #2: PosEncTest ....................... Passed 3.36 sec Start 3: CascadeTest 3/6 Test #3: CascadeTest ...................... Passed 3.35 sec Start 4: PageTest 4/6 Test #4: PageTest ......................... Passed 114.08 sec Start 5: SingleDecodeTest 5/6 Test #5: SingleDecodeTest ................. Passed 35.22 sec Start 6: BatchDecodeTest 6/6 Test #6: BatchDecodeTest .................. Passed 559.75 sec 100% tests passed, 0 tests failed out of 6 Total Test time (real) = 719.07 sec ```

CPP test suite was using `hipified` headers. In this PR, we port over unit tests to use `gpu_iface`. This is necessary for us as the next step is to move the build infrastructure to use `gpu_iface` This PR has been tested locally ``` Test project /root/flashinfer/libflashinfer/tests/hip/build Start 1: MathTest 1/6 Test #1: MathTest ......................... Passed 3.40 sec Start 2: PosEncTest 2/6 Test #2: PosEncTest ....................... Passed 3.40 sec Start 3: CascadeTest 3/6 Test #3: CascadeTest ...................... Passed 985.27 sec Start 4: PageTest 4/6 Test #4: PageTest ......................... Passed 112.40 sec Start 5: SingleDecodeTest 5/6 Test #5: SingleDecodeTest ................. Passed 35.46 sec Start 6: BatchDecodeTest 6/6 Test #6: BatchDecodeTest .................. Passed 556.81 sec 100% tests passed, 0 tests failed out of 6 ``` To replicate the tests ``` cd flashinfer/libflashinfer/tests/hip ``` ``` mkdir build && cd build/ ``` ``` cmake -DCMAKE_PREFIX_PATH=/root/libtorch -DCMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++ -DFLASHINFER_INCLUDE_DIRS=/root/flashinfer/libflashinfer/include/ .. ``` ``` make ``` ``` ctest ```

In this PR I remove the `libtorch` dependency and removed `test_page.cpp`. `test_page.cpp` is the only unit test that uses libtorch. However, we also have a pytest for testing page. We will use that for validation. Removing the libtorch dependency will help us speed docker builds and remove additional dependencies. ```Test project /root/flashinfer/libflashinfer/tests/hip/build Start 1: MathTest 1/8 Test #1: MathTest ............................ Passed 0.31 sec Start 2: PosEncTest 2/8 Test #2: PosEncTest .......................... Passed 0.31 sec Start 3: CascadeTest 3/8 Test #3: CascadeTest ......................... Passed 1369.12 sec Start 4: SingleDecodeTest 4/8 Test #4: SingleDecodeTest .................... Passed 7726.35 sec Start 5: BatchDecodeTest 5/8 Test #5: BatchDecodeTest ..................... Passed 811.61 sec Start 6: test_mfma_fp32_16x16x16fp16 6/8 Test #6: test_mfma_fp32_16x16x16fp16 ......... Passed 0.30 sec Start 7: test_transpose_4x4_half_registers 7/8 Test #7: test_transpose_4x4_half_registers ... Passed 0.28 sec Start 8: test_rowsum 8/8 Test #8: test_rowsum ......................... Passed 0.27 sec 100% tests passed, 0 tests failed out of 8 ```

enable skip_softmax in python interface

## 📌 Description To fix the following bug: When the CuteDSL MoE kernels were ported from TensorRT-LLM to FlashInfer, the mPtrPermutedIdxToExpandedIdx field was accidentally dropped from the routing kernel's DataBase struct in RoutingKernel.h. TRT-LLM's routing kernel produces three reverse-mapping outputs: 1. mPtrExpandedIdxToPermutedIdx[expandedIdx] = permutedIdx — forward mapping 2. mPtrPermutedIdxToExpandedIdx[permutedIdx] = expandedIdx — reverse to expanded index (token_idx * topk + k) 3. mPtrPermutedIdxToTokenIdx[permutedIdx] = tokenIdx — reverse to token index only FlashInfer's port kept only #1 and #3, dropping #2. The binding in moe_utils_binding.cu then had to wire the Python buffer permuted_idx_to_expanded_idx to the only available reverse-mapping field — mPtrPermutedIdxToTokenIdx — which writes plain tokenIdx instead of expandedIdx. The Impact The CuteDSL kernels (GEMM1 gather, moe_output_memset, GEMM2 finalize) all expect expanded indices and derive the token index via expanded_idx // topk. When they received plain tokenIdx instead, they computed tokenIdx // topk — yielding the wrong A row for gather, wrong zero-init for memset, and wrong scatter position + wrong routing scale for finalize.  ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [ ] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [ ] I have installed the hooks with `pre-commit install`. - [ ] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Refactor** * Refined MOE (Mixture of Experts) routing infrastructure by extending index mapping capabilities across multiple kernel implementations to improve internal data flow consistency. * **Tests** * Strengthened accuracy validation thresholds from 0.925 to 0.97 with adjusted error tolerance parameters, ensuring more rigorous testing of MOE operations under FP4 quantization conditions.

## 📌 Description To fix the following bug: When the CuteDSL MoE kernels were ported from TensorRT-LLM to FlashInfer, the mPtrPermutedIdxToExpandedIdx field was accidentally dropped from the routing kernel's DataBase struct in RoutingKernel.h. TRT-LLM's routing kernel produces three reverse-mapping outputs: 1. mPtrExpandedIdxToPermutedIdx[expandedIdx] = permutedIdx — forward mapping 2. mPtrPermutedIdxToExpandedIdx[permutedIdx] = expandedIdx — reverse to expanded index (token_idx * topk + k) 3. mPtrPermutedIdxToTokenIdx[permutedIdx] = tokenIdx — reverse to token index only FlashInfer's port kept only flashinfer-ai#1 and flashinfer-ai#3, dropping flashinfer-ai#2. The binding in moe_utils_binding.cu then had to wire the Python buffer permuted_idx_to_expanded_idx to the only available reverse-mapping field — mPtrPermutedIdxToTokenIdx — which writes plain tokenIdx instead of expandedIdx. The Impact The CuteDSL kernels (GEMM1 gather, moe_output_memset, GEMM2 finalize) all expect expanded indices and derive the token index via expanded_idx // topk. When they received plain tokenIdx instead, they computed tokenIdx // topk — yielding the wrong A row for gather, wrong zero-init for memset, and wrong scatter position + wrong routing scale for finalize.  ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [ ] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [ ] I have installed the hooks with `pre-commit install`. - [ ] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Refactor** * Refined MOE (Mixture of Experts) routing infrastructure by extending index mapping capabilities across multiple kernel implementations to improve internal data flow consistency. * **Tests** * Strengthened accuracy validation thresholds from 0.925 to 0.97 with adjusted error tolerance parameters, ensuring more rigorous testing of MOE operations under FP4 quantization conditions.  Signed-off-by: Amey Naik <212485788+ameynaik-hub@users.noreply.github.com>

Add tmp buffer and rotary mode to BatchDecode wrapper

4558be7

yzh119 approved these changes Sep 13, 2023

View reviewed changes

yzh119 merged commit 3d1f5b3 into main Sep 13, 2023

MasterJH5574 deleted the batch-decode-tmp-rotary branch September 18, 2023 13:43

wacodespace mentioned this pull request Jun 30, 2025

Compatibility Issue Due to API Change in Sampling Functions #1192

Closed

yyihuang mentioned this pull request Sep 19, 2025

[bug] bmm_fp8 test error #1738

Closed

coderabbitai bot mentioned this pull request Dec 24, 2025

feat: Add FP8/NVFP4 quant fusion for MNNVL Allreduce #2263

Open

5 tasks

claude bot mentioned this pull request Jan 20, 2026

feat: add trtllm_fp8_block_scale_routed_moe API #2382

Closed

claude bot mentioned this pull request Jan 31, 2026

CI docker release workflow uses cached states #2453

Closed

bobboli pushed a commit to bobboli/flashinfer that referenced this pull request Feb 17, 2026

Merge pull request flashinfer-ai#2 from zhou-yuxin/add-skip-softmax

e8ebf7c

enable skip_softmax in python interface

nv-yunzheq mentioned this pull request Feb 24, 2026

fix: cute dsl nvfp4 moe routing index error #2629

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tmp buffer and rotary mode to BatchDecode wrapper#2

Add tmp buffer and rotary mode to BatchDecode wrapper#2
yzh119 merged 1 commit intomainfrom
batch-decode-tmp-rotary

MasterJH5574 commented Sep 13, 2023

Uh oh!

yzh119 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MasterJH5574 commented Sep 13, 2023

Uh oh!

yzh119 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants