Skip to content

[Wrapper] TVM wrapper for batch-decode kernel without RoPE#1

Merged
yzh119 merged 1 commit intomainfrom
batch-decode-wrapper
Sep 11, 2023
Merged

[Wrapper] TVM wrapper for batch-decode kernel without RoPE#1
yzh119 merged 1 commit intomainfrom
batch-decode-wrapper

Conversation

@MasterJH5574
Copy link
Collaborator

Will extend with RoPE later on.

Copy link
Collaborator

@yzh119 yzh119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yzh119 yzh119 merged commit 77242ef into main Sep 11, 2023
@MasterJH5574 MasterJH5574 deleted the batch-decode-wrapper branch September 18, 2023 13:43
JackFram pushed a commit to JackFram/flashinfer that referenced this pull request May 4, 2025
diptorupd referenced this pull request in ROCm/flashinfer Sep 29, 2025
This PR fixes some of the unit test failures that occur in Single
Decode. It also disables clang formatting of headers.
The clang format of headers causes compilation issues. The compiler is
unable to find `HIP WARP SYNC INTRINSICS` causing failures. Disabling
clang format fixes these issues

```
    Start 1: MathTest
1/6 Test #1: MathTest .........................   Passed    3.31 sec
    Start 2: PosEncTest
2/6 Test #2: PosEncTest .......................   Passed    3.36 sec
    Start 3: CascadeTest
3/6 Test #3: CascadeTest ......................   Passed    3.35 sec
    Start 4: PageTest
4/6 Test #4: PageTest .........................   Passed  114.08 sec
    Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest .................   Passed   35.22 sec
    Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest ..................   Passed  559.75 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 719.07 sec
```
diptorupd referenced this pull request in ROCm/flashinfer Sep 29, 2025
CPP test suite was using `hipified` headers. In this PR, we port over unit tests to use `gpu_iface`. This is necessary for us as the next step is to move the build infrastructure to use `gpu_iface`

This PR has been tested locally 
```
Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/6 Test #1: MathTest .........................   Passed    3.40 sec
    Start 2: PosEncTest
2/6 Test #2: PosEncTest .......................   Passed    3.40 sec
    Start 3: CascadeTest
3/6 Test #3: CascadeTest ......................   Passed  985.27 sec
    Start 4: PageTest
4/6 Test #4: PageTest .........................   Passed  112.40 sec
    Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest .................   Passed   35.46 sec
    Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest ..................   Passed  556.81 sec

100% tests passed, 0 tests failed out of 6
```

To replicate the tests
```
cd flashinfer/libflashinfer/tests/hip
```
```
mkdir build && cd build/
```
```
cmake -DCMAKE_PREFIX_PATH=/root/libtorch -DCMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++ -DFLASHINFER_INCLUDE_DIRS=/root/flashinfer/libflashinfer/include/ ..
```
```
make
```
```
ctest
```
diptorupd referenced this pull request in ROCm/flashinfer Sep 29, 2025
In this PR I remove the `libtorch` dependency and removed
`test_page.cpp`. `test_page.cpp` is the only unit test that uses
libtorch. However, we also have a pytest for testing page. We will use
that for validation.

Removing the libtorch dependency will help us speed docker builds and
remove additional dependencies.


```Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/8 Test #1: MathTest ............................   Passed    0.31 sec
    Start 2: PosEncTest
2/8 Test #2: PosEncTest ..........................   Passed    0.31 sec
    Start 3: CascadeTest
3/8 Test #3: CascadeTest .........................   Passed  1369.12 sec
    Start 4: SingleDecodeTest
4/8 Test #4: SingleDecodeTest ....................   Passed  7726.35 sec
    Start 5: BatchDecodeTest
5/8 Test #5: BatchDecodeTest .....................   Passed  811.61 sec
    Start 6: test_mfma_fp32_16x16x16fp16
6/8 Test #6: test_mfma_fp32_16x16x16fp16 .........   Passed    0.30 sec
    Start 7: test_transpose_4x4_half_registers
7/8 Test #7: test_transpose_4x4_half_registers ...   Passed    0.28 sec
    Start 8: test_rowsum
8/8 Test #8: test_rowsum .........................   Passed    0.27 sec

100% tests passed, 0 tests failed out of 8
```
wangbo981016 pushed a commit to meituan-longcat/flashinfer that referenced this pull request Feb 5, 2026
Update to v0.5.2 and opt cuda graph launch config for MTP situation
* fix q len for MTP;
* release: Bump version for v0.5.2 release (flashinfer-ai#2057)

<!-- .github/pull_request_template.md -->

## 📌 Description

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
  * Version updated to 0.5.2

<!-- end of auto-generated comment: release notes by coderabbit.ai -->;
* [BUG] Fix trtllm-gen fp4 moe renormalize routing (flashinfer-ai#2049)

<!-- .github/pull_request_template.md -->

## 📌 Description

Temporarily disable `routingIndicesBlockKernel` as it's not compatible
with the current packing format (topk-id and expert weights are packed
into a 32 bit tensor). This solves the issue
flashinfer-ai#2032

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Forced multi-block MoE execution to avoid sporadic single-block
selection and improve stability with certain workloads.

* **New Features**
* Added an alternative packed top‑k routing input path that propagates
routing scores when present.

* **Tests**
* Added a comprehensive parametrized test validating routed fused MoE
across token counts, model sizes, expert counts and multiple
quantization modes.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
Co-authored-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>;
* test: Skip test_fp8_quantize.py on Hopper (flashinfer-ai#2052)

<!-- .github/pull_request_template.md -->

## 📌 Description

The unit test `test_fp8_quantize.py` currently fails on sm90. 

Root cause: The test file tests the accuracy of `mxfp8_quantize()`.
However, in
[fp8_quantization.py](https://github.com/flashinfer-ai/flashinfer/blob/adb0e89fdee0a3140a43982bc3bef4e79ce20046/flashinfer/fp8_quantization.py#L7),
the `mxfp8_quantize()`'s underlying module only exists for
`gen_mxfp8_quantization_sm100_module` with no sm90 support.

Current PR changes test file to skip for pre-SM100 SM archs as they are
not supported..

Results:
* Before current PR on SM90: `72 failed, 40 passed in 2.69s`
* After current PR on SM90: `40 passed, 72 skipped in 1.41s`
* Before current PR on SM120: `112 passed  in 1.59s`
* After current PR on SM120: `112 passed in 1.54s` (expected to be the
same as before)

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Tests**
* Added conditional checks to skip FP8 quantization tests on GPUs that
lack required computational capabilities.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->;
* Add support for topkPacked input in block-level renormalize (flashinfer-ai#2051)

<!-- .github/pull_request_template.md -->

## 📌 Description

Add support for topkPacked input in block-level renormalize

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Performance**
* Optimized routing layer efficiency through improved index handling in
specialized processing configurations.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>;
* chore: Update CODEOWNERS (flashinfer-ai#1984)

## Summary

This PR updates the CODEOWNERS file based on git commit history analysis
from the last 180 days.

## Changes

- Updated `.github/CODEOWNERS` with current code ownership based on:
  - Commit frequency
  - File coverage
  - Commit recency

## How to Review

1. Review the changes to `.github/CODEOWNERS`
2. Verify that the assigned owners are appropriate for each module
3. Make manual adjustments if needed before merging

## Notes

- This is an automated PR generated weekly
- Minimum commits threshold: 1
- Analysis period: 180 days
- Directory depth: 3 levels
- Top N owners per module: 5

---

🤖 This PR was automatically generated by the [update-codeowners
workflow](.github/workflows/update-codeowners.yml)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
* Updated code ownership assignments and reorganized related section
mappings for internal development processes.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: flashinfer-bot <flashinfer-bot@users.noreply.github.com>
Co-authored-by: Claude <noreply@anthropic.com>;
* Update trtllm-gen fused moe routing kernel and add more kernels (flashinfer-ai#1955)

<!-- .github/pull_request_template.md -->

## 📌 Description
co-work with @IwakuraRein 
- update the trtllm-gen fused moe headers
- add new kernels for trtllm-gen fused moe
  - for NvFp4, add tile 256
  - for MxFp8 x MxFp4, add 128, 256
  - for FP8 per-tensor, add 192, 256
  - for FP8 block scale, add 128
 - update the logics of `computeSelectedTileN`
 - add `tune_max_num_tokens` to FP8 per-tensor and FP8 block scale
 - rename `TLLM_GEN_BMM_CUBIN_PATH` to `TLLM_GEN_GEMM_CUBIN_PATH`
 - add `TLLM_GEN_EXPORT_FLASHINFER`

**NOTE: split-k kernels are temporarily disabled as they cause failure
in renormalize + expert 256 tests.**

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Expanded MoE tiling (adds 128/192/256), FP8 per‑tensor MoE path,
FP8/FP4 autotuner benchmark, and new tune_max_num_tokens tuning
parameter.

* **Improvements**
* Router now supports tile‑based (non‑power‑of‑two) layouts and
propagates explicit valid M/N/K for safer sizing; autotuner logs include
exception details; added export/compile flags and clearer kernel error
messages.

* **Bug Fixes**
* Relaxed strict padding/power‑of‑two checks and made log2 handling
safer.

* **Tests**
* Extended MoE tests to cover new FP8 block‑scale and routing scenarios.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Co-authored-by: Siyuan Fu <siyuanf@nvidia.com>;
* Fix dtype of output scales from mnnvl_moe_alltoallv_prepare_without_allgather (flashinfer-ai#2048)

<!-- .github/pull_request_template.md -->

## 📌 Description

During flashinfer-ai#1641 the dtype
of output scales in
moePrepare(mnnvl_moe_alltoallv_prepare_without_allgather) was accidently
changed from float to int32. This PR fixes that.

## 🔍 Related Issues

Fix flashinfer-ai#2040

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Corrected tensor type validation for mixture-of-experts scale
preparation so scales are validated and handled as float32, preventing
type mismatches with downstream float operations.
* Ensured scale tensors are created on the same device as expert
identifiers, keeping tensor placement consistent across distributed
processing and avoiding cross-device issues.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>;
* test: Fix test_sampling.py on Spark (flashinfer-ai#2042)

<!-- .github/pull_request_template.md -->

## 📌 Description

Current PR fixes `test_sampling.py::test_softmax` on Spark by inserting
a `torch.cuda.synchronize()` before calling the softmax function.

tl; dr why it works: PDL is enabled in these tests. Investigation shows
that when PDL is enabled, `logits.view(-1).index_fill_(0, inf_idx,
float("-inf"))` that prepares the inputs overlaps with the `probs =
flashinfer.sampling.softmax(logits, temperature=temperature_arr)`
function itself. Hence, we need to ensure that the input preparation is
complete before running the softmax function to get the correct output.


#### Observations
`test_sampling.py::test_softmax` fails on select cases Spark. Example
output
```
# pytest tests/utils/test_sampling.py::test_softmax
=================================================================================================================================================== test session starts ===================================================================================================================================================
platform linux -- Python 3.12.11, pytest-8.4.2, pluggy-1.6.0
rootdir: /flashinfer
configfile: pytest.ini
collected 324 items                                    
...
================================================================================================================================================= short test summary info =================================================================================================================================================
FAILED tests/utils/test_sampling.py::test_softmax[True-True-1.0-normal_distribution(std=1)-128256-989] - AssertionError: assert False
FAILED tests/utils/test_sampling.py::test_softmax[True-True-1.0-normal_distribution(std=5)-128256-989] - AssertionError: assert False
FAILED tests/utils/test_sampling.py::test_softmax[True-True-1.0-gumbel_distribution(beta=0.1)-128256-989] - AssertionError: assert False
======================================================================================================================================== 3 failed, 321 passed, 1 warning in 10.33s
```

Observations from debugging:
* When outputs are printed, rows containing all `nan`s are produced in
the output of `probs = flashinfer.sampling.softmax(logits)`
* Surprisingly, the test passes with `CUDA_LAUNCH_BLOCKING=1 pytest
tests/utils/test_sampling.py::test_softmax`
* `compute-sanitizer` does not detect any IMAs
* Running only a failed test results in a pass:
```
$ pytest tests/utils/test_sampling.py::test_softmax[True-True-1.0-normal_distribution\(std=1\)-128256-989]
...
1 passed, 1 warning in 0.80s
```

Towards a fix:
* I empirically find that the test passes:
* when the reference `torch.softmax()` is called before
`flashinfer.sampling.softmax()` (currently reference is called after)
* when pdl is disabled in [line
67](https://github.com/flashinfer-ai/flashinfer/blob/main/tests/utils/test_sampling.py#L67)
with `probs = flashinfer.sampling.softmax(logits,
temperature=temperature_arr,enable_pdf=False)`
* when `torch.cuda.synchronize()` is inserted in the line 64 as in this
PR.
```
    if neg_inf_input:
        # assign random logits to -inf
        num_inf = torch.randint(0, logits.numel() - 1, (), device=logits.device).item()
        inf_idx = torch.randperm(logits.numel(), device=logits.device)[:num_inf]
        logits.view(-1).index_fill_(0, inf_idx, float("-inf"))
        torch.cuda.synchronize() ## This fixes the issue for some reason!

    if temperature_arr:
        temperature_arr = torch.full((batch_size,), temperature, device="cuda:0")
        probs = flashinfer.sampling.softmax(logits, temperature=temperature_arr)
        logits_scaled = logits / temperature_arr.unsqueeze(-1)
```
but **does not fix the issue if I place the synchronization any
earlier**

An nsys profile shows that surprisingly the
`logits.view(-1).index_fill_(0, inf_idx, float("-inf"))` and
`flashinfer.sampling.softmax(logits, temperature=temperature_arr)` can
overlap execution when pdl is enabled.
<img width="1243" height="640" alt="Screenshot 2025-11-04 at 5 49 50 PM"
src="https://github.com/user-attachments/assets/950ab8ab-0843-49c8-8411-ff81c00c34a6"
/>

This means that the softmax kernel is launching before inputs are done
being prepared when `neg_inf_input=True`. Hence, placing a
`torch.cuda.synchronize()` after the fill or disabling pdl can solve the
issue. With the current PR, the nsys timeline changes to:
<img width="1240" height="643" alt="Screenshot 2025-11-04 at 5 51 32 PM"
src="https://github.com/user-attachments/assets/aae63a88-d7cd-4661-8476-6d8c581879b2"
/>
and the unit test passes.

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **Bug Fixes**
* Improved synchronization of concurrent operations to ensure proper
execution order and prevent potential timing-related issues.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->;
* fix: support both pip and uv pip for finding flashinfer-python package (flashinfer-ai#2043)

Update getJitIncludeDirs() to try pip first, then fallback to uv pip if
pip is not available. This ensures compatibility with both standard pip
and uv pip package managers when locating the flashinfer-python
installation for JIT compilation include paths.

The command now uses shell OR operator (||) to attempt pip first, and
only falls back to uv pip if the first command fails.
```
pytest -xs tests/moe/test_trtllm_cutlass_fused_moe.py::test_moe_fp8_block_scaling
============================================================================================================================================================ test session starts =============================================================================================================================================================
platform linux -- Python 3.10.12, pytest-8.4.2, pluggy-1.6.0
rootdir: /home/scratch.dmoss_gpu_1/repos/flashinfer
configfile: pytest.ini
collected 1 item                                                                                                                                                                                                                                                                                                                             

tests/moe/test_trtllm_cutlass_fused_moe.py [TensorRT-LLM][INFO] Compiling JIT runtime gemm_swapAB_256_128_128_16_128_2_82_8_1_GroupedWithOffset with options: 
[TensorRT-LLM][INFO] -std=c++17 
[TensorRT-LLM][INFO] --gpu-architecture=sm_90a 
[TensorRT-LLM][INFO] --ptxas-options=-allow-expensive-optimizations=true 
[TensorRT-LLM][INFO] --ptxas-options=--register-usage-level=10 
[TensorRT-LLM][INFO] --diag-suppress=161,174,177,940 
[TensorRT-LLM][INFO] -D__FORCE_INCLUDE_CUDA_FP16_HPP_FROM_FP16_H__=1 
[TensorRT-LLM][INFO] -D__FORCE_INCLUDE_CUDA_BF16_HPP_FROM_BF16_H__=1 
[TensorRT-LLM][INFO] -O3 
[TensorRT-LLM][INFO] -cubin 
[TensorRT-LLM][INFO] --expt-relaxed-constexpr 
[TensorRT-LLM][INFO] --expt-extended-lambda 
[TensorRT-LLM][INFO] --compiler-options=-fPIC,-O3,-Wno-deprecated-declarations,-Wno-abi 
[TensorRT-LLM][INFO] -I/home/scratch.dmoss_gpu_1/repos/flashinfer/flashinfer/data/csrc/nv_internal/tensorrt_llm 
[TensorRT-LLM][INFO] 

[TensorRT-LLM][INFO] Generated kernel code:

#ifdef __CUDACC_RTC__
#ifndef NVRTC_JIT_COMPILATION
#define NVRTC_JIT_COMPILATION
#endif

#include <deep_gemm/nvrtc_std.cuh>

#else

#include <string>
#include <cuda.h>

#endif

#include <cuda_bf16.h>
#include <cuda_fp8.h>
#include <deep_gemm/nvrtc_cutlass.cuh>
#include <deep_gemm/fp8_gemm_impl.cuh>

using namespace deep_gemm;

using SchedulerType =
typename SchedulerSelectorSwapAB<GemmType::GroupedWithOffset, 256, 128, 128, 16, 128, 2, 1>::type;

__global__ void dummy_kernel() {
  void *ptr = (void *)&fp8_gemm_kernel_swapAB<256, 128, 128, 16, 128, 2, 8, 128, 128, 1, SchedulerType, GroupedWithOffsetSchedulerInputSwapAB>;
}

[TensorRT-LLM][INFO] NVCC compilation took 3064 ms
[TensorRT-LLM][INFO] Compilation log:

[TensorRT-LLM][INFO] Successfully copied kernel files to cache directory: /home/dmoss/.tensorrt_llm/cache/gemm_swapAB_256_128_128_16_128_2_82_8_1_GroupedWithOffset
[TensorRT-LLM][INFO] Compiling JIT runtime gemm_swapAB_128_128_128_16_128_2_82_8_1_GroupedWithOffset with options: 
[TensorRT-LLM][INFO] -std=c++17 
[TensorRT-LLM][INFO] --gpu-architecture=sm_90a 
[TensorRT-LLM][INFO] --ptxas-options=-allow-expensive-optimizations=true 
[TensorRT-LLM][INFO] --ptxas-options=--register-usage-level=10 
[TensorRT-LLM][INFO] --diag-suppress=161,174,177,940 
[TensorRT-LLM][INFO] -D__FORCE_INCLUDE_CUDA_FP16_HPP_FROM_FP16_H__=1 
[TensorRT-LLM][INFO] -D__FORCE_INCLUDE_CUDA_BF16_HPP_FROM_BF16_H__=1 
[TensorRT-LLM][INFO] -O3 
[TensorRT-LLM][INFO] -cubin 
[TensorRT-LLM][INFO] --expt-relaxed-constexpr 
[TensorRT-LLM][INFO] --expt-extended-lambda 
[TensorRT-LLM][INFO] --compiler-options=-fPIC,-O3,-Wno-deprecated-declarations,-Wno-abi 
[TensorRT-LLM][INFO] -I/home/scratch.dmoss_gpu_1/repos/flashinfer/flashinfer/data/csrc/nv_internal/tensorrt_llm 
[TensorRT-LLM][INFO] 

[TensorRT-LLM][INFO] Generated kernel code:

#ifdef __CUDACC_RTC__
#ifndef NVRTC_JIT_COMPILATION
#define NVRTC_JIT_COMPILATION
#endif

#include <deep_gemm/nvrtc_std.cuh>

#else

#include <string>
#include <cuda.h>

#endif

#include <cuda_bf16.h>
#include <cuda_fp8.h>
#include <deep_gemm/nvrtc_cutlass.cuh>
#include <deep_gemm/fp8_gemm_impl.cuh>

using namespace deep_gemm;

using SchedulerType =
typename SchedulerSelectorSwapAB<GemmType::GroupedWithOffset, 128, 128, 128, 16, 128, 2, 1>::type;

__global__ void dummy_kernel() {
  void *ptr = (void *)&fp8_gemm_kernel_swapAB<128, 128, 128, 16, 128, 2, 8, 128, 128, 1, SchedulerType, GroupedWithOffsetSchedulerInputSwapAB>;
}

[TensorRT-LLM][INFO] NVCC compilation took 1479 ms
[TensorRT-LLM][INFO] Compilation log:

[TensorRT-LLM][INFO] Successfully copied kernel files to cache directory: /home/dmoss/.tensorrt_llm/cache/gemm_swapAB_128_128_128_16_128_2_82_8_1_GroupedWithOffset
.

============================================================================================================================================================= 1 passed in 9.02s ==============================================================================================================================================================
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Improved package detection compatibility for alternative package
management tool installations.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->;
* use scalar for kv_scale in xqa (flashinfer-ai#2033)

<!-- .github/pull_request_template.md -->

## 📌 Description

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [ ] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [ ] I have installed the hooks with `pre-commit install`.
- [ ] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Breaking Changes**
* Public xqa/xqa_mla entry points now accept kv_scale as a plain float
(default 1.0) instead of a 1-element tensor. Update call sites
accordingly.

* **Documentation**
  * Docstrings updated to reflect kv_scale as float.

* **Tests**
* Tests updated to pass scalar kv_scale, with added parameterization and
conditional skip for FP8 kv-cache scenarios.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>;
* Support cc common check decorator for empty backends (flashinfer-ai#2015)

<!-- .github/pull_request_template.md -->

## 📌 Description

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Improved backend/compute-capability validation with clearer errors and
correct fallback when backend-specific checks are absent.

* **New Features**
* Decorated functions expose runtime attributes to query backend
availability and choices.
  * Default-backend behavior: kernels use a default when none is passed.

* **Compatibility**
* Expanded supported compute-capability set and raised minimum cuDNN
package requirements.

* **Tests**
* Added tests for empty-backend common-checks and default-backend
behavior.

* **Chores**
  * Version bumped to 0.5.1.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->;
* perf: Speed up fp4 quantization for small batch with swizzling for cutlass MoE (flashinfer-ai#2025)

<!-- .github/pull_request_template.md -->

## 📌 Description

Performance optimization for `fp4_quantize()` function. The performance
issue was raised in issues flashinfer-ai#1734 and flashinfer-ai#2021

Observed behavior was slow performance when `is_sf_swizzled_layout=True`
(as opposed to False). Root cause of the issue was

* Excessive Padding Overhead: Swizzled layouts require row padding to
tile boundaries where `SWIZZLED_128x4` pads to multiples of 128 rows and
`SWIZZLED_8x4` pads to multiples of 8 rows
* This means `For batch_size=1` with SWIZZLED_128x4: 127 out of 128 rows
are padding (99.2% wasted work)
* Sequential Processing: The original grid launch used grid.x = min(m,
multiProcessorCount * numBlocksPerSM), so:
For batch_size=1: only 1 block launched
* This single block iterated sequentially over all 128 padded rows
* Each padding row still computed scale factors, checked bounds, and
performed conditional logic
* No Fast Path: Every row (real or padding) went through the same
expensive code path with multiple conditional branches

The fix:
1. Kernel-Level Early Exit Fast Path (`quantization.cuh`): Added branch
divergence optimization with separate handling for padding vs. data rows
- Padding rows now execute ~10× fewer instructions; Eliminates memory
loads/stores for input/output data on padding rows; Reduces register
pressure and divergence overhead

2. Host-Level Parallel Grid Launch (`quantization.cu`): Modified grid
calculation to launch blocks proportional to padded rows instead of
actual rows:
- For batch_size=1 with SWIZZLED_128x4: launches up to 128 blocks
instead of 1; Each block processes 1 row in parallel instead of
sequentially; overall tries to achieve full GPU occupancy even with
small batch sizes

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->


`fp4_quantize()` performance before fix:
```
$ python3 bench_fp4_quantize.py 
+------------+---------------------+-------------------------+
| batch size | swizzled_times (us) | non_swizzled_times (us) |
+------------+---------------------+-------------------------+
|    1.0     |        71.52        |          3.136          |
|    2.0     |       37.152        |          3.168          |
|    4.0     |       19.904        |          3.168          |
|    8.0     |       11.296        |           3.2           |
|    16.0    |        7.103        |          3.296          |
|    32.0    |        4.96         |          3.376          |
|    64.0    |        4.128        |          3.487          |
|   128.0    |        3.808        |          3.648          |
|   256.0    |        4.32         |          4.161          |
|   512.0    |        5.472        |          5.184          |
+------------+---------------------+-------------------------+
```
After fix in current PR:
```
$ python3 bench_fp4_quantize.py 
+------------+---------------------+-------------------------+
| batch size | swizzled_times (us) | non_swizzled_times (us) |
+------------+---------------------+-------------------------+
|    1.0     |        3.456        |          3.264          |
|    2.0     |        3.488        |          3.296          |
|    4.0     |        3.536        |          3.296          |
|    8.0     |        3.52         |          3.296          |
|    16.0    |        3.52         |          3.456          |
|    32.0    |        3.696        |          3.488          |
|    64.0    |        3.744        |          3.584          |
|   128.0    |        3.936        |          3.776          |
|   256.0    |        4.384        |          4.288          |
|   512.0    |        5.568        |          5.248          |
+------------+---------------------+-------------------------+
```

where the `bench_fp4_quantize.py` script used to benchmark (adopted from
flashinfer-ai#1734) :
```
from flashinfer.testing.utils import bench_gpu_time_with_cupti
from flashinfer import fp4_quantize
import torch
import numpy as np
import pandas as pd
from tabulate import tabulate

A_scale = torch.randn(16).cuda().float()
bsz = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
swizzled_times = []
for bs in bsz:
    A = torch.randn(bs, 5120).cuda().to(torch.bfloat16)
    t = np.median(bench_gpu_time_with_cupti(
            lambda: fp4_quantize(A, A_scale, is_sf_swizzled_layout=True),
            dry_run_iters = 10, 
            repeat_iters = 100,
            )
        ) * 1000
    swizzled_times.append(t)

non_swizzled_times = []
for bs in bsz:
    A = torch.randn(bs, 5120).cuda().to(torch.bfloat16)
    t = np.median(bench_gpu_time_with_cupti(
        lambda: fp4_quantize(A, A_scale, is_sf_swizzled_layout=False),
            dry_run_iters = 10, 
            repeat_iters = 100,
            )
        ) * 1000
    non_swizzled_times.append(t)


summary_df = pd.DataFrame({
    "batch size": bsz,
    "swizzled_times (us)": swizzled_times,
    "non_swizzled_times (us)": non_swizzled_times,
})

# Round numeric columns to three decimals before printing
summary_df_rounded = summary_df.copy()
summary_df_rounded["batch size"] = summary_df_rounded["batch size"].astype(int)
summary_df_rounded["swizzled_times (us)"] = summary_df_rounded["swizzled_times (us)"].round(3)
summary_df_rounded["non_swizzled_times (us)"] = summary_df_rounded["non_swizzled_times (us)"].round(3)
print(tabulate(summary_df_rounded, headers='keys', tablefmt='pretty', showindex=False))
```

## 🔍 Related Issues

flashinfer-ai#1734 
flashinfer-ai#2021 

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Improved quantization for swizzled memory layouts by adjusting how
effective processing rows are computed to better utilize GPU resources.
* Added early-exit handling for padding-only rows so padding outputs are
zeroed without processing data.
* Ensured consistent zeroing of scale/format outputs for padded columns
across all quantization paths.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->;
* bugfix: fix failed unittest `test_green_ctx` and `test_jit_example` on spark (sm_121) (flashinfer-ai#1951)

<!-- .github/pull_request_template.md -->

## 📌 Description

There are three failed unittests on spark (sm_121):
* tests/utils/test_green_ctx.py
* tests/utils/test_jit_example.py
* tests/utils/test_sampling.py

First one is because spark has small number of SMs (48) and we don't
have a guard on green context splitting.
Second one is an unknown issue (logits don't match with reference) and
probably related to barriers on sm_121, xfail now and will fix later.

The last one will be fixed by another PR from @bkryu , this PR fixes the
first two issues.

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Tests**
* Tests now pre-check GPU resources and auto-skip with informative
messages including available and requested SM counts to avoid spurious
failures.
* Added a conditional xfail for GPUs with compute capability 12.1 to
avoid false negatives on that hardware.
* Tightened a sampling test by adding a relative tolerance for more
robust numerical validation.

* **Bug Fixes**
* Improved runtime error handling to surface clearer guidance when GPU
SM resources are insufficient.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>;
* Update Docker CI tags to 20251104-d528f0c (flashinfer-ai#2041)

This PR updates the Docker CI image tags to the latest version:
`20251104-d528f0c`

Updated images:
- flashinfer/flashinfer-ci-cu126:20251104-d528f0c
- flashinfer/flashinfer-ci-cu128:20251104-d528f0c
- flashinfer/flashinfer-ci-cu129:20251104-d528f0c
- flashinfer/flashinfer-ci-cu130:20251104-d528f0c

Auto-generated by [release-ci-docker
workflow](https://github.com/flashinfer-ai/flashinfer/actions/runs/19084098717)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
* Updated Docker image tags to latest versions for CUDA 12.6, 12.8,
12.9, and 13.0 distributions.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: yzh119 <11773619+yzh119@users.noreply.github.com>;
* test: Mark test_fp8_prefill.py as xfail on SM90 (flashinfer-ai#2038)

<!-- .github/pull_request_template.md -->

## 📌 Description

`test_fp8_prefill.py` is currently failing on SM90, but consumes too
much time to run/fail, causing unit-tests to time out.

--Current PR marks it as xfail so that unit tests can progress
forward.--

Update: Root cause of failure is because mixed precision attention is
not available on `fa3` backend, but the attention prefill wrapper
automatically selects `backend='fa3'` on SM90.

Fix is to explicitly specify the `backend='fa2'` so that fa2 is always
used.

Status after fix:
```
$ pytest tests/attention/test_fp8_prefill.py
=================================================================================================================================================== test session starts ===================================================================================================================================================
...
collected 768 items                                                                                                                                                                                                                                                                                                       

tests/attention/test_fp8_prefill.py ............................................................................................................................................................................................................................................................................... [ 35%]
................................................................................................................................................................................................................................................................................................................... [ 75%]
..............................................................................................................................................................................................                                                                                                                      [100%]
======================================================================================================================================= 768 passed, 1 warning in 131.42s (0:02:11) ========================================================================================================================================

```

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Tests**
* Adjusted FP8/FP16 attention test configuration to explicitly select a
backend during prefill/decoding, stabilizing test behavior across
environments.

* **Public API**
* Constructors now accept an explicit backend parameter to allow
selecting the backend used for KV cache operations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->;
* ci: Update cudnn version requirements in CI container (flashinfer-ai#2039)

<!-- .github/pull_request_template.md -->

## 📌 Description

cuDNN versions specified in CI container setup
(`docker/install/install_python_packages.sh`) are currently 9.11 and
9.12.

In unit testing, this causes issues as `mm_fp4(backend='cudnn')` is not
supported on Spark (sm121) for older cuDNN versions in cu130.

Failure is due to cuDNN version shipped with container being too old. In
the [latest container build pipeline
output](https://github.com/flashinfer-ai/flashinfer/actions/runs/18778064727/job/53577233568#step:6:727),
cudnn 9.13.0.50 is installed
```
flashinfer-ai#16 207.0 Requirement already satisfied: nvidia-cudnn-cu13>=9.12.0.46 in /opt/conda/envs/py312/lib/python3.12/site-packages (9.13.0.50)
flashinfer-ai#16 207.0 Requirement already satisfied: nvidia-cublas in /opt/conda/envs/py312/lib/python3.12/site-packages (from nvidia-cudnn-cu13>=9.12.0.46) (13.0.0.19)
```

Current PR updates the minimum cudnn version for both
[cu12](https://pypi.org/project/nvidia-cudnn-cu12/#history) and
[cu13](https://pypi.org/project/nvidia-cudnn-cu13/#history) to
9.14.0.64.

cudnn 9.13 --> unit test fails with 180 failed, 270 passed, 2790
skipped, 1 warning in 8.97s
```
# pytest tests/gemm/test_mm_fp4.py 
=================================================================================================================================================== test session starts ===================================================================================================================================================
platform linux -- Python 3.12.11, pytest-8.4.2, pluggy-1.6.0
rootdir: /flashinfer
configfile: pytest.ini
collected 3240 items    
...
FAILED tests/gemm/test_mm_fp4.py::test_mm_fp4[mxfp4_alpha-False-True-cudnn-res_dtype1-512-512-256] - cudnn._compiled_module.cudnnGraphNotSupportedError: No valid engine configs for Matmul_MUL_
FAILED tests/gemm/test_mm_fp4.py::test_mm_fp4[mxfp4_alpha-False-True-cudnn-res_dtype1-512-512-512] - cudnn._compiled_module.cudnnGraphNotSupportedError: No valid engine configs for Matmul_MUL_
================================================================================================================================ 180 failed, 270 passed, 2790 skipped, 1 warning in 8.97s =================================================================================================================================

```
cudnn 9.14 --> unit test passes with 450 passed, 2790 skipped, 1 warning
in 5.37s
```
# pytest tests/gemm/test_mm_fp4.py 
=================================================================================================================================================== test session starts ===================================================================================================================================================
platform linux -- Python 3.12.11, pytest-8.4.2, pluggy-1.6.0
rootdir: /flashinfer
configfile: pytest.ini
collected 3240 items                                                                                                                                                                                                                                                                                                      

tests/gemm/test_mm_fp4.py 
...
====================================================================================================================================== 450 passed, 2790 skipped, 1 warning in 5.37s =======================================================================================================================================

```

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
* Updated internal dependencies for improved system stability and
compatibility.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->;
* release: Bump version for v0.5.1 release (flashinfer-ai#2031)

<!-- .github/pull_request_template.md -->

## 📌 Description

Update `version.txt`

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
  * Version updated to 0.5.1

<!-- end of auto-generated comment: release notes by coderabbit.ai -->;
* Updated decorator to support unspecified default (flashinfer-ai#2026)

<!-- .github/pull_request_template.md -->

## 📌 Description

Updated decorator to support unspecified default. This was causing
issues when calling mm_fp4 without backend specified.
Also added SM 110 as a supported backend on the cutlass backend (mm_fp4)

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [ ] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [ ] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
  * FP4 Cutlass GEMM now supports the SM110 GPU compute capability.

* **Bug Fixes**
* Kernels called without an explicit backend now consistently use the
default backend.

* **Tests**
* Added a unit test to verify default backend selection and correct
results when backend is omitted.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->;
* test: Enable xfailed trtllm decode long seqlen tests and update microbenchmark (flashinfer-ai#2018)

<!-- .github/pull_request_template.md -->

## 📌 Description


[tests/attention/test_trtllm_gen_attention.py](https://github.com/flashinfer-ai/flashinfer/blob/v0.5.0rc2/tests/attention/test_trtllm_gen_attention.py#L1021-L1076)
was failing and therefore marked xfail.

PR flashinfer-ai#2002 fixed the underlying root cause. Current PR thus removed the
`xfail` marker so that these long seqlen cases could be fixed moving
forward.

Additionally, PR flashinfer-ai#2002 revealed a bug in the microbenchmark script where
[trtllm_batch_decode_with_kv_cache](https://github.com/flashinfer-ai/flashinfer/blob/v0.5.0rc2/flashinfer/decode.py#L2082-L2083)
explicitly requires the workspace to
zhou-yuxin pushed a commit to zhou-yuxin/flashinfer that referenced this pull request Feb 11, 2026
yzh119 pushed a commit that referenced this pull request Feb 25, 2026
<!-- .github/pull_request_template.md -->

## 📌 Description

To fix the following bug:
When the CuteDSL MoE kernels were ported from TensorRT-LLM to
FlashInfer, the mPtrPermutedIdxToExpandedIdx field was accidentally
dropped from the routing kernel's DataBase struct in RoutingKernel.h.
TRT-LLM's routing kernel produces three reverse-mapping outputs:

1. mPtrExpandedIdxToPermutedIdx[expandedIdx] = permutedIdx — forward
mapping
2. mPtrPermutedIdxToExpandedIdx[permutedIdx] = expandedIdx — reverse to
expanded index (token_idx * topk + k)
3. mPtrPermutedIdxToTokenIdx[permutedIdx] = tokenIdx — reverse to token
index only

FlashInfer's port kept only #1 and #3, dropping #2. The binding in
moe_utils_binding.cu then had to wire the Python buffer
permuted_idx_to_expanded_idx to the only available reverse-mapping field
— mPtrPermutedIdxToTokenIdx — which writes plain tokenIdx instead of
expandedIdx.
The Impact
The CuteDSL kernels (GEMM1 gather, moe_output_memset, GEMM2 finalize)
all expect expanded indices and derive the token index via expanded_idx
// topk. When they received plain tokenIdx instead, they computed
tokenIdx // topk — yielding the wrong A row for gather, wrong zero-init
for memset, and wrong scatter position + wrong routing scale for
finalize.

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [ ] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [ ] I have installed the hooks with `pre-commit install`.
- [ ] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Refactor**
* Refined MOE (Mixture of Experts) routing infrastructure by extending
index mapping capabilities across multiple kernel implementations to
improve internal data flow consistency.

* **Tests**
* Strengthened accuracy validation thresholds from 0.925 to 0.97 with
adjusted error tolerance parameters, ensuring more rigorous testing of
MOE operations under FP4 quantization conditions.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
ameynaik-hub pushed a commit to ameynaik-hub/flashinfer that referenced this pull request Mar 18, 2026
<!-- .github/pull_request_template.md -->

## 📌 Description

To fix the following bug:
When the CuteDSL MoE kernels were ported from TensorRT-LLM to
FlashInfer, the mPtrPermutedIdxToExpandedIdx field was accidentally
dropped from the routing kernel's DataBase struct in RoutingKernel.h.
TRT-LLM's routing kernel produces three reverse-mapping outputs:

1. mPtrExpandedIdxToPermutedIdx[expandedIdx] = permutedIdx — forward
mapping
2. mPtrPermutedIdxToExpandedIdx[permutedIdx] = expandedIdx — reverse to
expanded index (token_idx * topk + k)
3. mPtrPermutedIdxToTokenIdx[permutedIdx] = tokenIdx — reverse to token
index only

FlashInfer's port kept only flashinfer-ai#1 and flashinfer-ai#3, dropping flashinfer-ai#2. The binding in
moe_utils_binding.cu then had to wire the Python buffer
permuted_idx_to_expanded_idx to the only available reverse-mapping field
— mPtrPermutedIdxToTokenIdx — which writes plain tokenIdx instead of
expandedIdx.
The Impact
The CuteDSL kernels (GEMM1 gather, moe_output_memset, GEMM2 finalize)
all expect expanded indices and derive the token index via expanded_idx
// topk. When they received plain tokenIdx instead, they computed
tokenIdx // topk — yielding the wrong A row for gather, wrong zero-init
for memset, and wrong scatter position + wrong routing scale for
finalize.

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [ ] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [ ] I have installed the hooks with `pre-commit install`.
- [ ] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Refactor**
* Refined MOE (Mixture of Experts) routing infrastructure by extending
index mapping capabilities across multiple kernel implementations to
improve internal data flow consistency.

* **Tests**
* Strengthened accuracy validation thresholds from 0.925 to 0.97 with
adjusted error tolerance parameters, ensuring more rigorous testing of
MOE operations under FP4 quantization conditions.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Amey Naik <212485788+ameynaik-hub@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants