Skip to content

Fix unordered map crash with TRTLLM MoE kernels#2695

Closed
danisereb wants to merge 5 commits intoflashinfer-ai:mainfrom
danisereb:fix_unordered_map_crash
Closed

Fix unordered map crash with TRTLLM MoE kernels#2695
danisereb wants to merge 5 commits intoflashinfer-ai:mainfrom
danisereb:fix_unordered_map_crash

Conversation

@danisereb
Copy link
Contributor

@danisereb danisereb commented Mar 5, 2026

📌 Description

This PR aims to fix a bug in TRTLLM MoE kernels.

The bug was discovered in flashinfer v0.6.5, following a fix that prevents fallback to default autotuner tactic:
#2617

Bug Description

The Python AutoTuner profiles MoE kernels by creating input tensors at power-of-2 "bucket" sizes (1, 2, 4, ..., 4096) and benchmarking different kernel configurations (tactics) for each bucket.

The buckets are generated using Python function get_last_power_of_2_num_tokens_buckets, this function is called in MoERunner.refine_tuning_config.

And MoERunner.refine_tuning_config is called by:

  • trtllm_fp4_block_scale_moe_op
  • trtllm_fp8_block_scale_moe_op
  • trtllm_bf16_moe_op
  • etc.

Each tactic is a pair [tile_N, config] where tile_N is a tile size for batching tokens across experts and config is a specific kernel variant for that tile. The autotuner picks the fastest tactic per bucket and caches it, keyed by the bucketed input shapes.

During inference, the actual num_tokens (e.g., 1624) is rounded down to the nearest power of 2 (1024) to look up the cached tactic (last_positive_power_of_2), which is then passed as two integers to the C++ launcher.

The C++ launcher is located in file csrc/trtllm_fused_moe_kernel_launcher.cu.
For NVFP4, the relevant function is trtllm_fp4_block_scale_moe and the two integers that represent the tactic are Array<int64_t> config_index.

The config_index is the only tactic related data that is passed from Python to C++ code.
Python’s full autotune search space/results (candidate tactic list, timing data, ranking, cache internals) are not transferred to C++ as a structure.

The C++ MoE kernel launcher receives the actual tensors and the tactic, then independently computes which tile sizes are appropriate for the actual num_tokens using function computeSelectedTileN.

This function calculates the average tokens per expert, rounds up to the next power of 2, and selects a small neighborhood of tiles around that value. It then builds launcher objects only for those selected tiles in an unordered_map, and looks up the tactic's tile_N in that map.

The conflict arises because Python rounds num_tokens down for bucketing while C++ rounds the derived average up for tile selection, and these are applied to different values (raw num_tokens vs. num_tokens * top_k / num_experts).
A tactic cached for the smaller bucketed num_tokens tends to favor smaller tiles, while the C++ launcher for the larger actual num_tokens selects larger tiles and excludes the small ones from its map. When the cached tile_N is not in the C++ launcher's map, unordered_map::at throws, crashing the process.

The fix builds launchers for all supported tiles so that any tactic the autotuner returns is always found.

Note

Using the same rounding direction in both Python and C++ does not solve the problem because they round different values. The Python autotuner rounds num_tokens directly (via last_positive_power_of_2) to compute a cache bucket, while the C++ launcher rounds a derived value - num_tokens * top_k / num_experts (the average tokens per expert) - to select tile sizes.

Even if both used the same rounding function, the derived average for the bucketed num_tokens and the actual num_tokens can land on different sides of a power-of-2 boundary whenever top_k / num_experts is not itself a power of 2.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 5, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c099a63e-4118-4121-b4f0-04a91247ee70

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical crash in TRTLLM MoE kernels that occurred due to a discrepancy in how tile sizes were determined between the Python autotuner and the C++ launcher. The autotuner, after a previous fix, could cache tactics with tile_N values that the C++ launcher, using a different rounding logic for num_tokens, might not include in its unordered_map of available launchers, leading to a runtime error. The core change involves expanding the C++ launcher's tile selection to encompass all supported tile sizes, thereby guaranteeing that any valid cached tactic can be successfully looked up. A comprehensive regression test has been added to validate this fix.

Highlights

  • Fix for unordered_map::at crash: Resolved a RuntimeError: unordered_map::at crash in TRTLLM MoE kernels caused by a mismatch between the Python autotuner's cached tile_N and the C++ launcher's selected tile set.
  • C++ Launcher Logic Update: Modified the C++ kernel launcher to build launchers for all supported tile sizes, ensuring that any tile_N returned by the autotuner is always found in the launcher's map.
  • Default Tactic Handling: Adjusted the default case handling for tile_N and config in the C++ launcher to re-compute tile_N using computeSelectedTileN and explicitly set config to -1 when a default tactic is needed.
  • New Regression Test: Introduced a new Python regression test (test_autotuner_tile_mismatch.py) to specifically reproduce and verify the fix for the tile mismatch bug, including an SM100 integration test.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • csrc/trtllm_fused_moe_kernel_launcher.cu
    • Updated tile selection logic in trtllm_bf16_moe, trtllm_fp8_per_tensor_scale_moe, trtllm_fp8_block_scale_moe, trtllm_fp4_block_scale_moe, and trtllm_mxint4_block_scale_moe functions to initialize selected_tile_nums with all supported tiles instead of a subset from computeSelectedTileN.
    • Modified default tile_N and config handling to re-evaluate computeSelectedTileN for tile_N and set config to -1, allowing the runner to choose a default configuration.
  • tests/autotuner/test_autotuner_tile_mismatch.py
    • Added a new test file to verify the fix for the unordered_map::at crash in MoE kernels.
    • Included helper functions _next_power_of_two and compute_selected_tile_n to mirror C++ tile selection logic in Python.
    • Implemented TileAwareDummyRunner to simulate MoE runner behavior for testing.
    • Added tests to confirm tile set differences between bucketed and actual num_tokens, autotuner cache behavior, and an SM100 integration test for bf16_moe.
Activity
  • The pull request author identified a latent bug exposed by a previous fix related to autotuner bucketing, where cache hits started occurring, leading to the unordered_map::at crash.
  • The author implemented changes in the C++ kernel launcher to ensure all supported tile configurations are available.
  • A new regression test was added to validate the fix and prevent future regressions, including an integration test for SM100 architectures.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical crash in the TRTLLM MoE kernels caused by a mismatch in tile selection logic between the Python autotuner and the C++ kernel launcher. While the fix correctly prevents the unordered_map::at exception by creating launchers for all supported tiles, it introduces a potential Denial of Service vulnerability. The code still uses launchers_map.at(tile_N) without verifying its existence in the launchers_map, which can lead to a process crash if an invalid tactic is provided. This critical check is missing in 5 instances. Additionally, there is a suggestion to refactor duplicated code in csrc/trtllm_fused_moe_kernel_launcher.cu to improve maintainability.

@danisereb danisereb changed the title Fix unordered map crash with TRTLLM MoE kernels Fix unordered map crash with TRTLLM MoE kernels (v0.6.5) Mar 5, 2026
@danisereb danisereb changed the title Fix unordered map crash with TRTLLM MoE kernels (v0.6.5) Fix unordered map crash with TRTLLM MoE kernels Mar 5, 2026
@aleozlx aleozlx added v0.6.6 release blocker label for 0.6.6 op: moe labels Mar 5, 2026
@aleozlx aleozlx self-assigned this Mar 5, 2026
@danisereb
Copy link
Contributor Author

This PR is still WIP.
Should not be a blocker for v0.6.6

@aleozlx aleozlx removed the v0.6.6 release blocker label for 0.6.6 label Mar 5, 2026
@aleozlx aleozlx mentioned this pull request Mar 5, 2026
aleozlx pushed a commit that referenced this pull request Mar 6, 2026
<!-- .github/pull_request_template.md -->

## 📌 Description

PR #2617 added a fix that solves "using fallback tactic" for TRTLLM MoE
kernels.

But after running more tests (`lm_eval`) with flashinfer v0.6.5 another
issue was found -
an error from C++ file `csrc/trtllm_fused_moe_kernel_launcher.cu` (key
not found in `launchers_map.at(tile_N)`).

Fixing this is probably not simple, more details in this draft PR
(**NOT** for v0.6.6):
#2695

In order to prevent the crash, the change in `_find_nearest_profile`
will be reverted (to match flashinfer v0.6.4).

The relevant AutoTuner tests were marked with "skip":
```
tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_moe_shared_num_tokens_axis[1000-512] SKIPPED (_find_nearest_profile linked-dimension mapping was reverted;...)
tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_moe_shared_num_tokens_axis[4000-2048] SKIPPED (_find_nearest_profile linked-dimension mapping was reverted...)
tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_moe_shared_num_tokens_axis[8000-4096] SKIPPED (_find_nearest_profile linked-dimension mapping was reverted...)
tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_moe_shared_num_tokens_axis[12000-8192] SKIPPED (_find_nearest_profile linked-dimension mapping was reverte...)
tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_moe_same_bucket_same_profile SKIPPED (_find_nearest_profile linked-dimension mapping was reverted; re-enab...)
tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_maps_all_linked_dims SKIPPED (_find_nearest_profile linked-dimension mapping was reverted; re-enable when ...)
```

The AutoTuner rest of the tests are all successful:
```
pytest --tb short  tests/autotuner/
================================================================================= test session starts ==================================================================================
platform linux -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0
rootdir: /my_home/workspace/dani_flashinfer
configfile: pytest.ini
plugins: anyio-4.12.1
collected 39 items                                                                                                                                                                     

tests/autotuner/test_autotuner_bmm_fp8.py ............                                                                                                                           [ 30%]
tests/autotuner/test_autotuner_core.py ...........ssssss..........                                                                                                               [100%]

============================================================================ 33 passed, 6 skipped in 0.95s =============================================================================
```

**Using this branch, the failure from
`trtllm_fused_moe_kernel_launcher.cu` does not happen.**

**vLLM main still uses flashinfer v0.6.4 (that does not include PR
#2617).**

**This change should be included in flashinfer v0.6.6 (for use by
vLLM).**

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Tests**
* Temporarily disabled three autotuner tests pending restoration of
linked-dimension bucket propagation functionality. Tests will be
re-enabled once related features are restored.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@IwakuraRein
Copy link
Collaborator

Is it possible to just include all possible launchers in the launchers_map? I.e.,

for (int32_t curr_tile_N : selected_tile_nums) => for (int32_t curr_tile_N : mSupportedTileNums)

Assuming the overhead of instantiating launcher is small.

@IwakuraRein
Copy link
Collaborator

The root cause for the tileN limitation in the trtllm-gen moe launcher is that when tuning the trtllm-gen moe, the search space is the product of all FC1 and FC2 kernels. To prevent the auto tuner from running forever, we limited the available tileN in the tuning.

In hindsight, a better approach would be to expose TrtllmGenBatchedGemmRunner to Python so that FC1 and FC2 can be tuned independently.

@danisereb
Copy link
Contributor Author

@IwakuraRein if you have a better fix I don't mind if you open a PR (and I can close this PR).

I don't have a lot of context with TRTLLM code, so instead I chose to merge this fix:
#2697

@amitz-nv
Copy link
Contributor

amitz-nv commented Mar 15, 2026

I'm currently working on a fix.

Explanation of the problem:

Different sets of values may be returned from computeSelectedTileN for different num_tokens values in the same range.

For example:
Nemotron 3 Super - topK=22, num_experts=512
Let's look at range - [2048, 4095], with NVFP4:

  • num_tokens=2048:
num_tokens*topK/num_experts = 2048*22/512 = 88

rounds up to 128, so it would return (64, 128, 256)

  • num_tokens=3583:
num_tokens*topK/num_experts = 3583*22/512 = 153.957

rounds up to 256, so it would return (128, 256).

If tileN=64 was found to be the fastest, the autotuner would call the CPP with tileN=64 for any num_tokens in the range [2048, 4095]. When given num_tokens=3583 (or any num_tokens > 2978), launcher_map.at(tileN=64) would crash.

So I think the main question here is - how do we want to deal (in this example) with any num_tokens > 2978?
Assume that the chosen tileN of 2048 would still be the best and just add support for its chosen tileN to any num_tokens in the autotuned range (of [2048, 4095])?

frankwang28 pushed a commit to frankwang28/flashinfer that referenced this pull request Mar 18, 2026
<!-- .github/pull_request_template.md -->

## 📌 Description

PR flashinfer-ai#2617 added a fix that solves "using fallback tactic" for TRTLLM MoE
kernels.

But after running more tests (`lm_eval`) with flashinfer v0.6.5 another
issue was found -
an error from C++ file `csrc/trtllm_fused_moe_kernel_launcher.cu` (key
not found in `launchers_map.at(tile_N)`).

Fixing this is probably not simple, more details in this draft PR
(**NOT** for v0.6.6):
flashinfer-ai#2695

In order to prevent the crash, the change in `_find_nearest_profile`
will be reverted (to match flashinfer v0.6.4).

The relevant AutoTuner tests were marked with "skip":
```
tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_moe_shared_num_tokens_axis[1000-512] SKIPPED (_find_nearest_profile linked-dimension mapping was reverted;...)
tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_moe_shared_num_tokens_axis[4000-2048] SKIPPED (_find_nearest_profile linked-dimension mapping was reverted...)
tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_moe_shared_num_tokens_axis[8000-4096] SKIPPED (_find_nearest_profile linked-dimension mapping was reverted...)
tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_moe_shared_num_tokens_axis[12000-8192] SKIPPED (_find_nearest_profile linked-dimension mapping was reverte...)
tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_moe_same_bucket_same_profile SKIPPED (_find_nearest_profile linked-dimension mapping was reverted; re-enab...)
tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_maps_all_linked_dims SKIPPED (_find_nearest_profile linked-dimension mapping was reverted; re-enable when ...)
```

The AutoTuner rest of the tests are all successful:
```
pytest --tb short  tests/autotuner/
================================================================================= test session starts ==================================================================================
platform linux -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0
rootdir: /my_home/workspace/dani_flashinfer
configfile: pytest.ini
plugins: anyio-4.12.1
collected 39 items                                                                                                                                                                     

tests/autotuner/test_autotuner_bmm_fp8.py ............                                                                                                                           [ 30%]
tests/autotuner/test_autotuner_core.py ...........ssssss..........                                                                                                               [100%]

============================================================================ 33 passed, 6 skipped in 0.95s =============================================================================
```

**Using this branch, the failure from
`trtllm_fused_moe_kernel_launcher.cu` does not happen.**

**vLLM main still uses flashinfer v0.6.4 (that does not include PR
flashinfer-ai#2617).**

**This change should be included in flashinfer v0.6.6 (for use by
vLLM).**

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Tests**
* Temporarily disabled three autotuner tests pending restoration of
linked-dimension bucket propagation functionality. Tests will be
re-enabled once related features are restored.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
ameynaik-hub pushed a commit to ameynaik-hub/flashinfer that referenced this pull request Mar 18, 2026
<!-- .github/pull_request_template.md -->

## 📌 Description

PR flashinfer-ai#2617 added a fix that solves "using fallback tactic" for TRTLLM MoE
kernels.

But after running more tests (`lm_eval`) with flashinfer v0.6.5 another
issue was found -
an error from C++ file `csrc/trtllm_fused_moe_kernel_launcher.cu` (key
not found in `launchers_map.at(tile_N)`).

Fixing this is probably not simple, more details in this draft PR
(**NOT** for v0.6.6):
flashinfer-ai#2695

In order to prevent the crash, the change in `_find_nearest_profile`
will be reverted (to match flashinfer v0.6.4).

The relevant AutoTuner tests were marked with "skip":
```
tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_moe_shared_num_tokens_axis[1000-512] SKIPPED (_find_nearest_profile linked-dimension mapping was reverted;...)
tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_moe_shared_num_tokens_axis[4000-2048] SKIPPED (_find_nearest_profile linked-dimension mapping was reverted...)
tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_moe_shared_num_tokens_axis[8000-4096] SKIPPED (_find_nearest_profile linked-dimension mapping was reverted...)
tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_moe_shared_num_tokens_axis[12000-8192] SKIPPED (_find_nearest_profile linked-dimension mapping was reverte...)
tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_moe_same_bucket_same_profile SKIPPED (_find_nearest_profile linked-dimension mapping was reverted; re-enab...)
tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_maps_all_linked_dims SKIPPED (_find_nearest_profile linked-dimension mapping was reverted; re-enable when ...)
```

The AutoTuner rest of the tests are all successful:
```
pytest --tb short  tests/autotuner/
================================================================================= test session starts ==================================================================================
platform linux -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0
rootdir: /my_home/workspace/dani_flashinfer
configfile: pytest.ini
plugins: anyio-4.12.1
collected 39 items

tests/autotuner/test_autotuner_bmm_fp8.py ............                                                                                                                           [ 30%]
tests/autotuner/test_autotuner_core.py ...........ssssss..........                                                                                                               [100%]

============================================================================ 33 passed, 6 skipped in 0.95s =============================================================================
```

**Using this branch, the failure from
`trtllm_fused_moe_kernel_launcher.cu` does not happen.**

**vLLM main still uses flashinfer v0.6.4 (that does not include PR
flashinfer-ai#2617).**

**This change should be included in flashinfer v0.6.6 (for use by
vLLM).**

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Tests**
* Temporarily disabled three autotuner tests pending restoration of
linked-dimension bucket propagation functionality. Tests will be
re-enabled once related features are restored.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Amey Naik <212485788+ameynaik-hub@users.noreply.github.com>
aleozlx pushed a commit that referenced this pull request Mar 20, 2026
…e launchers for all supported tileN in trtllm fused MoE (#2821)

## 📌 Description

It fixes two autotuner related bugs:
1. Revert back the autotuner fix that was reverted in
#2697
2. Fix the issue that
#2697 revealed, which is
trtllm fused MoE kernel launcher crash when it receives tileN that is
supported but filtered out by `computeSelectedTileN`, by creating kernel
launchers for all supported tileN values.

This PR continues the work in
#2695 by @danisereb to
revert bugfix 1 and to fix bug 2.

More technical details:
### Bug 1:
When given num_tokens that isn't a power-of-2, the autotuner (python
side) fails to find its appropriate entry in the autotuner cache, so it
falls back to passing default, which means passing `[-1, -1]` as the
`(tileN, tactic)` to the CPP.
It was fixed in [this
PR](https://github.com/flashinfer-ai/flashinfer/pull/2617/changes#diff-1964ab957d8185d04b0d5f0cb02d0c7c0a3260ac0a6c573167af6875ab0b0e87L729-L734)
but soon after merge, it was reverted
[here](#2697), as it
exposed the next bug.

### Bug 2 (exposed after fixing bug 1):
Crash in fused MoE kernel launcher on forward pass on some values of
num_tokens. The crash is at `launchers_map.at(tile_N)` in
`trtllm_fused_moe_kernel_launcher.cu`. It happens because:
The python side of the autotuner profiles num_tokens that are power of
2, and each such value represents the range until the next power of 2.
e.g.: The profile for the range `[2048, 4095]` is done on
num_tokens=2048.

`computeSelectedTileN` function in `trtllm_fused_moe_kernel_launcher.cu`
reduces the set of supported tileN values (to reduce the autotuner's
search space), by choosing specific values from the supported tileN
sorted list, the values are: `roundUpToPowerOfTwo(num_tokens * topK /
numExperts)`, its previous one, and its next 2 values (max value is
256). So values in the same range can get different sets of tileN
values.
For example, on Nemotron 3 Super NVFP4:
- `num_tokens=2048` -> `2048*22/512 = 88`, which rounds up to 128, so
the tileN set is `(64, 128, 256)`
- `num_tokens=3003` -> `3003*22/512 = 129.03`, which rounds up to 256,
so the tileN set is `(128, 256)`
In case `tileN=64` was found to be the fastest on `num_tokens=2048` for
range `[2048, 4095]`, when given `num_tokens=3003`, the python side
would pass `[64, someTactic]` to the CPP, but for `num_tokens=3003`,
there's no launcher for `tileN=64` as `computeSelectedTileN` filtered it
out.


## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Stricter MoE tile validation and ensured all supported tiles are
available at launch to avoid missing kernel configurations.
* Autotuner mapping for linked dynamic dimensions now yields consistent
cached bucket values.

* **Tests**
* Added SM100 MoE autotuner integration tests (including
invalid-cached-tactic checks).
* Re-enabled and expanded autotuner unit tests and added a test utility
to reset the autotuner.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Co-authored-by: Daniel Serebrenik <daserebrenik@nvidia.com>
@danisereb
Copy link
Contributor Author

This PR fixed the issue:
#2821

So my PR is no longer needed.

@danisereb danisereb closed this Mar 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants