fix: Autotuner _find_nearest_profile non-power-of-2 num_tokens, create launchers for all supported tileN in trtllm fused MoE by amitz-nv · Pull Request #2821 · flashinfer-ai/flashinfer

amitz-nv · 2026-03-19T17:44:49Z

📌 Description

It fixes two autotuner related bugs:

Revert back the autotuner fix that was reverted in Undo fix to AutoTuner find_nearest_profile #2697
Fix the issue that Undo fix to AutoTuner find_nearest_profile #2697 revealed, which is trtllm fused MoE kernel launcher crash when it receives tileN that is supported but filtered out by computeSelectedTileN, by creating kernel launchers for all supported tileN values.

This PR continues the work in #2695 by @danisereb to revert bugfix 1 and to fix bug 2.

More technical details:

Bug 1:

When given num_tokens that isn't a power-of-2, the autotuner (python side) fails to find its appropriate entry in the autotuner cache, so it falls back to passing default, which means passing [-1, -1] as the (tileN, tactic) to the CPP.
It was fixed in this PR but soon after merge, it was reverted here, as it exposed the next bug.

Bug 2 (exposed after fixing bug 1):

Crash in fused MoE kernel launcher on forward pass on some values of num_tokens. The crash is at launchers_map.at(tile_N) in trtllm_fused_moe_kernel_launcher.cu. It happens because:
The python side of the autotuner profiles num_tokens that are power of 2, and each such value represents the range until the next power of 2.
e.g.: The profile for the range [2048, 4095] is done on num_tokens=2048.

computeSelectedTileN function in trtllm_fused_moe_kernel_launcher.cu reduces the set of supported tileN values (to reduce the autotuner's search space), by choosing specific values from the supported tileN sorted list, the values are: roundUpToPowerOfTwo(num_tokens * topK / numExperts), its previous one, and its next 2 values (max value is 256). So values in the same range can get different sets of tileN values.
For example, on Nemotron 3 Super NVFP4:

num_tokens=2048 -> 2048*22/512 = 88, which rounds up to 128, so the tileN set is (64, 128, 256)
num_tokens=3003 -> 3003*22/512 = 129.03, which rounds up to 256, so the tileN set is (128, 256)
In case tileN=64 was found to be the fastest on num_tokens=2048 for range [2048, 4095], when given num_tokens=3003, the python side would pass [64, someTactic] to the CPP, but for num_tokens=3003, there's no launcher for tileN=64 as computeSelectedTileN filtered it out.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Bug Fixes
- Stricter MoE tile validation and ensured all supported tiles are available at launch to avoid missing kernel configurations.
- Autotuner mapping for linked dynamic dimensions now yields consistent cached bucket values.
Tests
- Added SM100 MoE autotuner integration tests (including invalid-cached-tactic checks).
- Re-enabled and expanded autotuner unit tests and added a test utility to reset the autotuner.

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

…lashinfer-ai#2617 Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

…est of supporting tileN that was filtered out by computeSelectedTileN Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

coderabbitai · 2026-03-19T17:44:57Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0435f527-dc56-4b24-9bec-bfd90cacaadd

📥 Commits

Reviewing files that changed from the base of the PR and between c67f600 and 7799f19.

📒 Files selected for processing (1)

tests/autotuner/test_trtllm_fused_moe_autotuner_integration.py

✅ Files skipped from review due to trivial changes (1)

tests/autotuner/test_trtllm_fused_moe_autotuner_integration.py

📝 Walkthrough

Walkthrough

Validated MoE tile selection, added helpers to resolve default tile/config, and expanded launcher construction to include all supported tiles. Adjusted autotuner bucket mapping to propagate a single mapped bucket across linked dimensions. Added/reset autotuner test utilities and new SM100 MoE integration tests.

Changes

Cohort / File(s)	Summary
MoE Kernel Launcher Core `csrc/trtllm_fused_moe_kernel_launcher.cu`	Added runtime checks in `computeSelectedTileN` (requires exact tile membership), added `selectDefaultTileN()` and `resolveMoeTileAndConfig()` helpers, and changed all MoE launcher entrypoints to build launchers for all supported tiles and use `find`+`FLASHINFER_CHECK` for resolved tile lookup.
Autotuner Core Logic `flashinfer/autotuner.py`	Changed `_find_nearest_profile` to map a dynamic-bucket once per `DynamicTensorSpec` and apply that mapped value to all linked (input,dim) pairs to ensure consistent bucket resolution across linked dimensions.
Test Utilities & Unit Tests `tests/autotuner/utils.py`, `tests/autotuner/test_autotuner_core.py`	Added `reset_autotuner()` utility; updated tests to use it, re-enabled several MoE-related `_find_nearest_profile` tests, added `TileTacticDummyRunner`, and added a new cache-consistency test asserting same-bucket inference reuses cached tactic.
MoE Autotuner Integration Tests `tests/autotuner/test_trtllm_fused_moe_autotuner_integration.py`	New SM100-only integration tests for `trtllm_bf16_moe` covering autotune/infer flows, weight layout helpers, forced profiling biases, cache population checks, and assertions for invalid/corrupted cached tactics.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Undo fix to AutoTuner find_nearest_profile #2697: Reverts the linked-dimension bucket propagation change in _find_nearest_profile, directly related to the autotuner mapping change.

Suggested labels

op: moe-routing, ready

Suggested reviewers

yzh119
nv-yunzheq
djmmoss
cyx-6
aleozlx
bkryu

Poem

🐰✨ I hopped through tiles both new and old,
I checked each bucket, brave and bold.
All launchers gathered, not a one left out,
Autotune sings, the rabbits shout! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 52.78% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main changes: fixing autotuner's non-power-of-2 num_tokens handling and creating launchers for all supported tileN in trtllm fused MoE.
Description check	✅ Passed	The description provides comprehensive technical details on both bugs, their root causes, the fix approach, and includes pre-commit and test updates sections from the template.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-03-19T17:45:14Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness and correctness of the autotuner and fused Mixture-of-Experts (MoE) kernel launchers. It rectifies a long-standing problem where the autotuner's cache lookup for non-power-of-2 num_tokens could fail, and critically, prevents crashes in MoE kernels by ensuring that all supported tileN configurations have corresponding launchers available at runtime. These changes streamline the interaction between the Python-based autotuner and the C++ kernel implementations, leading to more stable and predictable performance for MoE operations.

Highlights

Autotuner Cache Fix: Resolved an issue where the autotuner failed to find appropriate cache entries for num_tokens values that are not powers of two, leading to fallback to default tactics. The Python-side autotuner now correctly maps non-power-of-2 num_tokens to tuning buckets by applying the mapped bucket value to all linked dimensions.
Fused MoE Kernel Launcher Robustness: Addressed a crash in the TRTLLM fused MoE kernel launcher that occurred when the autotuner selected a tileN value that was supported but filtered out by the computeSelectedTileN function. Launchers are now created for all supported tileN values, ensuring that any valid tileN chosen by the autotuner can be found and executed.
Tactic Resolution Logic: Introduced a new resolveMoeTileAndConfig function in the C++ backend to centralize the logic for resolving (tileN, config) pairs from the Python autotuner, including handling [-1, -1] fallback tactics gracefully.
Expanded Test Coverage: Added new integration tests for the TRTLLM fused MoE autotuner to specifically verify that all supported tileN values can be used for inference without crashes, and that invalid tactics are correctly handled.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively addresses two bugs related to the autotuner. The first fix correctly handles non-power-of-2 num_tokens for cache lookups by ensuring all linked dimensions are mapped to the same bucket. The second fix prevents crashes in the fused MoE kernel launcher by creating launchers for all supported tileN values, rather than a filtered subset. The C++ changes also improve robustness by using find instead of at for map lookups and encapsulating fallback logic. The accompanying tests, including new integration tests, are comprehensive. I've found one issue in a new test case and provided a suggestion to fix it.

gemini-code-assist · 2026-03-19T17:51:11Z

tests/autotuner/test_autotuner_core.py

+    tune_inputs = [torch.empty((bucket_start, hidden_size), dtype=torch.float32)]
+    tuning_config = TuningConfig(
+        dynamic_tensor_specs=(
+            DynamicTensorSpec(
+                input_idx=(0,),
+                dim_idx=(0,),
+                gen_tuning_buckets=tuning_buckets,
+                map_to_tuning_buckets=lambda x: min(
+                    last_positive_power_of_2(x), tune_max
+                ),
+            ),
+        ),
+    )
+    with autotune(tune_mode=True):
+        tuner.choose_one("test_same_bucket", [runner], tuning_config, tune_inputs)


The test logic for tuning is flawed. It only populates the autotuner cache for a single bucket (for num_tokens in [512, 1024)), but the subsequent inference checks assert expected tactics for three different buckets. The checks for buckets that were not tuned will fail because they will result in a cache miss and receive the fallback tactic, not the expected one.

To fix this, the tuning step should be performed for a representative num_tokens from each of the three bucket ranges being tested to ensure the cache is populated correctly before the inference-time checks.

tuning_config = TuningConfig( dynamic_tensor_specs=( DynamicTensorSpec( input_idx=(0,), dim_idx=(0,), gen_tuning_buckets=tuning_buckets, map_to_tuning_buckets=lambda x: min( last_positive_power_of_2(x), tune_max ), ), ), ) with autotune(tune_mode=True): # Tune for a representative num_tokens from each bucket range defined in fake_profile # to populate the cache correctly for the inference checks below. for tune_tokens in [bucket_start // 2, bucket_start, bucket_end]: tune_inputs = [torch.empty((tune_tokens, hidden_size), dtype=torch.float32)] tuner.choose_one("test_same_bucket", [runner], tuning_config, tune_inputs)

…ame_cached_tactic Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/autotuner/test_trtllm_fused_moe_autotuner_integration.py`:
- Around line 221-253: The loop over supported_tile_n_values is not validating
each tile because trtllm_get_valid_moe_configs/computeSelectedTileN may not
include some tile_n and AutoTuner.choose_one consults AutoTuner.profiling_cache,
so later iterations reuse a cached tactic; fix by forcing a fresh profiling per
tile: in the loop, monkeypatch AutoTuner._profile_single_kernel as you already
do, then clear or overwrite AutoTuner.get().profiling_cache (or call
AutoTuner.get().clear_cache()) at the start of each iteration to avoid cache
hits, and after tuning assert that the cached tactic chosen has the expected
tile_n (inspect the cached entry’s tile_n) or alternatively bypass choose_one by
injecting the target [tile_n, config] directly into AutoTuner.profiling_cache
before calling _run_bf16_moe_infer so each iteration exercises and verifies the
requested tile_n.

In `@tests/autotuner/utils.py`:
- Around line 4-9: reset_autotuner currently clears cache, statistics and sets
is_tuning_mode but does not reset the internal counter that autotune() uses;
update the helper (reset_autotuner / AutoTuner.get()) to also reset the
AutoTuner._active_tuning_contexts counter back to zero (or its default empty
state) so that autotune() will not incorrectly derive tuning mode from leftover
contexts; locate the AutoTuner instance via AutoTuner.get(), set its
_active_tuning_contexts to the appropriate empty/zero value, then return the
tuner.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 960704f7-b832-4d0c-a83f-dadd2535665e

📥 Commits

Reviewing files that changed from the base of the PR and between 623db38 and c67f600.

📒 Files selected for processing (5)

csrc/trtllm_fused_moe_kernel_launcher.cu
flashinfer/autotuner.py
tests/autotuner/test_autotuner_core.py
tests/autotuner/test_trtllm_fused_moe_autotuner_integration.py
tests/autotuner/utils.py

tests/autotuner/test_trtllm_fused_moe_autotuner_integration.py

coderabbitai · 2026-03-19T18:21:15Z

tests/autotuner/utils.py

+def reset_autotuner() -> AutoTuner:
+    tuner = AutoTuner.get()
+    tuner.clear_cache()
+    tuner.reset_statistics()
+    tuner.is_tuning_mode = False
+    return tuner


⚠️ Potential issue | 🟡 Minor

Reset _active_tuning_contexts in the shared helper.

autotune() derives is_tuning_mode from _active_tuning_contexts, so leaving that counter untouched can leak tuning mode across tests even after this helper forces is_tuning_mode = False.

🛠 Suggested fix

def reset_autotuner() -> AutoTuner: tuner = AutoTuner.get() - tuner.clear_cache() - tuner.reset_statistics() - tuner.is_tuning_mode = False + with tuner._lock: + tuner.clear_cache() + tuner.reset_statistics() + tuner._active_tuning_contexts = 0 + tuner.is_tuning_mode = False return tuner

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def reset_autotuner() -> AutoTuner:

tuner = AutoTuner.get()

tuner.clear_cache()

tuner.reset_statistics()

tuner.is_tuning_mode = False

return tuner

def reset_autotuner() -> AutoTuner:

tuner = AutoTuner.get()

with tuner._lock:

tuner.clear_cache()

tuner.reset_statistics()

tuner._active_tuning_contexts = 0

tuner.is_tuning_mode = False

return tuner

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/autotuner/utils.py` around lines 4 - 9, reset_autotuner currently clears cache, statistics and sets is_tuning_mode but does not reset the internal counter that autotune() uses; update the helper (reset_autotuner / AutoTuner.get()) to also reset the AutoTuner._active_tuning_contexts counter back to zero (or its default empty state) so that autotune() will not incorrectly derive tuning mode from leftover contexts; locate the AutoTuner instance via AutoTuner.get(), set its _active_tuning_contexts to the appropriate empty/zero value, then return the tuner.

IwakuraRein

LGTM. Thanks for the fix.

…otuner before every tileN Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

aleozlx · 2026-03-20T00:05:45Z

/bot run

flashinfer-bot · 2026-03-20T00:06:53Z

GitLab MR !436 has been created, and the CI pipeline #46565266 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-03-20T04:19:00Z

[SUCCESS] Pipeline #46565266: 14/20 passed

danisereb and others added 10 commits March 19, 2026 17:28

Add test for "unordered_map::at" failure with TRTLLM MoE

261efcd

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

Fix crash when key not found in launchers_map

55c351b

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

Add proper fix for tile selection

cd7ea99

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

Improve bugfix code

0feb41c

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

Improve tests and increase coverage

5611ea6

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

Restore autotuner fix of support non-power-of-two cache lookup from f…

4a42e7a

…lashinfer-ai#2617 Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

Don't allow invalid tactic tuple, update tests accordingly, improve t…

0c843c0

…est of supporting tileN that was filtered out by computeSelectedTileN Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

Unskip tests related to bug in _find_nearest_profile

e90b66b

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

Remove is_sorted check in computeSelectedTileN, remove inline comments

887d7a8

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

Autotuner new tests improvements

73428ee

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

gemini-code-assist bot reviewed Mar 19, 2026

View reviewed changes

amitz-nv changed the title ~~fix: Autotuner insert cache non-power-of-2 num_tokens, create launcher for all supported tileN in trtllm fused MoE~~ fix: Autotuner _find_nearest_profile non-power-of-2 num_tokens, create launcher for all supported tileN in trtllm fused MoE Mar 19, 2026

amitz-nv changed the title ~~fix: Autotuner _find_nearest_profile non-power-of-2 num_tokens, create launcher for all supported tileN in trtllm fused MoE~~ fix: Autotuner _find_nearest_profile non-power-of-2 num_tokens, create launchers for all supported tileN in trtllm fused MoE Mar 19, 2026

Fix setup in test_choose_one_different_infer_tokens_same_bucket_get_s…

c67f600

…ame_cached_tactic Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

amitz-nv marked this pull request as ready for review March 19, 2026 18:08

amitz-nv requested review from aleozlx, bkryu, cyx-6, jimmyzho, kahyunnam, nv-yunzheq, saltyminty, sricketts, yongwww, yyihuang and yzh119 as code owners March 19, 2026 18:08

coderabbitai bot reviewed Mar 19, 2026

View reviewed changes

IwakuraRein approved these changes Mar 19, 2026

View reviewed changes

IwakuraRein added the run-ci label Mar 19, 2026

aleozlx added the op: moe label Mar 19, 2026

Fix test_bf16_moe_all_supported_tile_n_inference_succeed by reset aut…

7799f19

…otuner before every tileN Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

aleozlx added run-ci and removed run-ci labels Mar 20, 2026

aleozlx approved these changes Mar 20, 2026

View reviewed changes

aleozlx merged commit e679e45 into flashinfer-ai:main Mar 20, 2026
37 of 68 checks passed

danisereb mentioned this pull request Mar 22, 2026

Fix unordered map crash with TRTLLM MoE kernels #2695

Closed

5 tasks

Conversation

amitz-nv commented Mar 19, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

Bug 1:

Bug 2 (exposed after fixing bug 1):

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist bot commented Mar 19, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

amitz-nv Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

IwakuraRein left a comment

Choose a reason for hiding this comment

Uh oh!

aleozlx commented Mar 20, 2026

Uh oh!

flashinfer-bot commented Mar 20, 2026

Uh oh!

flashinfer-bot commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

amitz-nv commented Mar 19, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 19, 2026 •

edited

Loading