Skip to content

fix: Autotuner _find_nearest_profile non-power-of-2 num_tokens, create launchers for all supported tileN in trtllm fused MoE#2821

Merged
aleozlx merged 12 commits intoflashinfer-ai:mainfrom
amitz-nv:fix_unordered_map_crash
Mar 20, 2026
Merged

fix: Autotuner _find_nearest_profile non-power-of-2 num_tokens, create launchers for all supported tileN in trtllm fused MoE#2821
aleozlx merged 12 commits intoflashinfer-ai:mainfrom
amitz-nv:fix_unordered_map_crash

Conversation

@amitz-nv
Copy link
Contributor

@amitz-nv amitz-nv commented Mar 19, 2026

📌 Description

It fixes two autotuner related bugs:

  1. Revert back the autotuner fix that was reverted in Undo fix to AutoTuner find_nearest_profile #2697
  2. Fix the issue that Undo fix to AutoTuner find_nearest_profile #2697 revealed, which is trtllm fused MoE kernel launcher crash when it receives tileN that is supported but filtered out by computeSelectedTileN, by creating kernel launchers for all supported tileN values.

This PR continues the work in #2695 by @danisereb to revert bugfix 1 and to fix bug 2.

More technical details:

Bug 1:

When given num_tokens that isn't a power-of-2, the autotuner (python side) fails to find its appropriate entry in the autotuner cache, so it falls back to passing default, which means passing [-1, -1] as the (tileN, tactic) to the CPP.
It was fixed in this PR but soon after merge, it was reverted here, as it exposed the next bug.

Bug 2 (exposed after fixing bug 1):

Crash in fused MoE kernel launcher on forward pass on some values of num_tokens. The crash is at launchers_map.at(tile_N) in trtllm_fused_moe_kernel_launcher.cu. It happens because:
The python side of the autotuner profiles num_tokens that are power of 2, and each such value represents the range until the next power of 2.
e.g.: The profile for the range [2048, 4095] is done on num_tokens=2048.

computeSelectedTileN function in trtllm_fused_moe_kernel_launcher.cu reduces the set of supported tileN values (to reduce the autotuner's search space), by choosing specific values from the supported tileN sorted list, the values are: roundUpToPowerOfTwo(num_tokens * topK / numExperts), its previous one, and its next 2 values (max value is 256). So values in the same range can get different sets of tileN values.
For example, on Nemotron 3 Super NVFP4:

  • num_tokens=2048 -> 2048*22/512 = 88, which rounds up to 128, so the tileN set is (64, 128, 256)
  • num_tokens=3003 -> 3003*22/512 = 129.03, which rounds up to 256, so the tileN set is (128, 256)
    In case tileN=64 was found to be the fastest on num_tokens=2048 for range [2048, 4095], when given num_tokens=3003, the python side would pass [64, someTactic] to the CPP, but for num_tokens=3003, there's no launcher for tileN=64 as computeSelectedTileN filtered it out.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • Bug Fixes

    • Stricter MoE tile validation and ensured all supported tiles are available at launch to avoid missing kernel configurations.
    • Autotuner mapping for linked dynamic dimensions now yields consistent cached bucket values.
  • Tests

    • Added SM100 MoE autotuner integration tests (including invalid-cached-tactic checks).
    • Re-enabled and expanded autotuner unit tests and added a test utility to reset the autotuner.

danisereb and others added 10 commits March 19, 2026 17:28
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
…lashinfer-ai#2617

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
…est of supporting tileN that was filtered out by computeSelectedTileN

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 19, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0435f527-dc56-4b24-9bec-bfd90cacaadd

📥 Commits

Reviewing files that changed from the base of the PR and between c67f600 and 7799f19.

📒 Files selected for processing (1)
  • tests/autotuner/test_trtllm_fused_moe_autotuner_integration.py
✅ Files skipped from review due to trivial changes (1)
  • tests/autotuner/test_trtllm_fused_moe_autotuner_integration.py

📝 Walkthrough

Walkthrough

Validated MoE tile selection, added helpers to resolve default tile/config, and expanded launcher construction to include all supported tiles. Adjusted autotuner bucket mapping to propagate a single mapped bucket across linked dimensions. Added/reset autotuner test utilities and new SM100 MoE integration tests.

Changes

Cohort / File(s) Summary
MoE Kernel Launcher Core
csrc/trtllm_fused_moe_kernel_launcher.cu
Added runtime checks in computeSelectedTileN (requires exact tile membership), added selectDefaultTileN() and resolveMoeTileAndConfig() helpers, and changed all MoE launcher entrypoints to build launchers for all supported tiles and use find+FLASHINFER_CHECK for resolved tile lookup.
Autotuner Core Logic
flashinfer/autotuner.py
Changed _find_nearest_profile to map a dynamic-bucket once per DynamicTensorSpec and apply that mapped value to all linked (input,dim) pairs to ensure consistent bucket resolution across linked dimensions.
Test Utilities & Unit Tests
tests/autotuner/utils.py, tests/autotuner/test_autotuner_core.py
Added reset_autotuner() utility; updated tests to use it, re-enabled several MoE-related _find_nearest_profile tests, added TileTacticDummyRunner, and added a new cache-consistency test asserting same-bucket inference reuses cached tactic.
MoE Autotuner Integration Tests
tests/autotuner/test_trtllm_fused_moe_autotuner_integration.py
New SM100-only integration tests for trtllm_bf16_moe covering autotune/infer flows, weight layout helpers, forced profiling biases, cache population checks, and assertions for invalid/corrupted cached tactics.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

op: moe-routing, ready

Suggested reviewers

  • yzh119
  • nv-yunzheq
  • djmmoss
  • cyx-6
  • aleozlx
  • bkryu

Poem

🐰✨ I hopped through tiles both new and old,
I checked each bucket, brave and bold.
All launchers gathered, not a one left out,
Autotune sings, the rabbits shout! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 52.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: fixing autotuner's non-power-of-2 num_tokens handling and creating launchers for all supported tileN in trtllm fused MoE.
Description check ✅ Passed The description provides comprehensive technical details on both bugs, their root causes, the fix approach, and includes pre-commit and test updates sections from the template.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness and correctness of the autotuner and fused Mixture-of-Experts (MoE) kernel launchers. It rectifies a long-standing problem where the autotuner's cache lookup for non-power-of-2 num_tokens could fail, and critically, prevents crashes in MoE kernels by ensuring that all supported tileN configurations have corresponding launchers available at runtime. These changes streamline the interaction between the Python-based autotuner and the C++ kernel implementations, leading to more stable and predictable performance for MoE operations.

Highlights

  • Autotuner Cache Fix: Resolved an issue where the autotuner failed to find appropriate cache entries for num_tokens values that are not powers of two, leading to fallback to default tactics. The Python-side autotuner now correctly maps non-power-of-2 num_tokens to tuning buckets by applying the mapped bucket value to all linked dimensions.
  • Fused MoE Kernel Launcher Robustness: Addressed a crash in the TRTLLM fused MoE kernel launcher that occurred when the autotuner selected a tileN value that was supported but filtered out by the computeSelectedTileN function. Launchers are now created for all supported tileN values, ensuring that any valid tileN chosen by the autotuner can be found and executed.
  • Tactic Resolution Logic: Introduced a new resolveMoeTileAndConfig function in the C++ backend to centralize the logic for resolving (tileN, config) pairs from the Python autotuner, including handling [-1, -1] fallback tactics gracefully.
  • Expanded Test Coverage: Added new integration tests for the TRTLLM fused MoE autotuner to specifically verify that all supported tileN values can be used for inference without crashes, and that invalid tactics are correctly handled.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses two bugs related to the autotuner. The first fix correctly handles non-power-of-2 num_tokens for cache lookups by ensuring all linked dimensions are mapped to the same bucket. The second fix prevents crashes in the fused MoE kernel launcher by creating launchers for all supported tileN values, rather than a filtered subset. The C++ changes also improve robustness by using find instead of at for map lookups and encapsulating fallback logic. The accompanying tests, including new integration tests, are comprehensive. I've found one issue in a new test case and provided a suggestion to fix it.

Comment on lines +438 to +452
tune_inputs = [torch.empty((bucket_start, hidden_size), dtype=torch.float32)]
tuning_config = TuningConfig(
dynamic_tensor_specs=(
DynamicTensorSpec(
input_idx=(0,),
dim_idx=(0,),
gen_tuning_buckets=tuning_buckets,
map_to_tuning_buckets=lambda x: min(
last_positive_power_of_2(x), tune_max
),
),
),
)
with autotune(tune_mode=True):
tuner.choose_one("test_same_bucket", [runner], tuning_config, tune_inputs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The test logic for tuning is flawed. It only populates the autotuner cache for a single bucket (for num_tokens in [512, 1024)), but the subsequent inference checks assert expected tactics for three different buckets. The checks for buckets that were not tuned will fail because they will result in a cache miss and receive the fallback tactic, not the expected one.

To fix this, the tuning step should be performed for a representative num_tokens from each of the three bucket ranges being tested to ensure the cache is populated correctly before the inference-time checks.

    tuning_config = TuningConfig(
        dynamic_tensor_specs=(
            DynamicTensorSpec(
                input_idx=(0,),
                dim_idx=(0,),
                gen_tuning_buckets=tuning_buckets,
                map_to_tuning_buckets=lambda x: min(
                    last_positive_power_of_2(x), tune_max
                ),
            ),
        ),
    )
    with autotune(tune_mode=True):
        # Tune for a representative num_tokens from each bucket range defined in fake_profile
        # to populate the cache correctly for the inference checks below.
        for tune_tokens in [bucket_start // 2, bucket_start, bucket_end]:
            tune_inputs = [torch.empty((tune_tokens, hidden_size), dtype=torch.float32)]
            tuner.choose_one("test_same_bucket", [runner], tuning_config, tune_inputs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@amitz-nv amitz-nv changed the title fix: Autotuner insert cache non-power-of-2 num_tokens, create launcher for all supported tileN in trtllm fused MoE fix: Autotuner _find_nearest_profile non-power-of-2 num_tokens, create launcher for all supported tileN in trtllm fused MoE Mar 19, 2026
@amitz-nv amitz-nv changed the title fix: Autotuner _find_nearest_profile non-power-of-2 num_tokens, create launcher for all supported tileN in trtllm fused MoE fix: Autotuner _find_nearest_profile non-power-of-2 num_tokens, create launchers for all supported tileN in trtllm fused MoE Mar 19, 2026
…ame_cached_tactic

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/autotuner/test_trtllm_fused_moe_autotuner_integration.py`:
- Around line 221-253: The loop over supported_tile_n_values is not validating
each tile because trtllm_get_valid_moe_configs/computeSelectedTileN may not
include some tile_n and AutoTuner.choose_one consults AutoTuner.profiling_cache,
so later iterations reuse a cached tactic; fix by forcing a fresh profiling per
tile: in the loop, monkeypatch AutoTuner._profile_single_kernel as you already
do, then clear or overwrite AutoTuner.get().profiling_cache (or call
AutoTuner.get().clear_cache()) at the start of each iteration to avoid cache
hits, and after tuning assert that the cached tactic chosen has the expected
tile_n (inspect the cached entry’s tile_n) or alternatively bypass choose_one by
injecting the target [tile_n, config] directly into AutoTuner.profiling_cache
before calling _run_bf16_moe_infer so each iteration exercises and verifies the
requested tile_n.

In `@tests/autotuner/utils.py`:
- Around line 4-9: reset_autotuner currently clears cache, statistics and sets
is_tuning_mode but does not reset the internal counter that autotune() uses;
update the helper (reset_autotuner / AutoTuner.get()) to also reset the
AutoTuner._active_tuning_contexts counter back to zero (or its default empty
state) so that autotune() will not incorrectly derive tuning mode from leftover
contexts; locate the AutoTuner instance via AutoTuner.get(), set its
_active_tuning_contexts to the appropriate empty/zero value, then return the
tuner.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 960704f7-b832-4d0c-a83f-dadd2535665e

📥 Commits

Reviewing files that changed from the base of the PR and between 623db38 and c67f600.

📒 Files selected for processing (5)
  • csrc/trtllm_fused_moe_kernel_launcher.cu
  • flashinfer/autotuner.py
  • tests/autotuner/test_autotuner_core.py
  • tests/autotuner/test_trtllm_fused_moe_autotuner_integration.py
  • tests/autotuner/utils.py

Comment on lines +4 to +9
def reset_autotuner() -> AutoTuner:
tuner = AutoTuner.get()
tuner.clear_cache()
tuner.reset_statistics()
tuner.is_tuning_mode = False
return tuner
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Reset _active_tuning_contexts in the shared helper.

autotune() derives is_tuning_mode from _active_tuning_contexts, so leaving that counter untouched can leak tuning mode across tests even after this helper forces is_tuning_mode = False.

🛠 Suggested fix
 def reset_autotuner() -> AutoTuner:
     tuner = AutoTuner.get()
-    tuner.clear_cache()
-    tuner.reset_statistics()
-    tuner.is_tuning_mode = False
+    with tuner._lock:
+        tuner.clear_cache()
+        tuner.reset_statistics()
+        tuner._active_tuning_contexts = 0
+        tuner.is_tuning_mode = False
     return tuner
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def reset_autotuner() -> AutoTuner:
tuner = AutoTuner.get()
tuner.clear_cache()
tuner.reset_statistics()
tuner.is_tuning_mode = False
return tuner
def reset_autotuner() -> AutoTuner:
tuner = AutoTuner.get()
with tuner._lock:
tuner.clear_cache()
tuner.reset_statistics()
tuner._active_tuning_contexts = 0
tuner.is_tuning_mode = False
return tuner
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/autotuner/utils.py` around lines 4 - 9, reset_autotuner currently
clears cache, statistics and sets is_tuning_mode but does not reset the internal
counter that autotune() uses; update the helper (reset_autotuner /
AutoTuner.get()) to also reset the AutoTuner._active_tuning_contexts counter
back to zero (or its default empty state) so that autotune() will not
incorrectly derive tuning mode from leftover contexts; locate the AutoTuner
instance via AutoTuner.get(), set its _active_tuning_contexts to the appropriate
empty/zero value, then return the tuner.

Copy link
Collaborator

@IwakuraRein IwakuraRein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the fix.

…otuner before every tileN

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
@aleozlx
Copy link
Collaborator

aleozlx commented Mar 20, 2026

/bot run

@aleozlx aleozlx added run-ci and removed run-ci labels Mar 20, 2026
@flashinfer-bot
Copy link
Collaborator

GitLab MR !436 has been created, and the CI pipeline #46565266 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Collaborator

[SUCCESS] Pipeline #46565266: 14/20 passed

@aleozlx aleozlx merged commit e679e45 into flashinfer-ai:main Mar 20, 2026
37 of 68 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants