fix: Fix autotuner crash on meta-device tensor in trtllm_fp4_block_scale_routed_moe by bkryu · Pull Request #2916 · flashinfer-ai/flashinfer

bkryu · 2026-03-30T21:32:48Z

📌 Description

Summary

Fixes RuntimeError: Cannot pack tensors on meta when trtllm_fp4_block_scale_routed_moe is called with autotuning enabled
Ensures the autotuner profiles the correct kernel code path (no-routing) when routing is pre-computed

Root Cause

Wh#en trtllm_fp4_block_scale_routed_moe is called, routing_logits is None because routing has already been done (pre-computed topk_ids are provided instead). To give the autotuner a tensor with the right shape/dtype for profile generation, a placeholder was created with device="meta":

torch.empty(num_tokens, num_experts, dtype=routing_dtype, device="meta")

This worked without autotuning because choose_one returns early (only inspects .size() for the cache key, never passes the tensor to a kernel).

With autotuning enabled, choose_one enters the profiling loop, which calls _create_tensor_like on the placeholder. That method copies origin_tensor.device, so the derived profiling tensor is also on "meta". When the profiling path calls MoERunner.forward, this meta tensor is passed to the C++ kernel via TVM FFI, which attempts DLPack conversion and fails: the meta device has no real memory.

Fix

Three changes in flashinfer/fused_moe/core.py:

Replace device="meta" with device=hidden_states.device — the placeholder is now a real CUDA tensor so the autotuner can safely derive profiling tensors from it.
Pass skip_routing=(routing_logits is None) through kwargs to choose_one, signaling that routing was pre-computed.
In MoERunner.forward, set routing_logits = None when skip_routing=True — this ensures the C++ kernel takes the same no-routing code path during profiling as it does in production. Without this, the autotuner would profile with routing computation enabled (random routing_logits data), potentially selecting a suboptimal tactic for the actual inference path where routing is skipped.

Unit test changes

Added test_fp4_routed_moe_autotune_no_crash regression test in tests/autotuner/test_trtllm_fused_moe_autotuner_integration.py The test calls trtllm_fp4_block_scale_routed_moe inside autotune(True) with num_tokens=1 and num_tokens=16, verifying no crash occurs.

Main branch fails the newly added tests before the changes in flashinfer/fused_moe/core.py:

$ pytest tests/autotuner/test_trtllm_fused_moe_autotuner_integration.py
...
tests/autotuner/test_trtllm_fused_moe_autotuner_integration.py ....FF...   
...
FAILED tests/autotuner/test_trtllm_fused_moe_autotuner_integration.py::test_fp4_routed_moe_autotune_no_crash[4-16-1] - RuntimeError: Cannot pack tensors on meta
FAILED tests/autotuner/test_trtllm_fused_moe_autotuner_integration.py::test_fp4_routed_moe_autotune_no_crash[4-16-16] - RuntimeError: Cannot pack tensors on meta
====================================================================================== 2 failed, 7 passed in 4.08s ======================================================================================

After fix:

$ pytest tests/autotuner/test_trtllm_fused_moe_autotuner_integration.py
...
tests/autotuner/test_trtllm_fused_moe_autotuner_integration.py .........                                                                                                                           [100%]

=========================================================================================== 9 passed in 6.47s ============================================================================================

🔍 Related Issues

#2023

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

New Features
- Added skip routing support for Mixture of Experts operations, enabling optimization when routing computation can be bypassed.
Tests
- Added regression test for FP4 routed Mixture of Experts autotuning to ensure stability across token configurations.

coderabbitai · 2026-03-30T21:33:04Z

📝 Walkthrough

Walkthrough

Added support for a skip_routing flag in MoERunner to bypass routing computation. Updated TensorRT-LLM FP4 block-scale MoE operator to allocate routing logits workspace on the correct device and pass the skip_routing state to the autotuner. Included integration test for FP4 routed MoE autotuning.

Changes

Cohort / File(s)	Summary
Core MoE routing bypass support `flashinfer/fused_moe/core.py`	Added conditional `skip_routing` flag handling in MoERunner.forward to set routing_logits to None when requested. Updated workspace allocation for routing logits to use `hidden_states.device` instead of meta device. Propagated skip_routing state to autotuner call for consistent tactic selection.
FP4 MoE autotuner integration test `tests/autotuner/test_trtllm_fused_moe_autotuner_integration.py`	Added SM100-only regression test `test_fp4_routed_moe_autotune_no_crash` parameterized over token counts and expert configurations. Constructs FP4 routed MoE inputs with packed topk_ids and block-scale quantized weights, runs operator within autotuning context, validates no crash occurs.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related issues

0.6.5 auto-tuning crash #2701: Changes to skip_routing flag propagation and routing_logits device placement directly affect the same trtllm MoE autotuner/executor interaction path.

Possibly related PRs

Fix autotuner crash when input tensor is None #2756: Modifies fused_moe/core.py to handle routing bypass (routing_logits=None) through skip_routing flag propagation into autotuner and kernel calls.
chore/feat: A2A + MoE benchmark; add routed counterpart for trtllm_gen_fp8_fused_moe #2379: Modifies fused MoE block-scale routing handling and related tensor allocations in the same trtllm_fp4/_fp8 entry points.
chore/feat: Add do_finalize to trtllm-gen fp8/f16 MoE APIs #2548: Threads execution-control flags into MoE operator/autotuner flows; this PR adds skip_routing while that PR adds do_finalize to related entry points.

Suggested labels

run-ci, op: moe, op: moe-routing

Suggested reviewers

sricketts
aleozlx
cyx-6
yzh119
samuellees
djmmoss

Poem

🐰✨ A hop through the routing bypass flow,
Where skip_routing flags let decisions go!
Device placement dancing, workspace repositioned,
The autotuner's wisdom expertly conditioned.
Tests ensure no crashes on the SM100 stage—
A well-tuned MoE writes a cleaner page! 🎯

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check	✅ Passed	The pull request description is comprehensive and well-structured, following the template with detailed explanations of the problem, root cause, and the implemented fix.
Title check	✅ Passed	The title clearly identifies the specific bug fix (autotuner crash) and the affected function (trtllm_fp4_block_scale_routed_moe), directly matching the main changes in the PR.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request fixes a crash in the autotuner for FP4 routed MoE by ensuring that placeholder tensors are created on the actual device instead of the 'meta' device, which lacks storage for C++ kernel interaction. It also introduces logic to skip routing when routing_logits are absent and adds a regression test to verify the fix. I have no feedback to provide as the changes correctly address the issue.

bkryu · 2026-03-30T21:36:27Z

/bot run

flashinfer-bot · 2026-03-30T21:37:10Z

GitLab MR !476 has been created, and the CI pipeline #47290871 is currently running. I'll report back once the pipeline job completes.

nvpohanh · 2026-03-31T01:38:27Z

cc @trevor-m who is working on integrating routed moe into SGL

flashinfer-bot · 2026-03-31T09:40:43Z

[FAILED] Pipeline #47290871: 5/20 passed

bkryu · 2026-03-31T19:47:14Z

/bot run

flashinfer-bot · 2026-03-31T19:48:17Z

GitLab MR !476 has been updated with latest changes, and the CI pipeline #47381824 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-03-31T23:59:25Z

[FAILED] Pipeline #47381824: 11/20 passed

First commit

660bc2b

bkryu requested review from IwakuraRein, aleozlx, cyx-6, jiahanc, jimmyzho, kahyunnam, nv-yunzheq, saltyminty, samuellees, sricketts, yongwww, yyihuang and yzh119 as code owners March 30, 2026 21:32

bkryu self-assigned this Mar 30, 2026

bkryu added the run-ci label Mar 30, 2026

gemini-code-assist bot reviewed Mar 30, 2026

View reviewed changes

bkryu mentioned this pull request Mar 30, 2026

[Bug] Autotuning + trtllm_fp4_block_scale_routed_moe Issue #2023

Closed

aleozlx approved these changes Mar 30, 2026

View reviewed changes

aleozlx linked an issue Mar 30, 2026 that may be closed by this pull request

[Bug] Autotuning + trtllm_fp4_block_scale_routed_moe Issue #2023

Closed

aleozlx added the op: moe label Mar 30, 2026

aleozlx enabled auto-merge (squash) March 30, 2026 22:14

bkryu changed the title ~~fix: Fix autotuner crash on meta-device tensor in trtllm_fp4_block_scale_routed_moe Description:~~ fix: Fix autotuner crash on meta-device tensor in trtllm_fp4_block_scale_routed_moe Mar 31, 2026

Merge branch 'main' into autotune_trtllm_fp4_block_scale_routed_moe

4f3acee

aleozlx merged commit 23b3279 into flashinfer-ai:main Apr 1, 2026
29 of 34 checks passed

bkryu deleted the autotune_trtllm_fp4_block_scale_routed_moe branch April 1, 2026 19:50

coderabbitai bot mentioned this pull request Apr 1, 2026

[Perf] Refactor MoE autotuning to set valid topk ids in routed MoE tuning #2942

Merged

5 tasks

Conversation

bkryu commented Mar 30, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

Summary

Root Cause

Fix

Unit test changes

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

bkryu commented Mar 30, 2026

Uh oh!

flashinfer-bot commented Mar 30, 2026

Uh oh!

nvpohanh commented Mar 31, 2026

Uh oh!

flashinfer-bot commented Mar 31, 2026

Uh oh!

bkryu commented Mar 31, 2026

Uh oh!

flashinfer-bot commented Mar 31, 2026

Uh oh!

flashinfer-bot commented Mar 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bkryu commented Mar 30, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 30, 2026 •

edited

Loading