fix: Fix autotuner crash on meta-device tensor in trtllm_fp4_block_scale_routed_moe#2916
Conversation
📝 WalkthroughWalkthroughAdded support for a Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request fixes a crash in the autotuner for FP4 routed MoE by ensuring that placeholder tensors are created on the actual device instead of the 'meta' device, which lacks storage for C++ kernel interaction. It also introduces logic to skip routing when routing_logits are absent and adds a regression test to verify the fix. I have no feedback to provide as the changes correctly address the issue.
|
/bot run |
|
cc @trevor-m who is working on integrating routed moe into SGL |
|
[FAILED] Pipeline #47290871: 5/20 passed |
|
/bot run |
|
[FAILED] Pipeline #47381824: 11/20 passed |
📌 Description
Summary
RuntimeError: Cannot pack tensors on meta when trtllm_fp4_block_scale_routed_moeis called with autotuning enabledRoot Cause
Wh#en
trtllm_fp4_block_scale_routed_moeis called,routing_logitsisNonebecause routing has already been done (pre-computedtopk_idsare provided instead). To give the autotuner a tensor with the right shape/dtype for profile generation, a placeholder was created withdevice="meta":This worked without autotuning because
choose_onereturns early (only inspects .size() for the cache key, never passes the tensor to a kernel).With autotuning enabled,
choose_oneenters the profiling loop, which calls_create_tensor_likeon the placeholder. That method copiesorigin_tensor.device, so the derived profiling tensor is also on"meta". When the profiling path callsMoERunner.forward, this meta tensor is passed to the C++ kernel via TVM FFI, which attempts DLPack conversion and fails: the meta device has no real memory.Fix
Three changes in
flashinfer/fused_moe/core.py:device="meta"withdevice=hidden_states.device— the placeholder is now a real CUDA tensor so the autotuner can safely derive profiling tensors from it.skip_routing=(routing_logits is None)throughkwargstochoose_one, signaling that routing was pre-computed.MoERunner.forward, setrouting_logits = Nonewhenskip_routing=True— this ensures the C++ kernel takes the same no-routing code path during profiling as it does in production. Without this, the autotuner would profile with routing computation enabled (randomrouting_logitsdata), potentially selecting a suboptimal tactic for the actual inference path where routing is skipped.Unit test changes
Added test_fp4_routed_moe_autotune_no_crash regression test in
tests/autotuner/test_trtllm_fused_moe_autotuner_integration.pyThe test callstrtllm_fp4_block_scale_routed_moeinsideautotune(True)withnum_tokens=1andnum_tokens=16, verifying no crash occurs.Main branch fails the newly added tests before the changes in
flashinfer/fused_moe/core.py:After fix:
🔍 Related Issues
#2023
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
New Features
Tests