Fix fused_moe fallback issue. (NVIDIA#3652)

hyukn · hyukn · commit b5abcf846488 · 2025-04-24T07:48:43.000+08:00
min_latency_mode is only set to False during warmup phase. Thus when it becomes true during inference, all tactics fall back to the default one and thus cause perf regression.

Signed-off-by: Yukun He &lt;23156053+hyukn@users.noreply.github.com&gt;
diff --git a/tensorrt_llm/_torch/custom_ops/torch_custom_ops.py b/tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
@@ -126,7 +126,9 @@ def fused_moe(
         (2, 0, ((0, ), lambda x: x)),
     ))
 
-    min_latency_tensor = torch.empty(1) if min_latency_mode else torch.empty(0)
+    # TODO: set min_latency_mode always to False due to the error in the moe_kernels
+    min_latency_tensor = torch.empty(0)
+
     # allocate workspace for profiling
     moe_runner = MoERunner(
         x_dtype=input.dtype,