Skip to content

Commit b5abcf8

Browse files
committed
Fix fused_moe fallback issue. (NVIDIA#3652)
min_latency_mode is only set to False during warmup phase. Thus when it becomes true during inference, all tactics fall back to the default one and thus cause perf regression. Signed-off-by: Yukun He <[email protected]>
1 parent 7822945 commit b5abcf8

File tree

1 file changed

+3
-1
lines changed

1 file changed

+3
-1
lines changed

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -126,7 +126,9 @@ def fused_moe(
126126
(2, 0, ((0, ), lambda x: x)),
127127
))
128128

129-
min_latency_tensor = torch.empty(1) if min_latency_mode else torch.empty(0)
129+
# TODO: set min_latency_mode always to False due to the error in the moe_kernels
130+
min_latency_tensor = torch.empty(0)
131+
130132
# allocate workspace for profiling
131133
moe_runner = MoERunner(
132134
x_dtype=input.dtype,

0 commit comments

Comments
 (0)