CuteDSL MoE fix redundant output buffer zeroing #2811

coderabbitai · 2026-03-18T00:40:21Z

⚠️ Potential issue | 🟠 Major

This turns the public out= path into a silent accumulation trap.

blockscaled_contiguous_grouped_gemm_finalize_fusion_nvfp4 is still a flashinfer_api, so removing the internal zero on caller-supplied buffers breaks existing out= call sites with wrong answers rather than a loud failure. Please keep the zero-free fast path internal, or gate it behind an explicit opt-in, and preserve overwrite semantics on the public entry point.

Also applies to: 321-325, 410-420

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@flashinfer/fused_moe/cute_dsl/blockscaled_contiguous_grouped_gemm_finalize_fusion.py` around lines 301 - 305, The public API function blockscaled_contiguous_grouped_gemm_finalize_fusion_nvfp4 currently removes zero-initialization on caller-supplied out buffers, turning legitimate out= usage into a silent accumulation bug; restore the original overwrite semantics so when callers pass a non-None out buffer it is always zeroed (or explicitly documented to be required zeroed), and move the zero-free fast path into an internal helper (or gate it behind an explicit opt-in flag) used only by internal callers; update the implementations referenced around the other similar sites (the same pattern at the blocks around lines 321-325 and 410-420) to follow the same approach so public entry points preserve overwrite semantics while internal optimized paths can bypass zeroing when explicitly opted-in.

gemini-code-assist · 2026-03-18T00:29:00Z

This assertion is a valuable safeguard. To improve debuggability when this assertion fails, I recommend enhancing the error message to include the actual and expected tensor sizes. This provides immediate, actionable context to the developer, reducing debugging time.

Suggested change

assert moe_output.size(0) == num_tokens, (

"moe_output must be sliced to num_tokens rows before calling _moe_core_impl"

)

assert moe_output.size(0) == num_tokens, (

f"moe_output has {moe_output.size(0)} rows, but expected {num_tokens}. "

"It must be sliced to num_tokens rows before calling _moe_core_impl."

)

coderabbitai · 2026-03-18T00:40:21Z

⚠️ Potential issue | 🟠 Major

Don't require an exact-row moe_output here.

The new assert breaks callers that reuse a larger pre-allocated output buffer, even though the optimization only needs a [:num_tokens] view. Accept size(0) >= num_tokens, validate the hidden dimension, and slice locally so the fast path stays intact without changing the public contract.

Proposed fix

if moe_output is None: moe_output = torch.empty( (num_tokens, hidden_size), dtype=output_dtype, device=x.device, ) else: - assert moe_output.size(0) == num_tokens, ( - "moe_output must be sliced to num_tokens rows before calling _moe_core_impl" - ) + if moe_output.size(0) < num_tokens or moe_output.size(1) != hidden_size: + raise ValueError( + "moe_output must have shape [>= num_tokens, hidden_size]" + ) + moe_output = moe_output[:num_tokens] + if not moe_output.is_contiguous(): + raise ValueError("moe_output[:num_tokens] must be contiguous")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@flashinfer/fused_moe/cute_dsl/fused_moe.py` around lines 185 - 197, The assert in the _moe_core_impl input handling is too strict: allow callers to pass a larger preallocated moe_output buffer by checking moe_output.size(0) >= num_tokens and moe_output.size(1) == hidden_size (validate hidden dimension and dtype/device if desired), then locally slice moe_output = moe_output[:num_tokens] before using it; keep the existing allocation path when moe_output is None (using torch.empty((num_tokens, hidden_size), dtype=output_dtype, device=x.device)) so the fast path and CUDA-graph slice semantics remain intact.

-Original file line number
+Diff line change
@@ Expand Up @@
                 expanded_idx = token_idx * topk + topk_idx. Invalid rows have -1.
             token_final_scales: Router scaling factors, shape (seq_len, topk), float32/bf16/fp16
             out: Optional output tensor, shape (seq_len, n). Created if None.
-                 This tensor is used for atomic accumulation, so it should be zero-initialized.
+                 This tensor is used for atomic accumulation. If `out` is
+                 provided, it must already be zero-initialized by the caller.
+                 If `out` is None, this function allocates a zero-initialized
+                 output tensor. Passing a non-zeroed `out` buffer will silently
+                 produce incorrect results.
             ab_dtype: Data type for A and B matrices. Default: "float4_e2m1fn"
             sf_dtype: Data type for scale factors. Default: "float8_e4m3fn"
             out_dtype: Data type for output matrix. Default: "bfloat16"
@@ Expand All @@
         Notes:
             - The output tensor is modified in-place using atomic adds for scatter-reduction.
+            - When out is provided it is NOT zeroed internally; the caller
+              must ensure the buffer is zeroed before each invocation.
+              In the main CuteDSL MoE path, _moe_core_impl handles this by
+              zeroing the active output slice before GEMM2, typically on an
+              auxiliary stream overlapped with GEMM1.
             - Call create_finalize_fusion_tensors() to create permuted_idx_to_expanded_idx and token_final_scales.
             - Requires SM100 (Blackwell) GPU architecture
             - The finalize fusion eliminates the need for a separate moe_unpermute kernel
@@ Expand Down Expand Up @@
                 f"cluster_shape_mn={cluster_shape_mn}, shape=({permuted_m}, {n}, {k}, {num_experts})"
             )
-        # Create output tensor if not provided (zero-initialized for atomic adds)
+        # Create output tensor if not provided (zero-initialized for atomic adds).
+        # If out is provided, the caller is responsible for zeroing it before
+        # this call. The GEMM2 epilogue uses atomic scatter-add
+        # (out[token_idx] += ...), so any non-zero residual would corrupt
+        # results.
         if out is None:
             out = torch.zeros(
                 (seq_len, n),
                 dtype=cutlass_to_torch_dtype(out_dtype_cutlass),
                 device=a.device,
             )
-        else:
-            # Ensure output is zero for proper accumulation
-            out.zero_()
         # Get SM count
         if sm_count is None:
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CuteDSL MoE fix redundant output buffer zeroing #2811

Diff view

Diff view

There are no files selected for viewing

coderabbitai bot Mar 18, 2026

Uh oh!

gemini-code-assist bot Mar 18, 2026

Uh oh!

coderabbitai bot Mar 18, 2026

Uh oh!

Uh oh!

Uh oh!

CuteDSL MoE fix redundant output buffer zeroing #2811

CuteDSL MoE fix redundant output buffer zeroing #2811

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!