Fix: CUDA illegal memory access in MoE three-step sort fallback (num_tokens > 256) by jhaotingc · Pull Request #3011 · flashinfer-ai/flashinfer

jhaotingc · 2026-04-07T23:27:15Z

📌 Description

Fixes a CUDA illegal memory access crash in cutlass_fused_moe with use_deepseek_fp8_block_scale=True when num_tokens > 256.

Root Cause

When num_tokens > 256, the fused single-block sort kernel (fusedBuildExpertMapsSortFirstTokenKernel, max BLOCK_SIZE=256) cannot handle the batch and falls back to threeStepBuildExpertMapsSortFirstToken.

The three-step fallback's blockExpertPrefixSumKernel uses break after finding the first match of a given expert in a token's top_k list. When a token has duplicate expert selections in its top_k (e.g., experts=[5, 88, 88, 65, ...]), only the first occurrence is recorded — the second slot's entry in unpermuted_row_to_permuted_row is never written.

finalizeMoeRoutingKernel then reads these uninitialized entries as row indices into the GEMM output buffer, causing wild pointer dereferences and CUDA illegal memory access.

Fix

Zero-initialize unpermuted_row_to_permuted_row with -1 before the three-step fallback runs. This is safe because finalizeMoeRoutingKernel checks expert_id validity before accessing the permutation array — entries with -1 produce valid (though redundant) expert lookups that are handled correctly.

cudaMemsetAsync(unpermuted_row_to_permuted_row, -1, expanded_num_rows * sizeof(int), stream);

Performance Impact

None. cudaMemsetAsync is fully async (no CPU-GPU sync), CUDA-graph compatible, and memsets <16KB on the GPU (<1 us). It only runs on the fallback path (num_tokens > 256).

Test plan

1. Serve with FlashInfer MoE (the fixed path)

export VLLM_USE_FLASHINFER_SAMPLER=1
export VLLM_USE_FLASHINFER_MOE_FP8=1

vllm serve /scratch_omniml_data_3/hf-local/Qwen/Qwen3.5-397B-A17B-FP8 \
  --tensor-parallel-size 4 --enable-expert-parallel \
  --async-scheduling --max-num-seqs 512 \
  --max-num-batched-tokens 4096 --max-model-len 4096 \
  --gpu-memory-utilization 0.9 --no-enable-prefix-caching \
  --trust-remote-code --port 8001 \
  --compilation-config '{"compile_sizes":[1,2,4,8,16,32,64,128,256,512],"cudagraph_capture_sizes":[1,2,4,8,16,32,64,128,256,512]}'

2. Serve with Triton MoE (baseline for accuracy comparison)

export VLLM_USE_FLASHINFER_SAMPLER=1

vllm serve /scratch_omniml_data_3/hf-local/Qwen/Qwen3.5-397B-A17B-FP8 \
  --tensor-parallel-size 4 --enable-expert-parallel \
  --async-scheduling --max-num-seqs 512 \
  --max-num-batched-tokens 4096 --max-model-len 4096 \
  --gpu-memory-utilization 0.9 --no-enable-prefix-caching \
  --trust-remote-code --port 8001 \
  --compilation-config '{"compile_sizes":[1,2,4,8,16,32,64,128,256,512],"cudagraph_capture_sizes":[1,2,4,8,16,32,64,128,256,512]}' \
  --kernel-config '{"moe_backend": "triton"}'

3. Accuracy test (lm_eval GSM8K)

lm_eval --model local-completions \
    --model_args base_url=http://localhost:8001/v1/completions,model=/scratch_omniml_data_3/hf-local/Qwen/Qwen3.5-397B-A17B-FP8,num_concurrent=128,tokenized_requests=False \
    --tasks gsm8k \
    --batch_size 128

4. Results

FlashInfer MoE (with fix)

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8165|±  |0.0107|
|     |       |strict-match    |     5|exact_match|↑  |0.7991|±  |0.0110|

Triton MoE (baseline)

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8036|±  |0.0109|
|     |       |strict-match    |     5|exact_match|↑  |0.7976|±  |0.0111|

Summary

Backend	GSM8K (flexible)	GSM8K (strict)	Crash at CONC 16+?
FlashInfer MoE (before fix)	N/A (crash)	N/A (crash)	Yes
FlashInfer MoE (after fix)	0.8165	0.7991	No
Triton MoE (baseline)	0.8036	0.7976	No

Perf

Memset overhead per step = 94 layers × 2 us = 188 us = 0.188 ms
Step time = 50 ms
Slowdown = 0.188 / 50 = 0.38%

~0.4% slowdown — negligible. And this only happens on the three-step path (num_tokens > 256). For decode steps with fewer tokens, the fused path runs and there's zero overhead.

🔍 Related Issues

vllm-project/vllm#39244

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Bug Fixes
- Improved mixture-of-experts backend stability and reliability when processing edge cases with tokens not matched to local expert assignments.

gemini-code-assist

Code Review

This pull request introduces an asynchronous memory initialization to -1 for permutation arrays within the MoE three-step fallback path to prevent the use of uninitialized memory. While this is a necessary step, review feedback indicates that it does not fully resolve potential illegal memory accesses in the finalizeMoeRoutingKernel, which still lacks checks for negative indices. Additionally, the code comments should be updated to accurately reflect that unpopulated entries are primarily caused by duplicate expert selections rather than tokens not matching local experts.

gemini-code-assist · 2026-04-07T23:29:49Z

@@ -3800,6 +3800,9 @@ void CutlassMoeFCRunner<T, WeightType, OutputType, InputType, BackBoneType, IsMX

    if (!fused_prologue_result) {
      TLLM_LOG_TRACE("Falling back to unfused prologue");
+      // Fix: zero-init permutation arrays before three-step fallback
+      // The three-step path may not populate all entries (e.g. tokens not matching local experts)
+      cudaMemsetAsync(unpermuted_row_to_permuted_row, -1, expanded_num_rows * sizeof(int), stream);


While initializing unpermuted_row_to_permuted_row to -1 prevents using uninitialized memory, it does not fully resolve the illegal memory access issue in all configurations.

Specifically, finalizeMoeRoutingKernel (used when enable_alltoall is false or ep_size == 1) does not check if the retrieved permuted row index is negative before using it as a pointer offset (see line 1733). If a token has duplicate expert selections, the three-step fallback only records the first occurrence due to the break in blockExpertPrefixSumKernel (line 564), leaving the entries for subsequent occurrences as -1. Accessing expanded_permuted_rows_v + (-1) * num_elems_in_padded_col will still result in an illegal memory access.

While finalizeMoeRoutingNoFillingKernel (used with all-to-all) happens to avoid this by only processing the first local expert per token, the standard finalizeMoeRoutingKernel remains vulnerable. You should add a check in the finalize kernels to skip entries where the permuted row index is negative.

gemini-code-assist · 2026-04-07T23:29:49Z

@@ -3800,6 +3800,9 @@ void CutlassMoeFCRunner<T, WeightType, OutputType, InputType, BackBoneType, IsMX

    if (!fused_prologue_result) {
      TLLM_LOG_TRACE("Falling back to unfused prologue");
+      // Fix: zero-init permutation arrays before three-step fallback
+      // The three-step path may not populate all entries (e.g. tokens not matching local experts)


The comment here is slightly inaccurate. The reason the three-step path may not populate all entries is not primarily due to "tokens not matching local experts" (those are already handled by the expert_id check in the finalize kernels), but rather due to the fact that blockExpertPrefixSumKernel only records the first occurrence of an expert for a given token, skipping any duplicate selections of the same expert in the token's top-k list.

// The three-step path may not populate all entries if a token has duplicate expert selections

…allback When num_tokens > 256, the fused sort kernel cannot handle the batch and falls back to threeStepBuildExpertMapsSortFirstToken. This three-step path does not populate all entries of unpermuted_row_to_permuted_row, leaving some entries uninitialized with garbage values. finalizeMoeRoutingKernel then reads these garbage values as row indices into the GEMM output buffer, causing CUDA illegal memory access. The fix initializes the array to -1 before the three-step path runs. Entries with -1 are safely skipped by finalizeMoeRoutingKernel because the corresponding expert_id check filters them out via continue. Crash repro: cutlass_fused_moe with use_deepseek_fp8_block_scale=True, num_tokens=257+, 128 experts, top_k=8, hidden=5120, intermediate=8960. Verified: BS=256,257,288,384,512 all pass with 10 replays each. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>

coderabbitai · 2026-04-07T23:39:54Z

📝 Walkthrough

Walkthrough

A cudaMemsetAsync call was added to initialize the unpermuted_row_to_permuted_row buffer to -1 before invoking threeStepBuildExpertMapsSortFirstToken in the unfused prologue fallback path. This ensures a known sentinel value for permutation map entries when not all indices may be populated.

Changes

Cohort / File(s)	Summary
CUTLASS MOE Kernel Initialization `csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh`	Added memory initialization via `cudaMemsetAsync` to set `unpermuted_row_to_permuted_row` to `-1` for `expanded_num_rows` entries before expert map construction.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Suggested labels

run-ci

Suggested reviewers

yzh119
cyx-6
jimmyzho
yongwww
djmmoss
sricketts
aleozlx

Poem

🐰✨ A dash of -1 in the async stream,
Sentinel values fulfill a dream—
Every row now knows its place,
Before the expert map sets pace!
Initialization's gentle touch,
Prevents lost tokens oh so much. 🌙

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the fix: a CUDA illegal memory access in MoE three-step sort fallback triggered when num_tokens > 256.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check	✅ Passed	The PR description is comprehensive and well-structured, covering the bug root cause, fix details, performance impact, and extensive test plan with quantitative results.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh`:
- Around line 3803-3805: The sentinel value -1 stored in
unpermuted_row_to_permuted_row must be checked by consumers before being used as
an index; update the finalize code paths that read this map (the two
finalize/consume sites referenced) to guard before dereferencing—e.g., read int
perm = unpermuted_row_to_permuted_row[i]; if (perm < 0) { /* skip / handle
unmapped row */ } else { use perm as index }—so unmapped entries are skipped or
handled safely instead of being used as array indices.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b5cdbe3a-e058-4810-a3ef-e0ae09213a2b

📥 Commits

Reviewing files that changed from the base of the PR and between 1fd6305 and e26a72b.

📒 Files selected for processing (1)

csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh

coderabbitai · 2026-04-07T23:43:06Z

+      // Fix: zero-init permutation arrays before three-step fallback
+      // The three-step path may not populate all entries (e.g. tokens not matching local experts)
+      cudaMemsetAsync(unpermuted_row_to_permuted_row, -1, expanded_num_rows * sizeof(int), stream);


⚠️ Potential issue | 🔴 Critical

-1 sentinel needs a consumer-side guard before use.

Line 3805 initializes missing map entries to -1, but both finalize paths consume that map as an index without checking for negative values (Line 1728 and Line 1830). This can still dereference invalid rows for unmapped entries.

Proposed fix

--- a/csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh +++ b/csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh @@ - int64_t const expanded_permuted_row = unpermuted_row_to_permuted_row[expanded_original_row]; + int64_t const expanded_permuted_row = unpermuted_row_to_permuted_row[expanded_original_row]; + if (expanded_permuted_row < 0) { + continue; + } @@ - int64_t const expanded_permuted_row_from_k_idx = - unpermuted_row_to_permuted_row[source_row + k_idx * num_rows]; + int64_t const expanded_permuted_row_from_k_idx = + unpermuted_row_to_permuted_row[source_row + k_idx * num_rows]; + if (expanded_permuted_row_from_k_idx < 0) { + continue; + }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh` around lines 3803 - 3805, The sentinel value -1 stored in unpermuted_row_to_permuted_row must be checked by consumers before being used as an index; update the finalize code paths that read this map (the two finalize/consume sites referenced) to guard before dereferencing—e.g., read int perm = unpermuted_row_to_permuted_row[i]; if (perm < 0) { /* skip / handle unmapped row */ } else { use perm as index }—so unmapped entries are skipped or handled safely instead of being used as array indices.

jhaotingc · 2026-04-14T16:45:35Z

Fixed in VLLM by vllm-project/vllm#39391

jhaotingc requested review from IwakuraRein, aleozlx, jiahanc, nv-yunzheq, samuellees and yzh119 as code owners April 7, 2026 23:27

flashinfer-bot added the op: moe label Apr 7, 2026

gemini-code-assist Bot reviewed Apr 7, 2026

View reviewed changes

jhaotingc force-pushed the fix/moe-permutation-uninit-crash branch from 9c67848 to e26a72b Compare April 7, 2026 23:39

coderabbitai Bot reviewed Apr 7, 2026

View reviewed changes

jhaotingc closed this Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: CUDA illegal memory access in MoE three-step sort fallback (num_tokens > 256)#3011

Fix: CUDA illegal memory access in MoE three-step sort fallback (num_tokens > 256)#3011
jhaotingc wants to merge 1 commit into
flashinfer-ai:mainfrom
jhaotingc:fix/moe-permutation-uninit-crash

jhaotingc commented Apr 7, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 7, 2026

Uh oh!

gemini-code-assist Bot Apr 7, 2026

Uh oh!

coderabbitai Bot commented Apr 7, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 7, 2026

Uh oh!

jhaotingc commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jhaotingc commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

Test plan

1. Serve with FlashInfer MoE (the fixed path)

2. Serve with Triton MoE (baseline for accuracy comparison)

3. Accuracy test (lm_eval GSM8K)

4. Results

FlashInfer MoE (with fix)

Triton MoE (baseline)

Summary

Perf

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

jhaotingc commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jhaotingc commented Apr 7, 2026 •

edited

Loading

coderabbitai Bot commented Apr 7, 2026 •

edited

Loading