Skip to content

Fix: CUDA illegal memory access in MoE three-step sort fallback (num_tokens > 256)#3011

Closed
jhaotingc wants to merge 1 commit into
flashinfer-ai:mainfrom
jhaotingc:fix/moe-permutation-uninit-crash
Closed

Fix: CUDA illegal memory access in MoE three-step sort fallback (num_tokens > 256)#3011
jhaotingc wants to merge 1 commit into
flashinfer-ai:mainfrom
jhaotingc:fix/moe-permutation-uninit-crash

Conversation

@jhaotingc
Copy link
Copy Markdown
Contributor

@jhaotingc jhaotingc commented Apr 7, 2026

📌 Description

Fixes a CUDA illegal memory access crash in cutlass_fused_moe with use_deepseek_fp8_block_scale=True when num_tokens > 256.

Root Cause

When num_tokens > 256, the fused single-block sort kernel (fusedBuildExpertMapsSortFirstTokenKernel, max BLOCK_SIZE=256) cannot handle the batch and falls back to threeStepBuildExpertMapsSortFirstToken.

The three-step fallback's blockExpertPrefixSumKernel uses break after finding the first match of a given expert in a token's top_k list. When a token has duplicate expert selections in its top_k (e.g., experts=[5, 88, 88, 65, ...]), only the first occurrence is recorded — the second slot's entry in unpermuted_row_to_permuted_row is never written.

finalizeMoeRoutingKernel then reads these uninitialized entries as row indices into the GEMM output buffer, causing wild pointer dereferences and CUDA illegal memory access.

Fix

Zero-initialize unpermuted_row_to_permuted_row with -1 before the three-step fallback runs. This is safe because finalizeMoeRoutingKernel checks expert_id validity before accessing the permutation array — entries with -1 produce valid (though redundant) expert lookups that are handled correctly.

cudaMemsetAsync(unpermuted_row_to_permuted_row, -1, expanded_num_rows * sizeof(int), stream);

Performance Impact

None. cudaMemsetAsync is fully async (no CPU-GPU sync), CUDA-graph compatible, and memsets <16KB on the GPU (<1 us). It only runs on the fallback path (num_tokens > 256).

Test plan

1. Serve with FlashInfer MoE (the fixed path)

export VLLM_USE_FLASHINFER_SAMPLER=1
export VLLM_USE_FLASHINFER_MOE_FP8=1

vllm serve /scratch_omniml_data_3/hf-local/Qwen/Qwen3.5-397B-A17B-FP8 \
  --tensor-parallel-size 4 --enable-expert-parallel \
  --async-scheduling --max-num-seqs 512 \
  --max-num-batched-tokens 4096 --max-model-len 4096 \
  --gpu-memory-utilization 0.9 --no-enable-prefix-caching \
  --trust-remote-code --port 8001 \
  --compilation-config '{"compile_sizes":[1,2,4,8,16,32,64,128,256,512],"cudagraph_capture_sizes":[1,2,4,8,16,32,64,128,256,512]}'

2. Serve with Triton MoE (baseline for accuracy comparison)

export VLLM_USE_FLASHINFER_SAMPLER=1

vllm serve /scratch_omniml_data_3/hf-local/Qwen/Qwen3.5-397B-A17B-FP8 \
  --tensor-parallel-size 4 --enable-expert-parallel \
  --async-scheduling --max-num-seqs 512 \
  --max-num-batched-tokens 4096 --max-model-len 4096 \
  --gpu-memory-utilization 0.9 --no-enable-prefix-caching \
  --trust-remote-code --port 8001 \
  --compilation-config '{"compile_sizes":[1,2,4,8,16,32,64,128,256,512],"cudagraph_capture_sizes":[1,2,4,8,16,32,64,128,256,512]}' \
  --kernel-config '{"moe_backend": "triton"}'

3. Accuracy test (lm_eval GSM8K)

lm_eval --model local-completions \
    --model_args base_url=http://localhost:8001/v1/completions,model=/scratch_omniml_data_3/hf-local/Qwen/Qwen3.5-397B-A17B-FP8,num_concurrent=128,tokenized_requests=False \
    --tasks gsm8k \
    --batch_size 128

4. Results

FlashInfer MoE (with fix)

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8165|±  |0.0107|
|     |       |strict-match    |     5|exact_match|↑  |0.7991|±  |0.0110|

Triton MoE (baseline)

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8036|±  |0.0109|
|     |       |strict-match    |     5|exact_match|↑  |0.7976|±  |0.0111|

Summary

Backend GSM8K (flexible) GSM8K (strict) Crash at CONC 16+?
FlashInfer MoE (before fix) N/A (crash) N/A (crash) Yes
FlashInfer MoE (after fix) 0.8165 0.7991 No
Triton MoE (baseline) 0.8036 0.7976 No

Perf

image

Memset overhead per step = 94 layers × 2 us = 188 us = 0.188 ms
Step time = 50 ms
Slowdown = 0.188 / 50 = 0.38%

~0.4% slowdown — negligible. And this only happens on the three-step path (num_tokens > 256). For decode steps with fewer tokens, the fused path runs and there's zero overhead.

🔍 Related Issues

vllm-project/vllm#39244

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • Bug Fixes
    • Improved mixture-of-experts backend stability and reliability when processing edge cases with tokens not matched to local expert assignments.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an asynchronous memory initialization to -1 for permutation arrays within the MoE three-step fallback path to prevent the use of uninitialized memory. While this is a necessary step, review feedback indicates that it does not fully resolve potential illegal memory accesses in the finalizeMoeRoutingKernel, which still lacks checks for negative indices. Additionally, the code comments should be updated to accurately reflect that unpopulated entries are primarily caused by duplicate expert selections rather than tokens not matching local experts.

@@ -3800,6 +3800,9 @@ void CutlassMoeFCRunner<T, WeightType, OutputType, InputType, BackBoneType, IsMX

if (!fused_prologue_result) {
TLLM_LOG_TRACE("Falling back to unfused prologue");
// Fix: zero-init permutation arrays before three-step fallback
// The three-step path may not populate all entries (e.g. tokens not matching local experts)
cudaMemsetAsync(unpermuted_row_to_permuted_row, -1, expanded_num_rows * sizeof(int), stream);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While initializing unpermuted_row_to_permuted_row to -1 prevents using uninitialized memory, it does not fully resolve the illegal memory access issue in all configurations.

Specifically, finalizeMoeRoutingKernel (used when enable_alltoall is false or ep_size == 1) does not check if the retrieved permuted row index is negative before using it as a pointer offset (see line 1733). If a token has duplicate expert selections, the three-step fallback only records the first occurrence due to the break in blockExpertPrefixSumKernel (line 564), leaving the entries for subsequent occurrences as -1. Accessing expanded_permuted_rows_v + (-1) * num_elems_in_padded_col will still result in an illegal memory access.

While finalizeMoeRoutingNoFillingKernel (used with all-to-all) happens to avoid this by only processing the first local expert per token, the standard finalizeMoeRoutingKernel remains vulnerable. You should add a check in the finalize kernels to skip entries where the permuted row index is negative.

@@ -3800,6 +3800,9 @@ void CutlassMoeFCRunner<T, WeightType, OutputType, InputType, BackBoneType, IsMX

if (!fused_prologue_result) {
TLLM_LOG_TRACE("Falling back to unfused prologue");
// Fix: zero-init permutation arrays before three-step fallback
// The three-step path may not populate all entries (e.g. tokens not matching local experts)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment here is slightly inaccurate. The reason the three-step path may not populate all entries is not primarily due to "tokens not matching local experts" (those are already handled by the expert_id check in the finalize kernels), but rather due to the fact that blockExpertPrefixSumKernel only records the first occurrence of an expert for a given token, skipping any duplicate selections of the same expert in the token's top-k list.

      // The three-step path may not populate all entries if a token has duplicate expert selections

…allback

When num_tokens > 256, the fused sort kernel cannot handle the batch and
falls back to threeStepBuildExpertMapsSortFirstToken. This three-step path
does not populate all entries of unpermuted_row_to_permuted_row, leaving
some entries uninitialized with garbage values.

finalizeMoeRoutingKernel then reads these garbage values as row indices
into the GEMM output buffer, causing CUDA illegal memory access.

The fix initializes the array to -1 before the three-step path runs.
Entries with -1 are safely skipped by finalizeMoeRoutingKernel because
the corresponding expert_id check filters them out via continue.

Crash repro: cutlass_fused_moe with use_deepseek_fp8_block_scale=True,
num_tokens=257+, 128 experts, top_k=8, hidden=5120, intermediate=8960.

Verified: BS=256,257,288,384,512 all pass with 10 replays each.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
@jhaotingc jhaotingc force-pushed the fix/moe-permutation-uninit-crash branch from 9c67848 to e26a72b Compare April 7, 2026 23:39
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 7, 2026

📝 Walkthrough

Walkthrough

A cudaMemsetAsync call was added to initialize the unpermuted_row_to_permuted_row buffer to -1 before invoking threeStepBuildExpertMapsSortFirstToken in the unfused prologue fallback path. This ensures a known sentinel value for permutation map entries when not all indices may be populated.

Changes

Cohort / File(s) Summary
CUTLASS MOE Kernel Initialization
csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh
Added memory initialization via cudaMemsetAsync to set unpermuted_row_to_permuted_row to -1 for expanded_num_rows entries before expert map construction.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Suggested labels

run-ci

Suggested reviewers

  • yzh119
  • cyx-6
  • jimmyzho
  • yongwww
  • djmmoss
  • sricketts
  • aleozlx

Poem

🐰✨ A dash of -1 in the async stream,
Sentinel values fulfill a dream—
Every row now knows its place,
Before the expert map sets pace!
Initialization's gentle touch,
Prevents lost tokens oh so much. 🌙

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the fix: a CUDA illegal memory access in MoE three-step sort fallback triggered when num_tokens > 256.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed The PR description is comprehensive and well-structured, covering the bug root cause, fix details, performance impact, and extensive test plan with quantitative results.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh`:
- Around line 3803-3805: The sentinel value -1 stored in
unpermuted_row_to_permuted_row must be checked by consumers before being used as
an index; update the finalize code paths that read this map (the two
finalize/consume sites referenced) to guard before dereferencing—e.g., read int
perm = unpermuted_row_to_permuted_row[i]; if (perm < 0) { /* skip / handle
unmapped row */ } else { use perm as index }—so unmapped entries are skipped or
handled safely instead of being used as array indices.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b5cdbe3a-e058-4810-a3ef-e0ae09213a2b

📥 Commits

Reviewing files that changed from the base of the PR and between 1fd6305 and e26a72b.

📒 Files selected for processing (1)
  • csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh

Comment on lines +3803 to +3805
// Fix: zero-init permutation arrays before three-step fallback
// The three-step path may not populate all entries (e.g. tokens not matching local experts)
cudaMemsetAsync(unpermuted_row_to_permuted_row, -1, expanded_num_rows * sizeof(int), stream);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

-1 sentinel needs a consumer-side guard before use.

Line 3805 initializes missing map entries to -1, but both finalize paths consume that map as an index without checking for negative values (Line 1728 and Line 1830). This can still dereference invalid rows for unmapped entries.

Proposed fix
--- a/csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh
+++ b/csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh
@@
-      int64_t const expanded_permuted_row = unpermuted_row_to_permuted_row[expanded_original_row];
+      int64_t const expanded_permuted_row = unpermuted_row_to_permuted_row[expanded_original_row];
+      if (expanded_permuted_row < 0) {
+        continue;
+      }
@@
-        int64_t const expanded_permuted_row_from_k_idx =
-            unpermuted_row_to_permuted_row[source_row + k_idx * num_rows];
+        int64_t const expanded_permuted_row_from_k_idx =
+            unpermuted_row_to_permuted_row[source_row + k_idx * num_rows];
+        if (expanded_permuted_row_from_k_idx < 0) {
+          continue;
+        }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh` around lines
3803 - 3805, The sentinel value -1 stored in unpermuted_row_to_permuted_row must
be checked by consumers before being used as an index; update the finalize code
paths that read this map (the two finalize/consume sites referenced) to guard
before dereferencing—e.g., read int perm = unpermuted_row_to_permuted_row[i]; if
(perm < 0) { /* skip / handle unmapped row */ } else { use perm as index }—so
unmapped entries are skipped or handled safely instead of being used as array
indices.

@jhaotingc
Copy link
Copy Markdown
Contributor Author

Fixed in VLLM by vllm-project/vllm#39391

@jhaotingc jhaotingc closed this Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants