Fix: CUDA illegal memory access in MoE three-step sort fallback (num_tokens > 256)#3011
Fix: CUDA illegal memory access in MoE three-step sort fallback (num_tokens > 256)#3011jhaotingc wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces an asynchronous memory initialization to -1 for permutation arrays within the MoE three-step fallback path to prevent the use of uninitialized memory. While this is a necessary step, review feedback indicates that it does not fully resolve potential illegal memory accesses in the finalizeMoeRoutingKernel, which still lacks checks for negative indices. Additionally, the code comments should be updated to accurately reflect that unpopulated entries are primarily caused by duplicate expert selections rather than tokens not matching local experts.
| @@ -3800,6 +3800,9 @@ void CutlassMoeFCRunner<T, WeightType, OutputType, InputType, BackBoneType, IsMX | |||
|
|
|||
| if (!fused_prologue_result) { | |||
| TLLM_LOG_TRACE("Falling back to unfused prologue"); | |||
| // Fix: zero-init permutation arrays before three-step fallback | |||
| // The three-step path may not populate all entries (e.g. tokens not matching local experts) | |||
| cudaMemsetAsync(unpermuted_row_to_permuted_row, -1, expanded_num_rows * sizeof(int), stream); | |||
There was a problem hiding this comment.
While initializing unpermuted_row_to_permuted_row to -1 prevents using uninitialized memory, it does not fully resolve the illegal memory access issue in all configurations.
Specifically, finalizeMoeRoutingKernel (used when enable_alltoall is false or ep_size == 1) does not check if the retrieved permuted row index is negative before using it as a pointer offset (see line 1733). If a token has duplicate expert selections, the three-step fallback only records the first occurrence due to the break in blockExpertPrefixSumKernel (line 564), leaving the entries for subsequent occurrences as -1. Accessing expanded_permuted_rows_v + (-1) * num_elems_in_padded_col will still result in an illegal memory access.
While finalizeMoeRoutingNoFillingKernel (used with all-to-all) happens to avoid this by only processing the first local expert per token, the standard finalizeMoeRoutingKernel remains vulnerable. You should add a check in the finalize kernels to skip entries where the permuted row index is negative.
| @@ -3800,6 +3800,9 @@ void CutlassMoeFCRunner<T, WeightType, OutputType, InputType, BackBoneType, IsMX | |||
|
|
|||
| if (!fused_prologue_result) { | |||
| TLLM_LOG_TRACE("Falling back to unfused prologue"); | |||
| // Fix: zero-init permutation arrays before three-step fallback | |||
| // The three-step path may not populate all entries (e.g. tokens not matching local experts) | |||
There was a problem hiding this comment.
The comment here is slightly inaccurate. The reason the three-step path may not populate all entries is not primarily due to "tokens not matching local experts" (those are already handled by the expert_id check in the finalize kernels), but rather due to the fact that blockExpertPrefixSumKernel only records the first occurrence of an expert for a given token, skipping any duplicate selections of the same expert in the token's top-k list.
// The three-step path may not populate all entries if a token has duplicate expert selections
…allback When num_tokens > 256, the fused sort kernel cannot handle the batch and falls back to threeStepBuildExpertMapsSortFirstToken. This three-step path does not populate all entries of unpermuted_row_to_permuted_row, leaving some entries uninitialized with garbage values. finalizeMoeRoutingKernel then reads these garbage values as row indices into the GEMM output buffer, causing CUDA illegal memory access. The fix initializes the array to -1 before the three-step path runs. Entries with -1 are safely skipped by finalizeMoeRoutingKernel because the corresponding expert_id check filters them out via continue. Crash repro: cutlass_fused_moe with use_deepseek_fp8_block_scale=True, num_tokens=257+, 128 experts, top_k=8, hidden=5120, intermediate=8960. Verified: BS=256,257,288,384,512 all pass with 10 replays each. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
9c67848 to
e26a72b
Compare
📝 WalkthroughWalkthroughA Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh`:
- Around line 3803-3805: The sentinel value -1 stored in
unpermuted_row_to_permuted_row must be checked by consumers before being used as
an index; update the finalize code paths that read this map (the two
finalize/consume sites referenced) to guard before dereferencing—e.g., read int
perm = unpermuted_row_to_permuted_row[i]; if (perm < 0) { /* skip / handle
unmapped row */ } else { use perm as index }—so unmapped entries are skipped or
handled safely instead of being used as array indices.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b5cdbe3a-e058-4810-a3ef-e0ae09213a2b
📒 Files selected for processing (1)
csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh
| // Fix: zero-init permutation arrays before three-step fallback | ||
| // The three-step path may not populate all entries (e.g. tokens not matching local experts) | ||
| cudaMemsetAsync(unpermuted_row_to_permuted_row, -1, expanded_num_rows * sizeof(int), stream); |
There was a problem hiding this comment.
-1 sentinel needs a consumer-side guard before use.
Line 3805 initializes missing map entries to -1, but both finalize paths consume that map as an index without checking for negative values (Line 1728 and Line 1830). This can still dereference invalid rows for unmapped entries.
Proposed fix
--- a/csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh
+++ b/csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh
@@
- int64_t const expanded_permuted_row = unpermuted_row_to_permuted_row[expanded_original_row];
+ int64_t const expanded_permuted_row = unpermuted_row_to_permuted_row[expanded_original_row];
+ if (expanded_permuted_row < 0) {
+ continue;
+ }
@@
- int64_t const expanded_permuted_row_from_k_idx =
- unpermuted_row_to_permuted_row[source_row + k_idx * num_rows];
+ int64_t const expanded_permuted_row_from_k_idx =
+ unpermuted_row_to_permuted_row[source_row + k_idx * num_rows];
+ if (expanded_permuted_row_from_k_idx < 0) {
+ continue;
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh` around lines
3803 - 3805, The sentinel value -1 stored in unpermuted_row_to_permuted_row must
be checked by consumers before being used as an index; update the finalize code
paths that read this map (the two finalize/consume sites referenced) to guard
before dereferencing—e.g., read int perm = unpermuted_row_to_permuted_row[i]; if
(perm < 0) { /* skip / handle unmapped row */ } else { use perm as index }—so
unmapped entries are skipped or handled safely instead of being used as array
indices.
|
Fixed in VLLM by vllm-project/vllm#39391 |
📌 Description
Fixes a CUDA illegal memory access crash in cutlass_fused_moe with use_deepseek_fp8_block_scale=True when num_tokens > 256.
Root Cause
When num_tokens > 256, the fused single-block sort kernel (fusedBuildExpertMapsSortFirstTokenKernel, max BLOCK_SIZE=256) cannot handle the batch and falls back to threeStepBuildExpertMapsSortFirstToken.
The three-step fallback's blockExpertPrefixSumKernel uses break after finding the first match of a given expert in a token's top_k list. When a token has duplicate expert selections in its top_k (e.g., experts=[5, 88, 88, 65, ...]), only the first occurrence is recorded — the second slot's entry in unpermuted_row_to_permuted_row is never written.
finalizeMoeRoutingKernel then reads these uninitialized entries as row indices into the GEMM output buffer, causing wild pointer dereferences and CUDA illegal memory access.
Fix
Zero-initialize unpermuted_row_to_permuted_row with -1 before the three-step fallback runs. This is safe because finalizeMoeRoutingKernel checks expert_id validity before accessing the permutation array — entries with -1 produce valid (though redundant) expert lookups that are handled correctly.
Performance Impact
None. cudaMemsetAsync is fully async (no CPU-GPU sync), CUDA-graph compatible, and memsets <16KB on the GPU (<1 us). It only runs on the fallback path (num_tokens > 256).
Test plan
1. Serve with FlashInfer MoE (the fixed path)
2. Serve with Triton MoE (baseline for accuracy comparison)
3. Accuracy test (lm_eval GSM8K)
lm_eval --model local-completions \ --model_args base_url=http://localhost:8001/v1/completions,model=/scratch_omniml_data_3/hf-local/Qwen/Qwen3.5-397B-A17B-FP8,num_concurrent=128,tokenized_requests=False \ --tasks gsm8k \ --batch_size 1284. Results
FlashInfer MoE (with fix)
Triton MoE (baseline)
Summary
Perf
Memset overhead per step = 94 layers × 2 us = 188 us = 0.188 ms
Step time = 50 ms
Slowdown = 0.188 / 50 = 0.38%
~0.4% slowdown — negligible. And this only happens on the three-step path (num_tokens > 256). For decode steps with fewer tokens, the fused path runs and there's zero overhead.
🔍 Related Issues
vllm-project/vllm#39244
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit