-
Notifications
You must be signed in to change notification settings - Fork 995
Fix: CUDA illegal memory access in MoE three-step sort fallback (num_tokens > 256) #3011
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -3800,6 +3800,9 @@ void CutlassMoeFCRunner<T, WeightType, OutputType, InputType, BackBoneType, IsMX | |
|
|
||
| if (!fused_prologue_result) { | ||
| TLLM_LOG_TRACE("Falling back to unfused prologue"); | ||
| // Fix: zero-init permutation arrays before three-step fallback | ||
| // The three-step path may not populate all entries (e.g. tokens not matching local experts) | ||
| cudaMemsetAsync(unpermuted_row_to_permuted_row, -1, expanded_num_rows * sizeof(int), stream); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. While initializing Specifically, While
Comment on lines
+3803
to
+3805
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Line 3805 initializes missing map entries to Proposed fix--- a/csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh
+++ b/csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh
@@
- int64_t const expanded_permuted_row = unpermuted_row_to_permuted_row[expanded_original_row];
+ int64_t const expanded_permuted_row = unpermuted_row_to_permuted_row[expanded_original_row];
+ if (expanded_permuted_row < 0) {
+ continue;
+ }
@@
- int64_t const expanded_permuted_row_from_k_idx =
- unpermuted_row_to_permuted_row[source_row + k_idx * num_rows];
+ int64_t const expanded_permuted_row_from_k_idx =
+ unpermuted_row_to_permuted_row[source_row + k_idx * num_rows];
+ if (expanded_permuted_row_from_k_idx < 0) {
+ continue;
+ }🤖 Prompt for AI Agents |
||
| threeStepBuildExpertMapsSortFirstToken( | ||
| token_selected_experts, permuted_token_selected_experts_, permuted_row_to_unpermuted_row_, | ||
| unpermuted_row_to_permuted_row, expert_first_token_offset_, blocked_expert_counts_, | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment here is slightly inaccurate. The reason the three-step path may not populate all entries is not primarily due to "tokens not matching local experts" (those are already handled by the
expert_idcheck in the finalize kernels), but rather due to the fact thatblockExpertPrefixSumKernelonly records the first occurrence of an expert for a given token, skipping any duplicate selections of the same expert in the token's top-k list.