Second part of refactoring the routing part#2993
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a dynamic single-block routing kernel and dispatch path (bounded by token/expert limits), removes host-side PDL overlap orchestration and last-kernel pre-copy, simplifies runPostTopKPipeline signature (now computes histogram threads internally), relaxes DeepSeekV3 routing-logits dtype checks, and updates tests/autotuner for packed TopK semantics. Changes
Sequence DiagramsequenceDiagram
participant Launcher as Launcher (prepare_routing)
participant Dispatch as Routing Dispatch (run)
participant DynBlock as DynBlock Kernel
participant Static as Static/Cluster/Coop Kernels
participant PostTopK as Post-TopK Pipeline
Launcher->>Dispatch: run(Data, stream) (dtype selection simplified)
alt token<=16 && experts<=512 (useDynBlock)
Dispatch->>DynBlock: launchDynBlockKernel(data, stream)
DynBlock->>DynBlock: allocate dynamic shared memory
DynBlock->>DynBlock: build TopK→histograms, warpExclusiveScan
DynBlock->>PostTopK: write permutation & CTA configs
PostTopK->>PostTopK: runPostTopKPipeline(data, stream)
else fallback
Dispatch->>Static: launchBlock/Cluster/Coop or histogram/offset kernels
Static->>PostTopK: produce histograms/offsets
PostTopK->>PostTopK: runPostTopKPipeline(data, stream)
end
PostTopK-->>Launcher: routing complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a dynamic block kernel for MoE routing to handle a wider range of tokens (up to 16) and experts (up to 512), filling the gap between the static block kernel and cluster/cooperative kernels. It also refactors the routing policy dispatching for better maintainability and simplifies the Programmatic Dependency Launch (PDL) logic by removing the mPdlOverlapWithNext flag. Additionally, new tiers for up to 1024 experts have been added to support larger models. The review feedback suggests enhancing the warning messages for missing tiers by recommending the next appropriate tiered expert count using getMaxNumExperts instead of the raw runtime value.
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cu (1)
535-540:⚠️ Potential issue | 🔴 CriticalKeep
mPtrExpertCountsvalid for the single-cluster path.After Line 540,
launchClusterKernel()receivesmPtrExpertCounts == nullptr, butroutingIndicesClusterKernel()still funnels intoroutingPermutation(), and that code unconditionallyatomicAdds throughparams.mPtrExpertCountsininclude/flashinfer/trtllm/fused_moe/RoutingKernel.cuh:573-576and1003-1006. That turns the single-cluster path into an illegal global write. If you want this buffer to become optional here,routingPermutation()needs a null-safe variant first.Suggested fix
if (!useSingleCluster) { FLASHINFER_CHECK(data.mPtrExpertCounts != nullptr, "When `#tokens` is large, `mPtrExpertCounts` is a required input."); - } else { - data.mPtrExpertCounts = nullptr; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cu` around lines 535 - 540, When useSingleCluster is true you set data.mPtrExpertCounts = nullptr which later causes routingPermutation() (called via routingIndicesClusterKernel() and launchClusterKernel()) to perform illegal atomicAdd through params.mPtrExpertCounts; instead ensure mPtrExpertCounts is a valid device pointer for the single-cluster path by allocating or reusing a small zero-initialized device buffer and assigning its pointer to data.mPtrExpertCounts before calling launchClusterKernel(), make the buffer size match the expected expert count, keep its lifetime until after the kernels complete, and free or reuse it appropriately (alternatively implement a null-safe routingPermutation(), but the immediate fix is to provide a valid zeroed device buffer).csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cu (1)
857-865:⚠️ Potential issue | 🟠 MajorKeep the precomputed-topK fast path behind the shared expert-count validation.
This branch reaches
runPostTopKPipeline(data, stream)before themNumExperts <= MaxSupportedExpertsguard below. Incsrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_common.cu:40-73, the refactored helper derivesnumThreadsHistfromgetMaxNumExperts(data.mNumExperts), which returns 0 for unsupported expert counts, so this path can fall into an invalid launch instead of failing cleanly. Move the shared validation above the fast path or validate insiderunPostTopKPipeline().🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cu` around lines 857 - 865, The fast-path that calls runPostTopKPipeline(data, stream) is executed before the shared expert-count validation, risking invalid kernel launches when getMaxNumExperts(data.mNumExperts) would return 0 for unsupported expert counts; move the guard that checks data.mNumExperts <= MaxSupportedExperts (or equivalently validate getMaxNumExperts(data.mNumExperts) > 0) to run before the branch that checks data.mPtrTopKIds / data.mPtrTopKPacked && data.mPtrScores so the precomputed-topK fast path runs only for supported expert counts, or alternatively add the same expert-count validation at the start of runPostTopKPipeline to enforce the check there (ensure references to data.mNumExperts, MaxSupportedExperts, getMaxNumExperts, and runPostTopKPipeline are covered).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cu`:
- Around line 453-455: The predicate that computes isLocal uses localExpIdx <
params.mNumLocalExperts but must account for stride expansion; update the checks
that compute isLocal (around the use of localExpIdx,
params.mLocalExpertsStartIdx, params.mNumLocalExperts,
params.mLocalExpertsStrideLog2) to compare against the stride-expanded extent
(i.e., use params.mNumLocalExperts << params.mLocalExpertsStrideLog2 or
equivalent) so experts in the upper part of the strided window are considered
local; apply the same fix to the other occurrences referenced (the similar
predicates around the blocks at the other two locations that check localExpIdx
and mNumLocalExperts).
- Around line 485-515: numCtaPerExpert and tmpCountPerExpert are left as
cluster-counts but later code and MnLimit expect per-CTA counts; multiply the
cluster counts by params.mClusterSizeInBatchDim (or the equivalent cluster-size
field) after computing them in the branch that sets
numCtaPerExpert/tmpCountPerExpert so they become per-CTA units, and ensure any
writes to mPtrNumNonExitingCtas, mPtrCtaIdxXyToBatchIdx, and
mPtrCtaIdxXyToMnLimit (and the code paths around
ctaOffsetPerExpert/expertScanCountsPerExpert) use these converted per-CTA
values; apply the same fix to the other blocks the reviewer called out (the
other two similar ranges).
- Around line 559-562: The call to cudaTriggerProgrammaticLaunchCompletion() is
currently placed before the Phase 5 permutation loop and can release dependent
grids while Phase 5 is still writing mPtrExpandedIdxToPermutedIdx,
mPtrPermutedIdxToExpandedIdx, and mPtrPermutedIdxToTokenIdx; move the
conditional call to cudaTriggerProgrammaticLaunchCompletion() (inside the `#if`
CUDA_ARCH >= 900 and guarded by params.mUsePdl) to immediately after the Phase 5
permutation loop finishes (after the block that populates those three buffers)
so PDL-dependent grids are only triggered once all stores are complete.
In `@include/flashinfer/trtllm/fused_moe/RoutingCustomPolicy.cuh`:
- Around line 633-635: The warning currently stringifies
"decltype(preProc_)+decltype(postProc_)" because LAUNCH_ROUTING_FOR_POLICY uses
`#PreProc/`#PostProc but LAUNCH_ROUTING_CUSTOM passes
decltype(preProc_)/decltype(postProc_); fix by passing pre-stringified policy
identifiers (or separate name args) instead of decltype(...) or by moving the
warning construction into dispatchRoutingPolicy so it can build a human-readable
policy name from the actual pre/post proc types. Concretely, update the
LAUNCH_ROUTING_CUSTOM invocation inside dispatchRoutingPolicy to supply the
original policy symbol names (e.g., PreProc, PostProc or provided name strings)
to LAUNCH_ROUTING_FOR_POLICY (or extend dispatchRoutingPolicy to accept and
forward policy-name strings), and adjust the macro call sites using
dispatchRoutingPolicy, LAUNCH_ROUTING_FOR_POLICY, preProc_, and postProc_
accordingly so the warning prints the real policy names rather than
decltype(...) text.
---
Outside diff comments:
In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cu`:
- Around line 857-865: The fast-path that calls runPostTopKPipeline(data,
stream) is executed before the shared expert-count validation, risking invalid
kernel launches when getMaxNumExperts(data.mNumExperts) would return 0 for
unsupported expert counts; move the guard that checks data.mNumExperts <=
MaxSupportedExperts (or equivalently validate getMaxNumExperts(data.mNumExperts)
> 0) to run before the branch that checks data.mPtrTopKIds / data.mPtrTopKPacked
&& data.mPtrScores so the precomputed-topK fast path runs only for supported
expert counts, or alternatively add the same expert-count validation at the
start of runPostTopKPipeline to enforce the check there (ensure references to
data.mNumExperts, MaxSupportedExperts, getMaxNumExperts, and runPostTopKPipeline
are covered).
In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cu`:
- Around line 535-540: When useSingleCluster is true you set
data.mPtrExpertCounts = nullptr which later causes routingPermutation() (called
via routingIndicesClusterKernel() and launchClusterKernel()) to perform illegal
atomicAdd through params.mPtrExpertCounts; instead ensure mPtrExpertCounts is a
valid device pointer for the single-cluster path by allocating or reusing a
small zero-initialized device buffer and assigning its pointer to
data.mPtrExpertCounts before calling launchClusterKernel(), make the buffer size
match the expected expert count, keep its lifetime until after the kernels
complete, and free or reuse it appropriately (alternatively implement a
null-safe routingPermutation(), but the immediate fix is to provide a valid
zeroed device buffer).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 160912fe-a37a-42c1-8539-cf0c3aa4429f
📒 Files selected for processing (9)
csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_common.cucsrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cucsrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cucsrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_llama4.cucsrc/trtllm_fused_moe_kernel_launcher.cuinclude/flashinfer/trtllm/fused_moe/RoutingCustomPolicy.cuhinclude/flashinfer/trtllm/fused_moe/RoutingDevKernel.hinclude/flashinfer/trtllm/fused_moe/RoutingKernel.htests/moe/test_trtllm_gen_fused_moe.py
|
/bot run |
|
[SUCCESS] Pipeline #47829409: 10/20 passed |
jiahanc
left a comment
There was a problem hiding this comment.
LGTM. Thanks for contribution!
|
@flashinfer-bot run |
|
/bot run |
cc @leejnau @trevor-m for vis since this is related to DSV3 and MiniMax2. |
|
[SUCCESS] Pipeline #47894433: 10/20 passed |
|
/bot run |
…PdlOverlapWithNext;Remove DeepSeekV3 float32 logits constraint from kernel launchers Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/moe/test_trtllm_gen_fused_moe.py (1)
4048-4063: Add one MiniMax2 case above 512 experts.This config only exercises MiniMax2 at 256 experts, so it never hits the new
Tier<1024, 32>fallback. The 1024-expert coverage added in this PR is DeepSeekV3-only, so a MiniMax2 regression in the new dispatch tier would still slip through.🧪 Possible follow-up matrix entry
+ pytest.param( + { + "num_experts": 1024, + "top_k": 6, + "padding": 8, + "n_groups": None, + "top_k_groups": None, + "routed_scaling": None, + "has_routing_bias": True, + "routing_method_type": RoutingMethodType.MiniMax2, + "compatible_moe_impls": [FP8BlockScaleMoe], + "compatible_intermediate_size": [512], + "enable_autotune": False, + }, + id="MiniMax2_1024e", + ),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/moe/test_trtllm_gen_fused_moe.py` around lines 4048 - 4063, Add a new pytest param exercising RoutingMethodType.MiniMax2 with num_experts >= 1024 so the test hits the Tier<1024, 32> fallback; replicate the existing MiniMax2 dict (keys: "num_experts", "top_k", "padding", "n_groups", "top_k_groups", "routed_scaling", "has_routing_bias", "routing_method_type", "compatible_moe_impls", "compatible_intermediate_size", "enable_autotune") but set "num_experts" to 1024 (or above) and give it a distinct id such as "MiniMax2_1024e" so the new dispatch tier is covered.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/moe/test_trtllm_gen_fused_moe.py`:
- Line 1876: The torch.topk call currently assigns two values but only uses the
second; replace the unused binding topk_values with an underscore (_) to avoid
the unused-variable lint error — i.e., change the assignment of
torch.topk(selection_scores, k=top_k, dim=-1) so the first element is named _
and the second remains topk_idx (refer to the existing torch.topk(...) call and
the topk_idx variable).
---
Nitpick comments:
In `@tests/moe/test_trtllm_gen_fused_moe.py`:
- Around line 4048-4063: Add a new pytest param exercising
RoutingMethodType.MiniMax2 with num_experts >= 1024 so the test hits the
Tier<1024, 32> fallback; replicate the existing MiniMax2 dict (keys:
"num_experts", "top_k", "padding", "n_groups", "top_k_groups", "routed_scaling",
"has_routing_bias", "routing_method_type", "compatible_moe_impls",
"compatible_intermediate_size", "enable_autotune") but set "num_experts" to 1024
(or above) and give it a distinct id such as "MiniMax2_1024e" so the new
dispatch tier is covered.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: cd0be958-d3c7-4c9e-bea8-5adf48f99e2a
📒 Files selected for processing (4)
csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cuflashinfer/fused_moe/core.pytests/autotuner/test_trtllm_fused_moe_autotuner_integration.pytests/moe/test_trtllm_gen_fused_moe.py
🚧 Files skipped from review as they are similar to previous changes (1)
- csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cu
…o prevent uninitialized memory from duplicate expert IDs Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
930f9b8 to
ce01e94
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
flashinfer/fused_moe/core.py (1)
1306-1327:⚠️ Potential issue | 🟠 MajorUse a tuning-only
routing_logitsplaceholder for routed BF16/FP8.
DynamicTensorSpecstill treats input slot 1 as a dynamic tensor, but these routed branches passNonethere. Unlike the FP4 path below, that leaves autotuning without shape/dtype/device metadata forrouting_logits, sotrtllm_bf16_routed_moeandtrtllm_fp8_block_scale_routed_moewon't tune the packed-topk path reliably. Build the same placeholder tensor FP4 uses and passskip_routing=(routing_logits is None)soforward()nulls it before the real op call.Also applies to: 1662-1692
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/fused_moe/core.py` around lines 1306 - 1327, The tuning call for BF16/FP8 routed paths uses inputs = [output, routing_logits, ...] but when routing_logits is None DynamicTensorSpec loses dtype/shape/device metadata; create the same tuning-only placeholder tensor used by the FP4 path (a DynamicTensorSpec / dummy tensor matching expected routing_logits shape/dtype/device) when routing_logits is None, insert it into inputs before calling tuner.choose_one (same place as the existing inputs list and the "flashinfer::trtllm_bf16_moe" call), and pass skip_routing=(routing_logits is None) into the MoERunner/forward invocation so forward() will null the real routing_logits at runtime while autotuning still receives valid metadata for the routed packed-topk path; apply the same change to the analogous block around where lines 1662-1692 are handled.
🧹 Nitpick comments (2)
tests/moe/test_trtllm_gen_fused_moe.py (2)
3900-3967: Please cover the MiniMax2 side of the new >512-expert tier.This targeted 1024-expert test only exercises
DeepSeekV3, but the new fallback tier is also meant to protectMiniMax2. OneMiniMax2case above 512 experts would close that gap.
3969-4118: The dtype-flexibility matrix misses the per-tensor launcher.The relaxed routing-logits dtype change touched both
trtllm_fp8_block_scale_moeandtrtllm_fp8_per_tensor_scale_moe, but this matrix only runsFP8BlockScaleMoe. Adding oneFP8PerTensorMoeDeepSeekV3 case would cover the second changed entry point.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/moe/test_trtllm_gen_fused_moe.py` around lines 3969 - 4118, The test matrix only instantiates FP8BlockScaleMoe but the recent dtype change also affects the per-tensor implementation; update the test to include FP8PerTensorMoe by adding a pytest.param for FP8PerTensorMoe in the moe_impl parametrization (same style as the existing FP8BlockScaleMoe entry, e.g., id="FP8_PerTensor_DeepSeek"), and add or duplicate the DeepSeekV3 routing_config case so its "compatible_moe_impls" includes FP8PerTensorMoe (and adjust the routing_config id if needed) so test_routing_dtype_flexibility covers trtllm_fp8_per_tensor_scale_moe as well as trtllm_fp8_block_scale_moe.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cu`:
- Around line 900-903: The predicate for useDynBlock should consider the
dispatched tier size instead of raw data.mNumExperts: replace the current check
that uses data.mNumExperts with a call to queryDispatchedMaxExperts() (or an
equivalent dispatched_experts value) so useDynBlock only becomes true when the
dispatched expert count <= DynBlockKernelMaxNumExperts; update the condition for
useDynBlock (and any downstream branch that leads to launchDynBlockKernel or
Tier<1024, 32> specializations) to gate on queryDispatchedMaxExperts() <=
DynBlockKernelMaxNumExperts to prevent selecting the 1024-expert dynamic path
when only 512 real experts exist.
---
Outside diff comments:
In `@flashinfer/fused_moe/core.py`:
- Around line 1306-1327: The tuning call for BF16/FP8 routed paths uses inputs =
[output, routing_logits, ...] but when routing_logits is None DynamicTensorSpec
loses dtype/shape/device metadata; create the same tuning-only placeholder
tensor used by the FP4 path (a DynamicTensorSpec / dummy tensor matching
expected routing_logits shape/dtype/device) when routing_logits is None, insert
it into inputs before calling tuner.choose_one (same place as the existing
inputs list and the "flashinfer::trtllm_bf16_moe" call), and pass
skip_routing=(routing_logits is None) into the MoERunner/forward invocation so
forward() will null the real routing_logits at runtime while autotuning still
receives valid metadata for the routed packed-topk path; apply the same change
to the analogous block around where lines 1662-1692 are handled.
---
Nitpick comments:
In `@tests/moe/test_trtllm_gen_fused_moe.py`:
- Around line 3969-4118: The test matrix only instantiates FP8BlockScaleMoe but
the recent dtype change also affects the per-tensor implementation; update the
test to include FP8PerTensorMoe by adding a pytest.param for FP8PerTensorMoe in
the moe_impl parametrization (same style as the existing FP8BlockScaleMoe entry,
e.g., id="FP8_PerTensor_DeepSeek"), and add or duplicate the DeepSeekV3
routing_config case so its "compatible_moe_impls" includes FP8PerTensorMoe (and
adjust the routing_config id if needed) so test_routing_dtype_flexibility covers
trtllm_fp8_per_tensor_scale_moe as well as trtllm_fp8_block_scale_moe.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4677e238-b9b3-43a8-99f7-12a76b6d1fb0
📒 Files selected for processing (11)
csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_common.cucsrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cucsrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cucsrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_llama4.cucsrc/trtllm_fused_moe_kernel_launcher.cuflashinfer/fused_moe/core.pyinclude/flashinfer/trtllm/fused_moe/RoutingCustomPolicy.cuhinclude/flashinfer/trtllm/fused_moe/RoutingDevKernel.hinclude/flashinfer/trtllm/fused_moe/RoutingKernel.htests/autotuner/test_trtllm_fused_moe_autotuner_integration.pytests/moe/test_trtllm_gen_fused_moe.py
🚧 Files skipped from review as they are similar to previous changes (5)
- csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_llama4.cu
- include/flashinfer/trtllm/fused_moe/RoutingDevKernel.h
- tests/autotuner/test_trtllm_fused_moe_autotuner_integration.py
- csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cu
- include/flashinfer/trtllm/fused_moe/RoutingKernel.h
…zeLog2 Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cu (1)
895-917:⚠️ Potential issue | 🟠 MajorGate dyn-block selection on the dispatched expert tier.
launchDynBlockKernel()on Line 609 sizes the launch fromqueryDispatchedMaxExperts(data), butuseDynBlockstill checks rawdata.mNumExperts. If policy dispatch rounds a 512-expert case up to the 1024 tier, this branch can still admit a dyn-block launch above the intended cap.Proposed fix
static int const smMajor = tensorrt_llm::common::getSMVersion() / 10; + int32_t const dispatchedMaxExperts = queryDispatchedMaxExperts(data); bool const useStaticBlock = data.mNumTokens <= BlockKernelMaxNumTokens; bool const useDynBlock = !useStaticBlock && data.mNumTokens <= DynBlockKernelMaxNumTokens && - data.mNumExperts <= DynBlockKernelMaxNumExperts; + dispatchedMaxExperts <= DynBlockKernelMaxNumExperts; bool const useSingleBlock = useStaticBlock || useDynBlock;Run this read-only check to confirm whether the dispatched tier can exceed
data.mNumExpertson the dyn-block path:#!/bin/bash set -euo pipefail policy_file="$(fd -i 'RoutingCustomPolicy\.cuh' | head -n1)" custom_file="$(fd -i 'trtllm_fused_moe_routing_custom\.cu' | head -n1)" rg -n -C2 'DynBlockKernelMaxNumExperts|queryDispatchedMaxExperts|Tier<1024,\s*32>|useDynBlock|launchDynBlockKernel' \ "$policy_file" "$custom_file"Expected result: if
queryDispatchedMaxExperts(...)can resolve to1024whiledata.mNumExpertsis still<= DynBlockKernelMaxNumExperts,useDynBlockshould gate on the dispatched tier instead.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cu` around lines 895 - 917, The dyn-block selection currently uses data.mNumExperts (via useDynBlock) but launchDynBlockKernel sizes its launch using queryDispatchedMaxExperts(data), so a policy that rounds up dispatched tiers can let dyn-block run above the intended cap; fix by computing dispatchedMax = queryDispatchedMaxExperts(data) early and replace the useDynBlock condition to check dispatchedMax <= DynBlockKernelMaxNumExperts (and <= DynBlockKernelMaxNumExperts && data.mNumTokens <= DynBlockKernelMaxNumTokens), then use dispatchedMax when sizing the kernel launch in launchDynBlockKernel and any related num-expert decision logic (referencing useDynBlock, launchDynBlockKernel, and queryDispatchedMaxExperts).
🧹 Nitpick comments (1)
include/flashinfer/trtllm/fused_moe/RoutingKernel.cuh (1)
379-389: Refresh theMnLimitcomment.The code now derives limits directly from
mPaddingLog2/mTileTokensDim, but the nearby note still saysctaTile = cgaTile / clusterSize. That comment describes the removed model and is now misleading when auditing these formulas.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/flashinfer/trtllm/fused_moe/RoutingKernel.cuh` around lines 379 - 389, Update the misleading comment about MnLimits to reflect the current calculation: replace the outdated "ctaTile = cgaTile / clusterSize" note with a brief explanation that mnLimit1/mnLimit2 are computed from either params.mPaddingLog2 (via mulLog2) or params.mTileTokensDim (via mulTileN) depending on params.mIsPow2, and that the result is stored in params.mPtrCtaIdxXyToMnLimit for index (ctaOffset[e] + cta); ensure the comment references mulLog2, mulTileN, params.mPaddingLog2, params.mTileTokensDim, and params.mPtrCtaIdxXyToMnLimit so the intent matches the code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cu`:
- Around line 575-585: The code reads smemOffset[offset] unconditionally even
for non-local experts; change logic so smemOffset is only loaded when isLocal is
true: keep computing localExpIdx and isLocal as-is, but move the read of
smemOffset[offset] (and the computation of offsetWithinExpert) into the branch
guarded by isLocal, then compute permutedIdx = offsetForExpert +
offsetWithinExpert only inside that branch and set permutedIdx to -1 otherwise;
references: smemKIdx, smemOffset, localExpIdx, isLocal, offsetWithinExpert,
expertScanCountsPerExpert, permutedIdx.
---
Duplicate comments:
In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cu`:
- Around line 895-917: The dyn-block selection currently uses data.mNumExperts
(via useDynBlock) but launchDynBlockKernel sizes its launch using
queryDispatchedMaxExperts(data), so a policy that rounds up dispatched tiers can
let dyn-block run above the intended cap; fix by computing dispatchedMax =
queryDispatchedMaxExperts(data) early and replace the useDynBlock condition to
check dispatchedMax <= DynBlockKernelMaxNumExperts (and <=
DynBlockKernelMaxNumExperts && data.mNumTokens <= DynBlockKernelMaxNumTokens),
then use dispatchedMax when sizing the kernel launch in launchDynBlockKernel and
any related num-expert decision logic (referencing useDynBlock,
launchDynBlockKernel, and queryDispatchedMaxExperts).
---
Nitpick comments:
In `@include/flashinfer/trtllm/fused_moe/RoutingKernel.cuh`:
- Around line 379-389: Update the misleading comment about MnLimits to reflect
the current calculation: replace the outdated "ctaTile = cgaTile / clusterSize"
note with a brief explanation that mnLimit1/mnLimit2 are computed from either
params.mPaddingLog2 (via mulLog2) or params.mTileTokensDim (via mulTileN)
depending on params.mIsPow2, and that the result is stored in
params.mPtrCtaIdxXyToMnLimit for index (ctaOffset[e] + cta); ensure the comment
references mulLog2, mulTileN, params.mPaddingLog2, params.mTileTokensDim, and
params.mPtrCtaIdxXyToMnLimit so the intent matches the code.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ce548420-32d5-48af-abdf-a9b3702d828a
📒 Files selected for processing (6)
csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cucsrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_llama4.cucsrc/trtllm_fused_moe_runner.cuinclude/flashinfer/trtllm/fused_moe/RoutingKernel.cuhinclude/flashinfer/trtllm/fused_moe/RoutingKernel.hinclude/flashinfer/trtllm/fused_moe/runner.h
🚧 Files skipped from review as they are similar to previous changes (1)
- include/flashinfer/trtllm/fused_moe/RoutingKernel.h
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
d6739ec to
b72471b
Compare
|
/bot run |
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
|
/bot run |
…PdlOverlapWithNext;Remove DeepSeekV3 float32 logits constraint from kernel launchers
📌 Description
routingIndicesDynBlockKernel) comes from the TensorRT-LLM. [None][perf] add Dynamic SMEM block routing in MOE NVIDIA/TensorRT-LLM#12456 . Made related modification by refactoringLAUNCH_ROUTING_CUSTOMwithdispatchRoutingPolicyandqueryDispatchedMaxExpertsTier<1024, 32>) to support future models with >512 experts using the DeepSeek nGroup≤1 / MiniMax2 routing policy.🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
New Features
Improvements
Tests