Skip to content

Second part of refactoring the routing part#2993

Merged
jiahanc merged 7 commits intoflashinfer-ai:mainfrom
ChristinaZ:refactor_routing_part2
Apr 13, 2026
Merged

Second part of refactoring the routing part#2993
jiahanc merged 7 commits intoflashinfer-ai:mainfrom
ChristinaZ:refactor_routing_part2

Conversation

@ChristinaZ
Copy link
Copy Markdown
Contributor

@ChristinaZ ChristinaZ commented Apr 6, 2026

…PdlOverlapWithNext;Remove DeepSeekV3 float32 logits constraint from kernel launchers

📌 Description

  1. Add dynamic block kernel (routingIndicesDynBlockKernel) comes from the TensorRT-LLM. [None][perf] add Dynamic SMEM block routing in MOE NVIDIA/TensorRT-LLM#12456 . Made related modification by refactoring  LAUNCH_ROUTING_CUSTOM with dispatchRoutingPolicy and queryDispatchedMaxExperts
  2. Simplify PDL (Programmatic Dependent Launch) for routing kernels, as the bug related to PDL is solved.
  3. Added a default fallback tier (Tier<1024, 32>) to support future models with >512 experts using the DeepSeek nGroup≤1 / MiniMax2 routing policy.
  4. Remove DeepSeekV3 float32 logits constraint
  5. Improve policy tier dispatch error messages

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

python -m pytest tests/moe/test_trtllm_gen_fused_moe.py -k "test_dyn_block_kernel_routing or test_tier_1024_experts_routing" -xvs
python3 -m pytest tests/moe/test_trtllm_gen_fused_moe.py -k "test_routing_dtype_flexibility" -xvs
  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • New Features

    • Dynamic single-block routing for small token/expert workloads to improve performance.
  • Improvements

    • Added a 1024-expert policy tier and clearer tier-dispatch diagnostics.
    • More flexible routing/logits dtype handling and simplified kernel overlap/launch synchronization.
    • Automatic histogram-thread sizing and removal of legacy cluster-size/overlap public flags.
    • Autotuner now initializes packed TopK with per-token unique expert IDs.
  • Tests

    • Added tests for dynamic-block routing, 1024-expert tier, MiniMax2 routing, and routing dtype combinations.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 6, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a dynamic single-block routing kernel and dispatch path (bounded by token/expert limits), removes host-side PDL overlap orchestration and last-kernel pre-copy, simplifies runPostTopKPipeline signature (now computes histogram threads internally), relaxes DeepSeekV3 routing-logits dtype checks, and updates tests/autotuner for packed TopK semantics.

Changes

Cohort / File(s) Summary
Dynamic single-block routing
csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cu
Added routingIndicesDynBlockKernel, warpExclusiveScan, and launchDynBlockKernel; uses dynamic shared memory to build TopK→histogram→prefix in-block; initializes expanded→permuted mappings to -1; host selects useDynBlock; removed host-side PDL overlap updates.
Common routing entry & post-TopK
csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_common.cu
Added routingCustom::launchDynBlockKernel declaration; changed runPostTopKPipeline signature to drop numThreadsHist (now computed internally); replaced last-kernel lastKernelData copy with direct customData usage; updated explicit instantiations.
Routing callers / early-return paths
csrc/fused_moe/trtllm_backend/.../trtllm_fused_moe_routing_deepseek.cu, .../trtllm_fused_moe_routing_llama4.cu
Removed local numThreadsHist computation and call sites now invoke runPostTopKPipeline(data, stream); removed host-side mPdlOverlapWithNext toggles; adjusted cluster/tiling arithmetic.
Policy dispatch, tiers & limits
include/flashinfer/trtllm/fused_moe/RoutingCustomPolicy.cuh
Added DynBlockKernelMaxNumTokens=16, DynBlockKernelMaxNumExperts=512; added dispatchRoutingPolicy() and queryDispatchedMaxExperts() utilities; extended tier list (added Tier<1024,32>); improved dispatch diagnostics.
PDL launch attribute change
include/flashinfer/trtllm/fused_moe/RoutingDevKernel.h
LAUNCH_PDL_ROUTING now sets programmatic-stream-serialization based solely on data.mUsePdl (removed mPdlOverlapWithNext dependency).
Routing kernel API / Data struct
include/flashinfer/trtllm/fused_moe/RoutingKernel.h
Removed DataBase::mPdlOverlapWithNext, mClusterSizeInBatchDim, mClusterSizeLog2; removed corresponding KernelParamsBase members; updated runPostTopKPipeline template declaration to (DataType const&, void*).
Launcher & dtype checks
csrc/trtllm_fused_moe_kernel_launcher.cu
Removed DeepSeekV3-only requirement that routing_logits be float32; score dtype chosen from input routing_logits.dtype() unconditionally.
Runner/autotuner & packed TopK init
flashinfer/fused_moe/core.py, tests/autotuner/test_trtllm_fused_moe_autotuner_integration.py
Removed MoEInputs dataclass; added MoERunner._init_packed_topk and dynamic tensor initializers; tuner now receives ordered tensor list; autotuner test topk generation changed to per-token unique randperm sampling for routed cases.
Routing tests & references
tests/moe/test_trtllm_gen_fused_moe.py
Added routing_reference_minimax2, added tests: dynamic-block routing, 1024-expert tier routing, and routing dtype-flexibility; extended routing-method parameters.

Sequence Diagram

sequenceDiagram
    participant Launcher as Launcher (prepare_routing)
    participant Dispatch as Routing Dispatch (run)
    participant DynBlock as DynBlock Kernel
    participant Static as Static/Cluster/Coop Kernels
    participant PostTopK as Post-TopK Pipeline

    Launcher->>Dispatch: run(Data, stream) (dtype selection simplified)
    alt token<=16 && experts<=512 (useDynBlock)
        Dispatch->>DynBlock: launchDynBlockKernel(data, stream)
        DynBlock->>DynBlock: allocate dynamic shared memory
        DynBlock->>DynBlock: build TopK→histograms, warpExclusiveScan
        DynBlock->>PostTopK: write permutation & CTA configs
        PostTopK->>PostTopK: runPostTopKPipeline(data, stream)
    else fallback
        Dispatch->>Static: launchBlock/Cluster/Coop or histogram/offset kernels
        Static->>PostTopK: produce histograms/offsets
        PostTopK->>PostTopK: runPostTopKPipeline(data, stream)
    end
    PostTopK-->>Launcher: routing complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested labels

run-ci, op: moe-routing

Suggested reviewers

  • aleozlx
  • yzh119
  • cyx-6
  • IwakuraRein
  • bkryu
  • jimmyzho
  • nv-yunzheq

Poem

🐇 I tunneled through kernels, found a nimble block,

Shared-mem burrows built TopK by the clock.
No host-flag paddles now to push or pry,
Tiers grow taller, tests chase across the sky.
Hop—small blocks dance, the routing rabbits fly!

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.91% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Second part of refactoring the routing part' is vague and generic, using non-descriptive language that doesn't convey the specific changes or main objectives. Use a more specific title that highlights key changes, such as 'Add dynamic block routing kernel and simplify PDL handling' or 'Refactor routing kernels: add dynamic block kernel and remove PDL overlap control'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description check ✅ Passed The PR description includes all required sections from the template (Description, Related Issues, Pre-commit Checks, Tests, Reviewer Notes) with substantial content detailing the changes, implementation notes, and completed test commands.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a dynamic block kernel for MoE routing to handle a wider range of tokens (up to 16) and experts (up to 512), filling the gap between the static block kernel and cluster/cooperative kernels. It also refactors the routing policy dispatching for better maintainability and simplifies the Programmatic Dependency Launch (PDL) logic by removing the mPdlOverlapWithNext flag. Additionally, new tiers for up to 1024 experts have been added to support larger models. The review feedback suggests enhancing the warning messages for missing tiers by recommending the next appropriate tiered expert count using getMaxNumExperts instead of the raw runtime value.

Comment thread include/flashinfer/trtllm/fused_moe/RoutingCustomPolicy.cuh Outdated
Comment thread include/flashinfer/trtllm/fused_moe/RoutingCustomPolicy.cuh Outdated
Comment thread tests/moe/test_trtllm_gen_fused_moe.py
@jiahanc jiahanc added the run-ci label Apr 6, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cu (1)

535-540: ⚠️ Potential issue | 🔴 Critical

Keep mPtrExpertCounts valid for the single-cluster path.

After Line 540, launchClusterKernel() receives mPtrExpertCounts == nullptr, but routingIndicesClusterKernel() still funnels into routingPermutation(), and that code unconditionally atomicAdds through params.mPtrExpertCounts in include/flashinfer/trtllm/fused_moe/RoutingKernel.cuh:573-576 and 1003-1006. That turns the single-cluster path into an illegal global write. If you want this buffer to become optional here, routingPermutation() needs a null-safe variant first.

Suggested fix
     if (!useSingleCluster) {
       FLASHINFER_CHECK(data.mPtrExpertCounts != nullptr,
                        "When `#tokens` is large, `mPtrExpertCounts` is a required input.");
-    } else {
-      data.mPtrExpertCounts = nullptr;
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cu` around
lines 535 - 540, When useSingleCluster is true you set data.mPtrExpertCounts =
nullptr which later causes routingPermutation() (called via
routingIndicesClusterKernel() and launchClusterKernel()) to perform illegal
atomicAdd through params.mPtrExpertCounts; instead ensure mPtrExpertCounts is a
valid device pointer for the single-cluster path by allocating or reusing a
small zero-initialized device buffer and assigning its pointer to
data.mPtrExpertCounts before calling launchClusterKernel(), make the buffer size
match the expected expert count, keep its lifetime until after the kernels
complete, and free or reuse it appropriately (alternatively implement a
null-safe routingPermutation(), but the immediate fix is to provide a valid
zeroed device buffer).
csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cu (1)

857-865: ⚠️ Potential issue | 🟠 Major

Keep the precomputed-topK fast path behind the shared expert-count validation.

This branch reaches runPostTopKPipeline(data, stream) before the mNumExperts <= MaxSupportedExperts guard below. In csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_common.cu:40-73, the refactored helper derives numThreadsHist from getMaxNumExperts(data.mNumExperts), which returns 0 for unsupported expert counts, so this path can fall into an invalid launch instead of failing cleanly. Move the shared validation above the fast path or validate inside runPostTopKPipeline().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cu` around
lines 857 - 865, The fast-path that calls runPostTopKPipeline(data, stream) is
executed before the shared expert-count validation, risking invalid kernel
launches when getMaxNumExperts(data.mNumExperts) would return 0 for unsupported
expert counts; move the guard that checks data.mNumExperts <=
MaxSupportedExperts (or equivalently validate getMaxNumExperts(data.mNumExperts)
> 0) to run before the branch that checks data.mPtrTopKIds / data.mPtrTopKPacked
&& data.mPtrScores so the precomputed-topK fast path runs only for supported
expert counts, or alternatively add the same expert-count validation at the
start of runPostTopKPipeline to enforce the check there (ensure references to
data.mNumExperts, MaxSupportedExperts, getMaxNumExperts, and runPostTopKPipeline
are covered).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cu`:
- Around line 453-455: The predicate that computes isLocal uses localExpIdx <
params.mNumLocalExperts but must account for stride expansion; update the checks
that compute isLocal (around the use of localExpIdx,
params.mLocalExpertsStartIdx, params.mNumLocalExperts,
params.mLocalExpertsStrideLog2) to compare against the stride-expanded extent
(i.e., use params.mNumLocalExperts << params.mLocalExpertsStrideLog2 or
equivalent) so experts in the upper part of the strided window are considered
local; apply the same fix to the other occurrences referenced (the similar
predicates around the blocks at the other two locations that check localExpIdx
and mNumLocalExperts).
- Around line 485-515: numCtaPerExpert and tmpCountPerExpert are left as
cluster-counts but later code and MnLimit expect per-CTA counts; multiply the
cluster counts by params.mClusterSizeInBatchDim (or the equivalent cluster-size
field) after computing them in the branch that sets
numCtaPerExpert/tmpCountPerExpert so they become per-CTA units, and ensure any
writes to mPtrNumNonExitingCtas, mPtrCtaIdxXyToBatchIdx, and
mPtrCtaIdxXyToMnLimit (and the code paths around
ctaOffsetPerExpert/expertScanCountsPerExpert) use these converted per-CTA
values; apply the same fix to the other blocks the reviewer called out (the
other two similar ranges).
- Around line 559-562: The call to cudaTriggerProgrammaticLaunchCompletion() is
currently placed before the Phase 5 permutation loop and can release dependent
grids while Phase 5 is still writing mPtrExpandedIdxToPermutedIdx,
mPtrPermutedIdxToExpandedIdx, and mPtrPermutedIdxToTokenIdx; move the
conditional call to cudaTriggerProgrammaticLaunchCompletion() (inside the `#if`
CUDA_ARCH >= 900 and guarded by params.mUsePdl) to immediately after the Phase 5
permutation loop finishes (after the block that populates those three buffers)
so PDL-dependent grids are only triggered once all stores are complete.

In `@include/flashinfer/trtllm/fused_moe/RoutingCustomPolicy.cuh`:
- Around line 633-635: The warning currently stringifies
"decltype(preProc_)+decltype(postProc_)" because LAUNCH_ROUTING_FOR_POLICY uses
`#PreProc/`#PostProc but LAUNCH_ROUTING_CUSTOM passes
decltype(preProc_)/decltype(postProc_); fix by passing pre-stringified policy
identifiers (or separate name args) instead of decltype(...) or by moving the
warning construction into dispatchRoutingPolicy so it can build a human-readable
policy name from the actual pre/post proc types. Concretely, update the
LAUNCH_ROUTING_CUSTOM invocation inside dispatchRoutingPolicy to supply the
original policy symbol names (e.g., PreProc, PostProc or provided name strings)
to LAUNCH_ROUTING_FOR_POLICY (or extend dispatchRoutingPolicy to accept and
forward policy-name strings), and adjust the macro call sites using
dispatchRoutingPolicy, LAUNCH_ROUTING_FOR_POLICY, preProc_, and postProc_
accordingly so the warning prints the real policy names rather than
decltype(...) text.

---

Outside diff comments:
In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cu`:
- Around line 857-865: The fast-path that calls runPostTopKPipeline(data,
stream) is executed before the shared expert-count validation, risking invalid
kernel launches when getMaxNumExperts(data.mNumExperts) would return 0 for
unsupported expert counts; move the guard that checks data.mNumExperts <=
MaxSupportedExperts (or equivalently validate getMaxNumExperts(data.mNumExperts)
> 0) to run before the branch that checks data.mPtrTopKIds / data.mPtrTopKPacked
&& data.mPtrScores so the precomputed-topK fast path runs only for supported
expert counts, or alternatively add the same expert-count validation at the
start of runPostTopKPipeline to enforce the check there (ensure references to
data.mNumExperts, MaxSupportedExperts, getMaxNumExperts, and runPostTopKPipeline
are covered).

In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cu`:
- Around line 535-540: When useSingleCluster is true you set
data.mPtrExpertCounts = nullptr which later causes routingPermutation() (called
via routingIndicesClusterKernel() and launchClusterKernel()) to perform illegal
atomicAdd through params.mPtrExpertCounts; instead ensure mPtrExpertCounts is a
valid device pointer for the single-cluster path by allocating or reusing a
small zero-initialized device buffer and assigning its pointer to
data.mPtrExpertCounts before calling launchClusterKernel(), make the buffer size
match the expected expert count, keep its lifetime until after the kernels
complete, and free or reuse it appropriately (alternatively implement a
null-safe routingPermutation(), but the immediate fix is to provide a valid
zeroed device buffer).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 160912fe-a37a-42c1-8539-cf0c3aa4429f

📥 Commits

Reviewing files that changed from the base of the PR and between 19b825d and 50f8f28.

📒 Files selected for processing (9)
  • csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_common.cu
  • csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cu
  • csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cu
  • csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_llama4.cu
  • csrc/trtllm_fused_moe_kernel_launcher.cu
  • include/flashinfer/trtllm/fused_moe/RoutingCustomPolicy.cuh
  • include/flashinfer/trtllm/fused_moe/RoutingDevKernel.h
  • include/flashinfer/trtllm/fused_moe/RoutingKernel.h
  • tests/moe/test_trtllm_gen_fused_moe.py

Comment thread csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cu
Comment thread csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cu
Comment thread csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cu Outdated
Comment thread include/flashinfer/trtllm/fused_moe/RoutingCustomPolicy.cuh Outdated
@jiahanc
Copy link
Copy Markdown
Collaborator

jiahanc commented Apr 6, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !510 has been created, and the CI pipeline #47829409 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[SUCCESS] Pipeline #47829409: 10/20 passed

@jiahanc jiahanc added run-ci and removed run-ci labels Apr 6, 2026
Copy link
Copy Markdown
Collaborator

@jiahanc jiahanc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for contribution!

@jiahanc
Copy link
Copy Markdown
Collaborator

jiahanc commented Apr 6, 2026

@flashinfer-bot run

@nvpohanh
Copy link
Copy Markdown
Contributor

nvpohanh commented Apr 7, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !510 has been updated with latest changes, and the CI pipeline #47894433 is currently running. I'll report back once the pipeline job completes.

@nvpohanh
Copy link
Copy Markdown
Contributor

nvpohanh commented Apr 7, 2026

Add dynamic block kernel (routingIndicesDynBlockKernel) comes from the TensorRT-LLM. NVIDIA/TensorRT-LLM#12456 . Made related modification by refactoring LAUNCH_ROUTING_CUSTOM with dispatchRoutingPolicy and queryDispatchedMaxExperts
Simplify PDL (Programmatic Dependent Launch) for routing kernels, as the bug related to PDL is solved.
Added a default fallback tier (Tier<1024, 32>) to support future models with >512 experts using the DeepSeek nGroup≤1 / MiniMax2 routing policy.
Remove DeepSeekV3 float32 logits constraint
Improve policy tier dispatch error messages

cc @leejnau @trevor-m for vis since this is related to DSV3 and MiniMax2.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[SUCCESS] Pipeline #47894433: 10/20 passed

@jiahanc jiahanc removed the run-ci label Apr 10, 2026
@jiahanc
Copy link
Copy Markdown
Collaborator

jiahanc commented Apr 10, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !510 has been created, and the CI pipeline #48254600 is currently running. I'll report back once the pipeline job completes.

…PdlOverlapWithNext;Remove DeepSeekV3 float32 logits constraint from kernel launchers

Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/moe/test_trtllm_gen_fused_moe.py (1)

4048-4063: Add one MiniMax2 case above 512 experts.

This config only exercises MiniMax2 at 256 experts, so it never hits the new Tier<1024, 32> fallback. The 1024-expert coverage added in this PR is DeepSeekV3-only, so a MiniMax2 regression in the new dispatch tier would still slip through.

🧪 Possible follow-up matrix entry
+        pytest.param(
+            {
+                "num_experts": 1024,
+                "top_k": 6,
+                "padding": 8,
+                "n_groups": None,
+                "top_k_groups": None,
+                "routed_scaling": None,
+                "has_routing_bias": True,
+                "routing_method_type": RoutingMethodType.MiniMax2,
+                "compatible_moe_impls": [FP8BlockScaleMoe],
+                "compatible_intermediate_size": [512],
+                "enable_autotune": False,
+            },
+            id="MiniMax2_1024e",
+        ),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/moe/test_trtllm_gen_fused_moe.py` around lines 4048 - 4063, Add a new
pytest param exercising RoutingMethodType.MiniMax2 with num_experts >= 1024 so
the test hits the Tier<1024, 32> fallback; replicate the existing MiniMax2 dict
(keys: "num_experts", "top_k", "padding", "n_groups", "top_k_groups",
"routed_scaling", "has_routing_bias", "routing_method_type",
"compatible_moe_impls", "compatible_intermediate_size", "enable_autotune") but
set "num_experts" to 1024 (or above) and give it a distinct id such as
"MiniMax2_1024e" so the new dispatch tier is covered.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/moe/test_trtllm_gen_fused_moe.py`:
- Line 1876: The torch.topk call currently assigns two values but only uses the
second; replace the unused binding topk_values with an underscore (_) to avoid
the unused-variable lint error — i.e., change the assignment of
torch.topk(selection_scores, k=top_k, dim=-1) so the first element is named _
and the second remains topk_idx (refer to the existing torch.topk(...) call and
the topk_idx variable).

---

Nitpick comments:
In `@tests/moe/test_trtllm_gen_fused_moe.py`:
- Around line 4048-4063: Add a new pytest param exercising
RoutingMethodType.MiniMax2 with num_experts >= 1024 so the test hits the
Tier<1024, 32> fallback; replicate the existing MiniMax2 dict (keys:
"num_experts", "top_k", "padding", "n_groups", "top_k_groups", "routed_scaling",
"has_routing_bias", "routing_method_type", "compatible_moe_impls",
"compatible_intermediate_size", "enable_autotune") but set "num_experts" to 1024
(or above) and give it a distinct id such as "MiniMax2_1024e" so the new
dispatch tier is covered.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cd0be958-d3c7-4c9e-bea8-5adf48f99e2a

📥 Commits

Reviewing files that changed from the base of the PR and between 9a250b2 and 930f9b8.

📒 Files selected for processing (4)
  • csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cu
  • flashinfer/fused_moe/core.py
  • tests/autotuner/test_trtllm_fused_moe_autotuner_integration.py
  • tests/moe/test_trtllm_gen_fused_moe.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cu

Comment thread tests/moe/test_trtllm_gen_fused_moe.py Outdated
…o prevent uninitialized memory from duplicate expert IDs

Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
@ChristinaZ ChristinaZ force-pushed the refactor_routing_part2 branch from 930f9b8 to ce01e94 Compare April 11, 2026 14:19
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
flashinfer/fused_moe/core.py (1)

1306-1327: ⚠️ Potential issue | 🟠 Major

Use a tuning-only routing_logits placeholder for routed BF16/FP8.

DynamicTensorSpec still treats input slot 1 as a dynamic tensor, but these routed branches pass None there. Unlike the FP4 path below, that leaves autotuning without shape/dtype/device metadata for routing_logits, so trtllm_bf16_routed_moe and trtllm_fp8_block_scale_routed_moe won't tune the packed-topk path reliably. Build the same placeholder tensor FP4 uses and pass skip_routing=(routing_logits is None) so forward() nulls it before the real op call.

Also applies to: 1662-1692

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/fused_moe/core.py` around lines 1306 - 1327, The tuning call for
BF16/FP8 routed paths uses inputs = [output, routing_logits, ...] but when
routing_logits is None DynamicTensorSpec loses dtype/shape/device metadata;
create the same tuning-only placeholder tensor used by the FP4 path (a
DynamicTensorSpec / dummy tensor matching expected routing_logits
shape/dtype/device) when routing_logits is None, insert it into inputs before
calling tuner.choose_one (same place as the existing inputs list and the
"flashinfer::trtllm_bf16_moe" call), and pass skip_routing=(routing_logits is
None) into the MoERunner/forward invocation so forward() will null the real
routing_logits at runtime while autotuning still receives valid metadata for the
routed packed-topk path; apply the same change to the analogous block around
where lines 1662-1692 are handled.
🧹 Nitpick comments (2)
tests/moe/test_trtllm_gen_fused_moe.py (2)

3900-3967: Please cover the MiniMax2 side of the new >512-expert tier.

This targeted 1024-expert test only exercises DeepSeekV3, but the new fallback tier is also meant to protect MiniMax2. One MiniMax2 case above 512 experts would close that gap.


3969-4118: The dtype-flexibility matrix misses the per-tensor launcher.

The relaxed routing-logits dtype change touched both trtllm_fp8_block_scale_moe and trtllm_fp8_per_tensor_scale_moe, but this matrix only runs FP8BlockScaleMoe. Adding one FP8PerTensorMoe DeepSeekV3 case would cover the second changed entry point.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/moe/test_trtllm_gen_fused_moe.py` around lines 3969 - 4118, The test
matrix only instantiates FP8BlockScaleMoe but the recent dtype change also
affects the per-tensor implementation; update the test to include
FP8PerTensorMoe by adding a pytest.param for FP8PerTensorMoe in the moe_impl
parametrization (same style as the existing FP8BlockScaleMoe entry, e.g.,
id="FP8_PerTensor_DeepSeek"), and add or duplicate the DeepSeekV3 routing_config
case so its "compatible_moe_impls" includes FP8PerTensorMoe (and adjust the
routing_config id if needed) so test_routing_dtype_flexibility covers
trtllm_fp8_per_tensor_scale_moe as well as trtllm_fp8_block_scale_moe.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cu`:
- Around line 900-903: The predicate for useDynBlock should consider the
dispatched tier size instead of raw data.mNumExperts: replace the current check
that uses data.mNumExperts with a call to queryDispatchedMaxExperts() (or an
equivalent dispatched_experts value) so useDynBlock only becomes true when the
dispatched expert count <= DynBlockKernelMaxNumExperts; update the condition for
useDynBlock (and any downstream branch that leads to launchDynBlockKernel or
Tier<1024, 32> specializations) to gate on queryDispatchedMaxExperts() <=
DynBlockKernelMaxNumExperts to prevent selecting the 1024-expert dynamic path
when only 512 real experts exist.

---

Outside diff comments:
In `@flashinfer/fused_moe/core.py`:
- Around line 1306-1327: The tuning call for BF16/FP8 routed paths uses inputs =
[output, routing_logits, ...] but when routing_logits is None DynamicTensorSpec
loses dtype/shape/device metadata; create the same tuning-only placeholder
tensor used by the FP4 path (a DynamicTensorSpec / dummy tensor matching
expected routing_logits shape/dtype/device) when routing_logits is None, insert
it into inputs before calling tuner.choose_one (same place as the existing
inputs list and the "flashinfer::trtllm_bf16_moe" call), and pass
skip_routing=(routing_logits is None) into the MoERunner/forward invocation so
forward() will null the real routing_logits at runtime while autotuning still
receives valid metadata for the routed packed-topk path; apply the same change
to the analogous block around where lines 1662-1692 are handled.

---

Nitpick comments:
In `@tests/moe/test_trtllm_gen_fused_moe.py`:
- Around line 3969-4118: The test matrix only instantiates FP8BlockScaleMoe but
the recent dtype change also affects the per-tensor implementation; update the
test to include FP8PerTensorMoe by adding a pytest.param for FP8PerTensorMoe in
the moe_impl parametrization (same style as the existing FP8BlockScaleMoe entry,
e.g., id="FP8_PerTensor_DeepSeek"), and add or duplicate the DeepSeekV3
routing_config case so its "compatible_moe_impls" includes FP8PerTensorMoe (and
adjust the routing_config id if needed) so test_routing_dtype_flexibility covers
trtllm_fp8_per_tensor_scale_moe as well as trtllm_fp8_block_scale_moe.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4677e238-b9b3-43a8-99f7-12a76b6d1fb0

📥 Commits

Reviewing files that changed from the base of the PR and between 930f9b8 and ce01e94.

📒 Files selected for processing (11)
  • csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_common.cu
  • csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cu
  • csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cu
  • csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_llama4.cu
  • csrc/trtllm_fused_moe_kernel_launcher.cu
  • flashinfer/fused_moe/core.py
  • include/flashinfer/trtllm/fused_moe/RoutingCustomPolicy.cuh
  • include/flashinfer/trtllm/fused_moe/RoutingDevKernel.h
  • include/flashinfer/trtllm/fused_moe/RoutingKernel.h
  • tests/autotuner/test_trtllm_fused_moe_autotuner_integration.py
  • tests/moe/test_trtllm_gen_fused_moe.py
🚧 Files skipped from review as they are similar to previous changes (5)
  • csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_llama4.cu
  • include/flashinfer/trtllm/fused_moe/RoutingDevKernel.h
  • tests/autotuner/test_trtllm_fused_moe_autotuner_integration.py
  • csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cu
  • include/flashinfer/trtllm/fused_moe/RoutingKernel.h

Comment thread csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cu
Comment thread flashinfer/fused_moe/core.py Outdated
…zeLog2

Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cu (1)

895-917: ⚠️ Potential issue | 🟠 Major

Gate dyn-block selection on the dispatched expert tier.

launchDynBlockKernel() on Line 609 sizes the launch from queryDispatchedMaxExperts(data), but useDynBlock still checks raw data.mNumExperts. If policy dispatch rounds a 512-expert case up to the 1024 tier, this branch can still admit a dyn-block launch above the intended cap.

Proposed fix
   static int const smMajor = tensorrt_llm::common::getSMVersion() / 10;
+  int32_t const dispatchedMaxExperts = queryDispatchedMaxExperts(data);
   bool const useStaticBlock = data.mNumTokens <= BlockKernelMaxNumTokens;
   bool const useDynBlock = !useStaticBlock && data.mNumTokens <= DynBlockKernelMaxNumTokens &&
-                           data.mNumExperts <= DynBlockKernelMaxNumExperts;
+                           dispatchedMaxExperts <= DynBlockKernelMaxNumExperts;
   bool const useSingleBlock = useStaticBlock || useDynBlock;

Run this read-only check to confirm whether the dispatched tier can exceed data.mNumExperts on the dyn-block path:

#!/bin/bash
set -euo pipefail

policy_file="$(fd -i 'RoutingCustomPolicy\.cuh' | head -n1)"
custom_file="$(fd -i 'trtllm_fused_moe_routing_custom\.cu' | head -n1)"

rg -n -C2 'DynBlockKernelMaxNumExperts|queryDispatchedMaxExperts|Tier<1024,\s*32>|useDynBlock|launchDynBlockKernel' \
  "$policy_file" "$custom_file"

Expected result: if queryDispatchedMaxExperts(...) can resolve to 1024 while data.mNumExperts is still <= DynBlockKernelMaxNumExperts, useDynBlock should gate on the dispatched tier instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cu` around
lines 895 - 917, The dyn-block selection currently uses data.mNumExperts (via
useDynBlock) but launchDynBlockKernel sizes its launch using
queryDispatchedMaxExperts(data), so a policy that rounds up dispatched tiers can
let dyn-block run above the intended cap; fix by computing dispatchedMax =
queryDispatchedMaxExperts(data) early and replace the useDynBlock condition to
check dispatchedMax <= DynBlockKernelMaxNumExperts (and <=
DynBlockKernelMaxNumExperts && data.mNumTokens <= DynBlockKernelMaxNumTokens),
then use dispatchedMax when sizing the kernel launch in launchDynBlockKernel and
any related num-expert decision logic (referencing useDynBlock,
launchDynBlockKernel, and queryDispatchedMaxExperts).
🧹 Nitpick comments (1)
include/flashinfer/trtllm/fused_moe/RoutingKernel.cuh (1)

379-389: Refresh the MnLimit comment.

The code now derives limits directly from mPaddingLog2 / mTileTokensDim, but the nearby note still says ctaTile = cgaTile / clusterSize. That comment describes the removed model and is now misleading when auditing these formulas.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/flashinfer/trtllm/fused_moe/RoutingKernel.cuh` around lines 379 -
389, Update the misleading comment about MnLimits to reflect the current
calculation: replace the outdated "ctaTile = cgaTile / clusterSize" note with a
brief explanation that mnLimit1/mnLimit2 are computed from either
params.mPaddingLog2 (via mulLog2) or params.mTileTokensDim (via mulTileN)
depending on params.mIsPow2, and that the result is stored in
params.mPtrCtaIdxXyToMnLimit for index (ctaOffset[e] + cta); ensure the comment
references mulLog2, mulTileN, params.mPaddingLog2, params.mTileTokensDim, and
params.mPtrCtaIdxXyToMnLimit so the intent matches the code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cu`:
- Around line 575-585: The code reads smemOffset[offset] unconditionally even
for non-local experts; change logic so smemOffset is only loaded when isLocal is
true: keep computing localExpIdx and isLocal as-is, but move the read of
smemOffset[offset] (and the computation of offsetWithinExpert) into the branch
guarded by isLocal, then compute permutedIdx = offsetForExpert +
offsetWithinExpert only inside that branch and set permutedIdx to -1 otherwise;
references: smemKIdx, smemOffset, localExpIdx, isLocal, offsetWithinExpert,
expertScanCountsPerExpert, permutedIdx.

---

Duplicate comments:
In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cu`:
- Around line 895-917: The dyn-block selection currently uses data.mNumExperts
(via useDynBlock) but launchDynBlockKernel sizes its launch using
queryDispatchedMaxExperts(data), so a policy that rounds up dispatched tiers can
let dyn-block run above the intended cap; fix by computing dispatchedMax =
queryDispatchedMaxExperts(data) early and replace the useDynBlock condition to
check dispatchedMax <= DynBlockKernelMaxNumExperts (and <=
DynBlockKernelMaxNumExperts && data.mNumTokens <= DynBlockKernelMaxNumTokens),
then use dispatchedMax when sizing the kernel launch in launchDynBlockKernel and
any related num-expert decision logic (referencing useDynBlock,
launchDynBlockKernel, and queryDispatchedMaxExperts).

---

Nitpick comments:
In `@include/flashinfer/trtllm/fused_moe/RoutingKernel.cuh`:
- Around line 379-389: Update the misleading comment about MnLimits to reflect
the current calculation: replace the outdated "ctaTile = cgaTile / clusterSize"
note with a brief explanation that mnLimit1/mnLimit2 are computed from either
params.mPaddingLog2 (via mulLog2) or params.mTileTokensDim (via mulTileN)
depending on params.mIsPow2, and that the result is stored in
params.mPtrCtaIdxXyToMnLimit for index (ctaOffset[e] + cta); ensure the comment
references mulLog2, mulTileN, params.mPaddingLog2, params.mTileTokensDim, and
params.mPtrCtaIdxXyToMnLimit so the intent matches the code.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ce548420-32d5-48af-abdf-a9b3702d828a

📥 Commits

Reviewing files that changed from the base of the PR and between ce01e94 and 5f13079.

📒 Files selected for processing (6)
  • csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cu
  • csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_llama4.cu
  • csrc/trtllm_fused_moe_runner.cu
  • include/flashinfer/trtllm/fused_moe/RoutingKernel.cuh
  • include/flashinfer/trtllm/fused_moe/RoutingKernel.h
  • include/flashinfer/trtllm/fused_moe/runner.h
🚧 Files skipped from review as they are similar to previous changes (1)
  • include/flashinfer/trtllm/fused_moe/RoutingKernel.h

Comment thread csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_custom.cu Outdated
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
@ChristinaZ ChristinaZ force-pushed the refactor_routing_part2 branch from d6739ec to b72471b Compare April 12, 2026 03:54
@ChristinaZ
Copy link
Copy Markdown
Contributor Author

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !510 has been updated with latest changes, and the CI pipeline #48314914 is currently running. I'll report back once the pipeline job completes.

Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
@ChristinaZ
Copy link
Copy Markdown
Contributor Author

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !510 has been updated with latest changes, and the CI pipeline #48321283 is currently running. I'll report back once the pipeline job completes.

@jiahanc jiahanc added the run-ci label Apr 12, 2026
@jiahanc jiahanc enabled auto-merge (squash) April 12, 2026 12:47
@jiahanc jiahanc merged commit 86c8357 into flashinfer-ai:main Apr 13, 2026
42 of 43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants