Skip to content

[feat] Add 2048 experts and 32 Top K #2744

Merged
jiahanc merged 4 commits intoflashinfer-ai:mainfrom
jiahanc:add2048Experts
Mar 17, 2026
Merged

[feat] Add 2048 experts and 32 Top K #2744
jiahanc merged 4 commits intoflashinfer-ai:mainfrom
jiahanc:add2048Experts

Conversation

@jiahanc
Copy link
Collaborator

@jiahanc jiahanc commented Mar 10, 2026

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • New Features

    • Expanded MoE routing/renormalize to support up to 2,048 experts and top-k up to 32; backend reorganized to enable larger configurations.
  • Bug Fixes

    • Clamped token counts in kernel launches to prevent oversized grid launches.
  • Performance

    • Reworked routing/launch paths for improved scalability and throughput with large expert/top-k settings.
  • Tests

    • Added test scenarios covering large-expert (2,048) + top-k (32) configurations.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 10, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Split monolithic MoE routing into modular DeepSeek and Renormalize backends: added common headers, compile-time dispatch macros, multiple kernel launchers and run() orchestrators; updated KernelParams/top-K support, DevKernel macros, build/JIT source lists, and tests; removed legacy single-file routing implementations.

Changes

Cohort / File(s) Summary
DeepSeek — Common & Launchers
csrc/fused_moe/trtllm_backend/routingDeepSeek/RoutingDeepSeekCommon.cuh, csrc/fused_moe/trtllm_backend/routingDeepSeek/launch*.cu
New DeepSeek constants, getMaxNumExperts, LAUNCH_DEEPSEEK/LAUNCH_ROUTING_DEEPSEEK dispatch macros; split kernel launchers (main, cluster, coop, histogram, init counts, offsets); SM90+ guards and kernel implementations.
DeepSeek — Orchestrator
csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cu
New run() entry point and forward declarations that validate inputs and dispatch to DeepSeek launchers (replaces deleted monolithic implementation).
Renormalize — Common & Launchers
csrc/fused_moe/trtllm_backend/routingRenormalize/RoutingRenormalizeCommon.cuh, csrc/fused_moe/trtllm_backend/routingRenormalize/launch*.cu
New Renormalize top‑K utilities, getMaxNumExperts, LAUNCH_ROUTING_RENORMALIZE dispatch macro; split launchers (block, cluster, histogram, histogram‑scores, init counts, offsets) and kernels.
Renormalize — Orchestrator
csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_renormalize.cu
New run() orchestrator that validates, selects single/multi‑kernel flows, and invokes Renormalize launchers (replaces deleted monolith).
Removed monoliths
csrc/trtllm_fused_moe_routing_deepseek.cu, csrc/trtllm_fused_moe_routing_renormalize.cu
Deleted legacy single-file routing modules—functionality reimplemented as split backend files.
Kernel params & headers
include/flashinfer/trtllm/fused_moe/RoutingKernel.h, .../RoutingKernel.cuh, .../RoutingKernelTopK.cuh
Added MaxNumTopExperts template param and per-kernel layout changes; removed some Data dtype fields; extended TopK support to 32; added IsPowerOf2/Sort and revised reduceTopK semantics and per-block/thread grouping.
DevKernel macros & dispatch
include/flashinfer/trtllm/fused_moe/DevKernel.h
Wrapped LAUNCH_PDL in do/while; replaced legacy DeepSeek macros; added LAUNCH_ROUTING_WITH_NUM_EXPERTS_FORCE_FLOAT_INPUT; parameterized dispatch on numExperts/numTopExperts and forceFloatInput.
Orchestrator & runner tweaks
csrc/trtllm_fused_moe_kernel_launcher.cu, csrc/trtllm_fused_moe_routing_llama4.cu, csrc/fused_moe/trtllm_backend/trtllm_fused_moe_runner.cu
Clamped token grid dims to min(8192,numTokens); renamed NumExpertsLimit→MaxSupportedExperts; promoted runImpl→run in llama4; removed some DeepSeek dtype inits.
Build/JIT changes
flashinfer/jit/fused_moe.py, flashinfer/jit/moe_utils.py, flashinfer/jit/cpp_ext.py
Reorganized CUDA source lists to new backend-prefixed paths; split routing into multiple compilation units; obj naming adjusted to include parent-dir prefix.
Helpers & tests
include/flashinfer/trtllm/common/cudaUtils.h, tests/moe/test_trtllm_gen_fused_moe.py
Added getMultiProcessorCount() helper; relaxed FP8 Renormalize top‑K limit to 32; added large-experts test case (2048 experts, top_k=32) and removed prior assertion.
Bindings & runtime dtype init
csrc/moe_utils_binding.cu, csrc/trtllm_fused_moe_runner.cu
Removed explicit initialization of mDtypeBias and mDtypeScore for certain DeepSeek paths; remaining dtype init sets mDtypeExpW only.

Sequence Diagram(s)

sequenceDiagram
    rect rgba(100,150,240,0.5)
    participant Host as Host
    end
    rect rgba(120,200,150,0.5)
    participant Orch as Orchestrator (run)
    end
    rect rgba(240,180,120,0.5)
    participant Macro as Launcher / Dispatch Macro
    end
    rect rgba(200,120,220,0.5)
    participant GPU as GPU Kernels
    end

    Host->>Orch: call run(Data)
    Orch->>Orch: validate inputs & pick backend (DeepSeek / Renormalize)
    Orch->>Macro: select LAUNCH_ROUTING_* macro and specific launcher
    Macro->>GPU: configure & launch kernel(s) (main/cluster/coop/histogram/offsets)
    GPU-->>Macro: kernel(s) complete (optional PDL sync)
    Macro-->>Orch: device buffers updated
    Orch-->>Host: return / completion
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested labels

op: moe

Suggested reviewers

  • yzh119
  • cyx-6
  • IwakuraRein
  • kahyunnam
  • jimmyzho
  • nv-yunzheq

Poem

🐰 I hopped through headers, split the files,
Launchers lined up in tidy piles.
Kernels wake and macros sing,
Experts routed—bounce and spring.
A carrot for each compiled mile!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 51.02% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the main feature being added: support for 2048 experts and 32 Top K, which aligns with the PR's core objective of extending expert and top-k capacity limits.
Description check ✅ Passed The PR description provides sufficient context: it references the TensorRT-LLM integration, states the objectives (2048 experts and 32 TopK support, MOE refactoring), and confirms all checklist items are completed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can disable sequence diagrams in the walkthrough.

Disable the reviews.sequence_diagrams setting to disable sequence diagrams in the walkthrough.

@gemini-code-assist
Copy link
Contributor

Warning

Gemini is experiencing higher than usual traffic and was unable to create the summary. Please try again in a few hours by commenting /gemini summary.

@jiahanc jiahanc marked this pull request as ready for review March 11, 2026 03:26
@jiahanc
Copy link
Collaborator Author

jiahanc commented Mar 11, 2026

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !400 has been created, and the CI pipeline #45848569 is currently running. I'll report back once the pipeline job completes.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
include/flashinfer/trtllm/fused_moe/RoutingKernel.h (1)

279-303: ⚠️ Potential issue | 🟠 Major

Thread DoSoftmaxBeforeTopK into the renormalize packed-pointer type.

csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu derives packed entries as PackedScoreIdx<conditional_t<DoSoftmaxBeforeTopK, float, InputT>>, but this KernelParams still stores and casts mPtrTopKPacked as PackedScoreIdx<OutputT>*. When DoSoftmaxBeforeTopK == true and OutputT != float, the packed buffer layout no longer matches the kernel contract.

Suggested fix
 template <typename InputT_, typename OutputT_, int MaxNumExperts_, int MaxNumTopExperts_,
           bool DoSoftmaxBeforeTopK_, bool isPow2_, bool UsePdl_>
 struct KernelParams : public KernelParamsBase<InputT_, OutputT_, MaxNumExperts_, MaxNumTopExperts_,
                                               isPow2_, UsePdl_> {
   using InputT = InputT_;
   using OutputT = OutputT_;
+  using PackedT =
+      PackedScoreIdx<std::conditional_t<DoSoftmaxBeforeTopK_, float, InputT_>>;
 
   static constexpr bool DoSoftmaxBeforeTopK = DoSoftmaxBeforeTopK_;
 
-  PackedScoreIdx<OutputT>* mPtrTopKPacked = nullptr;
+  PackedT* mPtrTopKPacked = nullptr;
@@
-    params.mPtrTopKPacked = (PackedScoreIdx<OutputT>*)data.mPtrTopKPacked;
+    params.mPtrTopKPacked = static_cast<PackedT*>(data.mPtrTopKPacked);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/flashinfer/trtllm/fused_moe/RoutingKernel.h` around lines 279 - 303,
KernelParams currently stores mPtrTopKPacked as PackedScoreIdx<OutputT>*, but
the launched kernel expects PackedScoreIdx<conditional_t<DoSoftmaxBeforeTopK,
float, InputT>>; change mPtrTopKPacked's type declaration in struct KernelParams
to use the same conditional type (e.g.,
PackedScoreIdx<std::conditional_t<DoSoftmaxBeforeTopK, float, OutputT>>* or, if
the kernel uses InputT, PackedScoreIdx<std::conditional_t<DoSoftmaxBeforeTopK,
float, InputT>>*), and update the assignment in KernelParams::setKernelParams to
cast data.mPtrTopKPacked to that same conditional pointer type so the stored
pointer type matches the kernel contract (refer to symbols KernelParams,
mPtrTopKPacked, PackedScoreIdx, and setKernelParams).
🧹 Nitpick comments (1)
flashinfer/jit/moe_utils.py (1)

79-92: Extract the DeepSeek routing manifest into a shared helper/constant.

This file list is now duplicated here and in flashinfer/jit/fused_moe.py. The next TensorRT-LLM sync is likely to touch one list and miss the other, which would make the two JIT modules drift out of sync.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/jit/moe_utils.py` around lines 79 - 92, The DeepSeek routing CUDA
source list is duplicated between moe_utils.py and fused_moe.py; extract that
list into a single shared constant (e.g., DEEPSEEK_ROUTING_SOURCES or
DEEPSEEK_ROUTING_MANIFEST) in a common module (create a small helper or export
it from moe_utils.py) and replace the inline lists in both jit modules with
imports of that constant so both trtllm_fused_moe routing lists reference the
single source of truth.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@csrc/fused_moe/trtllm_backend/routingDeepSeek/launchCoopKernel.cu`:
- Around line 189-195: Before calling cudaTriggerProgrammaticLaunchCompletion()
when KernelParams::UsePdl is true, ensure a system-wide memory fence so writes
to mPtrCtaIdxXyToBatchIdx, mPtrCtaIdxXyToMnLimit, mPtrNumNonExitingCtas,
mPtrTotalNumPaddedTokens and other global metadata are visible to the
subsequently-launched kernel; add an appropriate global fence (e.g.,
__threadfence_system() or your platform's equivalent) after the existing
__syncthreads() and before invoking cudaTriggerProgrammaticLaunchCompletion(),
keeping the conditional on KernelParams::UsePdl and preserving existing comments
and ordering.

In `@csrc/fused_moe/trtllm_backend/routingDeepSeek/RoutingDeepSeekCommon.cuh`:
- Around line 39-52: getMaxNumExperts currently returns 0 on unsupported inputs
which allows callers like LAUNCH_ROUTING_DEEPSEEK to divide-by-zero or silently
skip launches; change getMaxNumExperts to fail fast instead of returning 0 by
logging the error and aborting (or throwing/asserting) so an invalid numExperts
causes a deterministic crash, and apply the same change to the analogous
function in the same file (the similar block around lines 72-88) so neither
function ever returns 0 for unsupported expert counts.

In `@csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu`:
- Around line 69-78: The code writes to params.mPtrExpandedIdxToPermutedIdx
unconditionally in the expertIdx == -1 branch; guard that write with a
null-check so we only assign when the optional mapping was provided. Locate the
block handling params.mPtrTopKIds / expertIdx (within the warpIdx/laneIdx loop
using MaxNumExperts and smemKIdx) and wrap the
params.mPtrExpandedIdxToPermutedIdx[warpIdx * params.mTopK + laneIdx] = -1
assignment inside an if (params.mPtrExpandedIdxToPermutedIdx != nullptr) check
(keeping the same index calculation).

In
`@csrc/fused_moe/trtllm_backend/routingRenormalize/RoutingRenormalizeCommon.cuh`:
- Around line 99-126: The macros currently treat any mTopK > NumTop16Experts as
MaxSupportedTopExperts, which causes mTopK > MaxSupportedTopExperts (e.g., 33+)
to fall through and index out-of-bounds; update LAUNCH_ROUTING_WITH_TOPK to
explicitly handle the 32-wide case and reject larger values: add an else-if
branch checking if data.mTopK <= NumTop32Experts and call
LAUNCH_ROUTING_WITH_NUM_EXPERTS with NumTop32Experts, and replace the final else
with an error/log and no-launch path for unsupported data.mTopK (referencing
data.mTopK, LAUNCH_ROUTING_WITH_TOPK, LAUNCH_ROUTING_WITH_NUM_EXPERTS,
NumTop8Experts, NumTop16Experts, NumTop32Experts, MaxSupportedTopExperts).

In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cu`:
- Around line 59-60: The code caches the SM count in a static variable (smCount)
which uses tensorrt_llm::common::getMultiProcessorCount() that is
device-specific; remove the static so the SM count is retrieved per-call (e.g.,
make smCount a non-static local variable where numBlocksCoop is computed) and
recompute numBlocksCoop from that per-device value (also guard numBlocksCoop to
not go negative if needed) to avoid pinning launch size to the first device that
hits this code.
- Around line 47-72: Validate data.mTopK before using it in the maxTokensCoop
calculation and ensure mTopK is within valid bounds relative to mNumExperts and
constants: check data.mTopK > 0 and data.mTopK <= data.mNumExperts and
data.mTopK <= MaxSupportedTopExperts early (before computing int const
maxTokensCoop = ... / data.mTopK), and reorder/adjust the later guards involving
data.mPtrTopKIds, data.mNumExperts, MaxSupportedTopExperts, and
MaxSupportedExpertCount so that the precomputed-TopK path cannot execute with
mTopK == 0 or mTopK > mNumExperts; reference the symbols data.mTopK,
data.mNumExperts, data.mPtrTopKIds, maxTokensCoop, MaxSupportedTopExperts, and
MaxSupportedExpertCount.

In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_llama4.cu`:
- Around line 242-247: The code writes
params.mPtrPermutedIdxToExpandedIdx[permutedIdx] without checking isTokenRouted,
which allows writing invalid slots for local-but-unselected experts; update the
condition on the mPtrPermutedIdxToExpandedIdx write to mirror the other
reverse-map guard by requiring both isLocalExpert and isTokenRouted (i.e., only
write params.mPtrPermutedIdxToExpandedIdx[permutedIdx] when isTokenRouted is
true), keeping the existing use of permutedIdx, tokenIdx, finalExpertOffset and
localOffsetToken to compute the index.

In `@csrc/trtllm_fused_moe_kernel_launcher.cu`:
- Around line 870-871: The runtime check using TVM_FFI_ICHECK currently enforces
args->top_k <= 32 but the error string still says "top_k<=10", so update the
diagnostic to match the enforced limit: change the message string in the
TVM_FFI_ICHECK call that references args->top_k to say "top_k<=32 && top_k>0"
(keep the same surrounding text "Current routing kernel (no groups, renormalize)
only supports ..."). Ensure you edit the string associated with the
TVM_FFI_ICHECK invocation so runtime diagnostics accurately reflect the new
limit.

In `@flashinfer/jit/fused_moe.py`:
- Around line 264-297: The sources list in fused_moe.py contains duplicate
basenames (e.g., launchClusterKernel.cu, launchHistogramKernel.cu,
launchInitExpertCounts.cu, launchOffsetsKernel.cu) that will collide when
JitSpec.get_object_paths() and generate_ninja_build_for_op() convert sources to
object names using source.with_suffix(object_suffix).name; fix by either (A)
deduplicating/renaming the entries in the sources list in fused_moe.py to ensure
unique basenames per compilation unit (move into uniquely named files or prefix
filenames by directory), or (B) change the object-naming logic in
JitSpec.get_object_paths() / generate_ninja_build_for_op() to include path
components or a short hash (e.g., incorporate source.parent.name or a hash of
source) when constructing the object filename to disambiguate
otherwise-identical basenames.

In `@include/flashinfer/trtllm/common/cudaUtils.h`:
- Around line 272-277: The getMultiProcessorCount function lacks CUDA error
checking and can return garbage; wrap the cudaGetDevice and
cudaDeviceGetAttribute calls with the same error-checking macro used elsewhere
(e.g., FLASHINFER_CHECK) or check their cudaError_t return values and handle
failures before returning count. Update the function (getMultiProcessorCount) to
call FLASHINFER_CHECK(cudaGetDevice(...)) and
FLASHINFER_CHECK(cudaDeviceGetAttribute(..., cudaDevAttrMultiProcessorCount,
device)) (or equivalent error handling) so failures are logged/handled and a
safe default or abort is used instead of returning an unchecked count.

In `@include/flashinfer/trtllm/fused_moe/RoutingKernel.cuh`:
- Around line 440-453: The code paths that set idx only when mPtrTopKIds or
mPtrTopKWeights are present leave idx uninitialized when mPtrTopKIds==nullptr
and mPtrTopKWeights==nullptr; always decode params.mPtrTopKPacked[expandedIdx]
into scoreIdx and assign idx = scoreIdx.idx (and set mPtrTopKWeights if present)
before computing localExpertIdx/isLocalExpert so
atomicAdd(&smemExpertCount[idx], 1) uses a valid index; update the branch around
params.mPtrTopKIds/mPtrTopKWeights to unconditionally read params.mPtrTopKPacked
and populate idx first, then conditionally write params.mPtrTopKWeights.

In `@tests/moe/test_trtllm_gen_fused_moe.py`:
- Around line 2872-2892: This 2048-expert config (num_experts=2048, top_k=32,
RoutingMethodType.Renormalize) must be removed from the large parametrized
matrix and isolated into a small targeted smoke test with autotune disabled:
create a new focused test that uses the same dict but sets enable_autotune=False
(and a distinct id like "RoutingRenormalize_large_experts_smoke"), and remove or
replace the pytest.param entry from the full matrix so the heavy case no longer
participates in the combinatorial num_tokens × zero_hidden_states × moe_impl ×
weight_processing × activation_type runs.

---

Outside diff comments:
In `@include/flashinfer/trtllm/fused_moe/RoutingKernel.h`:
- Around line 279-303: KernelParams currently stores mPtrTopKPacked as
PackedScoreIdx<OutputT>*, but the launched kernel expects
PackedScoreIdx<conditional_t<DoSoftmaxBeforeTopK, float, InputT>>; change
mPtrTopKPacked's type declaration in struct KernelParams to use the same
conditional type (e.g., PackedScoreIdx<std::conditional_t<DoSoftmaxBeforeTopK,
float, OutputT>>* or, if the kernel uses InputT,
PackedScoreIdx<std::conditional_t<DoSoftmaxBeforeTopK, float, InputT>>*), and
update the assignment in KernelParams::setKernelParams to cast
data.mPtrTopKPacked to that same conditional pointer type so the stored pointer
type matches the kernel contract (refer to symbols KernelParams, mPtrTopKPacked,
PackedScoreIdx, and setKernelParams).

---

Nitpick comments:
In `@flashinfer/jit/moe_utils.py`:
- Around line 79-92: The DeepSeek routing CUDA source list is duplicated between
moe_utils.py and fused_moe.py; extract that list into a single shared constant
(e.g., DEEPSEEK_ROUTING_SOURCES or DEEPSEEK_ROUTING_MANIFEST) in a common module
(create a small helper or export it from moe_utils.py) and replace the inline
lists in both jit modules with imports of that constant so both trtllm_fused_moe
routing lists reference the single source of truth.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3886ca25-8b46-4874-8000-8b83bfc701bc

📥 Commits

Reviewing files that changed from the base of the PR and between fe06b91 and cdf67c7.

📒 Files selected for processing (31)
  • csrc/fused_moe/trtllm_backend/routingDeepSeek/RoutingDeepSeekCommon.cuh
  • csrc/fused_moe/trtllm_backend/routingDeepSeek/launchClusterKernel.cu
  • csrc/fused_moe/trtllm_backend/routingDeepSeek/launchCoopKernel.cu
  • csrc/fused_moe/trtllm_backend/routingDeepSeek/launchHistogramKernel.cu
  • csrc/fused_moe/trtllm_backend/routingDeepSeek/launchInitExpertCounts.cu
  • csrc/fused_moe/trtllm_backend/routingDeepSeek/launchMainKernel.cu
  • csrc/fused_moe/trtllm_backend/routingDeepSeek/launchOffsetsKernel.cu
  • csrc/fused_moe/trtllm_backend/routingRenormalize/RoutingRenormalizeCommon.cuh
  • csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu
  • csrc/fused_moe/trtllm_backend/routingRenormalize/launchClusterKernel.cu
  • csrc/fused_moe/trtllm_backend/routingRenormalize/launchHistogramKernel.cu
  • csrc/fused_moe/trtllm_backend/routingRenormalize/launchHistogramScoresKernel.cu
  • csrc/fused_moe/trtllm_backend/routingRenormalize/launchInitExpertCounts.cu
  • csrc/fused_moe/trtllm_backend/routingRenormalize/launchOffsetsKernel.cu
  • csrc/fused_moe/trtllm_backend/trtllm_fused_moe_dev_kernel.cu
  • csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cu
  • csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_llama4.cu
  • csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_renormalize.cu
  • csrc/trtllm_fused_moe_kernel_launcher.cu
  • csrc/trtllm_fused_moe_routing_deepseek.cu
  • csrc/trtllm_fused_moe_routing_renormalize.cu
  • csrc/trtllm_fused_moe_runner.cu
  • flashinfer/jit/cpp_ext.py
  • flashinfer/jit/fused_moe.py
  • flashinfer/jit/moe_utils.py
  • include/flashinfer/trtllm/common/cudaUtils.h
  • include/flashinfer/trtllm/fused_moe/DevKernel.h
  • include/flashinfer/trtllm/fused_moe/RoutingKernel.cuh
  • include/flashinfer/trtllm/fused_moe/RoutingKernel.h
  • include/flashinfer/trtllm/fused_moe/RoutingKernelTopK.cuh
  • tests/moe/test_trtllm_gen_fused_moe.py
💤 Files with no reviewable changes (2)
  • csrc/trtllm_fused_moe_routing_renormalize.cu
  • csrc/trtllm_fused_moe_routing_deepseek.cu

Comment on lines +189 to +195
// trigger the secondary kernel when using PDL
// We can't do it earlier because FC1 depends on the mPtrCtaIdxXyToBatchIdx,
// mPtrCtaIdxXyToMnLimit, mPtrNumNonExitingCtas and mPtrTotalNumPaddedTokens
// TODO: this is not sufficient to ensure visibility in the next kernel!
if constexpr (KernelParams::UsePdl) {
cudaTriggerProgrammaticLaunchCompletion();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Locate and examine the file
fd "launchCoopKernel.cu" --type f

Repository: flashinfer-ai/flashinfer

Length of output: 133


🏁 Script executed:

# If found, read the relevant section with context
cat -n csrc/fused_moe/trtllm_backend/routingDeepSeek/launchCoopKernel.cu | head -250 | tail -100

Repository: flashinfer-ai/flashinfer

Length of output: 5119


🏁 Script executed:

# Search for PDL usage patterns in the codebase to understand the context
rg "cudaTriggerProgrammaticLaunchCompletion|cudaGridDependencySynchronize" -B 3 -A 3

Repository: flashinfer-ai/flashinfer

Length of output: 41002


🏁 Script executed:

# Search for cooperative groups usage and grid.sync patterns
rg "grid\.sync|__threadfence|cg::sync" --type cu --type h -B 2 -A 2

Repository: flashinfer-ai/flashinfer

Length of output: 94


🏁 Script executed:

# Check kernel launch configuration and block/grid structure in this file
rg "dim3|blockDim|gridDim" csrc/fused_moe/trtllm_backend/routingDeepSeek/launchCoopKernel.cu -B 2 -A 2

Repository: flashinfer-ai/flashinfer

Length of output: 283


🏁 Script executed:

# Search for where FC1 or subsequent kernels use cudaGridDependencySynchronize
rg "cudaGridDependencySynchronize" csrc/fused_moe/trtllm_backend/routingDeepSeek/ -B 5 -A 5

Repository: flashinfer-ai/flashinfer

Length of output: 3524


🏁 Script executed:

# Check what kernel is launched after routingIndicesCoopKernel and if it synchronizes
rg "LAUNCH_ROUTING_DEEPSEEK|routingIndicesCoopKernel" csrc/fused_moe/trtllm_backend/routingDeepSeek/ -B 3 -A 8

Repository: flashinfer-ai/flashinfer

Length of output: 12940


🏁 Script executed:

# Look for mPtrCtaIdxXyToBatchIdx and mPtrCtaIdxXyToMnLimit to understand memory scope
rg "mPtrCtaIdxXyToBatchIdx|mPtrCtaIdxXyToMnLimit" csrc/ -B 2 -A 2 | head -80

Repository: flashinfer-ai/flashinfer

Length of output: 6867


🏁 Script executed:

# Check if __threadfence or __threadfence_system is used anywhere in PDL contexts
rg "__threadfence" csrc/fused_moe/trtllm_backend/ -B 3 -A 3

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Check if kernel launches with single or multiple blocks
rg "launchCoopKernel|numBlocksCoop" csrc/ --type cu -B 2 -A 2 | grep -A 5 "numBlocksCoop"

Repository: flashinfer-ai/flashinfer

Length of output: 94


🏁 Script executed:

# Verify CUDA PDL semantics in comments/docs within repo
rg "PDL|Programmatic Dependent Launch" include/ --type h -B 2 -A 2 | head -60

Repository: flashinfer-ai/flashinfer

Length of output: 7785


Add explicit memory ordering before PDL trigger to ensure metadata visibility.

Line 192 documents the issue: the current code does not guarantee visibility of mPtrCtaIdxXyToBatchIdx, mPtrCtaIdxXyToMnLimit, and scalar outputs to the next kernel before cudaTriggerProgrammaticLaunchCompletion() releases it.

The block-level __syncthreads() at line 187 is insufficient for global device memory. A system-wide fence is required before signaling PDL completion to ensure all threads' writes to global metadata are visible before the next kernel launches.

🛠️ Proposed fix
   // TODO: this is not sufficient to ensure visibility in the next kernel!
   if constexpr (KernelParams::UsePdl) {
+    __threadfence_system();
     cudaTriggerProgrammaticLaunchCompletion();
   }

Verify this aligns with your CUDA toolkit's PDL memory-ordering contract and whether __threadfence_system() or an alternative primitive (e.g., grid synchronization) is required for your target SM version.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// trigger the secondary kernel when using PDL
// We can't do it earlier because FC1 depends on the mPtrCtaIdxXyToBatchIdx,
// mPtrCtaIdxXyToMnLimit, mPtrNumNonExitingCtas and mPtrTotalNumPaddedTokens
// TODO: this is not sufficient to ensure visibility in the next kernel!
if constexpr (KernelParams::UsePdl) {
cudaTriggerProgrammaticLaunchCompletion();
}
// trigger the secondary kernel when using PDL
// We can't do it earlier because FC1 depends on the mPtrCtaIdxXyToBatchIdx,
// mPtrCtaIdxXyToMnLimit, mPtrNumNonExitingCtas and mPtrTotalNumPaddedTokens
// TODO: this is not sufficient to ensure visibility in the next kernel!
if constexpr (KernelParams::UsePdl) {
__threadfence_system();
cudaTriggerProgrammaticLaunchCompletion();
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@csrc/fused_moe/trtllm_backend/routingDeepSeek/launchCoopKernel.cu` around
lines 189 - 195, Before calling cudaTriggerProgrammaticLaunchCompletion() when
KernelParams::UsePdl is true, ensure a system-wide memory fence so writes to
mPtrCtaIdxXyToBatchIdx, mPtrCtaIdxXyToMnLimit, mPtrNumNonExitingCtas,
mPtrTotalNumPaddedTokens and other global metadata are visible to the
subsequently-launched kernel; add an appropriate global fence (e.g.,
__threadfence_system() or your platform's equivalent) after the existing
__syncthreads() and before invoking cudaTriggerProgrammaticLaunchCompletion(),
keeping the conditional on KernelParams::UsePdl and preserving existing comments
and ordering.

Comment on lines +39 to +52
int constexpr getMaxNumExperts(int32_t numExperts) {
if (numExperts <= topk::MaxNumExpertsUnit) {
return topk::MaxNumExpertsUnit;
} else if (numExperts <= NumDeepseekExperts) {
return NumDeepseekExperts;
} else if (numExperts <= NumKimiK2Experts) {
return NumKimiK2Experts;
} else if (numExperts <= NumNemotronExperts) {
return NumNemotronExperts;
} else {
TLLM_LOG_ERROR("Unsupported numExperts");
return 0;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail fast on unsupported expert counts.

getMaxNumExperts() returns 0, and LAUNCH_ROUTING_DEEPSEEK only logs on the unsupported path. The new launchers use that value as numThreadsHist and as a divisor in grid sizing, so an out-of-range mNumExperts can degrade into divide-by-zero or a silently skipped launch rather than a deterministic failure.

Also applies to: 72-88

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@csrc/fused_moe/trtllm_backend/routingDeepSeek/RoutingDeepSeekCommon.cuh`
around lines 39 - 52, getMaxNumExperts currently returns 0 on unsupported inputs
which allows callers like LAUNCH_ROUTING_DEEPSEEK to divide-by-zero or silently
skip launches; change getMaxNumExperts to fail fast instead of returning 0 by
logging the error and aborting (or throwing/asserting) so an invalid numExperts
causes a deterministic crash, and apply the same change to the analogous
function in the same file (the similar block around lines 72-88) so neither
function ever returns 0 for unsupported expert counts.

Comment on lines +69 to +78
if (params.mPtrTopKIds != nullptr) {
if (validToken) {
if (laneIdx < params.mTopK) {
auto expertIdx = params.mPtrTopKIds[warpIdx * params.mTopK + laneIdx];
if (expertIdx != -1) {
int offset = warpIdx * MaxNumExperts + expertIdx;
smemKIdx[offset] = static_cast<int8_t>(laneIdx);
} else {
params.mPtrExpandedIdxToPermutedIdx[warpIdx * params.mTopK + laneIdx] = int32_t{-1};
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Guard the optional mPtrExpandedIdxToPermutedIdx write.

mPtrExpandedIdxToPermutedIdx is optional, but the expertIdx == -1 branch dereferences it unconditionally. A caller that passes precomputed mPtrTopKIds with padded -1 entries and does not request the expanded-index mapping will crash here.

Suggested fix
-        } else {
-          params.mPtrExpandedIdxToPermutedIdx[warpIdx * params.mTopK + laneIdx] = int32_t{-1};
+        } else if (params.mPtrExpandedIdxToPermutedIdx != nullptr) {
+          params.mPtrExpandedIdxToPermutedIdx[warpIdx * params.mTopK + laneIdx] = int32_t{-1};
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu` around
lines 69 - 78, The code writes to params.mPtrExpandedIdxToPermutedIdx
unconditionally in the expertIdx == -1 branch; guard that write with a
null-check so we only assign when the optional mapping was provided. Locate the
block handling params.mPtrTopKIds / expertIdx (within the warpIdx/laneIdx loop
using MaxNumExperts and smemKIdx) and wrap the
params.mPtrExpandedIdxToPermutedIdx[warpIdx * params.mTopK + laneIdx] = -1
assignment inside an if (params.mPtrExpandedIdxToPermutedIdx != nullptr) check
(keeping the same index calculation).

Comment on lines +99 to +126
// Helper macro: dispatch on topK tier for a given numExperts tier.
#define LAUNCH_ROUTING_WITH_TOPK(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, \
stream, extraFlag1, numExperts) \
if (data.mTopK <= NumTop8Experts) { \
LAUNCH_ROUTING_WITH_NUM_EXPERTS(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, \
stream, extraFlag1, numExperts, NumTop8Experts); \
} else if (data.mTopK <= NumTop16Experts) { \
LAUNCH_ROUTING_WITH_NUM_EXPERTS(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, \
stream, extraFlag1, numExperts, NumTop16Experts); \
} else { \
LAUNCH_ROUTING_WITH_NUM_EXPERTS(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, \
stream, extraFlag1, numExperts, MaxSupportedTopExperts); \
}

#define LAUNCH_ROUTING_RENORMALIZE(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, \
stream, extraFlag1) \
if (data.mNumExperts <= NumExperts128Experts) { \
LAUNCH_ROUTING_WITH_TOPK(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, stream, \
extraFlag1, NumExperts128Experts); \
} else if (data.mNumExperts <= NumExperts512Experts) { \
LAUNCH_ROUTING_WITH_TOPK(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, stream, \
extraFlag1, NumExperts512Experts); \
} else if (data.mNumExperts <= MaxSupportedExperts) { \
LAUNCH_ROUTING_WITH_TOPK(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, stream, \
extraFlag1, MaxSupportedExperts); \
} else { \
TLLM_LOG_ERROR("Unsupported numExperts"); \
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Reject mTopK > 32 instead of falling through to the 32-wide specialization.

The final else currently catches both the supported 17..32 range and unsupported larger values. When data.mTopK is 33+, the downstream renormalize kernels still index fixed-size [MaxNumTopExperts] buffers with laneIdx < params.mTopK, which will walk past those 32-entry arrays.

🛠️ Proposed fix
 `#define` LAUNCH_ROUTING_WITH_TOPK(data, coopLaunch, kernel, numBlocks, numThreads, smemSize,    \
                                  stream, extraFlag1, numExperts)                               \
   if (data.mTopK <= NumTop8Experts) {                                                          \
     LAUNCH_ROUTING_WITH_NUM_EXPERTS(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, \
                                     stream, extraFlag1, numExperts, NumTop8Experts);           \
   } else if (data.mTopK <= NumTop16Experts) {                                                  \
     LAUNCH_ROUTING_WITH_NUM_EXPERTS(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, \
                                     stream, extraFlag1, numExperts, NumTop16Experts);          \
-  } else {                                                                                     \
+  } else if (data.mTopK <= MaxSupportedTopExperts) {                                           \
     LAUNCH_ROUTING_WITH_NUM_EXPERTS(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, \
                                     stream, extraFlag1, numExperts, MaxSupportedTopExperts);   \
+  } else {                                                                                     \
+    TLLM_LOG_ERROR("Unsupported topK");                                                        \
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@csrc/fused_moe/trtllm_backend/routingRenormalize/RoutingRenormalizeCommon.cuh`
around lines 99 - 126, The macros currently treat any mTopK > NumTop16Experts as
MaxSupportedTopExperts, which causes mTopK > MaxSupportedTopExperts (e.g., 33+)
to fall through and index out-of-bounds; update LAUNCH_ROUTING_WITH_TOPK to
explicitly handle the 32-wide case and reject larger values: add an else-if
branch checking if data.mTopK <= NumTop32Experts and call
LAUNCH_ROUTING_WITH_NUM_EXPERTS with NumTop32Experts, and replace the final else
with an error/log and no-launch path for unsupported data.mTopK (referencing
data.mTopK, LAUNCH_ROUTING_WITH_TOPK, LAUNCH_ROUTING_WITH_NUM_EXPERTS,
NumTop8Experts, NumTop16Experts, NumTop32Experts, MaxSupportedTopExperts).

Comment on lines +47 to +72
FLASHINFER_CHECK(!data.mUseRoutingSoftmax, "Routing with softmax not implemented yet");
int const numBlocks = data.mNumTokens;
int const numThreadsHist = getMaxNumExperts(data.mNumExperts);

bool const useSingleCluster = data.mNumTokens <= 1024;
if (!useSingleCluster) {
FLASHINFER_CHECK(data.mPtrExpertCounts != nullptr,
"When #tokens is large, `mPtrExpertCounts` is a required input.");
} else {
data.mPtrExpertCounts = nullptr;
}

static int const smCount = tensorrt_llm::common::getMultiProcessorCount();
int const numBlocksCoop = smCount - 8;

int const maxTokensCoop = (numBlocksCoop * numThreadsHist * 64) / data.mTopK;
if (data.mPtrTopKIds == nullptr) {
FLASHINFER_CHECK(data.mNumExperts >= MaxSupportedTopExperts,
"Routing kernel expects %d to be at most #experts %d", MaxSupportedTopExperts,
data.mNumExperts);
FLASHINFER_CHECK(data.mNumExperts <= MaxSupportedExpertCount,
"Routing kernel expects #experts %d <= %d", data.mNumExperts,
MaxSupportedExpertCount);
FLASHINFER_CHECK(data.mTopK <= MaxSupportedTopExperts,
"Routing kernel expects topK experts <= %d, got %d", MaxSupportedTopExperts,
data.mTopK);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Validate mTopK before using it, and compare it to mNumExperts.

Line 62 divides by mTopK before any bound check, and the later data.mNumExperts >= MaxSupportedTopExperts guard rejects valid smaller-expert configs while still leaving the precomputed-TopK path unvalidated. mTopK == 0 is UB here, and data.mPtrTopKIds != nullptr can still send topK > 32 or topK > mNumExperts into the downstream kernels.

Suggested fix
   FLASHINFER_CHECK(!data.mUseRoutingSoftmax, "Routing with softmax not implemented yet");
+  FLASHINFER_CHECK(data.mTopK > 0 && data.mTopK <= MaxSupportedTopExperts,
+                   "Routing kernel expects topK in [1, %d], got %d",
+                   MaxSupportedTopExperts, data.mTopK);
+  FLASHINFER_CHECK(data.mNumExperts >= data.mTopK,
+                   "Routing kernel expects topK %d <= `#experts` %d",
+                   data.mTopK, data.mNumExperts);
   int const numBlocks = data.mNumTokens;
   int const numThreadsHist = getMaxNumExperts(data.mNumExperts);
@@
-    FLASHINFER_CHECK(data.mNumExperts >= MaxSupportedTopExperts,
-                     "Routing kernel expects %d to be at most `#experts` %d", MaxSupportedTopExperts,
-                     data.mNumExperts);
     FLASHINFER_CHECK(data.mNumExperts <= MaxSupportedExpertCount,
                      "Routing kernel expects `#experts` %d <= %d", data.mNumExperts,
                      MaxSupportedExpertCount);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cu` around
lines 47 - 72, Validate data.mTopK before using it in the maxTokensCoop
calculation and ensure mTopK is within valid bounds relative to mNumExperts and
constants: check data.mTopK > 0 and data.mTopK <= data.mNumExperts and
data.mTopK <= MaxSupportedTopExperts early (before computing int const
maxTokensCoop = ... / data.mTopK), and reorder/adjust the later guards involving
data.mPtrTopKIds, data.mNumExperts, MaxSupportedTopExperts, and
MaxSupportedExpertCount so that the precomputed-TopK path cannot execute with
mTopK == 0 or mTopK > mNumExperts; reference the symbols data.mTopK,
data.mNumExperts, data.mPtrTopKIds, maxTokensCoop, MaxSupportedTopExperts, and
MaxSupportedExpertCount.

Comment on lines +870 to 871
TVM_FFI_ICHECK(args->top_k <= 32 && args->top_k > 0)
<< "Current routing kernel (no groups, renormalize) only supports top_k<=10 && top_k>0.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update the validation message to match the new limit.

Line 871 still says top_k<=10, but Line 870 now accepts top_k<=32. That will produce misleading runtime diagnostics for this path.

Proposed fix
-      TVM_FFI_ICHECK(args->top_k <= 32 && args->top_k > 0)
-          << "Current routing kernel (no groups, renormalize) only supports top_k<=10 && top_k>0.";
+      TVM_FFI_ICHECK(args->top_k <= 32 && args->top_k > 0)
+          << "Current routing kernel (no groups, renormalize) only supports top_k<=32 && top_k>0.";
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
TVM_FFI_ICHECK(args->top_k <= 32 && args->top_k > 0)
<< "Current routing kernel (no groups, renormalize) only supports top_k<=10 && top_k>0.";
TVM_FFI_ICHECK(args->top_k <= 32 && args->top_k > 0)
<< "Current routing kernel (no groups, renormalize) only supports top_k<=32 && top_k>0.";
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@csrc/trtllm_fused_moe_kernel_launcher.cu` around lines 870 - 871, The runtime
check using TVM_FFI_ICHECK currently enforces args->top_k <= 32 but the error
string still says "top_k<=10", so update the diagnostic to match the enforced
limit: change the message string in the TVM_FFI_ICHECK call that references
args->top_k to say "top_k<=32 && top_k>0" (keep the same surrounding text
"Current routing kernel (no groups, renormalize) only supports ..."). Ensure you
edit the string associated with the TVM_FFI_ICHECK invocation so runtime
diagnostics accurately reflect the new limit.

Comment on lines +272 to +277
inline int getMultiProcessorCount() {
int device, count;
cudaGetDevice(&device);
cudaDeviceGetAttribute(&count, cudaDevAttrMultiProcessorCount, device);
return count;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing error handling for CUDA API calls.

The function does not check the return values of cudaGetDevice and cudaDeviceGetAttribute. This is inconsistent with the error handling pattern used elsewhere in this file (e.g., isCapturing at Line 115-116 uses FLASHINFER_CHECK). If either call fails, count may contain garbage, leading to incorrect kernel launch configurations.

🛡️ Proposed fix to add error checking
 inline int getMultiProcessorCount() {
   int device, count;
-  cudaGetDevice(&device);
-  cudaDeviceGetAttribute(&count, cudaDevAttrMultiProcessorCount, device);
+  FLASHINFER_CHECK(cudaGetDevice(&device) == cudaSuccess,
+                   "CUDA error in cudaGetDevice");
+  FLASHINFER_CHECK(cudaDeviceGetAttribute(&count, cudaDevAttrMultiProcessorCount, device) == cudaSuccess,
+                   "CUDA error in cudaDeviceGetAttribute");
   return count;
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
inline int getMultiProcessorCount() {
int device, count;
cudaGetDevice(&device);
cudaDeviceGetAttribute(&count, cudaDevAttrMultiProcessorCount, device);
return count;
}
inline int getMultiProcessorCount() {
int device, count;
FLASHINFER_CHECK(cudaGetDevice(&device) == cudaSuccess,
"CUDA error in cudaGetDevice");
FLASHINFER_CHECK(cudaDeviceGetAttribute(&count, cudaDevAttrMultiProcessorCount, device) == cudaSuccess,
"CUDA error in cudaDeviceGetAttribute");
return count;
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/flashinfer/trtllm/common/cudaUtils.h` around lines 272 - 277, The
getMultiProcessorCount function lacks CUDA error checking and can return
garbage; wrap the cudaGetDevice and cudaDeviceGetAttribute calls with the same
error-checking macro used elsewhere (e.g., FLASHINFER_CHECK) or check their
cudaError_t return values and handle failures before returning count. Update the
function (getMultiProcessorCount) to call FLASHINFER_CHECK(cudaGetDevice(...))
and FLASHINFER_CHECK(cudaDeviceGetAttribute(..., cudaDevAttrMultiProcessorCount,
device)) (or equivalent error handling) so failures are logged/handled and a
safe default or abort is used instead of returning an unchecked count.

Comment on lines 440 to 453
if (params.mPtrTopKIds != nullptr) {
idx = params.mPtrTopKIds[expandedIdx];
} else {
// If params.mPtrTopKIds != nullptr, we don't need to store the weights
if (params.mPtrTopKWeights != nullptr) {
scoreIdx = params.mPtrTopKPacked[expandedIdx];
idx = scoreIdx.idx;
params.mPtrTopKWeights[expandedIdx] = static_cast<OutputT>(scoreIdx.score);
}
}
// check whether this expert is local to our GPU at all and ignore if not
auto localExpertIdx = idx - params.mLocalExpertsStartIdx;
auto isLocalExpert = localExpertIdx >= 0 && localExpertIdx < localExpertExtent &&
(localExpertIdx & params.mLocalExpertsStrideLog2) == 0;
(localExpertIdx & ((1 << params.mLocalExpertsStrideLog2) - 1)) == 0;
if (isLocalExpert) {
atomicAdd(&smemExpertCount[idx], 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Always decode mPtrTopKPacked when mPtrTopKIds is null.

idx is needed for histogramming regardless of whether a standalone weight buffer is requested. In the packed-only path (mPtrTopKIds == nullptr && mPtrTopKWeights == nullptr), this branch leaves idx uninitialized and can send atomicAdd to a garbage expert slot, corrupting counts before the offsets stage runs.

🐛 Proposed fix
     if (params.mPtrTopKIds != nullptr) {
       idx = params.mPtrTopKIds[expandedIdx];
     } else {
-      if (params.mPtrTopKWeights != nullptr) {
-        scoreIdx = params.mPtrTopKPacked[expandedIdx];
-        idx = scoreIdx.idx;
+      scoreIdx = params.mPtrTopKPacked[expandedIdx];
+      idx = scoreIdx.idx;
+      if (params.mPtrTopKWeights != nullptr) {
         params.mPtrTopKWeights[expandedIdx] = static_cast<OutputT>(scoreIdx.score);
       }
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/flashinfer/trtllm/fused_moe/RoutingKernel.cuh` around lines 440 -
453, The code paths that set idx only when mPtrTopKIds or mPtrTopKWeights are
present leave idx uninitialized when mPtrTopKIds==nullptr and
mPtrTopKWeights==nullptr; always decode params.mPtrTopKPacked[expandedIdx] into
scoreIdx and assign idx = scoreIdx.idx (and set mPtrTopKWeights if present)
before computing localExpertIdx/isLocalExpert so
atomicAdd(&smemExpertCount[idx], 1) uses a valid index; update the branch around
params.mPtrTopKIds/mPtrTopKWeights to unconditionally read params.mPtrTopKPacked
and populate idx first, then conditionally write params.mPtrTopKWeights.

Comment on lines +2872 to +2892
pytest.param(
{
"num_experts": 2048,
"top_k": 32,
"padding": 8,
"n_groups": None,
"top_k_groups": None,
"routed_scaling": None,
"has_routing_bias": False,
"routing_method_type": RoutingMethodType.Renormalize,
"compatible_moe_impls": [
FP8BlockScaleMoe,
FP4Moe,
BF16Moe,
MxInt4BlockScaleMoe,
],
"compatible_intermediate_size": [384],
"enable_autotune": True,
},
id="RoutingRenormalize_large_experts",
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This 2048-expert case is too large for the full parametrized matrix.

With hidden_size=1024 and gated activations, this single config already allocates about 3 GiB for gemm1_weights plus 1.5 GiB for gemm2_weights in bf16, before quantized/shuffled copies or reference tensors. Because it also participates in the full num_tokens × zero_hidden_states × moe_impl × weight_processing × activation_type matrix with autotune enabled, it is very likely to become an OOM/timeout source on common CI GPUs. Please isolate the 2048/32 coverage into a small targeted smoke test and keep autotune off there.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/moe/test_trtllm_gen_fused_moe.py` around lines 2872 - 2892, This
2048-expert config (num_experts=2048, top_k=32, RoutingMethodType.Renormalize)
must be removed from the large parametrized matrix and isolated into a small
targeted smoke test with autotune disabled: create a new focused test that uses
the same dict but sets enable_autotune=False (and a distinct id like
"RoutingRenormalize_large_experts_smoke"), and remove or replace the
pytest.param entry from the full matrix so the heavy case no longer participates
in the combinatorial num_tokens × zero_hidden_states × moe_impl ×
weight_processing × activation_type runs.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu (1)

69-78: ⚠️ Potential issue | 🔴 Critical

Guard the optional mPtrExpandedIdxToPermutedIdx write.

mPtrExpandedIdxToPermutedIdx is optional, but the expertIdx == -1 branch at line 77 dereferences it unconditionally. A caller that passes precomputed mPtrTopKIds with padded -1 entries without requesting the expanded-index mapping will crash here.

Note: The same pointer is correctly guarded later at lines 242-244.

Suggested fix
         } else {
-          params.mPtrExpandedIdxToPermutedIdx[warpIdx * params.mTopK + laneIdx] = int32_t{-1};
+          if (params.mPtrExpandedIdxToPermutedIdx != nullptr) {
+            params.mPtrExpandedIdxToPermutedIdx[warpIdx * params.mTopK + laneIdx] = int32_t{-1};
+          }
         }

,

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu` around
lines 69 - 78, The branch handling expertIdx == -1 writes to the optional buffer
params.mPtrExpandedIdxToPermutedIdx unguarded; guard that write by checking
params.mPtrExpandedIdxToPermutedIdx != nullptr before assigning int32_t{-1} (so
when mPtrExpandedIdxToPermutedIdx is null we skip the write), keeping the
existing behavior for smemKIdx and using the same null-check pattern as later
code; locate this in the block that reads params.mPtrTopKIds, using
warpIdx/laneIdx/MaxNumExperts to find the correct index.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu`:
- Around line 69-78: The branch handling expertIdx == -1 writes to the optional
buffer params.mPtrExpandedIdxToPermutedIdx unguarded; guard that write by
checking params.mPtrExpandedIdxToPermutedIdx != nullptr before assigning
int32_t{-1} (so when mPtrExpandedIdxToPermutedIdx is null we skip the write),
keeping the existing behavior for smemKIdx and using the same null-check pattern
as later code; locate this in the block that reads params.mPtrTopKIds, using
warpIdx/laneIdx/MaxNumExperts to find the correct index.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9b5e2c2d-a01d-435d-8308-22fdc910cbf7

📥 Commits

Reviewing files that changed from the base of the PR and between ce401d7 and 77d13ac.

📒 Files selected for processing (2)
  • csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu
  • csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_renormalize.cu

@jiahanc
Copy link
Collaborator Author

jiahanc commented Mar 11, 2026

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !400 has been updated with latest changes, and the CI pipeline #45886576 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Collaborator

[CANCELING] Pipeline #45886576: canceled

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (5)
include/flashinfer/trtllm/common/cudaUtils.h (1)

272-277: ⚠️ Potential issue | 🟠 Major

Add CUDA error checks in getMultiProcessorCount.

Line 274 and Line 275 ignore CUDA return codes; on failure, count can be invalid and propagate bad launch configuration decisions.

🔧 Proposed fix
 inline int getMultiProcessorCount() {
   int device, count;
-  cudaGetDevice(&device);
-  cudaDeviceGetAttribute(&count, cudaDevAttrMultiProcessorCount, device);
+  FLASHINFER_CHECK(cudaGetDevice(&device) == cudaSuccess,
+                   "CUDA error in cudaGetDevice");
+  FLASHINFER_CHECK(cudaDeviceGetAttribute(&count, cudaDevAttrMultiProcessorCount, device) == cudaSuccess,
+                   "CUDA error in cudaDeviceGetAttribute");
   return count;
 }
#!/bin/bash
# Verify whether CUDA calls in getMultiProcessorCount are guarded.
rg -n -A8 -B2 'inline int getMultiProcessorCount\(\)' include/flashinfer/trtllm/common/cudaUtils.h
rg -n 'FLASHINFER_CHECK\(cudaGetDevice|FLASHINFER_CHECK\(cudaDeviceGetAttribute' include/flashinfer/trtllm/common/cudaUtils.h
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/flashinfer/trtllm/common/cudaUtils.h` around lines 272 - 277, The
function getMultiProcessorCount currently calls cudaGetDevice and
cudaDeviceGetAttribute without checking return values; wrap those calls with the
project's CUDA-check macro (e.g., FLASHINFER_CHECK or equivalent) or explicitly
check their cudaError_t results, log or handle failures, and return a safe
sentinel (e.g., 0 or -1) when an error occurs instead of returning an
uninitialized count; update references to cudaGetDevice and
cudaDeviceGetAttribute inside getMultiProcessorCount to use the chosen
error-checking pattern so callers never receive an invalid count.
csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_llama4.cu (1)

246-247: ⚠️ Potential issue | 🟠 Major

Guard mPtrPermutedIdxToExpandedIdx writes with isTokenRouted.

At Line 246, writing reverse-map entries for isLocalExpert without requiring routing selection can overwrite valid slots for subsequent routed tokens. Mirror the guard used by the other reverse-map write.

💡 Proposed fix
-      if (params.mPtrPermutedIdxToExpandedIdx != nullptr && isLocalExpert) {
+      if (params.mPtrPermutedIdxToExpandedIdx != nullptr && isLocalExpert &&
+          isTokenRouted) {
         params.mPtrPermutedIdxToExpandedIdx[permutedIdx] = tokenIdx;
       }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_llama4.cu` around
lines 246 - 247, The reverse-map write to params.mPtrPermutedIdxToExpandedIdx
guarded only by isLocalExpert can overwrite slots for tokens that were not
routed; update the guard to require both isLocalExpert and isTokenRouted (mirror
the other reverse-map write) before assigning
params.mPtrPermutedIdxToExpandedIdx[permutedIdx] = tokenIdx so only routed
tokens create reverse-map entries (use the same condition structure as the other
reverse-map write to locate and fix the assignment).
tests/moe/test_trtllm_gen_fused_moe.py (1)

2872-2892: ⚠️ Potential issue | 🟠 Major

Isolate this 2048-expert config into a targeted smoke test.

This config allocates ~5 GiB for weights alone (2048 experts × gated intermediate × hidden_size in bf16) before quantized copies. Since it participates in the full parametrized matrix with enable_autotune=True, it risks CI OOM/timeout on standard GPUs.

Consider:

  1. Move to a separate focused test with enable_autotune=False
  2. Limit to a single moe_impl and weight_processing combination
  3. Use the smallest num_tokens value

,

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/moe/test_trtllm_gen_fused_moe.py` around lines 2872 - 2892, This large
2048-expert pytest.param (id "RoutingRenormalize_large_experts") should be
pulled out of the full parametrized matrix and converted into a focused smoke
test to avoid CI OOMs: create a new dedicated test that uses this dict but set
"enable_autotune": False, reduce "compatible_moe_impls" to a single
implementation (pick one from the current list, e.g., BF16Moe), restrict any
"weight_processing" combinations to a single option, and set the test's input
size to the minimum by lowering "num_tokens" to the smallest allowed value; keep
the original "id" for traceability and ensure the new test is run separately
from the large parametrized suite.
csrc/fused_moe/trtllm_backend/routingDeepSeek/launchCoopKernel.cu (1)

189-195: ⚠️ Potential issue | 🔴 Critical

Insert a global fence before completing the PDL launch.

Line 192 already calls out the problem: __syncthreads() only orders the CTA, but the dependent kernel consumes global metadata written above. Without a device/system fence before cudaTriggerProgrammaticLaunchCompletion(), the next kernel can observe stale mPtrCtaIdxXyToBatchIdx, mPtrCtaIdxXyToMnLimit, mPtrPermutedIdxSize, or mPtrNumNonExitingCtas.

Suggested fix
   if constexpr (KernelParams::UsePdl) {
+    __threadfence_system();
     cudaTriggerProgrammaticLaunchCompletion();
   }
According to the official NVIDIA CUDA documentation, does `cudaTriggerProgrammaticLaunchCompletion()` require an explicit `__threadfence_system()` (or another memory fence) to guarantee that prior global-memory writes are visible to the dependent kernel before programmatic launch completion?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@csrc/fused_moe/trtllm_backend/routingDeepSeek/launchCoopKernel.cu` around
lines 189 - 195, The programmatic launch completion lacks a system-wide memory
fence: when KernelParams::UsePdl is true, insert a device/system fence (e.g.,
call __threadfence_system() or equivalent) immediately before the existing
cudaTriggerProgrammaticLaunchCompletion() so that prior global writes to
mPtrCtaIdxXyToBatchIdx, mPtrCtaIdxXyToMnLimit, mPtrPermutedIdxSize,
mPtrNumNonExitingCtas, and mPtrTotalNumPaddedTokens are visible to the dependent
kernel; update the block guarded by if constexpr (KernelParams::UsePdl) to call
the global fence first, then call cudaTriggerProgrammaticLaunchCompletion().
csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu (1)

69-78: ⚠️ Potential issue | 🔴 Critical

Guard the optional expanded-index mapping write.

mPtrExpandedIdxToPermutedIdx is optional in this backend, but the expertIdx == -1 branch dereferences it unconditionally. A top-k-id caller that omits that mapping buffer will hit a null device pointer here.

Suggested fix
         if (expertIdx != -1) {
           int offset = warpIdx * MaxNumExperts + expertIdx;
           smemKIdx[offset] = static_cast<int8_t>(laneIdx);
-        } else {
+        } else if (params.mPtrExpandedIdxToPermutedIdx != nullptr) {
           params.mPtrExpandedIdxToPermutedIdx[warpIdx * params.mTopK + laneIdx] = int32_t{-1};
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu` around
lines 69 - 78, The branch handling expertIdx == -1 writes unconditionally to the
optional device pointer params.mPtrExpandedIdxToPermutedIdx, which can be null;
modify the code in the launchBlockKernel logic so you first test if
params.mPtrExpandedIdxToPermutedIdx != nullptr before performing the write
(params.mPtrExpandedIdxToPermutedIdx[warpIdx * params.mTopK + laneIdx] = -1); if
the pointer is null simply skip that write (leave smemKIdx handling unchanged),
ensuring all accesses to params.mPtrExpandedIdxToPermutedIdx are guarded.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@csrc/fused_moe/trtllm_backend/routingDeepSeek/launchClusterKernel.cu`:
- Around line 25-27: The __launch_bounds__ macro in the
routingIndicesClusterKernel definition uses KernelParams::MaxNumExperts
unconditionally which can exceed CUDA's 1024-thread-per-block limit; change the
launch bound to cap MaxNumExperts at 1024 (i.e., use a conditional expression
like KernelParams::MaxNumExperts <= 1024 ? KernelParams::MaxNumExperts : 1024)
for the __launch_bounds__ on the routingIndicesClusterKernel function; also
apply the same conditional cap fix to the identical usage in launchCoopKernel.cu
to prevent compile/runtime failures when MaxNumExperts > 1024.

In `@csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu`:
- Around line 127-130: The isLocal predicate incorrectly checks localExpIdx <
params.mNumLocalExperts but must account for the strided extent when expert
interleaving is enabled; change the upper bound to (params.mNumLocalExperts <<
params.mLocalExpertsStrideLog2) (i.e., use the strided local-expert extent) so
that localExpIdx is compared against the true number of interleaved slots, and
update the same logic where it appears (the other occurrences around the blocks
referencing localExpIdx/isLocal at the later sites). Ensure you keep the
existing bitmask check for stride alignment ((localExpIdx & ((1 <<
params.mLocalExpertsStrideLog2) - 1)) == 0) and only replace the plain
mNumLocalExperts bound with the shifted value.

---

Duplicate comments:
In `@csrc/fused_moe/trtllm_backend/routingDeepSeek/launchCoopKernel.cu`:
- Around line 189-195: The programmatic launch completion lacks a system-wide
memory fence: when KernelParams::UsePdl is true, insert a device/system fence
(e.g., call __threadfence_system() or equivalent) immediately before the
existing cudaTriggerProgrammaticLaunchCompletion() so that prior global writes
to mPtrCtaIdxXyToBatchIdx, mPtrCtaIdxXyToMnLimit, mPtrPermutedIdxSize,
mPtrNumNonExitingCtas, and mPtrTotalNumPaddedTokens are visible to the dependent
kernel; update the block guarded by if constexpr (KernelParams::UsePdl) to call
the global fence first, then call cudaTriggerProgrammaticLaunchCompletion().

In `@csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu`:
- Around line 69-78: The branch handling expertIdx == -1 writes unconditionally
to the optional device pointer params.mPtrExpandedIdxToPermutedIdx, which can be
null; modify the code in the launchBlockKernel logic so you first test if
params.mPtrExpandedIdxToPermutedIdx != nullptr before performing the write
(params.mPtrExpandedIdxToPermutedIdx[warpIdx * params.mTopK + laneIdx] = -1); if
the pointer is null simply skip that write (leave smemKIdx handling unchanged),
ensuring all accesses to params.mPtrExpandedIdxToPermutedIdx are guarded.

In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_llama4.cu`:
- Around line 246-247: The reverse-map write to
params.mPtrPermutedIdxToExpandedIdx guarded only by isLocalExpert can overwrite
slots for tokens that were not routed; update the guard to require both
isLocalExpert and isTokenRouted (mirror the other reverse-map write) before
assigning params.mPtrPermutedIdxToExpandedIdx[permutedIdx] = tokenIdx so only
routed tokens create reverse-map entries (use the same condition structure as
the other reverse-map write to locate and fix the assignment).

In `@include/flashinfer/trtllm/common/cudaUtils.h`:
- Around line 272-277: The function getMultiProcessorCount currently calls
cudaGetDevice and cudaDeviceGetAttribute without checking return values; wrap
those calls with the project's CUDA-check macro (e.g., FLASHINFER_CHECK or
equivalent) or explicitly check their cudaError_t results, log or handle
failures, and return a safe sentinel (e.g., 0 or -1) when an error occurs
instead of returning an uninitialized count; update references to cudaGetDevice
and cudaDeviceGetAttribute inside getMultiProcessorCount to use the chosen
error-checking pattern so callers never receive an invalid count.

In `@tests/moe/test_trtllm_gen_fused_moe.py`:
- Around line 2872-2892: This large 2048-expert pytest.param (id
"RoutingRenormalize_large_experts") should be pulled out of the full
parametrized matrix and converted into a focused smoke test to avoid CI OOMs:
create a new dedicated test that uses this dict but set "enable_autotune":
False, reduce "compatible_moe_impls" to a single implementation (pick one from
the current list, e.g., BF16Moe), restrict any "weight_processing" combinations
to a single option, and set the test's input size to the minimum by lowering
"num_tokens" to the smallest allowed value; keep the original "id" for
traceability and ensure the new test is run separately from the large
parametrized suite.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: df6f5495-6b6d-40ac-97d5-297d77567843

📥 Commits

Reviewing files that changed from the base of the PR and between eb2c0e3 and eea25a1.

📒 Files selected for processing (32)
  • csrc/fused_moe/trtllm_backend/routingDeepSeek/RoutingDeepSeekCommon.cuh
  • csrc/fused_moe/trtllm_backend/routingDeepSeek/launchClusterKernel.cu
  • csrc/fused_moe/trtllm_backend/routingDeepSeek/launchCoopKernel.cu
  • csrc/fused_moe/trtllm_backend/routingDeepSeek/launchHistogramKernel.cu
  • csrc/fused_moe/trtllm_backend/routingDeepSeek/launchInitExpertCounts.cu
  • csrc/fused_moe/trtllm_backend/routingDeepSeek/launchMainKernel.cu
  • csrc/fused_moe/trtllm_backend/routingDeepSeek/launchOffsetsKernel.cu
  • csrc/fused_moe/trtllm_backend/routingRenormalize/RoutingRenormalizeCommon.cuh
  • csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu
  • csrc/fused_moe/trtllm_backend/routingRenormalize/launchClusterKernel.cu
  • csrc/fused_moe/trtllm_backend/routingRenormalize/launchHistogramKernel.cu
  • csrc/fused_moe/trtllm_backend/routingRenormalize/launchHistogramScoresKernel.cu
  • csrc/fused_moe/trtllm_backend/routingRenormalize/launchInitExpertCounts.cu
  • csrc/fused_moe/trtllm_backend/routingRenormalize/launchOffsetsKernel.cu
  • csrc/fused_moe/trtllm_backend/trtllm_fused_moe_dev_kernel.cu
  • csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cu
  • csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_llama4.cu
  • csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_renormalize.cu
  • csrc/moe_utils_binding.cu
  • csrc/trtllm_fused_moe_kernel_launcher.cu
  • csrc/trtllm_fused_moe_routing_deepseek.cu
  • csrc/trtllm_fused_moe_routing_renormalize.cu
  • csrc/trtllm_fused_moe_runner.cu
  • flashinfer/jit/cpp_ext.py
  • flashinfer/jit/fused_moe.py
  • flashinfer/jit/moe_utils.py
  • include/flashinfer/trtllm/common/cudaUtils.h
  • include/flashinfer/trtllm/fused_moe/DevKernel.h
  • include/flashinfer/trtllm/fused_moe/RoutingKernel.cuh
  • include/flashinfer/trtllm/fused_moe/RoutingKernel.h
  • include/flashinfer/trtllm/fused_moe/RoutingKernelTopK.cuh
  • tests/moe/test_trtllm_gen_fused_moe.py
💤 Files with no reviewable changes (3)
  • csrc/moe_utils_binding.cu
  • csrc/trtllm_fused_moe_routing_renormalize.cu
  • csrc/trtllm_fused_moe_routing_deepseek.cu
🚧 Files skipped from review as they are similar to previous changes (7)
  • csrc/trtllm_fused_moe_kernel_launcher.cu
  • csrc/fused_moe/trtllm_backend/routingRenormalize/launchHistogramScoresKernel.cu
  • csrc/fused_moe/trtllm_backend/trtllm_fused_moe_dev_kernel.cu
  • csrc/fused_moe/trtllm_backend/routingRenormalize/launchHistogramKernel.cu
  • csrc/trtllm_fused_moe_runner.cu
  • flashinfer/jit/cpp_ext.py
  • csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cu

Comment on lines +127 to +130
int expert = threadIdx.x * ExpertsPerThread + e;
auto localExpIdx = expert - params.mLocalExpertsStartIdx;
auto isLocal = localExpIdx >= 0 && localExpIdx < params.mNumLocalExperts &&
(localExpIdx & ((1 << params.mLocalExpertsStrideLog2) - 1)) == 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Use the strided local-expert extent here.

These predicates use localExpIdx < params.mNumLocalExperts, but the other routing paths already treat the local range as mNumLocalExperts << mLocalExpertsStrideLog2. With expert interleaving enabled, this block path will misclassify valid local experts as remote and produce wrong counts, CTA mappings, and permuted indices.

Suggested fix
+  auto localExpertExtent = params.mNumLocalExperts << params.mLocalExpertsStrideLog2;
   int accExpertCount[ExpertsPerThread];
 `#pragma` unroll
   for (int e = 0; e < ExpertsPerThread; e++) {
     int expert = threadIdx.x * ExpertsPerThread + e;
     auto localExpIdx = expert - params.mLocalExpertsStartIdx;
-    auto isLocal = localExpIdx >= 0 && localExpIdx < params.mNumLocalExperts &&
+    auto isLocal = localExpIdx >= 0 && localExpIdx < localExpertExtent &&
                    (localExpIdx & ((1 << params.mLocalExpertsStrideLog2) - 1)) == 0;
@@
     int expert = threadIdx.x * ExpertsPerThread + e;
     auto localExpIdx = expert - params.mLocalExpertsStartIdx;
-    auto isLocal = localExpIdx >= 0 && localExpIdx < params.mNumLocalExperts &&
+    auto isLocal = localExpIdx >= 0 && localExpIdx < localExpertExtent &&
                    (localExpIdx & ((1 << params.mLocalExpertsStrideLog2) - 1)) == 0;
@@
         if (smemKIdx[offset] >= 0) {
           auto localExpIdx = expert - params.mLocalExpertsStartIdx;
-          auto isLocal = localExpIdx >= 0 && localExpIdx < params.mNumLocalExperts &&
+          auto isLocal = localExpIdx >= 0 && localExpIdx < localExpertExtent &&
                          (localExpIdx & ((1 << params.mLocalExpertsStrideLog2) - 1)) == 0;

Also applies to: 183-185, 233-235

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu` around
lines 127 - 130, The isLocal predicate incorrectly checks localExpIdx <
params.mNumLocalExperts but must account for the strided extent when expert
interleaving is enabled; change the upper bound to (params.mNumLocalExperts <<
params.mLocalExpertsStrideLog2) (i.e., use the strided local-expert extent) so
that localExpIdx is compared against the true number of interleaved slots, and
update the same logic where it appears (the other occurrences around the blocks
referencing localExpIdx/isLocal at the later sites). Ensure you keep the
existing bitmask check for stride alignment ((localExpIdx & ((1 <<
params.mLocalExpertsStrideLog2) - 1)) == 0) and only replace the plain
mNumLocalExperts bound with the shifted value.

@jiahanc
Copy link
Collaborator Author

jiahanc commented Mar 16, 2026

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !400 has been updated with latest changes, and the CI pipeline #46225337 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Collaborator

[FAILED] Pipeline #46225337: 9/20 passed

@jiahanc jiahanc enabled auto-merge (squash) March 17, 2026 01:36
Copy link
Collaborator

@IwakuraRein IwakuraRein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

jiahanc added 3 commits March 16, 2026 21:00
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Made-with: Cursor
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
@jiahanc jiahanc merged commit abf080a into flashinfer-ai:main Mar 17, 2026
29 checks passed
frankwang28 pushed a commit to frankwang28/flashinfer that referenced this pull request Mar 18, 2026
<!-- .github/pull_request_template.md -->

## 📌 Description


- Integrate NVIDIA/TensorRT-LLM#11510 to support
2048 num of experts and 32 TopK in renormalize
- Refactor MOE cu files

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Expanded MoE routing/renormalize to support up to 2,048 experts and
top-k up to 32; backend reorganized to enable larger configurations.

* **Bug Fixes**
* Clamped token counts in kernel launches to prevent oversized grid
launches.

* **Performance**
* Reworked routing/launch paths for improved scalability and throughput
with large expert/top-k settings.

* **Tests**
* Added test scenarios covering large-expert (2,048) + top-k (32)
configurations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
ameynaik-hub pushed a commit to ameynaik-hub/flashinfer that referenced this pull request Mar 18, 2026
<!-- .github/pull_request_template.md -->

## 📌 Description

- Integrate NVIDIA/TensorRT-LLM#11510 to support
2048 num of experts and 32 TopK in renormalize
- Refactor MOE cu files

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Expanded MoE routing/renormalize to support up to 2,048 experts and
top-k up to 32; backend reorganized to enable larger configurations.

* **Bug Fixes**
* Clamped token counts in kernel launches to prevent oversized grid
launches.

* **Performance**
* Reworked routing/launch paths for improved scalability and throughput
with large expert/top-k settings.

* **Tests**
* Added test scenarios covering large-expert (2,048) + top-k (32)
configurations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: Amey Naik <212485788+ameynaik-hub@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants