[feat] Add 2048 experts and 32 Top K by jiahanc · Pull Request #2744 · flashinfer-ai/flashinfer

jiahanc · 2026-03-10T14:32:43Z

📌 Description

Integrate [None][feat] Add support for expert_number<=2048 and K<=32 NVIDIA/TensorRT-LLM#11510 to support 2048 num of experts and 32 TopK in renormalize
Refactor MOE cu files

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

New Features
- Expanded MoE routing/renormalize to support up to 2,048 experts and top-k up to 32; backend reorganized to enable larger configurations.
Bug Fixes
- Clamped token counts in kernel launches to prevent oversized grid launches.
Performance
- Reworked routing/launch paths for improved scalability and throughput with large expert/top-k settings.
Tests
- Added test scenarios covering large-expert (2,048) + top-k (32) configurations.

coderabbitai · 2026-03-10T14:33:01Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Split monolithic MoE routing into modular DeepSeek and Renormalize backends: added common headers, compile-time dispatch macros, multiple kernel launchers and run() orchestrators; updated KernelParams/top-K support, DevKernel macros, build/JIT source lists, and tests; removed legacy single-file routing implementations.

Changes

Cohort / File(s)	Summary
DeepSeek — Common & Launchers `csrc/fused_moe/trtllm_backend/routingDeepSeek/RoutingDeepSeekCommon.cuh`, `csrc/fused_moe/trtllm_backend/routingDeepSeek/launch*.cu`	New DeepSeek constants, `getMaxNumExperts`, LAUNCH_DEEPSEEK/LAUNCH_ROUTING_DEEPSEEK dispatch macros; split kernel launchers (main, cluster, coop, histogram, init counts, offsets); SM90+ guards and kernel implementations.
DeepSeek — Orchestrator `csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cu`	New `run()` entry point and forward declarations that validate inputs and dispatch to DeepSeek launchers (replaces deleted monolithic implementation).
Renormalize — Common & Launchers `csrc/fused_moe/trtllm_backend/routingRenormalize/RoutingRenormalizeCommon.cuh`, `csrc/fused_moe/trtllm_backend/routingRenormalize/launch*.cu`	New Renormalize top‑K utilities, `getMaxNumExperts`, LAUNCH_ROUTING_RENORMALIZE dispatch macro; split launchers (block, cluster, histogram, histogram‑scores, init counts, offsets) and kernels.
Renormalize — Orchestrator `csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_renormalize.cu`	New `run()` orchestrator that validates, selects single/multi‑kernel flows, and invokes Renormalize launchers (replaces deleted monolith).
Removed monoliths `csrc/trtllm_fused_moe_routing_deepseek.cu`, `csrc/trtllm_fused_moe_routing_renormalize.cu`	Deleted legacy single-file routing modules—functionality reimplemented as split backend files.
Kernel params & headers `include/flashinfer/trtllm/fused_moe/RoutingKernel.h`, `.../RoutingKernel.cuh`, `.../RoutingKernelTopK.cuh`	Added MaxNumTopExperts template param and per-kernel layout changes; removed some Data dtype fields; extended TopK support to 32; added IsPowerOf2/Sort and revised reduceTopK semantics and per-block/thread grouping.
DevKernel macros & dispatch `include/flashinfer/trtllm/fused_moe/DevKernel.h`	Wrapped LAUNCH_PDL in do/while; replaced legacy DeepSeek macros; added `LAUNCH_ROUTING_WITH_NUM_EXPERTS_FORCE_FLOAT_INPUT`; parameterized dispatch on numExperts/numTopExperts and forceFloatInput.
Orchestrator & runner tweaks `csrc/trtllm_fused_moe_kernel_launcher.cu`, `csrc/trtllm_fused_moe_routing_llama4.cu`, `csrc/fused_moe/trtllm_backend/trtllm_fused_moe_runner.cu`	Clamped token grid dims to min(8192,numTokens); renamed NumExpertsLimit→MaxSupportedExperts; promoted runImpl→run in llama4; removed some DeepSeek dtype inits.
Build/JIT changes `flashinfer/jit/fused_moe.py`, `flashinfer/jit/moe_utils.py`, `flashinfer/jit/cpp_ext.py`	Reorganized CUDA source lists to new backend-prefixed paths; split routing into multiple compilation units; obj naming adjusted to include parent-dir prefix.
Helpers & tests `include/flashinfer/trtllm/common/cudaUtils.h`, `tests/moe/test_trtllm_gen_fused_moe.py`	Added `getMultiProcessorCount()` helper; relaxed FP8 Renormalize top‑K limit to 32; added large-experts test case (2048 experts, top_k=32) and removed prior assertion.
Bindings & runtime dtype init `csrc/moe_utils_binding.cu`, `csrc/trtllm_fused_moe_runner.cu`	Removed explicit initialization of mDtypeBias and mDtypeScore for certain DeepSeek paths; remaining dtype init sets mDtypeExpW only.

Sequence Diagram(s)

sequenceDiagram
    rect rgba(100,150,240,0.5)
    participant Host as Host
    end
    rect rgba(120,200,150,0.5)
    participant Orch as Orchestrator (run)
    end
    rect rgba(240,180,120,0.5)
    participant Macro as Launcher / Dispatch Macro
    end
    rect rgba(200,120,220,0.5)
    participant GPU as GPU Kernels
    end

    Host->>Orch: call run(Data)
    Orch->>Orch: validate inputs & pick backend (DeepSeek / Renormalize)
    Orch->>Macro: select LAUNCH_ROUTING_* macro and specific launcher
    Macro->>GPU: configure & launch kernel(s) (main/cluster/coop/histogram/offsets)
    GPU-->>Macro: kernel(s) complete (optional PDL sync)
    Macro-->>Orch: device buffers updated
    Orch-->>Host: return / completion

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Update the routing for TRTLLMGEN to support kimi k2 and qwen #1831 — Overlapping changes to DeepSeek/Renormalize routing, getMaxNumExperts and dispatch macros.
Revert "feat: Support Fused MoE non gated Relu2 NVFP4 & FP8 and support Nemotron" #2451 — Related DeepSeek dispatch/constants and LAUNCH_ROUTING_* behavior.
[BUG] Fix trtllm-gen fp4 moe renormalize routing #2049 — Related Renormalize block-level packing/top‑K logic and kernel changes.

Suggested labels

op: moe

Suggested reviewers

yzh119
cyx-6
IwakuraRein
kahyunnam
jimmyzho
nv-yunzheq

Poem

🐰 I hopped through headers, split the files,
Launchers lined up in tidy piles.
Kernels wake and macros sing,
Experts routed—bounce and spring.
A carrot for each compiled mile!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 51.02% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies the main feature being added: support for 2048 experts and 32 Top K, which aligns with the PR's core objective of extending expert and top-k capacity limits.
Description check	✅ Passed	The PR description provides sufficient context: it references the TensorRT-LLM integration, states the objectives (2048 experts and 32 TopK support, MOE refactoring), and confirms all checklist items are completed.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can disable sequence diagrams in the walkthrough.

Disable the reviews.sequence_diagrams setting to disable sequence diagrams in the walkthrough.

gemini-code-assist · 2026-03-10T15:39:45Z

Warning

Gemini is experiencing higher than usual traffic and was unable to create the summary. Please try again in a few hours by commenting /gemini summary.

jiahanc · 2026-03-11T03:26:08Z

/bot run

flashinfer-bot · 2026-03-11T03:27:13Z

GitLab MR !400 has been created, and the CI pipeline #45848569 is currently running. I'll report back once the pipeline job completes.

coderabbitai

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

include/flashinfer/trtllm/fused_moe/RoutingKernel.h (1)

279-303: ⚠️ Potential issue | 🟠 Major

Thread DoSoftmaxBeforeTopK into the renormalize packed-pointer type.

csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu derives packed entries as PackedScoreIdx<conditional_t<DoSoftmaxBeforeTopK, float, InputT>>, but this KernelParams still stores and casts mPtrTopKPacked as PackedScoreIdx<OutputT>*. When DoSoftmaxBeforeTopK == true and OutputT != float, the packed buffer layout no longer matches the kernel contract.

Suggested fix

 template <typename InputT_, typename OutputT_, int MaxNumExperts_, int MaxNumTopExperts_,
           bool DoSoftmaxBeforeTopK_, bool isPow2_, bool UsePdl_>
 struct KernelParams : public KernelParamsBase<InputT_, OutputT_, MaxNumExperts_, MaxNumTopExperts_,
                                               isPow2_, UsePdl_> {
   using InputT = InputT_;
   using OutputT = OutputT_;
+  using PackedT =
+      PackedScoreIdx<std::conditional_t<DoSoftmaxBeforeTopK_, float, InputT_>>;
 
   static constexpr bool DoSoftmaxBeforeTopK = DoSoftmaxBeforeTopK_;
 
-  PackedScoreIdx<OutputT>* mPtrTopKPacked = nullptr;
+  PackedT* mPtrTopKPacked = nullptr;
@@
-    params.mPtrTopKPacked = (PackedScoreIdx<OutputT>*)data.mPtrTopKPacked;
+    params.mPtrTopKPacked = static_cast<PackedT*>(data.mPtrTopKPacked);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@include/flashinfer/trtllm/fused_moe/RoutingKernel.h` around lines 279 - 303,
KernelParams currently stores mPtrTopKPacked as PackedScoreIdx<OutputT>*, but
the launched kernel expects PackedScoreIdx<conditional_t<DoSoftmaxBeforeTopK,
float, InputT>>; change mPtrTopKPacked's type declaration in struct KernelParams
to use the same conditional type (e.g.,
PackedScoreIdx<std::conditional_t<DoSoftmaxBeforeTopK, float, OutputT>>* or, if
the kernel uses InputT, PackedScoreIdx<std::conditional_t<DoSoftmaxBeforeTopK,
float, InputT>>*), and update the assignment in KernelParams::setKernelParams to
cast data.mPtrTopKPacked to that same conditional pointer type so the stored
pointer type matches the kernel contract (refer to symbols KernelParams,
mPtrTopKPacked, PackedScoreIdx, and setKernelParams).

🧹 Nitpick comments (1)

flashinfer/jit/moe_utils.py (1)
79-92: Extract the DeepSeek routing manifest into a shared helper/constant.

This file list is now duplicated here and in flashinfer/jit/fused_moe.py. The next TensorRT-LLM sync is likely to touch one list and miss the other, which would make the two JIT modules drift out of sync.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/jit/moe_utils.py` around lines 79 - 92, The DeepSeek routing CUDA
source list is duplicated between moe_utils.py and fused_moe.py; extract that
list into a single shared constant (e.g., DEEPSEEK_ROUTING_SOURCES or
DEEPSEEK_ROUTING_MANIFEST) in a common module (create a small helper or export
it from moe_utils.py) and replace the inline lists in both jit modules with
imports of that constant so both trtllm_fused_moe routing lists reference the
single source of truth.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@csrc/fused_moe/trtllm_backend/routingDeepSeek/launchCoopKernel.cu`:
- Around line 189-195: Before calling cudaTriggerProgrammaticLaunchCompletion()
when KernelParams::UsePdl is true, ensure a system-wide memory fence so writes
to mPtrCtaIdxXyToBatchIdx, mPtrCtaIdxXyToMnLimit, mPtrNumNonExitingCtas,
mPtrTotalNumPaddedTokens and other global metadata are visible to the
subsequently-launched kernel; add an appropriate global fence (e.g.,
__threadfence_system() or your platform's equivalent) after the existing
__syncthreads() and before invoking cudaTriggerProgrammaticLaunchCompletion(),
keeping the conditional on KernelParams::UsePdl and preserving existing comments
and ordering.

In `@csrc/fused_moe/trtllm_backend/routingDeepSeek/RoutingDeepSeekCommon.cuh`:
- Around line 39-52: getMaxNumExperts currently returns 0 on unsupported inputs
which allows callers like LAUNCH_ROUTING_DEEPSEEK to divide-by-zero or silently
skip launches; change getMaxNumExperts to fail fast instead of returning 0 by
logging the error and aborting (or throwing/asserting) so an invalid numExperts
causes a deterministic crash, and apply the same change to the analogous
function in the same file (the similar block around lines 72-88) so neither
function ever returns 0 for unsupported expert counts.

In `@csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu`:
- Around line 69-78: The code writes to params.mPtrExpandedIdxToPermutedIdx
unconditionally in the expertIdx == -1 branch; guard that write with a
null-check so we only assign when the optional mapping was provided. Locate the
block handling params.mPtrTopKIds / expertIdx (within the warpIdx/laneIdx loop
using MaxNumExperts and smemKIdx) and wrap the
params.mPtrExpandedIdxToPermutedIdx[warpIdx * params.mTopK + laneIdx] = -1
assignment inside an if (params.mPtrExpandedIdxToPermutedIdx != nullptr) check
(keeping the same index calculation).

In
`@csrc/fused_moe/trtllm_backend/routingRenormalize/RoutingRenormalizeCommon.cuh`:
- Around line 99-126: The macros currently treat any mTopK > NumTop16Experts as
MaxSupportedTopExperts, which causes mTopK > MaxSupportedTopExperts (e.g., 33+)
to fall through and index out-of-bounds; update LAUNCH_ROUTING_WITH_TOPK to
explicitly handle the 32-wide case and reject larger values: add an else-if
branch checking if data.mTopK <= NumTop32Experts and call
LAUNCH_ROUTING_WITH_NUM_EXPERTS with NumTop32Experts, and replace the final else
with an error/log and no-launch path for unsupported data.mTopK (referencing
data.mTopK, LAUNCH_ROUTING_WITH_TOPK, LAUNCH_ROUTING_WITH_NUM_EXPERTS,
NumTop8Experts, NumTop16Experts, NumTop32Experts, MaxSupportedTopExperts).

In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cu`:
- Around line 59-60: The code caches the SM count in a static variable (smCount)
which uses tensorrt_llm::common::getMultiProcessorCount() that is
device-specific; remove the static so the SM count is retrieved per-call (e.g.,
make smCount a non-static local variable where numBlocksCoop is computed) and
recompute numBlocksCoop from that per-device value (also guard numBlocksCoop to
not go negative if needed) to avoid pinning launch size to the first device that
hits this code.
- Around line 47-72: Validate data.mTopK before using it in the maxTokensCoop
calculation and ensure mTopK is within valid bounds relative to mNumExperts and
constants: check data.mTopK > 0 and data.mTopK <= data.mNumExperts and
data.mTopK <= MaxSupportedTopExperts early (before computing int const
maxTokensCoop = ... / data.mTopK), and reorder/adjust the later guards involving
data.mPtrTopKIds, data.mNumExperts, MaxSupportedTopExperts, and
MaxSupportedExpertCount so that the precomputed-TopK path cannot execute with
mTopK == 0 or mTopK > mNumExperts; reference the symbols data.mTopK,
data.mNumExperts, data.mPtrTopKIds, maxTokensCoop, MaxSupportedTopExperts, and
MaxSupportedExpertCount.

In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_llama4.cu`:
- Around line 242-247: The code writes
params.mPtrPermutedIdxToExpandedIdx[permutedIdx] without checking isTokenRouted,
which allows writing invalid slots for local-but-unselected experts; update the
condition on the mPtrPermutedIdxToExpandedIdx write to mirror the other
reverse-map guard by requiring both isLocalExpert and isTokenRouted (i.e., only
write params.mPtrPermutedIdxToExpandedIdx[permutedIdx] when isTokenRouted is
true), keeping the existing use of permutedIdx, tokenIdx, finalExpertOffset and
localOffsetToken to compute the index.

In `@csrc/trtllm_fused_moe_kernel_launcher.cu`:
- Around line 870-871: The runtime check using TVM_FFI_ICHECK currently enforces
args->top_k <= 32 but the error string still says "top_k<=10", so update the
diagnostic to match the enforced limit: change the message string in the
TVM_FFI_ICHECK call that references args->top_k to say "top_k<=32 && top_k>0"
(keep the same surrounding text "Current routing kernel (no groups, renormalize)
only supports ..."). Ensure you edit the string associated with the
TVM_FFI_ICHECK invocation so runtime diagnostics accurately reflect the new
limit.

In `@flashinfer/jit/fused_moe.py`:
- Around line 264-297: The sources list in fused_moe.py contains duplicate
basenames (e.g., launchClusterKernel.cu, launchHistogramKernel.cu,
launchInitExpertCounts.cu, launchOffsetsKernel.cu) that will collide when
JitSpec.get_object_paths() and generate_ninja_build_for_op() convert sources to
object names using source.with_suffix(object_suffix).name; fix by either (A)
deduplicating/renaming the entries in the sources list in fused_moe.py to ensure
unique basenames per compilation unit (move into uniquely named files or prefix
filenames by directory), or (B) change the object-naming logic in
JitSpec.get_object_paths() / generate_ninja_build_for_op() to include path
components or a short hash (e.g., incorporate source.parent.name or a hash of
source) when constructing the object filename to disambiguate
otherwise-identical basenames.

In `@include/flashinfer/trtllm/common/cudaUtils.h`:
- Around line 272-277: The getMultiProcessorCount function lacks CUDA error
checking and can return garbage; wrap the cudaGetDevice and
cudaDeviceGetAttribute calls with the same error-checking macro used elsewhere
(e.g., FLASHINFER_CHECK) or check their cudaError_t return values and handle
failures before returning count. Update the function (getMultiProcessorCount) to
call FLASHINFER_CHECK(cudaGetDevice(...)) and
FLASHINFER_CHECK(cudaDeviceGetAttribute(..., cudaDevAttrMultiProcessorCount,
device)) (or equivalent error handling) so failures are logged/handled and a
safe default or abort is used instead of returning an unchecked count.

In `@include/flashinfer/trtllm/fused_moe/RoutingKernel.cuh`:
- Around line 440-453: The code paths that set idx only when mPtrTopKIds or
mPtrTopKWeights are present leave idx uninitialized when mPtrTopKIds==nullptr
and mPtrTopKWeights==nullptr; always decode params.mPtrTopKPacked[expandedIdx]
into scoreIdx and assign idx = scoreIdx.idx (and set mPtrTopKWeights if present)
before computing localExpertIdx/isLocalExpert so
atomicAdd(&smemExpertCount[idx], 1) uses a valid index; update the branch around
params.mPtrTopKIds/mPtrTopKWeights to unconditionally read params.mPtrTopKPacked
and populate idx first, then conditionally write params.mPtrTopKWeights.

In `@tests/moe/test_trtllm_gen_fused_moe.py`:
- Around line 2872-2892: This 2048-expert config (num_experts=2048, top_k=32,
RoutingMethodType.Renormalize) must be removed from the large parametrized
matrix and isolated into a small targeted smoke test with autotune disabled:
create a new focused test that uses the same dict but sets enable_autotune=False
(and a distinct id like "RoutingRenormalize_large_experts_smoke"), and remove or
replace the pytest.param entry from the full matrix so the heavy case no longer
participates in the combinatorial num_tokens × zero_hidden_states × moe_impl ×
weight_processing × activation_type runs.

---

Outside diff comments:
In `@include/flashinfer/trtllm/fused_moe/RoutingKernel.h`:
- Around line 279-303: KernelParams currently stores mPtrTopKPacked as
PackedScoreIdx<OutputT>*, but the launched kernel expects
PackedScoreIdx<conditional_t<DoSoftmaxBeforeTopK, float, InputT>>; change
mPtrTopKPacked's type declaration in struct KernelParams to use the same
conditional type (e.g., PackedScoreIdx<std::conditional_t<DoSoftmaxBeforeTopK,
float, OutputT>>* or, if the kernel uses InputT,
PackedScoreIdx<std::conditional_t<DoSoftmaxBeforeTopK, float, InputT>>*), and
update the assignment in KernelParams::setKernelParams to cast
data.mPtrTopKPacked to that same conditional pointer type so the stored pointer
type matches the kernel contract (refer to symbols KernelParams, mPtrTopKPacked,
PackedScoreIdx, and setKernelParams).

---

Nitpick comments:
In `@flashinfer/jit/moe_utils.py`:
- Around line 79-92: The DeepSeek routing CUDA source list is duplicated between
moe_utils.py and fused_moe.py; extract that list into a single shared constant
(e.g., DEEPSEEK_ROUTING_SOURCES or DEEPSEEK_ROUTING_MANIFEST) in a common module
(create a small helper or export it from moe_utils.py) and replace the inline
lists in both jit modules with imports of that constant so both trtllm_fused_moe
routing lists reference the single source of truth.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3886ca25-8b46-4874-8000-8b83bfc701bc

📥 Commits

Reviewing files that changed from the base of the PR and between fe06b91 and cdf67c7.

📒 Files selected for processing (31)

csrc/fused_moe/trtllm_backend/routingDeepSeek/RoutingDeepSeekCommon.cuh
csrc/fused_moe/trtllm_backend/routingDeepSeek/launchClusterKernel.cu
csrc/fused_moe/trtllm_backend/routingDeepSeek/launchCoopKernel.cu
csrc/fused_moe/trtllm_backend/routingDeepSeek/launchHistogramKernel.cu
csrc/fused_moe/trtllm_backend/routingDeepSeek/launchInitExpertCounts.cu
csrc/fused_moe/trtllm_backend/routingDeepSeek/launchMainKernel.cu
csrc/fused_moe/trtllm_backend/routingDeepSeek/launchOffsetsKernel.cu
csrc/fused_moe/trtllm_backend/routingRenormalize/RoutingRenormalizeCommon.cuh
csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu
csrc/fused_moe/trtllm_backend/routingRenormalize/launchClusterKernel.cu
csrc/fused_moe/trtllm_backend/routingRenormalize/launchHistogramKernel.cu
csrc/fused_moe/trtllm_backend/routingRenormalize/launchHistogramScoresKernel.cu
csrc/fused_moe/trtllm_backend/routingRenormalize/launchInitExpertCounts.cu
csrc/fused_moe/trtllm_backend/routingRenormalize/launchOffsetsKernel.cu
csrc/fused_moe/trtllm_backend/trtllm_fused_moe_dev_kernel.cu
csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cu
csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_llama4.cu
csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_renormalize.cu
csrc/trtllm_fused_moe_kernel_launcher.cu
csrc/trtllm_fused_moe_routing_deepseek.cu
csrc/trtllm_fused_moe_routing_renormalize.cu
csrc/trtllm_fused_moe_runner.cu
flashinfer/jit/cpp_ext.py
flashinfer/jit/fused_moe.py
flashinfer/jit/moe_utils.py
include/flashinfer/trtllm/common/cudaUtils.h
include/flashinfer/trtllm/fused_moe/DevKernel.h
include/flashinfer/trtllm/fused_moe/RoutingKernel.cuh
include/flashinfer/trtllm/fused_moe/RoutingKernel.h
include/flashinfer/trtllm/fused_moe/RoutingKernelTopK.cuh
tests/moe/test_trtllm_gen_fused_moe.py

💤 Files with no reviewable changes (2)

csrc/trtllm_fused_moe_routing_renormalize.cu
csrc/trtllm_fused_moe_routing_deepseek.cu

coderabbitai · 2026-03-11T03:43:11Z

csrc/fused_moe/trtllm_backend/routingDeepSeek/launchCoopKernel.cu

+  // trigger the secondary kernel when using PDL
+  // We can't do it earlier because FC1 depends on the mPtrCtaIdxXyToBatchIdx,
+  // mPtrCtaIdxXyToMnLimit, mPtrNumNonExitingCtas and mPtrTotalNumPaddedTokens
+  // TODO: this is not sufficient to ensure visibility in the next kernel!
+  if constexpr (KernelParams::UsePdl) {
+    cudaTriggerProgrammaticLaunchCompletion();
+  }


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Locate and examine the file fd "launchCoopKernel.cu" --type f

Repository: flashinfer-ai/flashinfer

Length of output: 133

🏁 Script executed:

# If found, read the relevant section with context cat -n csrc/fused_moe/trtllm_backend/routingDeepSeek/launchCoopKernel.cu | head -250 | tail -100

Repository: flashinfer-ai/flashinfer

Length of output: 5119

🏁 Script executed:

# Search for PDL usage patterns in the codebase to understand the context rg "cudaTriggerProgrammaticLaunchCompletion|cudaGridDependencySynchronize" -B 3 -A 3

Repository: flashinfer-ai/flashinfer

Length of output: 41002

🏁 Script executed:

# Search for cooperative groups usage and grid.sync patterns rg "grid\.sync|__threadfence|cg::sync" --type cu --type h -B 2 -A 2

Repository: flashinfer-ai/flashinfer

Length of output: 94

🏁 Script executed:

# Check kernel launch configuration and block/grid structure in this file rg "dim3|blockDim|gridDim" csrc/fused_moe/trtllm_backend/routingDeepSeek/launchCoopKernel.cu -B 2 -A 2

Repository: flashinfer-ai/flashinfer

Length of output: 283

🏁 Script executed:

# Search for where FC1 or subsequent kernels use cudaGridDependencySynchronize rg "cudaGridDependencySynchronize" csrc/fused_moe/trtllm_backend/routingDeepSeek/ -B 5 -A 5

Repository: flashinfer-ai/flashinfer

Length of output: 3524

🏁 Script executed:

# Check what kernel is launched after routingIndicesCoopKernel and if it synchronizes rg "LAUNCH_ROUTING_DEEPSEEK|routingIndicesCoopKernel" csrc/fused_moe/trtllm_backend/routingDeepSeek/ -B 3 -A 8

Repository: flashinfer-ai/flashinfer

Length of output: 12940

🏁 Script executed:

# Look for mPtrCtaIdxXyToBatchIdx and mPtrCtaIdxXyToMnLimit to understand memory scope rg "mPtrCtaIdxXyToBatchIdx|mPtrCtaIdxXyToMnLimit" csrc/ -B 2 -A 2 | head -80

Repository: flashinfer-ai/flashinfer

Length of output: 6867

🏁 Script executed:

# Check if __threadfence or __threadfence_system is used anywhere in PDL contexts rg "__threadfence" csrc/fused_moe/trtllm_backend/ -B 3 -A 3

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Check if kernel launches with single or multiple blocks rg "launchCoopKernel|numBlocksCoop" csrc/ --type cu -B 2 -A 2 | grep -A 5 "numBlocksCoop"

Repository: flashinfer-ai/flashinfer

Length of output: 94

🏁 Script executed:

# Verify CUDA PDL semantics in comments/docs within repo rg "PDL|Programmatic Dependent Launch" include/ --type h -B 2 -A 2 | head -60

Repository: flashinfer-ai/flashinfer

Length of output: 7785

Add explicit memory ordering before PDL trigger to ensure metadata visibility.

Line 192 documents the issue: the current code does not guarantee visibility of mPtrCtaIdxXyToBatchIdx, mPtrCtaIdxXyToMnLimit, and scalar outputs to the next kernel before cudaTriggerProgrammaticLaunchCompletion() releases it.

The block-level __syncthreads() at line 187 is insufficient for global device memory. A system-wide fence is required before signaling PDL completion to ensure all threads' writes to global metadata are visible before the next kernel launches.

🛠️ Proposed fix

// TODO: this is not sufficient to ensure visibility in the next kernel! if constexpr (KernelParams::UsePdl) { + __threadfence_system(); cudaTriggerProgrammaticLaunchCompletion(); }

Verify this aligns with your CUDA toolkit's PDL memory-ordering contract and whether __threadfence_system() or an alternative primitive (e.g., grid synchronization) is required for your target SM version.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// trigger the secondary kernel when using PDL

// We can't do it earlier because FC1 depends on the mPtrCtaIdxXyToBatchIdx,

// mPtrCtaIdxXyToMnLimit, mPtrNumNonExitingCtas and mPtrTotalNumPaddedTokens

// TODO: this is not sufficient to ensure visibility in the next kernel!

if constexpr (KernelParams::UsePdl) {

cudaTriggerProgrammaticLaunchCompletion();

}

// trigger the secondary kernel when using PDL

// We can't do it earlier because FC1 depends on the mPtrCtaIdxXyToBatchIdx,

// mPtrCtaIdxXyToMnLimit, mPtrNumNonExitingCtas and mPtrTotalNumPaddedTokens

// TODO: this is not sufficient to ensure visibility in the next kernel!

if constexpr (KernelParams::UsePdl) {

__threadfence_system();

cudaTriggerProgrammaticLaunchCompletion();

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@csrc/fused_moe/trtllm_backend/routingDeepSeek/launchCoopKernel.cu` around lines 189 - 195, Before calling cudaTriggerProgrammaticLaunchCompletion() when KernelParams::UsePdl is true, ensure a system-wide memory fence so writes to mPtrCtaIdxXyToBatchIdx, mPtrCtaIdxXyToMnLimit, mPtrNumNonExitingCtas, mPtrTotalNumPaddedTokens and other global metadata are visible to the subsequently-launched kernel; add an appropriate global fence (e.g., __threadfence_system() or your platform's equivalent) after the existing __syncthreads() and before invoking cudaTriggerProgrammaticLaunchCompletion(), keeping the conditional on KernelParams::UsePdl and preserving existing comments and ordering.

coderabbitai · 2026-03-11T03:43:11Z

csrc/fused_moe/trtllm_backend/routingDeepSeek/RoutingDeepSeekCommon.cuh

+int constexpr getMaxNumExperts(int32_t numExperts) {
+  if (numExperts <= topk::MaxNumExpertsUnit) {
+    return topk::MaxNumExpertsUnit;
+  } else if (numExperts <= NumDeepseekExperts) {
+    return NumDeepseekExperts;
+  } else if (numExperts <= NumKimiK2Experts) {
+    return NumKimiK2Experts;
+  } else if (numExperts <= NumNemotronExperts) {
+    return NumNemotronExperts;
+  } else {
+    TLLM_LOG_ERROR("Unsupported numExperts");
+    return 0;
+  }
+}


⚠️ Potential issue | 🟠 Major

Fail fast on unsupported expert counts.

getMaxNumExperts() returns 0, and LAUNCH_ROUTING_DEEPSEEK only logs on the unsupported path. The new launchers use that value as numThreadsHist and as a divisor in grid sizing, so an out-of-range mNumExperts can degrade into divide-by-zero or a silently skipped launch rather than a deterministic failure.

Also applies to: 72-88

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@csrc/fused_moe/trtllm_backend/routingDeepSeek/RoutingDeepSeekCommon.cuh` around lines 39 - 52, getMaxNumExperts currently returns 0 on unsupported inputs which allows callers like LAUNCH_ROUTING_DEEPSEEK to divide-by-zero or silently skip launches; change getMaxNumExperts to fail fast instead of returning 0 by logging the error and aborting (or throwing/asserting) so an invalid numExperts causes a deterministic crash, and apply the same change to the analogous function in the same file (the similar block around lines 72-88) so neither function ever returns 0 for unsupported expert counts.

coderabbitai · 2026-03-11T03:43:11Z

csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu

+  if (params.mPtrTopKIds != nullptr) {
+    if (validToken) {
+      if (laneIdx < params.mTopK) {
+        auto expertIdx = params.mPtrTopKIds[warpIdx * params.mTopK + laneIdx];
+        if (expertIdx != -1) {
+          int offset = warpIdx * MaxNumExperts + expertIdx;
+          smemKIdx[offset] = static_cast<int8_t>(laneIdx);
+        } else {
+          params.mPtrExpandedIdxToPermutedIdx[warpIdx * params.mTopK + laneIdx] = int32_t{-1};
+        }


⚠️ Potential issue | 🔴 Critical

Guard the optional mPtrExpandedIdxToPermutedIdx write.

mPtrExpandedIdxToPermutedIdx is optional, but the expertIdx == -1 branch dereferences it unconditionally. A caller that passes precomputed mPtrTopKIds with padded -1 entries and does not request the expanded-index mapping will crash here.

Suggested fix

- } else { - params.mPtrExpandedIdxToPermutedIdx[warpIdx * params.mTopK + laneIdx] = int32_t{-1}; + } else if (params.mPtrExpandedIdxToPermutedIdx != nullptr) { + params.mPtrExpandedIdxToPermutedIdx[warpIdx * params.mTopK + laneIdx] = int32_t{-1}; }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu` around lines 69 - 78, The code writes to params.mPtrExpandedIdxToPermutedIdx unconditionally in the expertIdx == -1 branch; guard that write with a null-check so we only assign when the optional mapping was provided. Locate the block handling params.mPtrTopKIds / expertIdx (within the warpIdx/laneIdx loop using MaxNumExperts and smemKIdx) and wrap the params.mPtrExpandedIdxToPermutedIdx[warpIdx * params.mTopK + laneIdx] = -1 assignment inside an if (params.mPtrExpandedIdxToPermutedIdx != nullptr) check (keeping the same index calculation).

coderabbitai · 2026-03-11T03:43:11Z

csrc/fused_moe/trtllm_backend/routingRenormalize/RoutingRenormalizeCommon.cuh

+// Helper macro: dispatch on topK tier for a given numExperts tier.
+#define LAUNCH_ROUTING_WITH_TOPK(data, coopLaunch, kernel, numBlocks, numThreads, smemSize,    \
+                                 stream, extraFlag1, numExperts)                               \
+  if (data.mTopK <= NumTop8Experts) {                                                          \
+    LAUNCH_ROUTING_WITH_NUM_EXPERTS(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, \
+                                    stream, extraFlag1, numExperts, NumTop8Experts);           \
+  } else if (data.mTopK <= NumTop16Experts) {                                                  \
+    LAUNCH_ROUTING_WITH_NUM_EXPERTS(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, \
+                                    stream, extraFlag1, numExperts, NumTop16Experts);          \
+  } else {                                                                                     \
+    LAUNCH_ROUTING_WITH_NUM_EXPERTS(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, \
+                                    stream, extraFlag1, numExperts, MaxSupportedTopExperts);   \
+  }
+
+#define LAUNCH_ROUTING_RENORMALIZE(data, coopLaunch, kernel, numBlocks, numThreads, smemSize,   \
+                                   stream, extraFlag1)                                          \
+  if (data.mNumExperts <= NumExperts128Experts) {                                               \
+    LAUNCH_ROUTING_WITH_TOPK(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, stream, \
+                             extraFlag1, NumExperts128Experts);                                 \
+  } else if (data.mNumExperts <= NumExperts512Experts) {                                        \
+    LAUNCH_ROUTING_WITH_TOPK(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, stream, \
+                             extraFlag1, NumExperts512Experts);                                 \
+  } else if (data.mNumExperts <= MaxSupportedExperts) {                                         \
+    LAUNCH_ROUTING_WITH_TOPK(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, stream, \
+                             extraFlag1, MaxSupportedExperts);                                  \
+  } else {                                                                                      \
+    TLLM_LOG_ERROR("Unsupported numExperts");                                                   \
+  }


⚠️ Potential issue | 🔴 Critical

Reject mTopK > 32 instead of falling through to the 32-wide specialization.

The final else currently catches both the supported 17..32 range and unsupported larger values. When data.mTopK is 33+, the downstream renormalize kernels still index fixed-size [MaxNumTopExperts] buffers with laneIdx < params.mTopK, which will walk past those 32-entry arrays.

🛠️ Proposed fix

`#define` LAUNCH_ROUTING_WITH_TOPK(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, \ stream, extraFlag1, numExperts) \ if (data.mTopK <= NumTop8Experts) { \ LAUNCH_ROUTING_WITH_NUM_EXPERTS(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, \ stream, extraFlag1, numExperts, NumTop8Experts); \ } else if (data.mTopK <= NumTop16Experts) { \ LAUNCH_ROUTING_WITH_NUM_EXPERTS(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, \ stream, extraFlag1, numExperts, NumTop16Experts); \ - } else { \ + } else if (data.mTopK <= MaxSupportedTopExperts) { \ LAUNCH_ROUTING_WITH_NUM_EXPERTS(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, \ stream, extraFlag1, numExperts, MaxSupportedTopExperts); \ + } else { \ + TLLM_LOG_ERROR("Unsupported topK"); \ }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@csrc/fused_moe/trtllm_backend/routingRenormalize/RoutingRenormalizeCommon.cuh` around lines 99 - 126, The macros currently treat any mTopK > NumTop16Experts as MaxSupportedTopExperts, which causes mTopK > MaxSupportedTopExperts (e.g., 33+) to fall through and index out-of-bounds; update LAUNCH_ROUTING_WITH_TOPK to explicitly handle the 32-wide case and reject larger values: add an else-if branch checking if data.mTopK <= NumTop32Experts and call LAUNCH_ROUTING_WITH_NUM_EXPERTS with NumTop32Experts, and replace the final else with an error/log and no-launch path for unsupported data.mTopK (referencing data.mTopK, LAUNCH_ROUTING_WITH_TOPK, LAUNCH_ROUTING_WITH_NUM_EXPERTS, NumTop8Experts, NumTop16Experts, NumTop32Experts, MaxSupportedTopExperts).

coderabbitai · 2026-03-11T03:43:11Z

csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cu

+  FLASHINFER_CHECK(!data.mUseRoutingSoftmax, "Routing with softmax not implemented yet");
+  int const numBlocks = data.mNumTokens;
+  int const numThreadsHist = getMaxNumExperts(data.mNumExperts);
+
+  bool const useSingleCluster = data.mNumTokens <= 1024;
+  if (!useSingleCluster) {
+    FLASHINFER_CHECK(data.mPtrExpertCounts != nullptr,
+                     "When #tokens is large, `mPtrExpertCounts` is a required input.");
+  } else {
+    data.mPtrExpertCounts = nullptr;
+  }
+
+  static int const smCount = tensorrt_llm::common::getMultiProcessorCount();
+  int const numBlocksCoop = smCount - 8;
+
+  int const maxTokensCoop = (numBlocksCoop * numThreadsHist * 64) / data.mTopK;
+  if (data.mPtrTopKIds == nullptr) {
+    FLASHINFER_CHECK(data.mNumExperts >= MaxSupportedTopExperts,
+                     "Routing kernel expects %d to be at most #experts %d", MaxSupportedTopExperts,
+                     data.mNumExperts);
+    FLASHINFER_CHECK(data.mNumExperts <= MaxSupportedExpertCount,
+                     "Routing kernel expects #experts %d <= %d", data.mNumExperts,
+                     MaxSupportedExpertCount);
+    FLASHINFER_CHECK(data.mTopK <= MaxSupportedTopExperts,
+                     "Routing kernel expects topK experts <= %d, got %d", MaxSupportedTopExperts,
+                     data.mTopK);


⚠️ Potential issue | 🔴 Critical

Validate mTopK before using it, and compare it to mNumExperts.

Line 62 divides by mTopK before any bound check, and the later data.mNumExperts >= MaxSupportedTopExperts guard rejects valid smaller-expert configs while still leaving the precomputed-TopK path unvalidated. mTopK == 0 is UB here, and data.mPtrTopKIds != nullptr can still send topK > 32 or topK > mNumExperts into the downstream kernels.

Suggested fix

FLASHINFER_CHECK(!data.mUseRoutingSoftmax, "Routing with softmax not implemented yet"); + FLASHINFER_CHECK(data.mTopK > 0 && data.mTopK <= MaxSupportedTopExperts, + "Routing kernel expects topK in [1, %d], got %d", + MaxSupportedTopExperts, data.mTopK); + FLASHINFER_CHECK(data.mNumExperts >= data.mTopK, + "Routing kernel expects topK %d <= `#experts` %d", + data.mTopK, data.mNumExperts); int const numBlocks = data.mNumTokens; int const numThreadsHist = getMaxNumExperts(data.mNumExperts); @@ - FLASHINFER_CHECK(data.mNumExperts >= MaxSupportedTopExperts, - "Routing kernel expects %d to be at most `#experts` %d", MaxSupportedTopExperts, - data.mNumExperts); FLASHINFER_CHECK(data.mNumExperts <= MaxSupportedExpertCount, "Routing kernel expects `#experts` %d <= %d", data.mNumExperts, MaxSupportedExpertCount);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cu` around lines 47 - 72, Validate data.mTopK before using it in the maxTokensCoop calculation and ensure mTopK is within valid bounds relative to mNumExperts and constants: check data.mTopK > 0 and data.mTopK <= data.mNumExperts and data.mTopK <= MaxSupportedTopExperts early (before computing int const maxTokensCoop = ... / data.mTopK), and reorder/adjust the later guards involving data.mPtrTopKIds, data.mNumExperts, MaxSupportedTopExperts, and MaxSupportedExpertCount so that the precomputed-TopK path cannot execute with mTopK == 0 or mTopK > mNumExperts; reference the symbols data.mTopK, data.mNumExperts, data.mPtrTopKIds, maxTokensCoop, MaxSupportedTopExperts, and MaxSupportedExpertCount.

coderabbitai · 2026-03-11T03:43:12Z

csrc/trtllm_fused_moe_kernel_launcher.cu

+      TVM_FFI_ICHECK(args->top_k <= 32 && args->top_k > 0)
          << "Current routing kernel (no groups, renormalize) only supports top_k<=10 && top_k>0.";


⚠️ Potential issue | 🟡 Minor

Update the validation message to match the new limit.

Line 871 still says top_k<=10, but Line 870 now accepts top_k<=32. That will produce misleading runtime diagnostics for this path.

Proposed fix

- TVM_FFI_ICHECK(args->top_k <= 32 && args->top_k > 0) - << "Current routing kernel (no groups, renormalize) only supports top_k<=10 && top_k>0."; + TVM_FFI_ICHECK(args->top_k <= 32 && args->top_k > 0) + << "Current routing kernel (no groups, renormalize) only supports top_k<=32 && top_k>0.";

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

TVM_FFI_ICHECK(args->top_k <= 32 && args->top_k > 0)

<< "Current routing kernel (no groups, renormalize) only supports top_k<=10 && top_k>0.";

TVM_FFI_ICHECK(args->top_k <= 32 && args->top_k > 0)

<< "Current routing kernel (no groups, renormalize) only supports top_k<=32 && top_k>0.";

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@csrc/trtllm_fused_moe_kernel_launcher.cu` around lines 870 - 871, The runtime check using TVM_FFI_ICHECK currently enforces args->top_k <= 32 but the error string still says "top_k<=10", so update the diagnostic to match the enforced limit: change the message string in the TVM_FFI_ICHECK call that references args->top_k to say "top_k<=32 && top_k>0" (keep the same surrounding text "Current routing kernel (no groups, renormalize) only supports ..."). Ensure you edit the string associated with the TVM_FFI_ICHECK invocation so runtime diagnostics accurately reflect the new limit.

flashinfer/jit/fused_moe.py

coderabbitai · 2026-03-11T03:43:12Z

include/flashinfer/trtllm/common/cudaUtils.h

+inline int getMultiProcessorCount() {
+  int device, count;
+  cudaGetDevice(&device);
+  cudaDeviceGetAttribute(&count, cudaDevAttrMultiProcessorCount, device);
+  return count;
+}


⚠️ Potential issue | 🟠 Major

Missing error handling for CUDA API calls.

The function does not check the return values of cudaGetDevice and cudaDeviceGetAttribute. This is inconsistent with the error handling pattern used elsewhere in this file (e.g., isCapturing at Line 115-116 uses FLASHINFER_CHECK). If either call fails, count may contain garbage, leading to incorrect kernel launch configurations.

🛡️ Proposed fix to add error checking

inline int getMultiProcessorCount() { int device, count; - cudaGetDevice(&device); - cudaDeviceGetAttribute(&count, cudaDevAttrMultiProcessorCount, device); + FLASHINFER_CHECK(cudaGetDevice(&device) == cudaSuccess, + "CUDA error in cudaGetDevice"); + FLASHINFER_CHECK(cudaDeviceGetAttribute(&count, cudaDevAttrMultiProcessorCount, device) == cudaSuccess, + "CUDA error in cudaDeviceGetAttribute"); return count; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

inline int getMultiProcessorCount() {

int device, count;

cudaGetDevice(&device);

cudaDeviceGetAttribute(&count, cudaDevAttrMultiProcessorCount, device);

return count;

}

inline int getMultiProcessorCount() {

int device, count;

FLASHINFER_CHECK(cudaGetDevice(&device) == cudaSuccess,

"CUDA error in cudaGetDevice");

FLASHINFER_CHECK(cudaDeviceGetAttribute(&count, cudaDevAttrMultiProcessorCount, device) == cudaSuccess,

"CUDA error in cudaDeviceGetAttribute");

return count;

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@include/flashinfer/trtllm/common/cudaUtils.h` around lines 272 - 277, The getMultiProcessorCount function lacks CUDA error checking and can return garbage; wrap the cudaGetDevice and cudaDeviceGetAttribute calls with the same error-checking macro used elsewhere (e.g., FLASHINFER_CHECK) or check their cudaError_t return values and handle failures before returning count. Update the function (getMultiProcessorCount) to call FLASHINFER_CHECK(cudaGetDevice(...)) and FLASHINFER_CHECK(cudaDeviceGetAttribute(..., cudaDevAttrMultiProcessorCount, device)) (or equivalent error handling) so failures are logged/handled and a safe default or abort is used instead of returning an unchecked count.

coderabbitai · 2026-03-11T03:43:12Z

include/flashinfer/trtllm/fused_moe/RoutingKernel.cuh

    if (params.mPtrTopKIds != nullptr) {
      idx = params.mPtrTopKIds[expandedIdx];
    } else {
-      // If params.mPtrTopKIds != nullptr, we don't need to store the weights
      if (params.mPtrTopKWeights != nullptr) {
        scoreIdx = params.mPtrTopKPacked[expandedIdx];
        idx = scoreIdx.idx;
        params.mPtrTopKWeights[expandedIdx] = static_cast<OutputT>(scoreIdx.score);
      }
    }
-    // check whether this expert is local to our GPU at all and ignore if not
    auto localExpertIdx = idx - params.mLocalExpertsStartIdx;
    auto isLocalExpert = localExpertIdx >= 0 && localExpertIdx < localExpertExtent &&
-                         (localExpertIdx & params.mLocalExpertsStrideLog2) == 0;
+                         (localExpertIdx & ((1 << params.mLocalExpertsStrideLog2) - 1)) == 0;
    if (isLocalExpert) {
      atomicAdd(&smemExpertCount[idx], 1);


⚠️ Potential issue | 🔴 Critical

Always decode mPtrTopKPacked when mPtrTopKIds is null.

idx is needed for histogramming regardless of whether a standalone weight buffer is requested. In the packed-only path (mPtrTopKIds == nullptr && mPtrTopKWeights == nullptr), this branch leaves idx uninitialized and can send atomicAdd to a garbage expert slot, corrupting counts before the offsets stage runs.

🐛 Proposed fix

if (params.mPtrTopKIds != nullptr) { idx = params.mPtrTopKIds[expandedIdx]; } else { - if (params.mPtrTopKWeights != nullptr) { - scoreIdx = params.mPtrTopKPacked[expandedIdx]; - idx = scoreIdx.idx; + scoreIdx = params.mPtrTopKPacked[expandedIdx]; + idx = scoreIdx.idx; + if (params.mPtrTopKWeights != nullptr) { params.mPtrTopKWeights[expandedIdx] = static_cast<OutputT>(scoreIdx.score); } }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@include/flashinfer/trtllm/fused_moe/RoutingKernel.cuh` around lines 440 - 453, The code paths that set idx only when mPtrTopKIds or mPtrTopKWeights are present leave idx uninitialized when mPtrTopKIds==nullptr and mPtrTopKWeights==nullptr; always decode params.mPtrTopKPacked[expandedIdx] into scoreIdx and assign idx = scoreIdx.idx (and set mPtrTopKWeights if present) before computing localExpertIdx/isLocalExpert so atomicAdd(&smemExpertCount[idx], 1) uses a valid index; update the branch around params.mPtrTopKIds/mPtrTopKWeights to unconditionally read params.mPtrTopKPacked and populate idx first, then conditionally write params.mPtrTopKWeights.

coderabbitai · 2026-03-11T03:43:12Z

tests/moe/test_trtllm_gen_fused_moe.py

+        pytest.param(
+            {
+                "num_experts": 2048,
+                "top_k": 32,
+                "padding": 8,
+                "n_groups": None,
+                "top_k_groups": None,
+                "routed_scaling": None,
+                "has_routing_bias": False,
+                "routing_method_type": RoutingMethodType.Renormalize,
+                "compatible_moe_impls": [
+                    FP8BlockScaleMoe,
+                    FP4Moe,
+                    BF16Moe,
+                    MxInt4BlockScaleMoe,
+                ],
+                "compatible_intermediate_size": [384],
+                "enable_autotune": True,
+            },
+            id="RoutingRenormalize_large_experts",
+        ),


⚠️ Potential issue | 🟠 Major

This 2048-expert case is too large for the full parametrized matrix.

With hidden_size=1024 and gated activations, this single config already allocates about 3 GiB for gemm1_weights plus 1.5 GiB for gemm2_weights in bf16, before quantized/shuffled copies or reference tensors. Because it also participates in the full num_tokens × zero_hidden_states × moe_impl × weight_processing × activation_type matrix with autotune enabled, it is very likely to become an OOM/timeout source on common CI GPUs. Please isolate the 2048/32 coverage into a small targeted smoke test and keep autotune off there.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/moe/test_trtllm_gen_fused_moe.py` around lines 2872 - 2892, This 2048-expert config (num_experts=2048, top_k=32, RoutingMethodType.Renormalize) must be removed from the large parametrized matrix and isolated into a small targeted smoke test with autotune disabled: create a new focused test that uses the same dict but sets enable_autotune=False (and a distinct id like "RoutingRenormalize_large_experts_smoke"), and remove or replace the pytest.param entry from the full matrix so the heavy case no longer participates in the combinatorial num_tokens × zero_hidden_states × moe_impl × weight_processing × activation_type runs.

coderabbitai

♻️ Duplicate comments (1)

csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu (1)
69-78: ⚠️ Potential issue | 🔴 Critical

Guard the optional mPtrExpandedIdxToPermutedIdx write.

mPtrExpandedIdxToPermutedIdx is optional, but the expertIdx == -1 branch at line 77 dereferences it unconditionally. A caller that passes precomputed mPtrTopKIds with padded -1 entries without requesting the expanded-index mapping will crash here.

Note: The same pointer is correctly guarded later at lines 242-244.
Suggested fix
         } else {
-          params.mPtrExpandedIdxToPermutedIdx[warpIdx * params.mTopK + laneIdx] = int32_t{-1};
+          if (params.mPtrExpandedIdxToPermutedIdx != nullptr) {
+            params.mPtrExpandedIdxToPermutedIdx[warpIdx * params.mTopK + laneIdx] = int32_t{-1};
+          }
         }
,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu` around
lines 69 - 78, The branch handling expertIdx == -1 writes to the optional buffer
params.mPtrExpandedIdxToPermutedIdx unguarded; guard that write by checking
params.mPtrExpandedIdxToPermutedIdx != nullptr before assigning int32_t{-1} (so
when mPtrExpandedIdxToPermutedIdx is null we skip the write), keeping the
existing behavior for smemKIdx and using the same null-check pattern as later
code; locate this in the block that reads params.mPtrTopKIds, using
warpIdx/laneIdx/MaxNumExperts to find the correct index.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu`:
- Around line 69-78: The branch handling expertIdx == -1 writes to the optional
buffer params.mPtrExpandedIdxToPermutedIdx unguarded; guard that write by
checking params.mPtrExpandedIdxToPermutedIdx != nullptr before assigning
int32_t{-1} (so when mPtrExpandedIdxToPermutedIdx is null we skip the write),
keeping the existing behavior for smemKIdx and using the same null-check pattern
as later code; locate this in the block that reads params.mPtrTopKIds, using
warpIdx/laneIdx/MaxNumExperts to find the correct index.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9b5e2c2d-a01d-435d-8308-22fdc910cbf7

📥 Commits

Reviewing files that changed from the base of the PR and between ce401d7 and 77d13ac.

📒 Files selected for processing (2)

csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu
csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_renormalize.cu

jiahanc · 2026-03-11T14:27:13Z

/bot run

flashinfer-bot · 2026-03-11T14:28:03Z

GitLab MR !400 has been updated with latest changes, and the CI pipeline #45886576 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-03-11T23:10:14Z

[CANCELING] Pipeline #45886576: canceled

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (5)

include/flashinfer/trtllm/common/cudaUtils.h (1)

272-277: ⚠️ Potential issue | 🟠 Major

Add CUDA error checks in getMultiProcessorCount.

Line 274 and Line 275 ignore CUDA return codes; on failure, count can be invalid and propagate bad launch configuration decisions.

🔧 Proposed fix

 inline int getMultiProcessorCount() {
   int device, count;
-  cudaGetDevice(&device);
-  cudaDeviceGetAttribute(&count, cudaDevAttrMultiProcessorCount, device);
+  FLASHINFER_CHECK(cudaGetDevice(&device) == cudaSuccess,
+                   "CUDA error in cudaGetDevice");
+  FLASHINFER_CHECK(cudaDeviceGetAttribute(&count, cudaDevAttrMultiProcessorCount, device) == cudaSuccess,
+                   "CUDA error in cudaDeviceGetAttribute");
   return count;
 }

#!/bin/bash
# Verify whether CUDA calls in getMultiProcessorCount are guarded.
rg -n -A8 -B2 'inline int getMultiProcessorCount\(\)' include/flashinfer/trtllm/common/cudaUtils.h
rg -n 'FLASHINFER_CHECK\(cudaGetDevice|FLASHINFER_CHECK\(cudaDeviceGetAttribute' include/flashinfer/trtllm/common/cudaUtils.h

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@include/flashinfer/trtllm/common/cudaUtils.h` around lines 272 - 277, The
function getMultiProcessorCount currently calls cudaGetDevice and
cudaDeviceGetAttribute without checking return values; wrap those calls with the
project's CUDA-check macro (e.g., FLASHINFER_CHECK or equivalent) or explicitly
check their cudaError_t results, log or handle failures, and return a safe
sentinel (e.g., 0 or -1) when an error occurs instead of returning an
uninitialized count; update references to cudaGetDevice and
cudaDeviceGetAttribute inside getMultiProcessorCount to use the chosen
error-checking pattern so callers never receive an invalid count.

csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_llama4.cu (1)

246-247: ⚠️ Potential issue | 🟠 Major

Guard mPtrPermutedIdxToExpandedIdx writes with isTokenRouted.

At Line 246, writing reverse-map entries for isLocalExpert without requiring routing selection can overwrite valid slots for subsequent routed tokens. Mirror the guard used by the other reverse-map write.

💡 Proposed fix

-      if (params.mPtrPermutedIdxToExpandedIdx != nullptr && isLocalExpert) {
+      if (params.mPtrPermutedIdxToExpandedIdx != nullptr && isLocalExpert &&
+          isTokenRouted) {
         params.mPtrPermutedIdxToExpandedIdx[permutedIdx] = tokenIdx;
       }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_llama4.cu` around
lines 246 - 247, The reverse-map write to params.mPtrPermutedIdxToExpandedIdx
guarded only by isLocalExpert can overwrite slots for tokens that were not
routed; update the guard to require both isLocalExpert and isTokenRouted (mirror
the other reverse-map write) before assigning
params.mPtrPermutedIdxToExpandedIdx[permutedIdx] = tokenIdx so only routed
tokens create reverse-map entries (use the same condition structure as the other
reverse-map write to locate and fix the assignment).

tests/moe/test_trtllm_gen_fused_moe.py (1)

2872-2892: ⚠️ Potential issue | 🟠 Major

Isolate this 2048-expert config into a targeted smoke test.

This config allocates ~5 GiB for weights alone (2048 experts × gated intermediate × hidden_size in bf16) before quantized copies. Since it participates in the full parametrized matrix with enable_autotune=True, it risks CI OOM/timeout on standard GPUs.

Consider:

Move to a separate focused test with enable_autotune=False

Limit to a single moe_impl and weight_processing combination

Use the smallest num_tokens value

,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/moe/test_trtllm_gen_fused_moe.py` around lines 2872 - 2892, This large
2048-expert pytest.param (id "RoutingRenormalize_large_experts") should be
pulled out of the full parametrized matrix and converted into a focused smoke
test to avoid CI OOMs: create a new dedicated test that uses this dict but set
"enable_autotune": False, reduce "compatible_moe_impls" to a single
implementation (pick one from the current list, e.g., BF16Moe), restrict any
"weight_processing" combinations to a single option, and set the test's input
size to the minimum by lowering "num_tokens" to the smallest allowed value; keep
the original "id" for traceability and ensure the new test is run separately
from the large parametrized suite.

csrc/fused_moe/trtllm_backend/routingDeepSeek/launchCoopKernel.cu (1)

189-195: ⚠️ Potential issue | 🔴 Critical

Insert a global fence before completing the PDL launch.

Line 192 already calls out the problem: __syncthreads() only orders the CTA, but the dependent kernel consumes global metadata written above. Without a device/system fence before cudaTriggerProgrammaticLaunchCompletion(), the next kernel can observe stale mPtrCtaIdxXyToBatchIdx, mPtrCtaIdxXyToMnLimit, mPtrPermutedIdxSize, or mPtrNumNonExitingCtas.

Suggested fix

   if constexpr (KernelParams::UsePdl) {
+    __threadfence_system();
     cudaTriggerProgrammaticLaunchCompletion();
   }

According to the official NVIDIA CUDA documentation, does `cudaTriggerProgrammaticLaunchCompletion()` require an explicit `__threadfence_system()` (or another memory fence) to guarantee that prior global-memory writes are visible to the dependent kernel before programmatic launch completion?

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@csrc/fused_moe/trtllm_backend/routingDeepSeek/launchCoopKernel.cu` around
lines 189 - 195, The programmatic launch completion lacks a system-wide memory
fence: when KernelParams::UsePdl is true, insert a device/system fence (e.g.,
call __threadfence_system() or equivalent) immediately before the existing
cudaTriggerProgrammaticLaunchCompletion() so that prior global writes to
mPtrCtaIdxXyToBatchIdx, mPtrCtaIdxXyToMnLimit, mPtrPermutedIdxSize,
mPtrNumNonExitingCtas, and mPtrTotalNumPaddedTokens are visible to the dependent
kernel; update the block guarded by if constexpr (KernelParams::UsePdl) to call
the global fence first, then call cudaTriggerProgrammaticLaunchCompletion().

csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu (1)

69-78: ⚠️ Potential issue | 🔴 Critical

Guard the optional expanded-index mapping write.

mPtrExpandedIdxToPermutedIdx is optional in this backend, but the expertIdx == -1 branch dereferences it unconditionally. A top-k-id caller that omits that mapping buffer will hit a null device pointer here.

Suggested fix

         if (expertIdx != -1) {
           int offset = warpIdx * MaxNumExperts + expertIdx;
           smemKIdx[offset] = static_cast<int8_t>(laneIdx);
-        } else {
+        } else if (params.mPtrExpandedIdxToPermutedIdx != nullptr) {
           params.mPtrExpandedIdxToPermutedIdx[warpIdx * params.mTopK + laneIdx] = int32_t{-1};
         }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu` around
lines 69 - 78, The branch handling expertIdx == -1 writes unconditionally to the
optional device pointer params.mPtrExpandedIdxToPermutedIdx, which can be null;
modify the code in the launchBlockKernel logic so you first test if
params.mPtrExpandedIdxToPermutedIdx != nullptr before performing the write
(params.mPtrExpandedIdxToPermutedIdx[warpIdx * params.mTopK + laneIdx] = -1); if
the pointer is null simply skip that write (leave smemKIdx handling unchanged),
ensuring all accesses to params.mPtrExpandedIdxToPermutedIdx are guarded.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@csrc/fused_moe/trtllm_backend/routingDeepSeek/launchClusterKernel.cu`:
- Around line 25-27: The __launch_bounds__ macro in the
routingIndicesClusterKernel definition uses KernelParams::MaxNumExperts
unconditionally which can exceed CUDA's 1024-thread-per-block limit; change the
launch bound to cap MaxNumExperts at 1024 (i.e., use a conditional expression
like KernelParams::MaxNumExperts <= 1024 ? KernelParams::MaxNumExperts : 1024)
for the __launch_bounds__ on the routingIndicesClusterKernel function; also
apply the same conditional cap fix to the identical usage in launchCoopKernel.cu
to prevent compile/runtime failures when MaxNumExperts > 1024.

In `@csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu`:
- Around line 127-130: The isLocal predicate incorrectly checks localExpIdx <
params.mNumLocalExperts but must account for the strided extent when expert
interleaving is enabled; change the upper bound to (params.mNumLocalExperts <<
params.mLocalExpertsStrideLog2) (i.e., use the strided local-expert extent) so
that localExpIdx is compared against the true number of interleaved slots, and
update the same logic where it appears (the other occurrences around the blocks
referencing localExpIdx/isLocal at the later sites). Ensure you keep the
existing bitmask check for stride alignment ((localExpIdx & ((1 <<
params.mLocalExpertsStrideLog2) - 1)) == 0) and only replace the plain
mNumLocalExperts bound with the shifted value.

---

Duplicate comments:
In `@csrc/fused_moe/trtllm_backend/routingDeepSeek/launchCoopKernel.cu`:
- Around line 189-195: The programmatic launch completion lacks a system-wide
memory fence: when KernelParams::UsePdl is true, insert a device/system fence
(e.g., call __threadfence_system() or equivalent) immediately before the
existing cudaTriggerProgrammaticLaunchCompletion() so that prior global writes
to mPtrCtaIdxXyToBatchIdx, mPtrCtaIdxXyToMnLimit, mPtrPermutedIdxSize,
mPtrNumNonExitingCtas, and mPtrTotalNumPaddedTokens are visible to the dependent
kernel; update the block guarded by if constexpr (KernelParams::UsePdl) to call
the global fence first, then call cudaTriggerProgrammaticLaunchCompletion().

In `@csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu`:
- Around line 69-78: The branch handling expertIdx == -1 writes unconditionally
to the optional device pointer params.mPtrExpandedIdxToPermutedIdx, which can be
null; modify the code in the launchBlockKernel logic so you first test if
params.mPtrExpandedIdxToPermutedIdx != nullptr before performing the write
(params.mPtrExpandedIdxToPermutedIdx[warpIdx * params.mTopK + laneIdx] = -1); if
the pointer is null simply skip that write (leave smemKIdx handling unchanged),
ensuring all accesses to params.mPtrExpandedIdxToPermutedIdx are guarded.

In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_llama4.cu`:
- Around line 246-247: The reverse-map write to
params.mPtrPermutedIdxToExpandedIdx guarded only by isLocalExpert can overwrite
slots for tokens that were not routed; update the guard to require both
isLocalExpert and isTokenRouted (mirror the other reverse-map write) before
assigning params.mPtrPermutedIdxToExpandedIdx[permutedIdx] = tokenIdx so only
routed tokens create reverse-map entries (use the same condition structure as
the other reverse-map write to locate and fix the assignment).

In `@include/flashinfer/trtllm/common/cudaUtils.h`:
- Around line 272-277: The function getMultiProcessorCount currently calls
cudaGetDevice and cudaDeviceGetAttribute without checking return values; wrap
those calls with the project's CUDA-check macro (e.g., FLASHINFER_CHECK or
equivalent) or explicitly check their cudaError_t results, log or handle
failures, and return a safe sentinel (e.g., 0 or -1) when an error occurs
instead of returning an uninitialized count; update references to cudaGetDevice
and cudaDeviceGetAttribute inside getMultiProcessorCount to use the chosen
error-checking pattern so callers never receive an invalid count.

In `@tests/moe/test_trtllm_gen_fused_moe.py`:
- Around line 2872-2892: This large 2048-expert pytest.param (id
"RoutingRenormalize_large_experts") should be pulled out of the full
parametrized matrix and converted into a focused smoke test to avoid CI OOMs:
create a new dedicated test that uses this dict but set "enable_autotune":
False, reduce "compatible_moe_impls" to a single implementation (pick one from
the current list, e.g., BF16Moe), restrict any "weight_processing" combinations
to a single option, and set the test's input size to the minimum by lowering
"num_tokens" to the smallest allowed value; keep the original "id" for
traceability and ensure the new test is run separately from the large
parametrized suite.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: df6f5495-6b6d-40ac-97d5-297d77567843

📥 Commits

Reviewing files that changed from the base of the PR and between eb2c0e3 and eea25a1.

📒 Files selected for processing (32)

csrc/fused_moe/trtllm_backend/routingDeepSeek/RoutingDeepSeekCommon.cuh
csrc/fused_moe/trtllm_backend/routingDeepSeek/launchClusterKernel.cu
csrc/fused_moe/trtllm_backend/routingDeepSeek/launchCoopKernel.cu
csrc/fused_moe/trtllm_backend/routingDeepSeek/launchHistogramKernel.cu
csrc/fused_moe/trtllm_backend/routingDeepSeek/launchInitExpertCounts.cu
csrc/fused_moe/trtllm_backend/routingDeepSeek/launchMainKernel.cu
csrc/fused_moe/trtllm_backend/routingDeepSeek/launchOffsetsKernel.cu
csrc/fused_moe/trtllm_backend/routingRenormalize/RoutingRenormalizeCommon.cuh
csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu
csrc/fused_moe/trtllm_backend/routingRenormalize/launchClusterKernel.cu
csrc/fused_moe/trtllm_backend/routingRenormalize/launchHistogramKernel.cu
csrc/fused_moe/trtllm_backend/routingRenormalize/launchHistogramScoresKernel.cu
csrc/fused_moe/trtllm_backend/routingRenormalize/launchInitExpertCounts.cu
csrc/fused_moe/trtllm_backend/routingRenormalize/launchOffsetsKernel.cu
csrc/fused_moe/trtllm_backend/trtllm_fused_moe_dev_kernel.cu
csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cu
csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_llama4.cu
csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_renormalize.cu
csrc/moe_utils_binding.cu
csrc/trtllm_fused_moe_kernel_launcher.cu
csrc/trtllm_fused_moe_routing_deepseek.cu
csrc/trtllm_fused_moe_routing_renormalize.cu
csrc/trtllm_fused_moe_runner.cu
flashinfer/jit/cpp_ext.py
flashinfer/jit/fused_moe.py
flashinfer/jit/moe_utils.py
include/flashinfer/trtllm/common/cudaUtils.h
include/flashinfer/trtllm/fused_moe/DevKernel.h
include/flashinfer/trtllm/fused_moe/RoutingKernel.cuh
include/flashinfer/trtllm/fused_moe/RoutingKernel.h
include/flashinfer/trtllm/fused_moe/RoutingKernelTopK.cuh
tests/moe/test_trtllm_gen_fused_moe.py

💤 Files with no reviewable changes (3)

csrc/moe_utils_binding.cu
csrc/trtllm_fused_moe_routing_renormalize.cu
csrc/trtllm_fused_moe_routing_deepseek.cu

🚧 Files skipped from review as they are similar to previous changes (7)

csrc/trtllm_fused_moe_kernel_launcher.cu
csrc/fused_moe/trtllm_backend/routingRenormalize/launchHistogramScoresKernel.cu
csrc/fused_moe/trtllm_backend/trtllm_fused_moe_dev_kernel.cu
csrc/fused_moe/trtllm_backend/routingRenormalize/launchHistogramKernel.cu
csrc/trtllm_fused_moe_runner.cu
flashinfer/jit/cpp_ext.py
csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cu

csrc/fused_moe/trtllm_backend/routingDeepSeek/launchClusterKernel.cu

coderabbitai · 2026-03-16T00:07:19Z

csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu

+    int expert = threadIdx.x * ExpertsPerThread + e;
+    auto localExpIdx = expert - params.mLocalExpertsStartIdx;
+    auto isLocal = localExpIdx >= 0 && localExpIdx < params.mNumLocalExperts &&
+                   (localExpIdx & ((1 << params.mLocalExpertsStrideLog2) - 1)) == 0;


⚠️ Potential issue | 🔴 Critical

Use the strided local-expert extent here.

These predicates use localExpIdx < params.mNumLocalExperts, but the other routing paths already treat the local range as mNumLocalExperts << mLocalExpertsStrideLog2. With expert interleaving enabled, this block path will misclassify valid local experts as remote and produce wrong counts, CTA mappings, and permuted indices.

Suggested fix

+ auto localExpertExtent = params.mNumLocalExperts << params.mLocalExpertsStrideLog2; int accExpertCount[ExpertsPerThread]; `#pragma` unroll for (int e = 0; e < ExpertsPerThread; e++) { int expert = threadIdx.x * ExpertsPerThread + e; auto localExpIdx = expert - params.mLocalExpertsStartIdx; - auto isLocal = localExpIdx >= 0 && localExpIdx < params.mNumLocalExperts && + auto isLocal = localExpIdx >= 0 && localExpIdx < localExpertExtent && (localExpIdx & ((1 << params.mLocalExpertsStrideLog2) - 1)) == 0; @@ int expert = threadIdx.x * ExpertsPerThread + e; auto localExpIdx = expert - params.mLocalExpertsStartIdx; - auto isLocal = localExpIdx >= 0 && localExpIdx < params.mNumLocalExperts && + auto isLocal = localExpIdx >= 0 && localExpIdx < localExpertExtent && (localExpIdx & ((1 << params.mLocalExpertsStrideLog2) - 1)) == 0; @@ if (smemKIdx[offset] >= 0) { auto localExpIdx = expert - params.mLocalExpertsStartIdx; - auto isLocal = localExpIdx >= 0 && localExpIdx < params.mNumLocalExperts && + auto isLocal = localExpIdx >= 0 && localExpIdx < localExpertExtent && (localExpIdx & ((1 << params.mLocalExpertsStrideLog2) - 1)) == 0;

Also applies to: 183-185, 233-235

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu` around lines 127 - 130, The isLocal predicate incorrectly checks localExpIdx < params.mNumLocalExperts but must account for the strided extent when expert interleaving is enabled; change the upper bound to (params.mNumLocalExperts << params.mLocalExpertsStrideLog2) (i.e., use the strided local-expert extent) so that localExpIdx is compared against the true number of interleaved slots, and update the same logic where it appears (the other occurrences around the blocks referencing localExpIdx/isLocal at the later sites). Ensure you keep the existing bitmask check for stride alignment ((localExpIdx & ((1 << params.mLocalExpertsStrideLog2) - 1)) == 0) and only replace the plain mNumLocalExperts bound with the shifted value.

jiahanc · 2026-03-16T06:03:30Z

/bot run

flashinfer-bot · 2026-03-16T06:04:32Z

GitLab MR !400 has been updated with latest changes, and the CI pipeline #46225337 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-03-16T11:15:52Z

[FAILED] Pipeline #46225337: 9/20 passed

IwakuraRein

LGTM

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Made-with: Cursor

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

## 📌 Description - Integrate NVIDIA/TensorRT-LLM#11510 to support 2048 num of experts and 32 TopK in renormalize - Refactor MOE cu files ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **New Features** * Expanded MoE routing/renormalize to support up to 2,048 experts and top-k up to 32; backend reorganized to enable larger configurations. * **Bug Fixes** * Clamped token counts in kernel launches to prevent oversized grid launches. * **Performance** * Reworked routing/launch paths for improved scalability and throughput with large expert/top-k settings. * **Tests** * Added test scenarios covering large-expert (2,048) + top-k (32) configurations.  --------- Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

## 📌 Description - Integrate NVIDIA/TensorRT-LLM#11510 to support 2048 num of experts and 32 TopK in renormalize - Refactor MOE cu files ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **New Features** * Expanded MoE routing/renormalize to support up to 2,048 experts and top-k up to 32; backend reorganized to enable larger configurations. * **Bug Fixes** * Clamped token counts in kernel launches to prevent oversized grid launches. * **Performance** * Reworked routing/launch paths for improved scalability and throughput with large expert/top-k settings. * **Tests** * Added test scenarios covering large-expert (2,048) + top-k (32) configurations.  --------- Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Signed-off-by: Amey Naik <212485788+ameynaik-hub@users.noreply.github.com>

jiahanc marked this pull request as ready for review March 11, 2026 03:26

jiahanc requested review from IwakuraRein, cyx-6, jimmyzho, kahyunnam, nv-yunzheq and yzh119 as code owners March 11, 2026 03:26

jiahanc added ready run-ci op: moe-routing labels Mar 11, 2026

coderabbitai bot reviewed Mar 11, 2026

View reviewed changes

jiahanc force-pushed the add2048Experts branch from ce401d7 to 77d13ac Compare March 11, 2026 10:15

coderabbitai bot reviewed Mar 11, 2026

View reviewed changes

jiahanc force-pushed the add2048Experts branch from eb2c0e3 to eea25a1 Compare March 15, 2026 23:52

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

jiahanc enabled auto-merge (squash) March 17, 2026 01:36

IwakuraRein approved these changes Mar 17, 2026

View reviewed changes

jiahanc added 3 commits March 16, 2026 21:00

add 2048 experts and refactor

a5ba4ee

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

fix

47aae4c

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

fix routed

45dfceb

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Made-with: Cursor

fix binding

b14f7f8

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

jiahanc force-pushed the add2048Experts branch from eea25a1 to b14f7f8 Compare March 17, 2026 04:01

yzh119 approved these changes Mar 17, 2026

View reviewed changes

jiahanc merged commit abf080a into flashinfer-ai:main Mar 17, 2026
29 checks passed

		TVM_FFI_ICHECK(args->top_k <= 32 && args->top_k > 0)
		<< "Current routing kernel (no groups, renormalize) only supports top_k<=10 && top_k>0.";

Conversation

jiahanc commented Mar 10, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist bot commented Mar 10, 2026

Uh oh!

jiahanc commented Mar 11, 2026

Uh oh!

flashinfer-bot commented Mar 11, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

jiahanc commented Mar 11, 2026

Uh oh!

flashinfer-bot commented Mar 11, 2026

Uh oh!

flashinfer-bot commented Mar 11, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

jiahanc commented Mar 16, 2026

Uh oh!

flashinfer-bot commented Mar 16, 2026

jiahanc commented Mar 10, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 10, 2026 •

edited

Loading