[None][feat] Refactor the routing part in trtllmgen by ChristinaZ · Pull Request #12246 · NVIDIA/TensorRT-LLM

ChristinaZ · 2026-03-16T14:41:09Z

Refactor the routing part in trtllmgen. (Including: add selectTopKPolicy to make it easier to add the customized routing method. Disable the PDL for the last kernel of routing as a workaround. )

Add support for float32 and bfloat16 for the input, bias, and output data types in routing.

Summary by CodeRabbit

Release Notes

New Features
- Added configurable routing policies (preprocessing, postprocessing) for flexible expert selection strategies.
- Introduced SigmoidRenorm routing method for enhanced routing flexibility.
- Added cluster-based tiling support for improved memory and execution efficiency.
- Extended routing configuration to support custom bias and scaling parameters.
Bug Fixes & Improvements
- Optimized kernel execution orchestration with improved synchronization.
- Enhanced support for various expert configurations and token batch sizes.
Tests
- Added comprehensive routing policy test coverage.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-03-16T14:59:31Z

📝 Walkthrough

Walkthrough

Major refactoring of MOE routing kernel infrastructure: removes legacy DeepSeek and Renormalize routing implementations, consolidates into a unified policy-driven routing system in a new routing/ subdirectory with custom kernels, launch macros, and post-TopK pipeline utilities. Updates public runner API and torch bindings to support new routing methods including SigmoidRenorm and MiniMax2.

Changes

Cohort / File(s)	Summary
Removed Legacy Routing Implementations `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.h`, `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu`, `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu`	Deleted 94 lines from DevKernel.h (removed complex launch routing macros) and entire files for old RoutingDeepSeek.cu (156 lines) and RoutingRenormalize.cu (108 lines) containing legacy kernel implementations and dispatch logic.
Removed DeepSeek Routing Common Utilities `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/*`	Deleted entire DeepSeekCommon.cuh header (115 lines) and all DeepSeek launch wrapper files (launchClusterKernel.cu, launchCoopKernel.cu, launchHistogramKernel.cu, launchInitExpertCounts.cu, launchMainKernel.cu, launchOffsetsKernel.cu totaling 737 lines) containing kernel implementations and old routing orchestration logic.
Removed Renormalize Routing Common Utilities `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/*`	Deleted RoutingRenormalizeCommon.cuh header (160 lines) and all renormalize launch wrappers (launchBlockKernel.cu, launchClusterKernel.cu, launchHistogramKernel.cu, launchHistogramScoresKernel.cu, launchInitExpertCounts.cu, launchOffsetsKernel.cu totaling 623 lines) containing kernel implementations and dispatch logic.
New Unified Routing Infrastructure `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingCustom.cu`, `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingCustomPolicy.cuh`, `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingDevKernel.h`	Added 673 lines of custom routing kernel implementations with block/cluster/histogram kernels, 769 lines of policy framework supporting preprocessing/postprocessing policies with tiered dispatch, and 172 lines of new launch macros (LAUNCH_PDL_ROUTING, LAUNCH_ROUTING_WITH_POLICIES, etc.) replacing old ad-hoc routing.
New DeepSeek Implementation at New Location `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingDeepSeek.cu`	Added 620 lines implementing comprehensive DeepSeek routing pipeline with main kernel, cluster kernel, and launch wrappers for multi-path execution, replacing old separate launcher files with unified control flow.
Post-TopK Pipeline and Kernel Enhancements `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingFromTopKIds.cu`, `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingKernel.cuh`, `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingKernel.h`	Added 127 lines for shared post-TopK pipeline utility, expanded RoutingKernel.cuh with 382 new lines adding coop kernel and helper functions, updated RoutingKernel.h with 103 new lines for cluster metadata, runtime PDL control, and policy-driven parameter structs.
Runner and Public API Updates `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu`, `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h`	Extended Runner constructor to accept clusterSizeInBatchDim, updated run() signature to include dtypeRoutingLogits parameter, refactored routing paths to use new routingCustom::run with policy-driven configuration, replaced CTA nomenclature with CGA for consistency (234 new lines in .cu, 37 new lines in .h).
Python Torch Module Integration `tensorrt_llm/_torch/modules/fused_moe/routing.py`, `tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py`, `tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py`, `tensorrt_llm/_torch/models/modeling_minimaxm2.py`	Added SigmoidRenormMoeRoutingMethod class and RoutingMethodType.SigmoidRenorm enum value, extended fused_moe to handle MiniMax2 and SigmoidRenorm routing methods, added SM-version-based constraint logic, updated MiniMaxM2MoE with bias_dtype selection based on moe_backend.
Torch Operator Call Sites `cpp/tensorrt_llm/thop/cuteDslMoeUtilsOp.cpp`, `cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp`, `cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp`, `cpp/tensorrt_llm/thop/fp8PerTensorScaleMoe.cpp`, `cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp`	Updated all MoE operator call sites to compute and propagate dtypeRoutingLogits parameter to routing kernel, relaxed routing_logits type checking to accept BFloat16 or Float32 uniformly across routing methods.
Test Framework Refactoring `cpp/tests/unit_tests/kernels/routing/routingTest.h`, `cpp/tests/unit_tests/kernels/routing/routingTest.cpp`, `cpp/tests/unit_tests/kernels/CMakeLists.txt`	Replaced old RoutingKernelTestParam constructor-based interface with fluent builder pattern, added preprocessType/postprocessType fields, introduced useTopKPackedAsInput and invalidExpertIdValue parameters, extended test utility functions to handle new input paths and policy-driven verification.
New Custom Routing Test Suite `cpp/tests/unit_tests/kernels/routing/routingCustomTest.cpp`	Added 1458 lines of comprehensive test fixture RoutingCustomKernelTest exercising policy combinations (Softmax, Sigmoid, SigmoidBias, ScaledSumNormalize, SumNormalize, None), input formats (scores, TopK, packed TopK), and multiple execution paths (block, cluster, coop, device).
Updated Routing Method Tests `cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp`, `cpp/tests/unit_tests/kernels/routing/routingLlama4Test.cpp`	Extended DeepSeekTest with TopKPackedAsInput variants and float32 bias test paths, updated Llama4Test to apply sigmoid to top-K scores with builder pattern construction, both now use updated mDtypeOutput field.
Removed Renormalize Tests `cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp`	Deleted entire 387-line test file exercising Renormalize routing variants; functionality merged into RoutingCustomTest with policy-driven parametrization.
Serial Test Reference Implementations `tests/unittest/_torch/thop/serial/test_moe.py`, `tests/unittest/_torch/modules/moe/moe_test_utils.py`, `tests/unittest/_torch/modules/moe/test_moe_module.py`	Added routing_reference_minimax and routing_reference_cohere_sigmoid reference implementations supporting MiniMax2 and SigmoidRenorm routing tests, extended parametrizations to include new routing methods, added use_flashinfer parameter plumbing for autotune logic.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

[None][feat] Add support for expert_number<=2048 and K<=32 #11510 — Directly overlaps on same routing kernel subsystem files (DeepSeek/Renormalize/custom routing) with conflicting structural refactoring.
[None][feat] Add routing support for the new model for both cutlass and trtllm moe backend #9792 — Conflicts on routing launch macros and kernel parameter signatures that this PR substantially rewrites.

Suggested reviewers

yweng0828

🚥 Pre-merge checks | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.17% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	The pull request description is incomplete and lacks critical information required by the repository template.	Add a comprehensive description explaining the refactoring goals, technical approach, and justification. Complete the Test Coverage section listing relevant tests. Verify all PR Checklist items are addressed and provide a proper JIRA/GitHub issue reference in the title.
Title check	❓ Inconclusive	The title is vague and generic, using non-descriptive terms like 'refactor' without specifying what routing aspects are being refactored or the primary improvements.	Replace with a more specific title that highlights the main change, such as '[None][feat] Refactor routing subsystem with policy-driven design and multi-dtype support' or similar.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can validate your CodeRabbit configuration file in your editor.

If your editor has YAML language server, you can enable auto-completion and validation by adding # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json at the top of your CodeRabbit configuration file.

coderabbitai

Actionable comments posted: 17

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp (1)
2-2: ⚠️ Potential issue | 🟡 Minor

Update the copyright year in this modified file.

Line 2 still ends at 2025 after 2026 edits.

As per coding guidelines "All TensorRT-LLM source files should contain an NVIDIA copyright header with the year of the latest meaningful modification."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp` at line 2,
Update the file header's copyright year from 2025 to 2026 in the top comment
(the line containing "Copyright (c) 2022-2025, NVIDIA CORPORATION.  All rights
reserved.") so it reflects the latest meaningful modification; replace "2025"
with "2026".
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu (1)
2-2: ⚠️ Potential issue | 🟡 Minor

Update the copyright year in the file header.

Line 2 still ends at 2025 even though this file has 2026 changes.

As per coding guidelines "All TensorRT-LLM source files should contain an NVIDIA copyright header with the year of the latest meaningful modification."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu` at line 2,
Update the file header copyright line that currently reads "Copyright (c)
2022-2025, NVIDIA CORPORATION." to include the latest modification year (change
2025 to 2026) so it becomes "Copyright (c) 2022-2026, NVIDIA CORPORATION.";
locate and edit the header string at the top of the file (the comment containing
that copyright) to apply this single-line change.
cpp/tests/unit_tests/kernels/routing/routingTest.h (1)
2-2: ⚠️ Potential issue | 🟡 Minor

Update the header year for this modified file.

Line 2 still ends at 2025 after 2026 changes.

As per coding guidelines "All TensorRT-LLM source files should contain an NVIDIA copyright header with the year of the latest meaningful modification."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/unit_tests/kernels/routing/routingTest.h` at line 2, Update the
top-of-file copyright header in routingTest.h to reflect the latest modification
year by changing the ending year from 2025 to 2026; locate the file's copyright
comment at the very beginning (the header comment block) and replace "2022-2025"
with "2022-2026" so it matches the coding guideline requiring the latest
meaningful modification year.

🧹 Nitpick comments (10)

tests/unittest/_torch/modules/moe/test_moe_module.py (1)

1153-1153: Avoid duplicate seq-len source of truth.

Line 1153 hardcodes [1,8] while SEQ_LENS is already defined as [1, 8]. Prefer seq_lens=SEQ_LENS to avoid drift in future edits.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/modules/moe/test_moe_module.py` at line 1153, The test
hardcodes seq_lens=[1,8] for CI while SEQ_LENS is already defined as [1, 8];
replace the conditional expression that sets seq_lens to the literal with a
single reference to SEQ_LENS (i.e., set seq_lens=SEQ_LENS) so the test uses one
source of truth (refer to SEQ_LENS and the seq_lens parameter in the test
invocation in test_moe_module.py).

tensorrt_llm/_torch/modules/fused_moe/routing.py (1)

442-472: Add explicit config/runtime validation for num_experts and top_k.

Line 453 stores num_experts, but it is never enforced. A mismatch currently fails later inside torch.topk with less actionable errors. Early checks would make misconfiguration failures clearer.

Proposed hardening diff

 class SigmoidRenormMoeRoutingMethod(BaseMoeRoutingMethod):
@@
     def __init__(
         self,
         top_k: int,
         num_experts: int,
         renormalize: bool = True,
         output_dtype: torch.dtype = torch.float32,
     ):
         super().__init__()
+        if top_k <= 0:
+            raise ValueError(f"top_k must be > 0, got {top_k}")
+        if num_experts <= 0:
+            raise ValueError(f"num_experts must be > 0, got {num_experts}")
+        if top_k > num_experts:
+            raise ValueError(
+                f"top_k ({top_k}) must be <= num_experts ({num_experts})")
         self.top_k = top_k
         self.num_experts = num_experts
         self.renormalize = renormalize
         self.output_dtype = output_dtype
@@
     def apply(self,
               router_logits: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
+        if router_logits.shape[-1] != self.num_experts:
+            raise ValueError(
+                f"Expected router_logits.shape[-1] == {self.num_experts}, "
+                f"got {router_logits.shape[-1]}")
         scores = torch.sigmoid(router_logits)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/fused_moe/routing.py` around lines 442 - 472, The
constructor of SigmoidRenormMoeRoutingMethod stores top_k and num_experts but
does not validate them, causing obscure failures in apply when torch.topk is
called; add explicit runtime/config validation in __init__
(SigmoidRenormMoeRoutingMethod.__init__) to ensure num_experts is a positive
integer and top_k is an integer between 1 and num_experts (inclusive), and raise
a clear ValueError if these conditions are not met (include the offending values
in the message) so misconfiguration fails fast and with actionable output.

tests/unittest/_torch/modules/moe/moe_test_utils.py (1)

287-309: Consider deleting this commented-out skip block or adding a note about the fix.

This skip block was protecting against a known accuracy issue with NVFP4 tactic[204] at seq_len=1 with large model configs. Commenting it out implies the issue is resolved, but leaving the code as a comment creates ambiguity:

If the PDL workaround (mentioned in PR objectives) resolves this issue, delete the block entirely and optionally add a brief comment noting the fix.

If there's uncertainty about whether the fix is permanent, consider keeping the skip logic active until confirmed.

Leaving dead code in comments makes it unclear whether this is intentional or accidental, and future maintainers may not know whether re-enabling it is appropriate.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/modules/moe/moe_test_utils.py` around lines 287 - 309,
The commented-out skip block checking quant_algo == QuantAlgo.NVFP4,
swiglu_gptoss_style, seq_len == 1, num_experts >= 256, and
model_config.hidden_size >= 7168 should be either removed or made explicit:
either delete the entire commented block (including the explanatory text and the
return message) if the PDL fix permanently resolves the NVFP4 tactic[204]
accuracy bug, or re-enable it as an active guard (restore the if + return) and
add a one-line TODO comment referencing the PR/PDL that fixed it and tests that
must pass before removal; locate the block by the variables quant_algo,
QuantAlgo.NVFP4, swiglu_gptoss_style, seq_len, num_experts, and
model_config.hidden_size in moe_test_utils.py to apply the change.

tests/unittest/_torch/thop/serial/test_moe.py (2)

941-944: Remove commented-out @pytest.mark.skip blocks after unskipping.

Keeping these decorators as comments adds noise and makes test intent harder to read.

Also applies to: 1079-1082, 2074-2077, 2303-2306

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/thop/serial/test_moe.py` around lines 941 - 944, Remove
the commented-out pytest skip decorators (lines containing "#
`@pytest.mark.skip`") in tests/unittest/_torch/thop/serial/test_moe.py so the test
files are clean and intent is clear; locate occurrences of the commented
"@pytest.mark.skip" blocks (notably the blocks around the ranges mentioned:
941-944, 1079-1082, 2074-2077, 2303-2306) and delete those commented lines,
leaving the surrounding test functions and their actual decorators intact (no
other functional changes).

1488-1493: Add bf16 logits coverage for the new routing methods.

Both FP4 and FP8-FP4 paths hard-pin RoutingMethodType.MiniMax2 and RoutingMethodType.SigmoidRenorm to torch.float, so bf16 routing-input behavior for these new methods is not exercised.

💡 Suggested direction

-        if routing_method_type in (RoutingMethodType.DeepSeekV3, RoutingMethodType.MiniMax2, RoutingMethodType.SigmoidRenorm):
-            expert_logits = torch.randn((num_tokens, num_experts),
-                                        device='cuda').to(torch.float)
+        if routing_method_type in (RoutingMethodType.DeepSeekV3, RoutingMethodType.MiniMax2, RoutingMethodType.SigmoidRenorm):
+            # Consider parametrizing this dtype to cover both fp32 and bf16.
+            expert_logits = torch.randn(
+                (num_tokens, num_experts), device='cuda', dtype=torch.float
+            )

Replicate the same parametrized dtype strategy in both run_moe_fp4_test and run_moe_fp8_fp4_test.

Also applies to: 1864-1869

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/thop/serial/test_moe.py` around lines 1488 - 1493, The
test currently forces expert_logits to torch.float for
RoutingMethodType.MiniMax2 and RoutingMethodType.SigmoidRenorm, skipping bf16
coverage; update both run_moe_fp4_test and run_moe_fp8_fp4_test to use the same
parametrized dtype selection as the other routing methods so these two new
routing types also get a bf16 branch—locate the expert_logits assignment (uses
routing_method_type and expert_logits) and mirror the dtype logic used elsewhere
(create bf16 logits for the non-excluded methods and ensure MiniMax2 and
SigmoidRenorm are included in bf16 coverage).

cpp/tensorrt_llm/thop/cuteDslMoeUtilsOp.cpp (1)

23-24: Consider moving namespace alias inside torch_ext namespace for consistency.

The btg namespace alias is declared at file scope before TRTLLM_NAMESPACE_BEGIN, whereas other MOE files (e.g., fp8BlockScaleMoe.cpp, fp4BlockScaleMoe.cpp) declare it inside namespace torch_ext. This works but creates a minor inconsistency.
Suggested change for consistency
-namespace btg = batchedGemm::trtllm::gen;
-
 TRTLLM_NAMESPACE_BEGIN

 namespace torch_ext
 {
+namespace btg = batchedGemm::trtllm::gen;
+
 // Sort
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/thop/cuteDslMoeUtilsOp.cpp` around lines 23 - 24, The
file-level namespace alias btg = batchedGemm::trtllm::gen is declared outside
the TRTLLM_NAMESPACE_BEGIN/namespace torch_ext block which is inconsistent with
other MOE files; move the alias declaration inside the namespace torch_ext
(i.e., after TRTLLM_NAMESPACE_BEGIN and before its usages) so btg is scoped the
same as in fp8BlockScaleMoe.cpp and fp4BlockScaleMoe.cpp and update any
references if necessary.

cpp/tensorrt_llm/thop/fp8PerTensorScaleMoe.cpp (1)

1-2: Consider updating copyright year.

The copyright header shows 2022-2024, but per coding guidelines, TensorRT-LLM source files should contain the year of the latest meaningful modification. Since this file is being modified in 2026, consider updating to 2022-2026.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/thop/fp8PerTensorScaleMoe.cpp` around lines 1 - 2, Update
the copyright header at the top of fp8PerTensorScaleMoe.cpp to reflect the
latest modification year (change "2022-2024" to "2022-2026"); locate the header
comment block at the top of the file and replace the year range so it matches
project guidelines for the most recent meaningful change.

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingCustomPolicy.cuh (3)

410-415: Consider stronger error handling for unsupported numExperts.

Returning 0 after logging an error may lead to subtle downstream issues if callers don't explicitly check the return value. Consider using TLLM_CHECK or throwing an exception to fail fast on invalid configurations.

However, if constexpr evaluation at compile-time is required, the current approach may be necessary.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingCustomPolicy.cuh`
around lines 410 - 415, The code logs "Unsupported numExperts" and returns 0,
which can silently propagate invalid state; replace the return with a fail-fast
mechanism: use TLLM_CHECK(false, "Unsupported numExperts: %d", numExperts) or
throw a std::runtime_error with the same formatted message in the function that
contains this block in RoutingCustomPolicy.cuh, so callers cannot continue with
an invalid configuration; if this path must be valid in a constexpr context,
instead use a compile-time guard (static_assert) for unsupported numExperts
values and keep the runtime check for non-constexpr builds.

355-376: Consider using 'k' prefix for constants.

The namespace-scope static constexpr constants should use the 'k' prefix per coding guidelines. For example:

NumExperts128Experts → kNumExperts128
NumTop8Experts → kNumTop8
NumThreads → kNumThreads
MaxSupportedExperts → kMaxSupportedExperts

As per coding guidelines: "Constants naming should be uppercase snakecase with prefix 'k' (e.g., kDIGIT_NUM = 10)".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingCustomPolicy.cuh`
around lines 355 - 376, The file defines namespace-scope static constexpr
constants that don't follow the project's constant naming convention; rename
each constant to use the 'k' prefix and uppercase snake_case per guidelines
(e.g., NumExperts128Experts -> kNUM_EXPERTS_128, NumTop8Experts -> kNUM_TOP_8,
NumThreads -> kKNUM_THREADS? Actually follow uppercase snake with k prefix like
kNUM_THREADS, MaxSupportedExperts -> kMAX_SUPPORTED_EXPERTS,
MaxNumTokensSingleCluster -> kMAX_NUM_TOKENS_SINGLE_CLUSTER,
MaxNumTokensSingleClusterScores -> kMAX_NUM_TOKENS_SINGLE_CLUSTER_SCORES,
BlockKernelMaxNumTokens -> kBLOCK_KERNEL_MAX_NUM_TOKENS) and update all usages;
specifically update declarations for symbols NumExperts128Experts,
NumExperts160Experts, NumExperts256Experts, NumExperts384Experts,
NumExperts512Experts, NumExperts576Experts, MaxSupportedExperts, NumTop4Experts,
NumTop8Experts, NumTop16Experts, NumTop22Experts, MaxSupportedTopExperts,
NumThreads, NumWarps, MaxNumTokensSingleCluster,
MaxNumTokensSingleClusterScores, and BlockKernelMaxNumTokens to their k-prefixed
uppercase snake_case equivalents and fix any references throughout the
translation unit.

16-16: Consider using preprocessor guard format for consistency.

The coding guidelines specify using TRTLLM_<FILENAME_IN_CAPS>_H format for header guards. While #pragma once is widely supported, consider using:

`#ifndef` TRTLLM_ROUTING_CUSTOM_POLICY_CUH
`#define` TRTLLM_ROUTING_CUSTOM_POLICY_CUH
// ... content ...
`#endif` // TRTLLM_ROUTING_CUSTOM_POLICY_CUH

As per coding guidelines: "Use a preprocessor guard in C++ header files with the format TRTLLM_<FILENAME_IN_CAPS>_H".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingCustomPolicy.cuh`
at line 16, Replace the lone '#pragma once' in RoutingCustomPolicy.cuh with the
project preprocessor guard format; remove '#pragma once' and add an include
guard using the macro name TRTLLM_ROUTING_CUSTOM_POLICY_CUH_H (wrap the entire
header between `#ifndef` TRTLLM_ROUTING_CUSTOM_POLICY_CUH_H / `#define`
TRTLLM_ROUTING_CUSTOM_POLICY_CUH_H and close with `#endif` //
TRTLLM_ROUTING_CUSTOM_POLICY_CUH_H).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingCustom.cu`:
- Around line 603-605: The single-cluster kernel is being selected only by token
count but must also be gated by device compute capability; change the
useSingleCluster boolean to require SM90 or newer in addition to data.mNumTokens
<= MaxNumTokensSingleClusterScores (e.g. compute capability major >= 9 or via
your project's helper like isSM90OrGreater()), and apply the same guard where
the identical cluster-vs-block decision is made (the other block that sets
useSingleCluster/useSingleBlock). Ensure you reference and update the existing
symbols useSingleCluster, useSingleBlock, MaxNumTokensSingleClusterScores, and
BlockKernelMaxNumTokens so the single-cluster path is never chosen on pre-SM90
devices.
- Around line 137-144: scoreIdx.idx loaded from params.mPtrTopKPacked can be
invalid (e.g., -1 or >= MaxNumExperts) and is used to compute offset and index
smemKIdx and params.mPtrTopKWeights; add a bounds check before using it: verify
scoreIdx.idx is >= 0 and < MaxNumExperts (or < params.numExperts if that symbol
exists) and only then compute offset = warpIdx * MaxNumExperts + scoreIdx.idx
and write smemKIdx[offset] and params.mPtrTopKWeights[expandedIdx]; if the index
is out of range, skip the shared-memory write and set
params.mPtrTopKWeights[expandedIdx] to a safe default (e.g., zero) or leave it
unchanged per surrounding logic to avoid out-of-bounds writes.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingCustomPolicy.cuh`:
- Around line 108-111: The comment block for "SigmoidBias" is duplicated; remove
the redundant copy so only a single documentation block remains for SigmoidBias
in RoutingCustomPolicy.cuh (the duplicate describes "applies sigmoid(score) +
bias[expertIdx] for topK selection" and is repeated). Keep one clear occurrence
and delete the second duplicate to avoid redundant comments near the SigmoidBias
description.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingDeepSeek.cu`:
- Around line 499-512: The precomputed DeepSeek scales (passed via
mPtrTopKIds/mPtrTopKWeights) should not be sent through runPostTopKPipeline()
because that helper assumes RoutingPostprocessType::Softmax and will re-softmax
already-final scales; instead detect the DeepSeek case (e.g., when mPtrTopKIds
!= nullptr and mPtrTopKWeights != nullptr and routing method is DeepSeek) and
route it through the non-softmax post-topK path or an alternate API that
preserves final scales (avoid RoutingPostprocessType::Softmax). Update the
branch around runPostTopKPipeline to call the DeepSeek-aware post-topK handler
(or pass a parameter to disable softmax) when mPtrTopKWeights contains final
token scales so the TopK-input path matches the logits path for DeepSeek.
- Around line 524-528: The check that gates "permuted index" work only ensures
mPtrTopKPacked and mPtrPermutedIdxSize, but kernels in step 2 also
unconditionally write grouped-GEMM launch-config buffers
(mPtrCtaIdxXyToBatchIdx, mPtrCtaIdxXyToMnLimit, mPtrNumNonExitingCtas), causing
possible segfaults; update the validation in the same conditional (and the
similar blocks covering the 560-611 region) to require those three buffers as
well when any of mPtrExpandedIdxToPermutedIdx, mPtrPermutedIdxToExpandedIdx, or
mPtrPermutedIdxToTokenIdx (or when mPtrPermutedIdxSize is non-zero) is present,
using TLLM_CHECK_WITH_INFO and the same error message style so callers get a
clear failure rather than a crash.
- Around line 162-167: The early return inside the KernelParams::UseGroups
branch (checking warpIdx >= params.mNumExpertGroups) exits inactive warps before
the block-wide barriers and causes CTA deadlocks; instead, keep all warps alive
through the kernel-level __syncthreads() calls and only skip per-warp work.
Replace the return with a warp-inactive flag or branch so inactive warps do no
work but still reach the later __syncthreads(); update the same pattern in the
other occurrence that checks warpIdx against params.mNumExpertGroups (the second
block around the routing logic) so inactive warps participate in barriers while
performing no work.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingFromTopKIds.cu`:
- Around line 84-103: The code currently calls
routingCustom::launchInitExpertCounts in both branches but only validates
data.mPtrExpertCounts in the non-coop path, which can cause a device null
dereference when canUseCoop/useCoop selects the coop branch; move or duplicate
the TLLM_CHECK_WITH_INFO that asserts data.mPtrExpertCounts != nullptr so it
runs before calling routingCustom::launchInitExpertCounts (i.e., perform the
null check prior to the coop/non-coop decision or immediately before any
launchInitExpertCounts invocation), keeping the existing error message unchanged
and referencing canUseCoop, useCoop, numBlocksCoop,
routingCustom::launchInitExpertCounts and data.mPtrExpertCounts to locate the
code to modify.
- Around line 62-64: The single-cluster path is only safe on SM90+ devices but
useSingleCluster currently checks only token count; change the definition of
useSingleCluster to require both data.mNumTokens <=
routingCustom::MaxNumTokensSingleClusterScores and that the device is SM90 or
newer (e.g., using an existing helper like isSM90OrGreater() or checking device
properties via cudaGetDeviceProperties / deviceProp.major/deviceProp.minor).
Update the condition where useSingleCluster is computed so the cluster kernel
variant cannot be chosen on pre-SM90 hardware (reference symbols:
useSingleCluster, routingCustom::MaxNumTokensSingleClusterScores,
data.mNumTokens, and the SM90 capability check helper you have or add one).

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingLlama4.cu`:
- Around line 558-574: The fast-path early return that forwards to
runPostTopKPipeline when mPtrTopKIds or mPtrTopKPacked (without mPtrScores) is
present must also validate that mPtrExpertCounts is provided whenever the
downstream cooperative path can be chosen; add a check that if
data.mPtrPermutedIdxSize != nullptr (or more generally whenever
runPostTopKPipeline may call launchInitExpertCounts/launchCoopKernel) then
data.mPtrExpertCounts != nullptr and fail with a clear TLLM_CHECK_WITH_INFO
message; keep the existing check for mPtrTopKWeights when mPtrTopKIds is set and
then call runPostTopKPipeline only after the expert-counts check passes.

In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu`:
- Around line 59-63: In Runner::Runner validate that mClusterSizeInBatchDim is a
positive power-of-two and that it divides mTileTokensDim before any later use
(e.g., before computeLog2(mClusterSizeInBatchDim) and uses of mClusterSizeLog2);
compute and assign mClusterSizeLog2 = computeLog2(mClusterSizeInBatchDim) in the
constructor after validation, and on failure either throw a clear
std::invalid_argument or assert with a descriptive message so invalid values
cannot propagate into kernel shift math.

In `@cpp/tests/unit_tests/kernels/routing/routingCustomTest.cpp`:
- Line 1458: The closing comment on the namespace uses the generic "// end
namespace"; update that comment to name the actual namespace being closed
(replace "// end namespace" with "// namespace <actual_namespace_name>" using
the namespace identifier declared earlier in this file) so the closing brace for
the namespace has a comment that explicitly names the namespace per the
guideline.

In `@cpp/tests/unit_tests/kernels/routing/routingTest.h`:
- Around line 404-408: The withNormTopkProb(bool) setter on
RoutingKernelTestParam is being overridden by build(), so preserve explicit
caller intent by tracking whether withNormTopkProb was invoked (e.g., add a bool
flag like normTopkProbSet) and only compute/reset normTopkProb inside build()
when that flag is false; update withNormTopkProb() to set the flag and modify
build() (and any places using routingMethod-based reassignment) to respect the
flag so explicit calls are not overwritten.
- Around line 311-315: Validate mExpertParallelization and
mExpertParallelizationId before using them to derive local ranges: ensure
mExpertParallelization > 0 and that numExperts is divisible by
mExpertParallelization (or handle remainder explicitly), and ensure 0 <=
mExpertParallelizationId < mExpertParallelization; add these checks where the
division/partitioning occurs (the code that computes local ranges using
mExpertParallelization and mExpertParallelizationId, referenced by
withExpertParallelization and the subsequent local-range derivation), and
return/log/assert or adjust logic (e.g., reject invalid inputs) so you never
perform a division by zero or index out-of-bounds when computing per-expert
ranges.

In `@tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py`:
- Around line 700-720: get_tuning_config() is being cached solely by ep_size
while get_constraint_specs() varies by SM (via get_sm_version()), causing wrong
hidden_states_scale constraints to be reused across SM90 vs SM100/103; update
the caching key for get_tuning_config (and any related tuner cache) to include
the SM version/compute capability (e.g., call get_sm_version() or other device
identifier) so the cache distinguishes configurations per-SM, and ensure cache
lookups/insertions use the tuple (ep_size, sm_version) and any cache
invalidation logic is adjusted accordingly.

In `@tests/unittest/_torch/modules/moe/moe_test_utils.py`:
- Around line 354-368: The NVFP4 + DeepEP protective skip was commented out
leaving QuantAlgo.NVFP4 unprotected (see deepep_crash_quant_algos and the
commented if-block); restore the original runtime guard by uncommenting and
re-enabling the if block that checks QuantAlgo.NVFP4 && num_experts >= 256 &&
model_config.hidden_size >= 7168 so the function returns the crash warning
(including comm_method) to skip those configs, or if you have a confirmed fix,
remove the block but add a clear comment referencing the fix and a
changelog/issue ID; ensure you reference QuantAlgo.NVFP4, num_experts,
model_config.hidden_size, deepep_crash_quant_algos, and comm_method when making
the change.

In `@tests/unittest/_torch/modules/moe/test_moe_module.py`:
- Around line 260-262: The call to supports_autotuner_capture currently passes
three arguments (backend_type, quant_algo, use_flashinfer) but the function
signature only accepts two; update the call to match the signature by removing
the extraneous use_flashinfer argument (i.e., call
supports_autotuner_capture(backend_type, quant_algo)) or alternatively update
the supports_autotuner_capture definition to accept and handle use_flashinfer if
that behavior is intended; adjust any dependent logic in the surrounding if
condition accordingly to avoid the TypeError.

In `@tests/unittest/_torch/thop/serial/test_moe.py`:
- Around line 312-316: routing_reference_minimax currently assumes routing_bias
is present and will raise an unhelpful attribute error when it is missing; add
an explicit check at the start of routing_reference_minimax to validate
routing_bias (e.g., if routing_bias is None) and raise a clear ValueError or
TypeError with a descriptive message indicating that MiniMax2 routing requires
routing_bias, referencing the function name routing_reference_minimax and the
subsequent use of routing_bias.to(torch.float) so callers know why the bias is
required.

---

Outside diff comments:
In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu`:
- Line 2: Update the file header copyright line that currently reads "Copyright
(c) 2022-2025, NVIDIA CORPORATION." to include the latest modification year
(change 2025 to 2026) so it becomes "Copyright (c) 2022-2026, NVIDIA
CORPORATION."; locate and edit the header string at the top of the file (the
comment containing that copyright) to apply this single-line change.

In `@cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp`:
- Line 2: Update the file header's copyright year from 2025 to 2026 in the top
comment (the line containing "Copyright (c) 2022-2025, NVIDIA CORPORATION.  All
rights reserved.") so it reflects the latest meaningful modification; replace
"2025" with "2026".

In `@cpp/tests/unit_tests/kernels/routing/routingTest.h`:
- Line 2: Update the top-of-file copyright header in routingTest.h to reflect
the latest modification year by changing the ending year from 2025 to 2026;
locate the file's copyright comment at the very beginning (the header comment
block) and replace "2022-2025" with "2022-2026" so it matches the coding
guideline requiring the latest meaningful modification year.

---

Nitpick comments:
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingCustomPolicy.cuh`:
- Around line 410-415: The code logs "Unsupported numExperts" and returns 0,
which can silently propagate invalid state; replace the return with a fail-fast
mechanism: use TLLM_CHECK(false, "Unsupported numExperts: %d", numExperts) or
throw a std::runtime_error with the same formatted message in the function that
contains this block in RoutingCustomPolicy.cuh, so callers cannot continue with
an invalid configuration; if this path must be valid in a constexpr context,
instead use a compile-time guard (static_assert) for unsupported numExperts
values and keep the runtime check for non-constexpr builds.
- Around line 355-376: The file defines namespace-scope static constexpr
constants that don't follow the project's constant naming convention; rename
each constant to use the 'k' prefix and uppercase snake_case per guidelines
(e.g., NumExperts128Experts -> kNUM_EXPERTS_128, NumTop8Experts -> kNUM_TOP_8,
NumThreads -> kKNUM_THREADS? Actually follow uppercase snake with k prefix like
kNUM_THREADS, MaxSupportedExperts -> kMAX_SUPPORTED_EXPERTS,
MaxNumTokensSingleCluster -> kMAX_NUM_TOKENS_SINGLE_CLUSTER,
MaxNumTokensSingleClusterScores -> kMAX_NUM_TOKENS_SINGLE_CLUSTER_SCORES,
BlockKernelMaxNumTokens -> kBLOCK_KERNEL_MAX_NUM_TOKENS) and update all usages;
specifically update declarations for symbols NumExperts128Experts,
NumExperts160Experts, NumExperts256Experts, NumExperts384Experts,
NumExperts512Experts, NumExperts576Experts, MaxSupportedExperts, NumTop4Experts,
NumTop8Experts, NumTop16Experts, NumTop22Experts, MaxSupportedTopExperts,
NumThreads, NumWarps, MaxNumTokensSingleCluster,
MaxNumTokensSingleClusterScores, and BlockKernelMaxNumTokens to their k-prefixed
uppercase snake_case equivalents and fix any references throughout the
translation unit.
- Line 16: Replace the lone '#pragma once' in RoutingCustomPolicy.cuh with the
project preprocessor guard format; remove '#pragma once' and add an include
guard using the macro name TRTLLM_ROUTING_CUSTOM_POLICY_CUH_H (wrap the entire
header between `#ifndef` TRTLLM_ROUTING_CUSTOM_POLICY_CUH_H / `#define`
TRTLLM_ROUTING_CUSTOM_POLICY_CUH_H and close with `#endif` //
TRTLLM_ROUTING_CUSTOM_POLICY_CUH_H).

In `@cpp/tensorrt_llm/thop/cuteDslMoeUtilsOp.cpp`:
- Around line 23-24: The file-level namespace alias btg =
batchedGemm::trtllm::gen is declared outside the
TRTLLM_NAMESPACE_BEGIN/namespace torch_ext block which is inconsistent with
other MOE files; move the alias declaration inside the namespace torch_ext
(i.e., after TRTLLM_NAMESPACE_BEGIN and before its usages) so btg is scoped the
same as in fp8BlockScaleMoe.cpp and fp4BlockScaleMoe.cpp and update any
references if necessary.

In `@cpp/tensorrt_llm/thop/fp8PerTensorScaleMoe.cpp`:
- Around line 1-2: Update the copyright header at the top of
fp8PerTensorScaleMoe.cpp to reflect the latest modification year (change
"2022-2024" to "2022-2026"); locate the header comment block at the top of the
file and replace the year range so it matches project guidelines for the most
recent meaningful change.

In `@tensorrt_llm/_torch/modules/fused_moe/routing.py`:
- Around line 442-472: The constructor of SigmoidRenormMoeRoutingMethod stores
top_k and num_experts but does not validate them, causing obscure failures in
apply when torch.topk is called; add explicit runtime/config validation in
__init__ (SigmoidRenormMoeRoutingMethod.__init__) to ensure num_experts is a
positive integer and top_k is an integer between 1 and num_experts (inclusive),
and raise a clear ValueError if these conditions are not met (include the
offending values in the message) so misconfiguration fails fast and with
actionable output.

In `@tests/unittest/_torch/modules/moe/moe_test_utils.py`:
- Around line 287-309: The commented-out skip block checking quant_algo ==
QuantAlgo.NVFP4, swiglu_gptoss_style, seq_len == 1, num_experts >= 256, and
model_config.hidden_size >= 7168 should be either removed or made explicit:
either delete the entire commented block (including the explanatory text and the
return message) if the PDL fix permanently resolves the NVFP4 tactic[204]
accuracy bug, or re-enable it as an active guard (restore the if + return) and
add a one-line TODO comment referencing the PR/PDL that fixed it and tests that
must pass before removal; locate the block by the variables quant_algo,
QuantAlgo.NVFP4, swiglu_gptoss_style, seq_len, num_experts, and
model_config.hidden_size in moe_test_utils.py to apply the change.

In `@tests/unittest/_torch/modules/moe/test_moe_module.py`:
- Line 1153: The test hardcodes seq_lens=[1,8] for CI while SEQ_LENS is already
defined as [1, 8]; replace the conditional expression that sets seq_lens to the
literal with a single reference to SEQ_LENS (i.e., set seq_lens=SEQ_LENS) so the
test uses one source of truth (refer to SEQ_LENS and the seq_lens parameter in
the test invocation in test_moe_module.py).

In `@tests/unittest/_torch/thop/serial/test_moe.py`:
- Around line 941-944: Remove the commented-out pytest skip decorators (lines
containing "# `@pytest.mark.skip`") in
tests/unittest/_torch/thop/serial/test_moe.py so the test files are clean and
intent is clear; locate occurrences of the commented "@pytest.mark.skip" blocks
(notably the blocks around the ranges mentioned: 941-944, 1079-1082, 2074-2077,
2303-2306) and delete those commented lines, leaving the surrounding test
functions and their actual decorators intact (no other functional changes).
- Around line 1488-1493: The test currently forces expert_logits to torch.float
for RoutingMethodType.MiniMax2 and RoutingMethodType.SigmoidRenorm, skipping
bf16 coverage; update both run_moe_fp4_test and run_moe_fp8_fp4_test to use the
same parametrized dtype selection as the other routing methods so these two new
routing types also get a bf16 branch—locate the expert_logits assignment (uses
routing_method_type and expert_logits) and mirror the dtype logic used elsewhere
(create bf16 logits for the non-excluded methods and ensure MiniMax2 and
SigmoidRenorm are included in bf16 coverage).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 64bd34ae-1efa-4290-8e55-af1992902818

📥 Commits

Reviewing files that changed from the base of the PR and between 93b0dc7 and fda4458.

📒 Files selected for processing (48)

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.h
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/IntFastDiv.h
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingCustom.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingCustomPolicy.cuh
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingDeepSeek.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingDevKernel.h
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingFromTopKIds.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingKernel.cuh
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingKernel.h
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingKernelTopK.cuh
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingLlama4.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/RoutingDeepSeekCommon.cuh
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchClusterKernel.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchCoopKernel.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchHistogramKernel.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchInitExpertCounts.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchMainKernel.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchOffsetsKernel.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/RoutingRenormalizeCommon.cuh
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/launchBlockKernel.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/launchClusterKernel.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/launchHistogramKernel.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/launchHistogramScoresKernel.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/launchInitExpertCounts.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/launchOffsetsKernel.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h
cpp/tensorrt_llm/thop/cuteDslMoeUtilsOp.cpp
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp
cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp
cpp/tensorrt_llm/thop/fp8PerTensorScaleMoe.cpp
cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp
cpp/tests/unit_tests/kernels/CMakeLists.txt
cpp/tests/unit_tests/kernels/routing/routingCustomTest.cpp
cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp
cpp/tests/unit_tests/kernels/routing/routingLlama4Test.cpp
cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp
cpp/tests/unit_tests/kernels/routing/routingTest.cpp
cpp/tests/unit_tests/kernels/routing/routingTest.h
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
tensorrt_llm/_torch/models/modeling_minimaxm2.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
tensorrt_llm/_torch/modules/fused_moe/routing.py
tests/unittest/_torch/modules/moe/moe_test_utils.py
tests/unittest/_torch/modules/moe/test_moe_module.py
tests/unittest/_torch/thop/serial/test_moe.py

💤 Files with no reviewable changes (18)

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchHistogramKernel.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/RoutingDeepSeekCommon.cuh
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchOffsetsKernel.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/launchHistogramKernel.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchMainKernel.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/launchHistogramScoresKernel.cu
cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/RoutingRenormalizeCommon.cuh
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/launchInitExpertCounts.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/launchOffsetsKernel.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/launchBlockKernel.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.h
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchCoopKernel.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchInitExpertCounts.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/launchClusterKernel.cu
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchClusterKernel.cu

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingCustom.cu

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingCustomPolicy.cuh

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingDeepSeek.cu

cpp/tests/unit_tests/kernels/routing/routingTest.h

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py

tests/unittest/_torch/modules/moe/moe_test_utils.py

tests/unittest/_torch/modules/moe/test_moe_module.py

tests/unittest/_torch/thop/serial/test_moe.py

ChristinaZ · 2026-03-22T13:54:09Z

/bot run --add-multi-gpu-test --test-backend "pytorch"

tensorrt-cicd · 2026-03-22T14:00:07Z

PR_Github #39823 [ run ] triggered by Bot. Commit: 5ea7944 Link to invocation

tensorrt-cicd · 2026-03-22T14:13:55Z

PR_Github #39823 [ run ] completed with state FAILURE. Commit: 5ea7944
/LLM/main/L0_MergeRequest_PR pipeline #31000 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

ChristinaZ · 2026-03-23T08:51:30Z

/bot run --add-multi-gpu-test --test-backend "pytorch"

tensorrt-cicd · 2026-03-23T08:58:15Z

PR_Github #39909 [ run ] triggered by Bot. Commit: b9f9920 Link to invocation

tensorrt-cicd · 2026-03-23T09:11:55Z

PR_Github #39909 [ run ] completed with state FAILURE. Commit: b9f9920
/LLM/main/L0_MergeRequest_PR pipeline #31076 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

ChristinaZ · 2026-03-24T13:58:28Z

/bot run --add-multi-gpu-test --test-backend "pytorch"

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingKernel.h

Add support for float32 and bfloat16 for the input, bias and output data type in routing Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>

ChristinaZ · 2026-03-25T14:20:46Z

/bot run --add-multi-gpu-test --test-backend "pytorch"

tensorrt-cicd · 2026-03-25T14:26:32Z

PR_Github #40338 [ run ] triggered by Bot. Commit: 3a22d88 Link to invocation

tensorrt-cicd · 2026-03-25T19:47:03Z

PR_Github #40338 [ run ] completed with state SUCCESS. Commit: 3a22d88
/LLM/main/L0_MergeRequest_PR pipeline #31443 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

ChristinaZ · 2026-03-26T02:38:11Z

/bot run --add-multi-gpu-test --test-backend "pytorch"

tensorrt-cicd · 2026-03-26T02:44:30Z

PR_Github #40410 [ run ] triggered by Bot. Commit: 3a22d88 Link to invocation

tensorrt-cicd · 2026-03-26T06:37:19Z

PR_Github #40410 [ run ] completed with state FAILURE. Commit: 3a22d88
/LLM/main/L0_MergeRequest_PR pipeline #31503 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

ChristinaZ · 2026-03-30T03:15:29Z

/bot run --add-multi-gpu-test --test-backend "pytorch"

ChristinaZ · 2026-03-30T03:15:39Z

/bot run --add-multi-gpu-test --test-backend "pytorch"

tensorrt-cicd · 2026-03-30T03:21:44Z

PR_Github #40642 [ run ] triggered by Bot. Commit: 3a22d88 Link to invocation

tensorrt-cicd · 2026-03-30T03:22:15Z

PR_Github #40643 [ run ] triggered by Bot. Commit: 3a22d88 Link to invocation

tensorrt-cicd · 2026-03-30T03:22:17Z

PR_Github #40642 [ run ] completed with state ABORTED. Commit: 3a22d88

Link to invocation

tensorrt-cicd · 2026-03-30T04:57:33Z

PR_Github #40643 [ run ] completed with state SUCCESS. Commit: 3a22d88
/LLM/main/L0_MergeRequest_PR pipeline #31678 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

ChristinaZ requested review from a team as code owners March 16, 2026 14:41

ChristinaZ requested review from HuiGao-NV, symphonylyh and yizhang-nv March 16, 2026 14:41

github-actions bot assigned ChristinaZ Mar 16, 2026

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

ChristinaZ force-pushed the refactor_routing branch 2 times, most recently from 6b22b9d to 5ea7944 Compare March 22, 2026 13:53

ChristinaZ force-pushed the refactor_routing branch from 5ea7944 to b9f9920 Compare March 23, 2026 08:50

ChristinaZ force-pushed the refactor_routing branch from b9f9920 to 70e6eea Compare March 24, 2026 13:58

rosenrodt reviewed Mar 24, 2026

View reviewed changes

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingKernel.h Outdated Show resolved Hide resolved

ChristinaZ force-pushed the refactor_routing branch from 70e6eea to 37e276f Compare March 25, 2026 13:48

Refactor the routing part in trtllmgen

3a22d88

Add support for float32 and bfloat16 for the input, bias and output data type in routing Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>

ChristinaZ force-pushed the refactor_routing branch from 37e276f to 3a22d88 Compare March 25, 2026 14:16

Conversation

ChristinaZ commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ChristinaZ commented Mar 22, 2026

Uh oh!

tensorrt-cicd commented Mar 22, 2026

Uh oh!

tensorrt-cicd commented Mar 22, 2026

Uh oh!

ChristinaZ commented Mar 23, 2026

Uh oh!

tensorrt-cicd commented Mar 23, 2026

Uh oh!

tensorrt-cicd commented Mar 23, 2026

Uh oh!

ChristinaZ commented Mar 24, 2026

Uh oh!

Uh oh!

ChristinaZ commented Mar 25, 2026

Uh oh!

tensorrt-cicd commented Mar 25, 2026

Uh oh!

tensorrt-cicd commented Mar 25, 2026

Uh oh!

ChristinaZ commented Mar 26, 2026

Uh oh!

tensorrt-cicd commented Mar 26, 2026

Uh oh!

tensorrt-cicd commented Mar 26, 2026

Uh oh!

ChristinaZ commented Mar 30, 2026

Uh oh!

ChristinaZ commented Mar 30, 2026

Uh oh!

tensorrt-cicd commented Mar 30, 2026

Uh oh!

tensorrt-cicd commented Mar 30, 2026

Uh oh!

tensorrt-cicd commented Mar 30, 2026

Uh oh!

tensorrt-cicd commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ChristinaZ commented Mar 16, 2026 •

edited

Loading

coderabbitai bot commented Mar 16, 2026 •

edited

Loading