[None][feat] Add routing support for the new model for both cutlass and trtllm moe backend by ChristinaZ · Pull Request #9792 · NVIDIA/TensorRT-LLM

ChristinaZ · 2025-12-08T14:12:17Z

Summary by CodeRabbit

Release Notes

New Features
- Support for larger Mixture-of-Experts models with up to 512 experts.
- Increased maximum top-k selection capacity from 10 to 22 experts.
Improvements
- Enhanced routing validation and constraint checking for expert selection.
- Stricter group-based routing requirements for improved stability.
Tests
- Added comprehensive test coverage for 512-expert configurations.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Description

Add routing support for the new model for both cutlass and trtllm moe backend

Test Coverage


pytest -s tests/unittest/_torch/thop/parallel/test_noaux_tc.py

cd cpp/build
make -j$(nproc) google-tests
./tests/unit_tests/kernels/routingKernelsTest --gtest_filter=RoutingDeepSeekKernelTest/*

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-12-08T14:20:44Z

📝 Walkthrough

Walkthrough

The pull request extends TensorRT-LLM's MoE routing kernels to support higher expert counts (up to 512) and increased top-k values (up to 22). Changes introduce new compile-time constants, add a MaxNumTopExperts template parameter to kernel configurations, relax validation bounds, tighten group-related conditionals, and expand test coverage across C++ kernels, Python routing logic, and unit tests.

Changes

Cohort / File(s)	Summary
Kernel constant and template parameter definitions `cpp/tensorrt_llm/kernels/noAuxTcKernels.cu`, `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h`, `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernelTopK.cuh`	Introduces compile-time constants (`NumNemotronExperts=512`, `MaxSupportedExpertCount`, `MaxSupportedTopExperts=22`, `DefaultMaxNumTopExperts=8`). Adds `MaxNumTopExperts_` template parameter to `KernelParams` struct. Renames `MaxNumTopK` to `MaxSupportedTopExperts` (value 10→22). Updates top-k kernel template signature with new `MaxNumTopExperts` parameter.
Core routing kernel implementations `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu`, `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu`	Replaces hard-coded expert/top-k bounds with parameterized constants. Updates array sizing, memory access patterns, and validation checks to use new bounds. Adjusts group-based routing logic to enforce index bounds. Extends `getMaxNumExperts` to handle Nemotron-scale experts. Replaces `MaxNumTopExperts` with `MaxSupportedTopExperts` throughout.
Kernel launch infrastructure `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.h`, `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu`	Updates `LAUNCH_ROUTING_WITH_NUM_EXPERTS_FORCE_FLOAT_INPUT` macro to propagate new `numTopExperts` parameter through all branches. Increases DeepSeek routing validation limit from top_k≤8 to top_k≤22.
Python validation and routing logic `cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp`, `cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp`, `cpp/tensorrt_llm/thop/fp8PerTensorScaleMoe.cpp`, `cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp`	Tightens group-related condition checks from `n_group != 0` to `n_group > 1` (all fp4/fp8 variants). Expands allowed top_k from ≤10 to ≤22 in `mxFp4BlockScaleMoe`. Skips group-specific validations when `n_group == 1`.
Python routing implementation `tensorrt_llm/_torch/modules/fused_moe/routing.py`	Updates `Deepseekv3RoutingImpl.noaux_tc` branch for `n_group == 1`: changes threshold logic from `num_experts > 384 or top_k > 8` to `num_experts > 512 or (top_k > 8 and top_k != 22)`. Replaces score masking from element-wise multiplication to `torch.where` with -inf assignment for masked positions.
Test coverage expansion `cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp`, `tests/unittest/_torch/thop/parallel/test_noaux_tc.py`, `tests/unittest/_torch/thop/serial/test_moe.py`	Adds 512-expert test configurations for DeepSeekV3 routing across parallelization modes. Expands test matrix with `(512, 1, 1, 6)` and `(512, 1, 1, 22)` parameter combinations. Increases top_k bound validation from ≤10 to ≤22. Adds new routing test case `RoutingDS_SuperV3`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Heterogeneous changes across 15 files: Kernel implementations, validation logic, Python routing, and tests require separate reasoning for each subsystem.
Critical kernel modifications: RoutingDeepSeek.cu and RoutingRenormalize.cu involve intricate changes to memory indexing, bounds checking, and control flow that necessitate careful verification of correctness across expert/top-k combinations.
Template parameter propagation: New MaxNumTopExperts parameter threading through macro invocations and kernel signatures requires tracing consistency across multiple files.
Validation tightening vs. relaxation: Group conditions narrowed (n_group > 1) while top-k limits expanded (≤22), requiring verification that interactions do not create new edge cases or conflicts.

Areas requiring extra attention:

RoutingDeepSeek.cu: Verify bounds checking logic for 512-expert Nemotron path and updated getMaxNumExperts branching.
RoutingRenormalize.cu: Ensure all array indexing, shared memory writes, and validation checks use consistent new bounds (MaxSupportedTopExperts).
tensorrt_llm/_torch/modules/fused_moe/routing.py: Review masking change from multiplication to -inf assignment and threshold logic update (top_k != 22 special case).
Macro propagation in DevKernel.h: Confirm numTopExperts is threaded through all conditional branches without omission.
Python validation consistency: Verify n_group > 1 condition is applied uniformly across all four thop files and correctly skips group checks for n_group == 1.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 6.45% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies the main change: adding routing support for a new model across both cutlass and trtllm MoE backends, which aligns with the file summaries.
Description check	✅ Passed	The pull request description is mostly complete with a clear summary and provided test coverage, though the general description could be more detailed about specific changes.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (6)

cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp (1)
1-2: Update copyright year to include 2025.

As per coding guidelines, all TensorRT-LLM OSS code files should include the current year in the copyright header.
 /*
- * Copyright (c) 2022-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2025, NVIDIA CORPORATION.  All rights reserved.
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (1)
1-2: Update copyright year to include 2025.

As per coding guidelines, all TensorRT-LLM OSS code files should include the current year in the copyright header.
 /*
- * Copyright (c) 2022-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2025, NVIDIA CORPORATION.  All rights reserved.
cpp/tensorrt_llm/thop/fp8PerTensorScaleMoe.cpp (1)
1-2: Update copyright year to include 2025.

As per coding guidelines, all TensorRT-LLM OSS code files should include the current year in the copyright header.
 /*
- * Copyright (c) 2022-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2025, NVIDIA CORPORATION.  All rights reserved.
tensorrt_llm/_torch/modules/fused_moe/routing.py (1)

253-301: DeepSeekV3 routing bounds and masking changes look correct; consider a tiny allocation tweak.

The updated n_group == 1 condition cleanly restricts the fused path to num_experts <= 512 and top_k <= 8 (or top_k == 22 for the Nemotron Super v3 special case), which matches the new kernel capabilities. Replacing scores_with_bias *= score_mask with torch.where(..., -inf) correctly prevents masked experts from being reselected by the subsequent top‑k.

If you want to shave a small allocation in torch.where, you could use scores_with_bias.new_full((), float('-inf')) instead of constructing a fresh scalar tensor each call, but this is a minor optimization.

cpp/tensorrt_llm/kernels/noAuxTcKernels.cu (1)

31-44: DeepSeek no‑aux top‑k kernel extensions match the new 512/22 config; document the MaxNumTopExperts invariant.

The new expert/top‑k constants, extra MaxNumTopExperts template parameter, multi‑warp shared‑memory initialization, and the invokeNoAuxTc dispatch logic (including the dedicated 512‑expert/22‑way single‑group path) are all consistent with supporting Nemotron Super v3 while preserving existing DeepSeek (≤256‑expert, top_k≤8) group behavior.

These implementations implicitly rely on topk <= MaxNumTopExperts for each instantiated kernel (8 for existing paths, 22 for the 512‑expert specialization). It would be good to keep that invariant explicit in comments or host‑side checks if new DeepSeek configs are added later, so we never accidentally launch with a larger runtime topk than the compile‑time MaxNumTopExperts.

Also applies to: 47-52, 120-129, 166-226, 273-334

cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp (1)

247-366: Consider adding a test with useTopKAsInput=true for 512-expert configuration.

The new tests cover the score-based routing path well. For completeness, consider adding a test variant similar to DeviceLevelParallelization that tests the pre-computed topK path (useTopKAsInput=true) with 512 experts.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 98db262 and bf73bf5.

📒 Files selected for processing (15)

cpp/tensorrt_llm/kernels/noAuxTcKernels.cu (6 hunks)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.h (1 hunks)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu (10 hunks)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h (1 hunks)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernelTopK.cuh (1 hunks)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu (8 hunks)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu (1 hunks)
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (1 hunks)
cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp (1 hunks)
cpp/tensorrt_llm/thop/fp8PerTensorScaleMoe.cpp (1 hunks)
cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp (2 hunks)
cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp (3 hunks)
tensorrt_llm/_torch/modules/fused_moe/routing.py (2 hunks)
tests/unittest/_torch/thop/parallel/test_noaux_tc.py (1 hunks)
tests/unittest/_torch/thop/serial/test_moe.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (5)

**/*.{cpp,h,cu}