[None][feat] Add bf16 trtllm-gen moe support through flashinfer. by nv-guomingz · Pull Request #12738 · NVIDIA/TensorRT-LLM

nv-guomingz · 2026-04-03T14:54:07Z

Summary by CodeRabbit

Release Notes

New Features
- Added BF16 (unquantized) Mixture of Experts execution mode with FlashInfer backend support
- Enhanced MoE backend selection logic with improved routing method handling
Tests
- Expanded test coverage for varying tensor parallel sizes and MoE backend configurations (CUTLASS and TRTLLM)

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-04-03T15:06:48Z

📝 Walkthrough

Walkthrough

This PR adds BF16 unquantized MoE execution support for the TRTLLM backend using FlashInfer. Changes introduce a new BF16TRTLLMGenFusedMoEMethod for weight layout conversion, a resolve_moe_cls() function for routing-method-dependent backend selection, FlashInfer-specific validation, and corresponding backend implementations with updated test coverage.

Changes

Cohort / File(s)	Summary
Backend Selection & Configuration `tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py`, `tensorrt_llm/_torch/modules/fused_moe/create_moe.py`	Introduced `resolve_moe_cls()` for routing-method-dependent MoE backend selection with fallback logic. Enhanced `get_moe_cls()` with BF16/no-quant path support. Updated `configurable_moe.py` to use deep-copied backend model config for override handling.
TRTLLM Gen Fused MoE Implementation `tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py`, `tensorrt_llm/_torch/modules/fused_moe/moe_op_backend.py`	Added BF16 unquantized execution mode with FlashInfer backend selection. Introduced FlashInfer-specific validation methods and changed `can_implement()` to conditionally accept unquantized mode. Added `run_bf16_moe()` method to backend interface with TRTLLMOpBackend and FlashinferOpBackend implementations.
Weight Layout & Quantization `tensorrt_llm/_torch/modules/fused_moe/quantization.py`	Introduced `BF16TRTLLMGenFusedMoEMethod` with BlockMajorK weight layout support. Added weight layout constants and helper functions for layout conversion. Extended lifecycle control for post-loading weight processing in `FusedMoEMethodBase`.
Test Parameterization `tests/integration/defs/accuracy/test_llm_api_pytorch.py`, `tests/integration/test_lists/qa/llm_function_core.txt`, `tests/integration/test_lists/qa/llm_function_core_sanity.txt`, `tests/integration/test_lists/test-db/l0_b200.yml`	Parameterized Qwen3 5.35B A3B tests for BF16 and FP8 modes across CUTLASS and TRTLLM backends with variable tensor parallelism. Updated test selection lists to reflect parameterized variants.
Test Utilities & Skip Logic `tests/unittest/_torch/modules/moe/moe_test_utils.py`, `tests/integration/test_lists/waives.txt`	Updated `should_skip_trtllm` to allow BF16 unquantized path through constraints. Added alignment validation (128-multiple requirement) for BF16. Adjusted CI skip logic to explicitly enable TRTLLM BF16 unquantized coverage. Updated waive entries.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client/Configurable MoE
    participant Resolver as resolve_moe_cls()
    participant Selector as get_moe_cls()
    participant TRTLLMGen as TRTLLMGenFusedMoE
    participant Backend as MoE Op Backend
    participant FlashInfer as FlashInfer/CUDA Ops

    Client->>Resolver: resolve_moe_cls(model_config, routing_method, dtype, override_quant_config)
    Resolver->>Selector: get_moe_cls(model_config, routing_method, dtype, override_quant_config)
    
    alt has_quant == True
        Selector->>TRTLLMGen: Check specific quant predicates
        TRTLLMGen-->>Selector: Return if match
    else has_quant == False
        Selector->>TRTLLMGen: Check BF16 + FlashInfer availability
        alt BF16 && FlashInfer available
            TRTLLMGen-->>Selector: Return TRTLLMGenFusedMoE
        else Missing FlashInfer
            Selector-->>Resolver: Raise RuntimeError
        end
    end
    
    Resolver->>TRTLLMGen: Check routing method support for BF16 unquantized
    alt Routing not supported && unquantized
        Resolver->>Resolver: Fall back to CutlassFusedMoE
    end
    
    Resolver-->>Client: MoE class selected
    Client->>TRTLLMGen: create_weights() / load_weights()
    TRTLLMGen->>Backend: run_bf16_moe() (if BF16 unquantized)
    Backend->>FlashInfer: trtllm_bf16_moe() or trtllm_bf16_routed_moe()
    FlashInfer-->>Backend: Output tensor
    Backend-->>TRTLLMGen: Result

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 28.26% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	The PR description contains only the template with sections unfilled; actual description, test coverage details, and substantive PR context are missing.	Add a detailed description explaining what bf16 TRT-LLM MoE support via FlashInfer entails, why it was added, and list specific tests that validate the changes (e.g., test_bf16 variants for Qwen3.5-35B).

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main feature: adding bf16 (bfloat16) support for TensorRT-LLM MoE through FlashInfer, which aligns with the primary changes across all modified files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (3)

tests/integration/defs/accuracy/test_llm_api_pytorch.py (2)
5963-5973: Assert the selected MoE backend in the BF16 matrix.

This is the new coverage point for the TRTLLM BF16 MoE path, but it only checks task accuracy. A silent fallback would still pass, so please assert the resolved backend or, at minimum, llm.args.moe_config.backend.
🧪 Minimal assertion
         with LLM(self.MODEL_PATH,
                  tensor_parallel_size=tp_size,
                  moe_expert_parallel_size=1,
                  max_seq_len=4096,
                  max_batch_size=32,
                  enable_chunked_prefill=True,
                  kv_cache_config=kv_cache_config,
                  cuda_graph_config=cuda_graph_config,
                  moe_config=moe_config) as llm:
+            assert llm.args.moe_config.backend == moe_backend
             task = GSM8K(self.MODEL_NAME)
             task.evaluate(llm,
                           extra_evaluator_kwargs=self.EXTRA_EVALUATOR_KWARGS)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/defs/accuracy/test_llm_api_pytorch.py` around lines 5963 -
5973, The test currently only checks task accuracy but doesn't assert that the
intended MoE backend was actually selected, allowing silent fallbacks to pass;
after entering the LLM context created with MoeConfig(backend=moe_backend) and
LLM(... ) as llm, add an assertion that verifies the resolved backend (for
example by checking llm.args.moe_config.backend or another resolved backend
field on the llm instance) equals the expected moe_backend so the BF16 MoE path
is exercised and failures are caught.
5969-5973: Keep the CUDA-graph ladder aligned with the new batch cap.

max_batch_size is now 32, but the explicit capture sizes defined just above still include 64 and 128. If those entries are honored, this matrix still warms larger graphs than it can ever serve.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/defs/accuracy/test_llm_api_pytorch.py` around lines 5969 -
5973, The CUDA-graph capture sizes (defined in the cuda_graph_config/capture
ladder) still include 64 and 128 while max_batch_size was changed to 32; update
the cuda_graph_config capture sizes used by the context manager (the
cuda_graph_config passed into the llm with max_batch_size=32) so none exceed
32—remove or replace the 64 and 128 entries and ensure the ladder ends at or
below max_batch_size to prevent warming graphs larger than the serving cap.
tensorrt_llm/_torch/modules/fused_moe/quantization.py (1)
700-701: Consider annotating the shared cache with ClassVar.

Mutable default values share state across all instances of the class, while not being obvious. This can lead to bugs when the attributes are changed in one instance, as those changes will unexpectedly affect all other instances.

The current implementation is intentional (matching the pattern at lines 2664 and 3873), but explicitly annotating the variable with typing.ClassVar indicates that it is intended to be shared across all instances. This also silences the Ruff RUF012 warning and documents the intent.
🛠️ Suggested annotation
+from typing import ClassVar
+
 class BF16TRTLLMGenFusedMoEMethod(UnquantizedFusedMoEMethod):
     # BlockMajorK uses 128-byte K blocks. BF16 has 2 bytes per element.
     block_k = 64
     use_shuffled_weight = True
     weight_layout = TRTLLM_GEN_WEIGHT_LAYOUT_BLOCK_MAJOR_K
     needs_post_load_processing_for_dummy = True
-    _cache_permute_indices: Dict[tuple[tuple[int, ...], str, int],
-                                 torch.Tensor] = {}
+    _cache_permute_indices: ClassVar[Dict[tuple[tuple[int, ...], str, int],
+                                          torch.Tensor]] = {}
Note: The same pattern exists for NVFP4TRTLLMGenFusedMoEBaseMethod (line 2664) and MXFP4WeightTRTLLMGenFusedMoEMethod (line 3873), which could benefit from similar annotations.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/fused_moe/quantization.py` around lines 700 -
701, Annotate the shared mutable class attribute _cache_permute_indices as a
ClassVar to make the shared intent explicit and silence Ruff RUF012: import
ClassVar from typing and change the annotation to ClassVar[Dict[tuple[tuple[int,
...], str, int], torch.Tensor]] while keeping the existing initializer ({});
apply the same pattern to the similar shared attributes on
NVFP4TRTLLMGenFusedMoEBaseMethod and MXFP4WeightTRTLLMGenFusedMoEMethod so these
mutable defaults are clearly documented as class-level shared state.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py`:
- Around line 192-195: The code temporarily mutates
backend_model_config.skip_create_weights_in_init and
backend_model_config._frozen but does not reliably restore the original values
on all paths; capture the original values (e.g., tmp_skip_create_weights_in_init
and tmp_frozen = backend_model_config._frozen) before mutating, set
skip_create_weights_in_init = True and _frozen = False only for the operation,
and restore both original values in a finally block around the call to
create_moe_backend (and the analogous block at the 240-244 site) so exceptions
do not leak mutated state.

In `@tensorrt_llm/_torch/modules/fused_moe/create_moe.py`:
- Around line 75-82: The BF16 branch in resolve_moe_cls/create_moe currently
only checks model_config.pretrained_config.torch_dtype and ignores the call-site
dtype argument, so callers passing dtype=torch.bfloat16 won't select
TRTLLMGenFusedMoE; update both BF16 checks (the block referencing
TRTLLMGenFusedMoE around the current 75-82 and the similar branch at 97-113) to
honor the dtype parameter by treating the branch as true when dtype is
torch.bfloat16 OR model_config.pretrained_config.torch_dtype is torch.bfloat16
(while still requiring not has_quant and the FlashInfer availability check),
i.e., use a combined condition like: not has_quant and (dtype is torch.bfloat16
or getattr(model_config.pretrained_config, "torch_dtype", None) is
torch.bfloat16) before returning TRTLLMGenFusedMoE or raising the same
RuntimeError.

In `@tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py`:
- Around line 323-325: The function signature for
_supports_flashinfer_bf16_routing_method has a hanging indent causing an E125
lint error; fix it by reformatting the parameter continuation so the closing
parenthesis aligns with the opening or move the closing parenthesis to the same
line as the last parameter, e.g. adjust the indentation of the line with
"routing_method: BaseMoeRoutingMethod, ) -> bool:" so it no longer creates a
misaligned continuation; update the def for
_supports_flashinfer_bf16_routing_method accordingly.

In `@tensorrt_llm/_torch/modules/fused_moe/moe_op_backend.py`:
- Around line 787-808: This branch unconditionally calls trtllm_bf16_moe when
router_logits is present, but TRTLLMGenFusedMoE._requires_separated_routing
documents BF16 direct-routing is inaccurate; update the guard in the block that
calls trtllm_bf16_moe to fail fast: detect BF16 direct-routing scenarios (use
routing_logits presence and routing_method_type via self.cvt_routing_method_type
or the same predicate used by TRTLLMGenFusedMoE._requires_separated_routing) and
raise a clear RuntimeError (or fallback to FP32/top-k precompute) instead of
dispatching trtllm_bf16_moe; reference trtllm_bf16_moe,
TRTLLMGenFusedMoE._requires_separated_routing, and forward_impl to align
behavior.

In `@tests/integration/test_lists/test-db/l0_b200.yml`:
- Around line 77-80: The test list includes tp2 variants (e.g.,
accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_bf16[tp2-CUTLASS],
::test_bf16[tp2-TRTLLM], ::test_fp8[tp2]) but the surrounding job is constrained
to a single GPU (system_gpu_count: 1), causing scheduling failures; update the
entries to match the single-GPU lane by either removing the tp2 variants or
replacing them with their tp1 equivalents (e.g., change tp2-* to tp1-* or drop
those lines) so the listed tests are compatible with the 1-GPU configuration.

In `@tests/unittest/_torch/modules/moe/moe_test_utils.py`:
- Around line 238-240: The TRTLLM C++ routing-kernel skips are being applied to
the BF16 FlashInfer path because quant_algo==None still reaches
should_skip_trtllm(); fix by guarding the old C++ routing-kernel checks so they
only run for quantized paths (i.e., require quant_algo is not None and/or
quant_algo in trtllm_gen_quant_algos) before calling should_skip_trtllm(), or
move the BF16 (quant_algo is None) BF16 block ahead of the C++ routing skips;
update the logic around quant_algo, trtllm_gen_quant_algos and
should_skip_trtllm to ensure DefaultMoeRoutingMethod, MiniMaxM2MoeRoutingMethod
and the Llama4 top-k restriction are not applied to the BF16 FlashInfer path
(which should only ban DeepSeekV3).

---

Nitpick comments:
In `@tensorrt_llm/_torch/modules/fused_moe/quantization.py`:
- Around line 700-701: Annotate the shared mutable class attribute
_cache_permute_indices as a ClassVar to make the shared intent explicit and
silence Ruff RUF012: import ClassVar from typing and change the annotation to
ClassVar[Dict[tuple[tuple[int, ...], str, int], torch.Tensor]] while keeping the
existing initializer ({}); apply the same pattern to the similar shared
attributes on NVFP4TRTLLMGenFusedMoEBaseMethod and
MXFP4WeightTRTLLMGenFusedMoEMethod so these mutable defaults are clearly
documented as class-level shared state.

In `@tests/integration/defs/accuracy/test_llm_api_pytorch.py`:
- Around line 5963-5973: The test currently only checks task accuracy but
doesn't assert that the intended MoE backend was actually selected, allowing
silent fallbacks to pass; after entering the LLM context created with
MoeConfig(backend=moe_backend) and LLM(... ) as llm, add an assertion that
verifies the resolved backend (for example by checking
llm.args.moe_config.backend or another resolved backend field on the llm
instance) equals the expected moe_backend so the BF16 MoE path is exercised and
failures are caught.
- Around line 5969-5973: The CUDA-graph capture sizes (defined in the
cuda_graph_config/capture ladder) still include 64 and 128 while max_batch_size
was changed to 32; update the cuda_graph_config capture sizes used by the
context manager (the cuda_graph_config passed into the llm with
max_batch_size=32) so none exceed 32—remove or replace the 64 and 128 entries
and ensure the ladder ends at or below max_batch_size to prevent warming graphs
larger than the serving cap.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5bec131a-2811-443f-ba6b-06d59f667b43

📥 Commits

Reviewing files that changed from the base of the PR and between 1045f38 and 7863d16.

📒 Files selected for processing (11)

tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py
tensorrt_llm/_torch/modules/fused_moe/create_moe.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
tensorrt_llm/_torch/modules/fused_moe/moe_op_backend.py
tensorrt_llm/_torch/modules/fused_moe/quantization.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tests/integration/test_lists/qa/llm_function_core.txt
tests/integration/test_lists/qa/llm_function_core_sanity.txt
tests/integration/test_lists/test-db/l0_b200.yml
tests/integration/test_lists/waives.txt
tests/unittest/_torch/modules/moe/moe_test_utils.py

coderabbitai · 2026-04-03T15:06:51Z

tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py

+        tmp_skip_create_weights_in_init = backend_model_config.skip_create_weights_in_init
+        backend_model_config._frozen = False
+        backend_model_config.skip_create_weights_in_init = True
+        backend_model_config._frozen = True


⚠️ Potential issue | 🟠 Major

Config freeze/skip flags are not safely restored across all paths.

Line 193 and Line 195 force _frozen=True instead of restoring the original frozen state, and if create_moe_backend(...) throws, skip_create_weights_in_init is left mutated. This can leak state into subsequent layer construction.

Suggested fix

- tmp_skip_create_weights_in_init = backend_model_config.skip_create_weights_in_init - backend_model_config._frozen = False - backend_model_config.skip_create_weights_in_init = True - backend_model_config._frozen = True - - backend = create_moe_backend( + original_skip_create_weights_in_init = backend_model_config.skip_create_weights_in_init + original_frozen = backend_model_config._frozen + try: + backend_model_config._frozen = False + backend_model_config.skip_create_weights_in_init = True + backend_model_config._frozen = original_frozen + + backend = create_moe_backend( moe_cls=moe_cls, routing_method=routing_method, num_experts=self.num_experts, hidden_size=self.hidden_size, intermediate_size=self.intermediate_size, dtype=self.dtype, reduce_results=self.reduce_results, model_config=backend_model_config, aux_stream_dict=self.aux_stream_dict, weight_loading_mode=self.weight_loading_mode, bias=kwargs.get("bias", False), apply_router_weight_on_input=self.apply_router_weight_on_input, layer_idx=None, swiglu_alpha=kwargs.get("swiglu_alpha"), swiglu_beta=kwargs.get("swiglu_beta"), swiglu_limit=kwargs.get("swiglu_limit"), init_load_balancer=False, without_comm=True, activation_type=self.activation_type, - ) + ) + finally: + backend_model_config._frozen = False + backend_model_config.skip_create_weights_in_init = original_skip_create_weights_in_init + backend_model_config._frozen = original_frozen @@ - backend_model_config._frozen = False - backend_model_config.skip_create_weights_in_init = tmp_skip_create_weights_in_init - backend_model_config._frozen = True - if not backend_model_config.skip_create_weights_in_init: + if not original_skip_create_weights_in_init: self.backend.create_weights()

Also applies to: 240-244

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py` around lines 192 - 195, The code temporarily mutates backend_model_config.skip_create_weights_in_init and backend_model_config._frozen but does not reliably restore the original values on all paths; capture the original values (e.g., tmp_skip_create_weights_in_init and tmp_frozen = backend_model_config._frozen) before mutating, set skip_create_weights_in_init = True and _frozen = False only for the operation, and restore both original values in a finally block around the call to create_moe_backend (and the analogous block at the 240-244 site) so exceptions do not leak mutated state.

coderabbitai · 2026-04-03T15:06:51Z

tensorrt_llm/_torch/modules/fused_moe/create_moe.py

+        if not has_quant and model_config.pretrained_config is not None and getattr(
+                model_config.pretrained_config, "torch_dtype",
+                None) == torch.bfloat16:
+            if TRTLLMGenFusedMoE._is_flashinfer_fused_moe_available():
+                return TRTLLMGenFusedMoE
+            raise RuntimeError(
+                "TRTLLMGenFusedMoE BF16 path requires FlashInfer fused MoE with "
+                "trtllm_bf16_moe support, but it is not available.")


⚠️ Potential issue | 🟠 Major

Honor call-site dtype overrides when resolving the BF16 backend.

resolve_moe_cls() takes dtype, but the unquantized BF16 branch still keys off model_config.pretrained_config.torch_dtype only. A caller doing create_moe(..., dtype=torch.bfloat16) will therefore fall back to CutlassFusedMoE whenever the pretrained config is unset or still says float16, so the new FlashInfer-backed TRTLLM path never gets selected.

Also applies to: 97-113

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/modules/fused_moe/create_moe.py` around lines 75 - 82, The BF16 branch in resolve_moe_cls/create_moe currently only checks model_config.pretrained_config.torch_dtype and ignores the call-site dtype argument, so callers passing dtype=torch.bfloat16 won't select TRTLLMGenFusedMoE; update both BF16 checks (the block referencing TRTLLMGenFusedMoE around the current 75-82 and the similar branch at 97-113) to honor the dtype parameter by treating the branch as true when dtype is torch.bfloat16 OR model_config.pretrained_config.torch_dtype is torch.bfloat16 (while still requiring not has_quant and the FlashInfer availability check), i.e., use a combined condition like: not has_quant and (dtype is torch.bfloat16 or getattr(model_config.pretrained_config, "torch_dtype", None) is torch.bfloat16) before returning TRTLLMGenFusedMoE or raising the same RuntimeError.

coderabbitai · 2026-04-03T15:06:51Z

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py

+    def _supports_flashinfer_bf16_routing_method(
+        routing_method: BaseMoeRoutingMethod, ) -> bool:
+        # FIXME: ban DeepSeekV3 FlashInfer trtllm_bf16_routed_moe() as it appears to have bug


⚠️ Potential issue | 🟡 Minor

Fix the hanging indent in _supports_flashinfer_bf16_routing_method.

Flake8 is already reporting E125 here, so the lint job will remain red until the continuation line is re-indented or the closing parenthesis is moved.

Minimal fix

`@staticmethod` def _supports_flashinfer_bf16_routing_method( - routing_method: BaseMoeRoutingMethod, ) -> bool: + routing_method: BaseMoeRoutingMethod, + ) -> bool:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _supports_flashinfer_bf16_routing_method(

routing_method: BaseMoeRoutingMethod, ) -> bool:

# FIXME: ban DeepSeekV3 FlashInfer trtllm_bf16_routed_moe() as it appears to have bug

`@staticmethod`

def _supports_flashinfer_bf16_routing_method(

routing_method: BaseMoeRoutingMethod,

) -> bool:

# FIXME: ban DeepSeekV3 FlashInfer trtllm_bf16_routed_moe() as it appears to have bug

🧰 Tools

🪛 Flake8 (7.3.0)

[error] 324-324: continuation line with same indent as next logical line

(E125)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py` around lines 323 - 325, The function signature for _supports_flashinfer_bf16_routing_method has a hanging indent causing an E125 lint error; fix it by reformatting the parameter continuation so the closing parenthesis aligns with the opening or move the closing parenthesis to the same line as the last parameter, e.g. adjust the indentation of the line with "routing_method: BaseMoeRoutingMethod, ) -> bool:" so it no longer creates a misaligned continuation; update the def for _supports_flashinfer_bf16_routing_method accordingly.

coderabbitai · 2026-04-03T15:06:51Z

tensorrt_llm/_torch/modules/fused_moe/moe_op_backend.py

+        if router_logits is not None:
+            result = self._fused_moe.trtllm_bf16_moe(
+                routing_logits=router_logits,
+                routing_bias=routing_bias,
+                hidden_states=hidden_states,
+                gemm1_weights=gemm1_weights,
+                gemm2_weights=gemm2_weights,
+                num_experts=num_experts,
+                top_k=top_k,
+                n_group=n_group,
+                topk_group=topk_group,
+                intermediate_size=intermediate_size,
+                local_expert_offset=local_expert_offset,
+                local_num_experts=local_num_experts,
+                routed_scaling_factor=routed_scaling_factor,
+                routing_method_type=self.cvt_routing_method_type(routing_method_type),
+                use_shuffled_weight=use_shuffled_weight,
+                weight_layout=weight_layout,
+                do_finalize=do_finalize,
+                enable_pdl=enable_pdl,
+                tune_max_num_tokens=tune_max_num_tokens,
+            )


⚠️ Potential issue | 🟠 Major

Fail fast on the BF16 direct-routing kernel for now.

TRTLLMGenFusedMoE._requires_separated_routing() now documents that FlashInfer BF16 direct routing has an accuracy bug, but this branch still dispatches trtllm_bf16_moe() whenever router_logits is present. TRTLLMGenFusedMoE.forward_impl() only precomputes top-k on the post-comm path, so single-GPU / legacy callers can still hit this kernel and get silent miscomputations.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/modules/fused_moe/moe_op_backend.py` around lines 787 - 808, This branch unconditionally calls trtllm_bf16_moe when router_logits is present, but TRTLLMGenFusedMoE._requires_separated_routing documents BF16 direct-routing is inaccurate; update the guard in the block that calls trtllm_bf16_moe to fail fast: detect BF16 direct-routing scenarios (use routing_logits presence and routing_method_type via self.cvt_routing_method_type or the same predicate used by TRTLLMGenFusedMoE._requires_separated_routing) and raise a clear RuntimeError (or fallback to FP32/top-k precompute) instead of dispatching trtllm_bf16_moe; reference trtllm_bf16_moe, TRTLLMGenFusedMoE._requires_separated_routing, and forward_impl to align behavior.

coderabbitai · 2026-04-03T15:06:51Z

tests/integration/test_lists/test-db/l0_b200.yml

+  - accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_bf16[tp2-CUTLASS]
+  - accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_bf16[tp2-TRTLLM]
+  - accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_fp8[tp1]
+  - accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_fp8[tp2]


⚠️ Potential issue | 🟠 Major

tp2 cases are inconsistent with this single-GPU lane.

Line 77–Line 80 add tp2 test IDs under a condition constrained to exactly 1 GPU (system_gpu_count: 1). This is likely to fail scheduling/execution in pre-merge l0_b200.

Suggested fix

- accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_bf16[tp1-CUTLASS] - accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_bf16[tp1-TRTLLM] - - accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_bf16[tp2-CUTLASS] - - accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_bf16[tp2-TRTLLM] - accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_fp8[tp1] - - accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_fp8[tp2]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_bf16[tp2-CUTLASS]

- accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_bf16[tp2-TRTLLM]

- accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_fp8[tp1]

- accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_fp8[tp2]

- accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_fp8[tp1]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/integration/test_lists/test-db/l0_b200.yml` around lines 77 - 80, The test list includes tp2 variants (e.g., accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_bf16[tp2-CUTLASS], ::test_bf16[tp2-TRTLLM], ::test_fp8[tp2]) but the surrounding job is constrained to a single GPU (system_gpu_count: 1), causing scheduling failures; update the entries to match the single-GPU lane by either removing the tp2 variants or replacing them with their tp1 equivalents (e.g., change tp2-* to tp1-* or drop those lines) so the listed tests are compatible with the 1-GPU configuration.

coderabbitai · 2026-04-03T15:06:52Z

tests/unittest/_torch/modules/moe/moe_test_utils.py

+    # Quant_algo==None (BF16 path) also falls through and must meet the should_skip_trtllm criteria
+    if quant_algo is not None and quant_algo not in trtllm_gen_quant_algos:
        return None


⚠️ Potential issue | 🟠 Major

Don't apply the TRTLLM C++ routing skips to the BF16 FlashInfer path.

quant_algo is None now reaches should_skip_trtllm(), but the older C++ routing-kernel guard still runs before this BF16 block. That still skips BF16 TRTLLM cases for routing methods like DefaultMoeRoutingMethod, MiniMaxM2MoeRoutingMethod, and the Llama4 top-k restriction, even though the BF16 FlashInfer path now uses separated routing and only bans DeepSeekV3. Most of the new BF16 coverage stays disabled.

Also applies to: 263-272

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/unittest/_torch/modules/moe/moe_test_utils.py` around lines 238 - 240, The TRTLLM C++ routing-kernel skips are being applied to the BF16 FlashInfer path because quant_algo==None still reaches should_skip_trtllm(); fix by guarding the old C++ routing-kernel checks so they only run for quantized paths (i.e., require quant_algo is not None and/or quant_algo in trtllm_gen_quant_algos) before calling should_skip_trtllm(), or move the BF16 (quant_algo is None) BF16 block ahead of the C++ routing skips; update the logic around quant_algo, trtllm_gen_quant_algos and should_skip_trtllm to ensure DefaultMoeRoutingMethod, MiniMaxM2MoeRoutingMethod and the Llama4 top-k restriction are not applied to the BF16 FlashInfer path (which should only ban DeepSeekV3).

nv-guomingz · 2026-04-03T15:14:23Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-04-03T15:20:35Z

PR_Github #41692 [ run ] triggered by Bot. Commit: 2319281 Link to invocation

tensorrt-cicd · 2026-04-03T23:15:43Z

PR_Github #41692 [ run ] completed with state SUCCESS. Commit: 2319281
/LLM/main/L0_MergeRequest_PR pipeline #32595 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

nv-guomingz · 2026-04-04T01:15:25Z

/bot run

tensorrt-cicd · 2026-04-04T01:21:16Z

PR_Github #41778 [ run ] triggered by Bot. Commit: 2319281 Link to invocation

tensorrt-cicd · 2026-04-04T01:37:38Z

PR_Github #41778 [ run ] completed with state FAILURE. Commit: 2319281
/LLM/main/L0_MergeRequest_PR pipeline #32673 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

nv-guomingz · 2026-04-04T13:32:31Z

/bot run

tensorrt-cicd · 2026-04-04T13:38:08Z

PR_Github #41826 [ run ] triggered by Bot. Commit: 2319281 Link to invocation

tensorrt-cicd · 2026-04-04T13:38:09Z

PR_Github #41826 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 PM PST on 4/4.

Link to invocation

nv-guomingz · 2026-04-05T10:45:44Z

/bot run

tensorrt-cicd · 2026-04-05T10:51:15Z

PR_Github #41864 [ run ] triggered by Bot. Commit: a65b400 Link to invocation

tensorrt-cicd · 2026-04-05T12:43:55Z

PR_Github #41864 [ run ] completed with state SUCCESS. Commit: a65b400
/LLM/main/L0_MergeRequest_PR pipeline #32730 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

rosenrodt · 2026-04-05T19:45:36Z

tests/integration/test_lists/waives.txt

 accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16_4gpus_python_scheduler[ep4-mtp_nextn=0] SKIP (https://nvbugs/5997051)
 perf/test_perf_sanity.py::test_e2e[aggr_upload-deepseek_v32_fp4_blackwell-v32_fp4_tep8_mtp3_8k1k] SKIP (https://nvbugs/5997092)
-accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_fp8 SKIP (https://nvbugs/6004530)
+accuracy/test_llm_api_pytorch.py::TestLlama3_1_8B_Instruct_RocketKV::test_auto_dtype SKIP (https://nvbugs/6007197)


Is this case TestLlama3_1_8B_Instruct_RocketKV added by accident? (I had similar rebase conflict #12257 and I reverted this in later commits)

Thanks ,removed

rosenrodt · 2026-04-05T19:48:16Z

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py

+    @staticmethod
+    def _supports_flashinfer_bf16_routing_method(
+        routing_method: BaseMoeRoutingMethod, ) -> bool:
+        # FIXME: ban DeepSeekV3 FlashInfer trtllm_bf16_routed_moe() as it appears to have bug


Note: this will be addressed by flashinfer-ai/flashinfer#2911.

Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

nv-guomingz · 2026-04-06T06:19:59Z

/bot run

tensorrt-cicd · 2026-04-06T06:25:46Z

PR_Github #41914 [ run ] triggered by Bot. Commit: 4652726 Link to invocation

tensorrt-cicd · 2026-04-06T11:21:21Z

PR_Github #41914 [ run ] completed with state SUCCESS. Commit: 4652726
/LLM/main/L0_MergeRequest_PR pipeline #32773 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

nv-guomingz · 2026-04-06T13:41:37Z

/bot run

tensorrt-cicd · 2026-04-06T13:47:05Z

PR_Github #41933 [ run ] triggered by Bot. Commit: 4652726 Link to invocation

tensorrt-cicd · 2026-04-06T16:44:53Z

PR_Github #41933 [ run ] completed with state SUCCESS. Commit: 4652726
/LLM/main/L0_MergeRequest_PR pipeline #32790 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

nv-guomingz · 2026-04-07T01:49:44Z

/bot run

tensorrt-cicd · 2026-04-07T01:56:21Z

PR_Github #42020 [ run ] triggered by Bot. Commit: 4652726 Link to invocation

tensorrt-cicd · 2026-04-07T02:18:21Z

PR_Github #42020 [ run ] completed with state FAILURE. Commit: 4652726
/LLM/main/L0_MergeRequest_PR pipeline #32866 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

nv-guomingz · 2026-04-07T06:15:47Z

/bot run

tensorrt-cicd · 2026-04-07T06:21:58Z

PR_Github #42078 [ run ] triggered by Bot. Commit: 4652726 Link to invocation

tensorrt-cicd · 2026-04-07T10:56:30Z

PR_Github #42078 [ run ] completed with state SUCCESS. Commit: 4652726
/LLM/main/L0_MergeRequest_PR pipeline #32917 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

rosenrodt · 2026-04-07T13:20:49Z

/bot run

tensorrt-cicd · 2026-04-07T13:27:28Z

PR_Github #42143 [ run ] triggered by Bot. Commit: 4652726 Link to invocation

tensorrt-cicd · 2026-04-07T15:14:59Z

PR_Github #42143 [ run ] completed with state FAILURE. Commit: 4652726
/LLM/main/L0_MergeRequest_PR pipeline #32977 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

nv-guomingz · 2026-04-08T00:46:26Z

/bot run

tensorrt-cicd · 2026-04-08T00:53:02Z

PR_Github #42210 [ run ] triggered by Bot. Commit: 4652726 Link to invocation

nv-guomingz · 2026-04-08T09:19:02Z

/bot run

tensorrt-cicd · 2026-04-08T09:26:07Z

PR_Github #42315 [ run ] triggered by Bot. Commit: 4652726 Link to invocation

tensorrt-cicd · 2026-04-08T16:34:52Z

PR_Github #42315 [ run ] completed with state SUCCESS. Commit: 4652726
/LLM/main/L0_MergeRequest_PR pipeline #33105 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

nv-guomingz · 2026-04-09T03:03:47Z

/bot run

tensorrt-cicd · 2026-04-09T03:49:22Z

PR_Github #42444 [ run ] triggered by Bot. Commit: 4652726 Link to invocation

tensorrt-cicd · 2026-04-09T06:16:33Z

PR_Github #42444 [ run ] completed with state FAILURE. Commit: 4652726
/LLM/main/L0_MergeRequest_PR pipeline #33210 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

nv-guomingz · 2026-04-09T06:27:55Z

/bot run

tensorrt-cicd · 2026-04-09T06:34:46Z

PR_Github #42485 [ run ] triggered by Bot. Commit: 4652726 Link to invocation

tensorrt-cicd · 2026-04-09T14:00:04Z

PR_Github #42485 [ run ] completed with state SUCCESS. Commit: 4652726
/LLM/main/L0_MergeRequest_PR pipeline #33235 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

nv-guomingz · 2026-04-09T14:18:54Z

/bot run --disable-fail-fast

nv-guomingz requested review from a team as code owners April 3, 2026 14:54

nv-guomingz requested a review from QiJune April 3, 2026 14:54

github-actions bot assigned nv-guomingz Apr 3, 2026

coderabbitai bot reviewed Apr 3, 2026

View reviewed changes

nv-guomingz force-pushed the user/guomingz/qwen3.5-bf16-trtllm-moe branch from 7863d16 to 2319281 Compare April 3, 2026 15:14

nv-guomingz mentioned this pull request Apr 3, 2026

[None][feat] Optimize GDN of Qwen3-Next/3.5; adds BF16 TRTLLM MoE #12557

Open

1 task

nv-guomingz requested a review from rosenrodt April 4, 2026 01:15

nv-guomingz changed the title ~~[None][feat] Add bf16 trtllm moe through flashinfer.~~ [None][feat] Add bf16 trtllm-gen moe through flashinfer. Apr 4, 2026

nv-guomingz changed the title ~~[None][feat] Add bf16 trtllm-gen moe through flashinfer.~~ [None][feat] Add bf16 trtllm-gen moe support through flashinfer. Apr 4, 2026

LarryXFly approved these changes Apr 5, 2026

View reviewed changes

nv-guomingz force-pushed the user/guomingz/qwen3.5-bf16-trtllm-moe branch from 2319281 to a65b400 Compare April 5, 2026 10:45

rosenrodt reviewed Apr 5, 2026

View reviewed changes

[None][feat] Add bf16 trtllm moe through flashinfer.

4652726

Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

nv-guomingz force-pushed the user/guomingz/qwen3.5-bf16-trtllm-moe branch from a65b400 to 4652726 Compare April 6, 2026 06:19

Conversation

nv-guomingz commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

nv-guomingz commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

nv-guomingz commented Apr 4, 2026

Uh oh!

tensorrt-cicd commented Apr 4, 2026

Uh oh!

tensorrt-cicd commented Apr 4, 2026

Uh oh!

nv-guomingz commented Apr 4, 2026

Uh oh!

tensorrt-cicd commented Apr 4, 2026

Uh oh!

tensorrt-cicd commented Apr 4, 2026

Uh oh!

nv-guomingz commented Apr 5, 2026

Uh oh!

tensorrt-cicd commented Apr 5, 2026

Uh oh!

tensorrt-cicd commented Apr 5, 2026

Uh oh!

rosenrodt Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

nv-guomingz Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

rosenrodt Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

nv-guomingz commented Apr 6, 2026

Uh oh!

tensorrt-cicd commented Apr 6, 2026

Uh oh!

tensorrt-cicd commented Apr 6, 2026

Uh oh!

nv-guomingz commented Apr 6, 2026

Uh oh!

nv-guomingz commented Apr 3, 2026 •

edited

Loading

coderabbitai bot commented Apr 3, 2026 •

edited

Loading