Skip to content

[None][feat] Add bf16 trtllm-gen moe support through flashinfer.#12738

Open
nv-guomingz wants to merge 1 commit intoNVIDIA:mainfrom
nv-guomingz:user/guomingz/qwen3.5-bf16-trtllm-moe
Open

[None][feat] Add bf16 trtllm-gen moe support through flashinfer.#12738
nv-guomingz wants to merge 1 commit intoNVIDIA:mainfrom
nv-guomingz:user/guomingz/qwen3.5-bf16-trtllm-moe

Conversation

@nv-guomingz
Copy link
Copy Markdown
Collaborator

@nv-guomingz nv-guomingz commented Apr 3, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Added BF16 (unquantized) Mixture of Experts execution mode with FlashInfer backend support
    • Enhanced MoE backend selection logic with improved routing method handling
  • Tests

    • Expanded test coverage for varying tensor parallel sizes and MoE backend configurations (CUTLASS and TRTLLM)

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@nv-guomingz nv-guomingz requested review from a team as code owners April 3, 2026 14:54
@nv-guomingz nv-guomingz requested a review from QiJune April 3, 2026 14:54
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 3, 2026

📝 Walkthrough

Walkthrough

This PR adds BF16 unquantized MoE execution support for the TRTLLM backend using FlashInfer. Changes introduce a new BF16TRTLLMGenFusedMoEMethod for weight layout conversion, a resolve_moe_cls() function for routing-method-dependent backend selection, FlashInfer-specific validation, and corresponding backend implementations with updated test coverage.

Changes

Cohort / File(s) Summary
Backend Selection & Configuration
tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py, tensorrt_llm/_torch/modules/fused_moe/create_moe.py
Introduced resolve_moe_cls() for routing-method-dependent MoE backend selection with fallback logic. Enhanced get_moe_cls() with BF16/no-quant path support. Updated configurable_moe.py to use deep-copied backend model config for override handling.
TRTLLM Gen Fused MoE Implementation
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py, tensorrt_llm/_torch/modules/fused_moe/moe_op_backend.py
Added BF16 unquantized execution mode with FlashInfer backend selection. Introduced FlashInfer-specific validation methods and changed can_implement() to conditionally accept unquantized mode. Added run_bf16_moe() method to backend interface with TRTLLMOpBackend and FlashinferOpBackend implementations.
Weight Layout & Quantization
tensorrt_llm/_torch/modules/fused_moe/quantization.py
Introduced BF16TRTLLMGenFusedMoEMethod with BlockMajorK weight layout support. Added weight layout constants and helper functions for layout conversion. Extended lifecycle control for post-loading weight processing in FusedMoEMethodBase.
Test Parameterization
tests/integration/defs/accuracy/test_llm_api_pytorch.py, tests/integration/test_lists/qa/llm_function_core.txt, tests/integration/test_lists/qa/llm_function_core_sanity.txt, tests/integration/test_lists/test-db/l0_b200.yml
Parameterized Qwen3 5.35B A3B tests for BF16 and FP8 modes across CUTLASS and TRTLLM backends with variable tensor parallelism. Updated test selection lists to reflect parameterized variants.
Test Utilities & Skip Logic
tests/unittest/_torch/modules/moe/moe_test_utils.py, tests/integration/test_lists/waives.txt
Updated should_skip_trtllm to allow BF16 unquantized path through constraints. Added alignment validation (128-multiple requirement) for BF16. Adjusted CI skip logic to explicitly enable TRTLLM BF16 unquantized coverage. Updated waive entries.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client/Configurable MoE
    participant Resolver as resolve_moe_cls()
    participant Selector as get_moe_cls()
    participant TRTLLMGen as TRTLLMGenFusedMoE
    participant Backend as MoE Op Backend
    participant FlashInfer as FlashInfer/CUDA Ops

    Client->>Resolver: resolve_moe_cls(model_config, routing_method, dtype, override_quant_config)
    Resolver->>Selector: get_moe_cls(model_config, routing_method, dtype, override_quant_config)
    
    alt has_quant == True
        Selector->>TRTLLMGen: Check specific quant predicates
        TRTLLMGen-->>Selector: Return if match
    else has_quant == False
        Selector->>TRTLLMGen: Check BF16 + FlashInfer availability
        alt BF16 && FlashInfer available
            TRTLLMGen-->>Selector: Return TRTLLMGenFusedMoE
        else Missing FlashInfer
            Selector-->>Resolver: Raise RuntimeError
        end
    end
    
    Resolver->>TRTLLMGen: Check routing method support for BF16 unquantized
    alt Routing not supported && unquantized
        Resolver->>Resolver: Fall back to CutlassFusedMoE
    end
    
    Resolver-->>Client: MoE class selected
    Client->>TRTLLMGen: create_weights() / load_weights()
    TRTLLMGen->>Backend: run_bf16_moe() (if BF16 unquantized)
    Backend->>FlashInfer: trtllm_bf16_moe() or trtllm_bf16_routed_moe()
    FlashInfer-->>Backend: Output tensor
    Backend-->>TRTLLMGen: Result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.26% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The PR description contains only the template with sections unfilled; actual description, test coverage details, and substantive PR context are missing. Add a detailed description explaining what bf16 TRT-LLM MoE support via FlashInfer entails, why it was added, and list specific tests that validate the changes (e.g., test_bf16 variants for Qwen3.5-35B).
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main feature: adding bf16 (bfloat16) support for TensorRT-LLM MoE through FlashInfer, which aligns with the primary changes across all modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (3)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (2)

5963-5973: Assert the selected MoE backend in the BF16 matrix.

This is the new coverage point for the TRTLLM BF16 MoE path, but it only checks task accuracy. A silent fallback would still pass, so please assert the resolved backend or, at minimum, llm.args.moe_config.backend.

🧪 Minimal assertion
         with LLM(self.MODEL_PATH,
                  tensor_parallel_size=tp_size,
                  moe_expert_parallel_size=1,
                  max_seq_len=4096,
                  max_batch_size=32,
                  enable_chunked_prefill=True,
                  kv_cache_config=kv_cache_config,
                  cuda_graph_config=cuda_graph_config,
                  moe_config=moe_config) as llm:
+            assert llm.args.moe_config.backend == moe_backend
             task = GSM8K(self.MODEL_NAME)
             task.evaluate(llm,
                           extra_evaluator_kwargs=self.EXTRA_EVALUATOR_KWARGS)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/defs/accuracy/test_llm_api_pytorch.py` around lines 5963 -
5973, The test currently only checks task accuracy but doesn't assert that the
intended MoE backend was actually selected, allowing silent fallbacks to pass;
after entering the LLM context created with MoeConfig(backend=moe_backend) and
LLM(... ) as llm, add an assertion that verifies the resolved backend (for
example by checking llm.args.moe_config.backend or another resolved backend
field on the llm instance) equals the expected moe_backend so the BF16 MoE path
is exercised and failures are caught.

5969-5973: Keep the CUDA-graph ladder aligned with the new batch cap.

max_batch_size is now 32, but the explicit capture sizes defined just above still include 64 and 128. If those entries are honored, this matrix still warms larger graphs than it can ever serve.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/defs/accuracy/test_llm_api_pytorch.py` around lines 5969 -
5973, The CUDA-graph capture sizes (defined in the cuda_graph_config/capture
ladder) still include 64 and 128 while max_batch_size was changed to 32; update
the cuda_graph_config capture sizes used by the context manager (the
cuda_graph_config passed into the llm with max_batch_size=32) so none exceed
32—remove or replace the 64 and 128 entries and ensure the ladder ends at or
below max_batch_size to prevent warming graphs larger than the serving cap.
tensorrt_llm/_torch/modules/fused_moe/quantization.py (1)

700-701: Consider annotating the shared cache with ClassVar.

Mutable default values share state across all instances of the class, while not being obvious. This can lead to bugs when the attributes are changed in one instance, as those changes will unexpectedly affect all other instances.

The current implementation is intentional (matching the pattern at lines 2664 and 3873), but explicitly annotating the variable with typing.ClassVar indicates that it is intended to be shared across all instances. This also silences the Ruff RUF012 warning and documents the intent.

🛠️ Suggested annotation
+from typing import ClassVar
+
 class BF16TRTLLMGenFusedMoEMethod(UnquantizedFusedMoEMethod):
     # BlockMajorK uses 128-byte K blocks. BF16 has 2 bytes per element.
     block_k = 64
     use_shuffled_weight = True
     weight_layout = TRTLLM_GEN_WEIGHT_LAYOUT_BLOCK_MAJOR_K
     needs_post_load_processing_for_dummy = True
-    _cache_permute_indices: Dict[tuple[tuple[int, ...], str, int],
-                                 torch.Tensor] = {}
+    _cache_permute_indices: ClassVar[Dict[tuple[tuple[int, ...], str, int],
+                                          torch.Tensor]] = {}

Note: The same pattern exists for NVFP4TRTLLMGenFusedMoEBaseMethod (line 2664) and MXFP4WeightTRTLLMGenFusedMoEMethod (line 3873), which could benefit from similar annotations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/fused_moe/quantization.py` around lines 700 -
701, Annotate the shared mutable class attribute _cache_permute_indices as a
ClassVar to make the shared intent explicit and silence Ruff RUF012: import
ClassVar from typing and change the annotation to ClassVar[Dict[tuple[tuple[int,
...], str, int], torch.Tensor]] while keeping the existing initializer ({});
apply the same pattern to the similar shared attributes on
NVFP4TRTLLMGenFusedMoEBaseMethod and MXFP4WeightTRTLLMGenFusedMoEMethod so these
mutable defaults are clearly documented as class-level shared state.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py`:
- Around line 192-195: The code temporarily mutates
backend_model_config.skip_create_weights_in_init and
backend_model_config._frozen but does not reliably restore the original values
on all paths; capture the original values (e.g., tmp_skip_create_weights_in_init
and tmp_frozen = backend_model_config._frozen) before mutating, set
skip_create_weights_in_init = True and _frozen = False only for the operation,
and restore both original values in a finally block around the call to
create_moe_backend (and the analogous block at the 240-244 site) so exceptions
do not leak mutated state.

In `@tensorrt_llm/_torch/modules/fused_moe/create_moe.py`:
- Around line 75-82: The BF16 branch in resolve_moe_cls/create_moe currently
only checks model_config.pretrained_config.torch_dtype and ignores the call-site
dtype argument, so callers passing dtype=torch.bfloat16 won't select
TRTLLMGenFusedMoE; update both BF16 checks (the block referencing
TRTLLMGenFusedMoE around the current 75-82 and the similar branch at 97-113) to
honor the dtype parameter by treating the branch as true when dtype is
torch.bfloat16 OR model_config.pretrained_config.torch_dtype is torch.bfloat16
(while still requiring not has_quant and the FlashInfer availability check),
i.e., use a combined condition like: not has_quant and (dtype is torch.bfloat16
or getattr(model_config.pretrained_config, "torch_dtype", None) is
torch.bfloat16) before returning TRTLLMGenFusedMoE or raising the same
RuntimeError.

In `@tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py`:
- Around line 323-325: The function signature for
_supports_flashinfer_bf16_routing_method has a hanging indent causing an E125
lint error; fix it by reformatting the parameter continuation so the closing
parenthesis aligns with the opening or move the closing parenthesis to the same
line as the last parameter, e.g. adjust the indentation of the line with
"routing_method: BaseMoeRoutingMethod, ) -> bool:" so it no longer creates a
misaligned continuation; update the def for
_supports_flashinfer_bf16_routing_method accordingly.

In `@tensorrt_llm/_torch/modules/fused_moe/moe_op_backend.py`:
- Around line 787-808: This branch unconditionally calls trtllm_bf16_moe when
router_logits is present, but TRTLLMGenFusedMoE._requires_separated_routing
documents BF16 direct-routing is inaccurate; update the guard in the block that
calls trtllm_bf16_moe to fail fast: detect BF16 direct-routing scenarios (use
routing_logits presence and routing_method_type via self.cvt_routing_method_type
or the same predicate used by TRTLLMGenFusedMoE._requires_separated_routing) and
raise a clear RuntimeError (or fallback to FP32/top-k precompute) instead of
dispatching trtllm_bf16_moe; reference trtllm_bf16_moe,
TRTLLMGenFusedMoE._requires_separated_routing, and forward_impl to align
behavior.

In `@tests/integration/test_lists/test-db/l0_b200.yml`:
- Around line 77-80: The test list includes tp2 variants (e.g.,
accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_bf16[tp2-CUTLASS],
::test_bf16[tp2-TRTLLM], ::test_fp8[tp2]) but the surrounding job is constrained
to a single GPU (system_gpu_count: 1), causing scheduling failures; update the
entries to match the single-GPU lane by either removing the tp2 variants or
replacing them with their tp1 equivalents (e.g., change tp2-* to tp1-* or drop
those lines) so the listed tests are compatible with the 1-GPU configuration.

In `@tests/unittest/_torch/modules/moe/moe_test_utils.py`:
- Around line 238-240: The TRTLLM C++ routing-kernel skips are being applied to
the BF16 FlashInfer path because quant_algo==None still reaches
should_skip_trtllm(); fix by guarding the old C++ routing-kernel checks so they
only run for quantized paths (i.e., require quant_algo is not None and/or
quant_algo in trtllm_gen_quant_algos) before calling should_skip_trtllm(), or
move the BF16 (quant_algo is None) BF16 block ahead of the C++ routing skips;
update the logic around quant_algo, trtllm_gen_quant_algos and
should_skip_trtllm to ensure DefaultMoeRoutingMethod, MiniMaxM2MoeRoutingMethod
and the Llama4 top-k restriction are not applied to the BF16 FlashInfer path
(which should only ban DeepSeekV3).

---

Nitpick comments:
In `@tensorrt_llm/_torch/modules/fused_moe/quantization.py`:
- Around line 700-701: Annotate the shared mutable class attribute
_cache_permute_indices as a ClassVar to make the shared intent explicit and
silence Ruff RUF012: import ClassVar from typing and change the annotation to
ClassVar[Dict[tuple[tuple[int, ...], str, int], torch.Tensor]] while keeping the
existing initializer ({}); apply the same pattern to the similar shared
attributes on NVFP4TRTLLMGenFusedMoEBaseMethod and
MXFP4WeightTRTLLMGenFusedMoEMethod so these mutable defaults are clearly
documented as class-level shared state.

In `@tests/integration/defs/accuracy/test_llm_api_pytorch.py`:
- Around line 5963-5973: The test currently only checks task accuracy but
doesn't assert that the intended MoE backend was actually selected, allowing
silent fallbacks to pass; after entering the LLM context created with
MoeConfig(backend=moe_backend) and LLM(... ) as llm, add an assertion that
verifies the resolved backend (for example by checking
llm.args.moe_config.backend or another resolved backend field on the llm
instance) equals the expected moe_backend so the BF16 MoE path is exercised and
failures are caught.
- Around line 5969-5973: The CUDA-graph capture sizes (defined in the
cuda_graph_config/capture ladder) still include 64 and 128 while max_batch_size
was changed to 32; update the cuda_graph_config capture sizes used by the
context manager (the cuda_graph_config passed into the llm with
max_batch_size=32) so none exceed 32—remove or replace the 64 and 128 entries
and ensure the ladder ends at or below max_batch_size to prevent warming graphs
larger than the serving cap.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5bec131a-2811-443f-ba6b-06d59f667b43

📥 Commits

Reviewing files that changed from the base of the PR and between 1045f38 and 7863d16.

📒 Files selected for processing (11)
  • tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py
  • tensorrt_llm/_torch/modules/fused_moe/create_moe.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_op_backend.py
  • tensorrt_llm/_torch/modules/fused_moe/quantization.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tests/integration/test_lists/qa/llm_function_core.txt
  • tests/integration/test_lists/qa/llm_function_core_sanity.txt
  • tests/integration/test_lists/test-db/l0_b200.yml
  • tests/integration/test_lists/waives.txt
  • tests/unittest/_torch/modules/moe/moe_test_utils.py

Comment on lines +192 to +195
tmp_skip_create_weights_in_init = backend_model_config.skip_create_weights_in_init
backend_model_config._frozen = False
backend_model_config.skip_create_weights_in_init = True
backend_model_config._frozen = True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Config freeze/skip flags are not safely restored across all paths.

Line 193 and Line 195 force _frozen=True instead of restoring the original frozen state, and if create_moe_backend(...) throws, skip_create_weights_in_init is left mutated. This can leak state into subsequent layer construction.

Suggested fix
-        tmp_skip_create_weights_in_init = backend_model_config.skip_create_weights_in_init
-        backend_model_config._frozen = False
-        backend_model_config.skip_create_weights_in_init = True
-        backend_model_config._frozen = True
-
-        backend = create_moe_backend(
+        original_skip_create_weights_in_init = backend_model_config.skip_create_weights_in_init
+        original_frozen = backend_model_config._frozen
+        try:
+            backend_model_config._frozen = False
+            backend_model_config.skip_create_weights_in_init = True
+            backend_model_config._frozen = original_frozen
+
+            backend = create_moe_backend(
             moe_cls=moe_cls,
             routing_method=routing_method,
             num_experts=self.num_experts,
             hidden_size=self.hidden_size,
             intermediate_size=self.intermediate_size,
             dtype=self.dtype,
             reduce_results=self.reduce_results,
             model_config=backend_model_config,
             aux_stream_dict=self.aux_stream_dict,
             weight_loading_mode=self.weight_loading_mode,
             bias=kwargs.get("bias", False),
             apply_router_weight_on_input=self.apply_router_weight_on_input,
             layer_idx=None,
             swiglu_alpha=kwargs.get("swiglu_alpha"),
             swiglu_beta=kwargs.get("swiglu_beta"),
             swiglu_limit=kwargs.get("swiglu_limit"),
             init_load_balancer=False,
             without_comm=True,
             activation_type=self.activation_type,
-        )
+            )
+        finally:
+            backend_model_config._frozen = False
+            backend_model_config.skip_create_weights_in_init = original_skip_create_weights_in_init
+            backend_model_config._frozen = original_frozen
@@
-        backend_model_config._frozen = False
-        backend_model_config.skip_create_weights_in_init = tmp_skip_create_weights_in_init
-        backend_model_config._frozen = True
-        if not backend_model_config.skip_create_weights_in_init:
+        if not original_skip_create_weights_in_init:
             self.backend.create_weights()

Also applies to: 240-244

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py` around lines 192 -
195, The code temporarily mutates
backend_model_config.skip_create_weights_in_init and
backend_model_config._frozen but does not reliably restore the original values
on all paths; capture the original values (e.g., tmp_skip_create_weights_in_init
and tmp_frozen = backend_model_config._frozen) before mutating, set
skip_create_weights_in_init = True and _frozen = False only for the operation,
and restore both original values in a finally block around the call to
create_moe_backend (and the analogous block at the 240-244 site) so exceptions
do not leak mutated state.

Comment on lines +75 to +82
if not has_quant and model_config.pretrained_config is not None and getattr(
model_config.pretrained_config, "torch_dtype",
None) == torch.bfloat16:
if TRTLLMGenFusedMoE._is_flashinfer_fused_moe_available():
return TRTLLMGenFusedMoE
raise RuntimeError(
"TRTLLMGenFusedMoE BF16 path requires FlashInfer fused MoE with "
"trtllm_bf16_moe support, but it is not available.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Honor call-site dtype overrides when resolving the BF16 backend.

resolve_moe_cls() takes dtype, but the unquantized BF16 branch still keys off model_config.pretrained_config.torch_dtype only. A caller doing create_moe(..., dtype=torch.bfloat16) will therefore fall back to CutlassFusedMoE whenever the pretrained config is unset or still says float16, so the new FlashInfer-backed TRTLLM path never gets selected.

Also applies to: 97-113

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/fused_moe/create_moe.py` around lines 75 - 82,
The BF16 branch in resolve_moe_cls/create_moe currently only checks
model_config.pretrained_config.torch_dtype and ignores the call-site dtype
argument, so callers passing dtype=torch.bfloat16 won't select
TRTLLMGenFusedMoE; update both BF16 checks (the block referencing
TRTLLMGenFusedMoE around the current 75-82 and the similar branch at 97-113) to
honor the dtype parameter by treating the branch as true when dtype is
torch.bfloat16 OR model_config.pretrained_config.torch_dtype is torch.bfloat16
(while still requiring not has_quant and the FlashInfer availability check),
i.e., use a combined condition like: not has_quant and (dtype is torch.bfloat16
or getattr(model_config.pretrained_config, "torch_dtype", None) is
torch.bfloat16) before returning TRTLLMGenFusedMoE or raising the same
RuntimeError.

Comment on lines +323 to +325
def _supports_flashinfer_bf16_routing_method(
routing_method: BaseMoeRoutingMethod, ) -> bool:
# FIXME: ban DeepSeekV3 FlashInfer trtllm_bf16_routed_moe() as it appears to have bug
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix the hanging indent in _supports_flashinfer_bf16_routing_method.

Flake8 is already reporting E125 here, so the lint job will remain red until the continuation line is re-indented or the closing parenthesis is moved.

Minimal fix
     `@staticmethod`
     def _supports_flashinfer_bf16_routing_method(
-        routing_method: BaseMoeRoutingMethod, ) -> bool:
+        routing_method: BaseMoeRoutingMethod,
+    ) -> bool:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _supports_flashinfer_bf16_routing_method(
routing_method: BaseMoeRoutingMethod, ) -> bool:
# FIXME: ban DeepSeekV3 FlashInfer trtllm_bf16_routed_moe() as it appears to have bug
`@staticmethod`
def _supports_flashinfer_bf16_routing_method(
routing_method: BaseMoeRoutingMethod,
) -> bool:
# FIXME: ban DeepSeekV3 FlashInfer trtllm_bf16_routed_moe() as it appears to have bug
🧰 Tools
🪛 Flake8 (7.3.0)

[error] 324-324: continuation line with same indent as next logical line

(E125)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py` around lines
323 - 325, The function signature for _supports_flashinfer_bf16_routing_method
has a hanging indent causing an E125 lint error; fix it by reformatting the
parameter continuation so the closing parenthesis aligns with the opening or
move the closing parenthesis to the same line as the last parameter, e.g. adjust
the indentation of the line with "routing_method: BaseMoeRoutingMethod, ) ->
bool:" so it no longer creates a misaligned continuation; update the def for
_supports_flashinfer_bf16_routing_method accordingly.

Comment on lines +787 to +808
if router_logits is not None:
result = self._fused_moe.trtllm_bf16_moe(
routing_logits=router_logits,
routing_bias=routing_bias,
hidden_states=hidden_states,
gemm1_weights=gemm1_weights,
gemm2_weights=gemm2_weights,
num_experts=num_experts,
top_k=top_k,
n_group=n_group,
topk_group=topk_group,
intermediate_size=intermediate_size,
local_expert_offset=local_expert_offset,
local_num_experts=local_num_experts,
routed_scaling_factor=routed_scaling_factor,
routing_method_type=self.cvt_routing_method_type(routing_method_type),
use_shuffled_weight=use_shuffled_weight,
weight_layout=weight_layout,
do_finalize=do_finalize,
enable_pdl=enable_pdl,
tune_max_num_tokens=tune_max_num_tokens,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail fast on the BF16 direct-routing kernel for now.

TRTLLMGenFusedMoE._requires_separated_routing() now documents that FlashInfer BF16 direct routing has an accuracy bug, but this branch still dispatches trtllm_bf16_moe() whenever router_logits is present. TRTLLMGenFusedMoE.forward_impl() only precomputes top-k on the post-comm path, so single-GPU / legacy callers can still hit this kernel and get silent miscomputations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/fused_moe/moe_op_backend.py` around lines 787 -
808, This branch unconditionally calls trtllm_bf16_moe when router_logits is
present, but TRTLLMGenFusedMoE._requires_separated_routing documents BF16
direct-routing is inaccurate; update the guard in the block that calls
trtllm_bf16_moe to fail fast: detect BF16 direct-routing scenarios (use
routing_logits presence and routing_method_type via self.cvt_routing_method_type
or the same predicate used by TRTLLMGenFusedMoE._requires_separated_routing) and
raise a clear RuntimeError (or fallback to FP32/top-k precompute) instead of
dispatching trtllm_bf16_moe; reference trtllm_bf16_moe,
TRTLLMGenFusedMoE._requires_separated_routing, and forward_impl to align
behavior.

Comment on lines +77 to +80
- accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_bf16[tp2-CUTLASS]
- accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_bf16[tp2-TRTLLM]
- accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_fp8[tp1]
- accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_fp8[tp2]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

tp2 cases are inconsistent with this single-GPU lane.

Line 77–Line 80 add tp2 test IDs under a condition constrained to exactly 1 GPU (system_gpu_count: 1). This is likely to fail scheduling/execution in pre-merge l0_b200.

Suggested fix
   - accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_bf16[tp1-CUTLASS]
   - accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_bf16[tp1-TRTLLM]
-  - accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_bf16[tp2-CUTLASS]
-  - accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_bf16[tp2-TRTLLM]
   - accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_fp8[tp1]
-  - accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_fp8[tp2]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_bf16[tp2-CUTLASS]
- accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_bf16[tp2-TRTLLM]
- accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_fp8[tp1]
- accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_fp8[tp2]
- accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_fp8[tp1]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/test_lists/test-db/l0_b200.yml` around lines 77 - 80, The
test list includes tp2 variants (e.g.,
accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_bf16[tp2-CUTLASS],
::test_bf16[tp2-TRTLLM], ::test_fp8[tp2]) but the surrounding job is constrained
to a single GPU (system_gpu_count: 1), causing scheduling failures; update the
entries to match the single-GPU lane by either removing the tp2 variants or
replacing them with their tp1 equivalents (e.g., change tp2-* to tp1-* or drop
those lines) so the listed tests are compatible with the 1-GPU configuration.

Comment on lines +238 to 240
# Quant_algo==None (BF16 path) also falls through and must meet the should_skip_trtllm criteria
if quant_algo is not None and quant_algo not in trtllm_gen_quant_algos:
return None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't apply the TRTLLM C++ routing skips to the BF16 FlashInfer path.

quant_algo is None now reaches should_skip_trtllm(), but the older C++ routing-kernel guard still runs before this BF16 block. That still skips BF16 TRTLLM cases for routing methods like DefaultMoeRoutingMethod, MiniMaxM2MoeRoutingMethod, and the Llama4 top-k restriction, even though the BF16 FlashInfer path now uses separated routing and only bans DeepSeekV3. Most of the new BF16 coverage stays disabled.

Also applies to: 263-272

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/modules/moe/moe_test_utils.py` around lines 238 - 240,
The TRTLLM C++ routing-kernel skips are being applied to the BF16 FlashInfer
path because quant_algo==None still reaches should_skip_trtllm(); fix by
guarding the old C++ routing-kernel checks so they only run for quantized paths
(i.e., require quant_algo is not None and/or quant_algo in
trtllm_gen_quant_algos) before calling should_skip_trtllm(), or move the BF16
(quant_algo is None) BF16 block ahead of the C++ routing skips; update the logic
around quant_algo, trtllm_gen_quant_algos and should_skip_trtllm to ensure
DefaultMoeRoutingMethod, MiniMaxM2MoeRoutingMethod and the Llama4 top-k
restriction are not applied to the BF16 FlashInfer path (which should only ban
DeepSeekV3).

@nv-guomingz nv-guomingz force-pushed the user/guomingz/qwen3.5-bf16-trtllm-moe branch from 7863d16 to 2319281 Compare April 3, 2026 15:14
@nv-guomingz
Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41692 [ run ] triggered by Bot. Commit: 2319281 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41692 [ run ] completed with state SUCCESS. Commit: 2319281
/LLM/main/L0_MergeRequest_PR pipeline #32595 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@nv-guomingz
Copy link
Copy Markdown
Collaborator Author

/bot run

@nv-guomingz nv-guomingz requested a review from rosenrodt April 4, 2026 01:15
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41778 [ run ] triggered by Bot. Commit: 2319281 Link to invocation

@nv-guomingz nv-guomingz changed the title [None][feat] Add bf16 trtllm moe through flashinfer. [None][feat] Add bf16 trtllm-gen moe through flashinfer. Apr 4, 2026
@nv-guomingz nv-guomingz changed the title [None][feat] Add bf16 trtllm-gen moe through flashinfer. [None][feat] Add bf16 trtllm-gen moe support through flashinfer. Apr 4, 2026
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41778 [ run ] completed with state FAILURE. Commit: 2319281
/LLM/main/L0_MergeRequest_PR pipeline #32673 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@nv-guomingz
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41826 [ run ] triggered by Bot. Commit: 2319281 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41826 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 PM PST on 4/4.

Link to invocation

@nv-guomingz nv-guomingz force-pushed the user/guomingz/qwen3.5-bf16-trtllm-moe branch from 2319281 to a65b400 Compare April 5, 2026 10:45
@nv-guomingz
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41864 [ run ] triggered by Bot. Commit: a65b400 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41864 [ run ] completed with state SUCCESS. Commit: a65b400
/LLM/main/L0_MergeRequest_PR pipeline #32730 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16_4gpus_python_scheduler[ep4-mtp_nextn=0] SKIP (https://nvbugs/5997051)
perf/test_perf_sanity.py::test_e2e[aggr_upload-deepseek_v32_fp4_blackwell-v32_fp4_tep8_mtp3_8k1k] SKIP (https://nvbugs/5997092)
accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_fp8 SKIP (https://nvbugs/6004530)
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8B_Instruct_RocketKV::test_auto_dtype SKIP (https://nvbugs/6007197)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this case TestLlama3_1_8B_Instruct_RocketKV added by accident? (I had similar rebase conflict #12257 and I reverted this in later commits)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ,removed

@staticmethod
def _supports_flashinfer_bf16_routing_method(
routing_method: BaseMoeRoutingMethod, ) -> bool:
# FIXME: ban DeepSeekV3 FlashInfer trtllm_bf16_routed_moe() as it appears to have bug
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this will be addressed by flashinfer-ai/flashinfer#2911.

Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
@nv-guomingz nv-guomingz force-pushed the user/guomingz/qwen3.5-bf16-trtllm-moe branch from a65b400 to 4652726 Compare April 6, 2026 06:19
@nv-guomingz
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41914 [ run ] triggered by Bot. Commit: 4652726 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41914 [ run ] completed with state SUCCESS. Commit: 4652726
/LLM/main/L0_MergeRequest_PR pipeline #32773 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@nv-guomingz
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41933 [ run ] triggered by Bot. Commit: 4652726 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41933 [ run ] completed with state SUCCESS. Commit: 4652726
/LLM/main/L0_MergeRequest_PR pipeline #32790 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@nv-guomingz
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42020 [ run ] triggered by Bot. Commit: 4652726 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42020 [ run ] completed with state FAILURE. Commit: 4652726
/LLM/main/L0_MergeRequest_PR pipeline #32866 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@nv-guomingz
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42078 [ run ] triggered by Bot. Commit: 4652726 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42078 [ run ] completed with state SUCCESS. Commit: 4652726
/LLM/main/L0_MergeRequest_PR pipeline #32917 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@rosenrodt
Copy link
Copy Markdown
Collaborator

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42143 [ run ] triggered by Bot. Commit: 4652726 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42143 [ run ] completed with state FAILURE. Commit: 4652726
/LLM/main/L0_MergeRequest_PR pipeline #32977 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@nv-guomingz
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42210 [ run ] triggered by Bot. Commit: 4652726 Link to invocation

@nv-guomingz
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42315 [ run ] triggered by Bot. Commit: 4652726 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42315 [ run ] completed with state SUCCESS. Commit: 4652726
/LLM/main/L0_MergeRequest_PR pipeline #33105 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@nv-guomingz
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42444 [ run ] triggered by Bot. Commit: 4652726 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42444 [ run ] completed with state FAILURE. Commit: 4652726
/LLM/main/L0_MergeRequest_PR pipeline #33210 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@nv-guomingz
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42485 [ run ] triggered by Bot. Commit: 4652726 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42485 [ run ] completed with state SUCCESS. Commit: 4652726
/LLM/main/L0_MergeRequest_PR pipeline #33235 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@nv-guomingz
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants