[None][feat] Add bf16 trtllm-gen moe support through flashinfer.#12738
[None][feat] Add bf16 trtllm-gen moe support through flashinfer.#12738nv-guomingz wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
📝 WalkthroughWalkthroughThis PR adds BF16 unquantized MoE execution support for the TRTLLM backend using FlashInfer. Changes introduce a new Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client/Configurable MoE
participant Resolver as resolve_moe_cls()
participant Selector as get_moe_cls()
participant TRTLLMGen as TRTLLMGenFusedMoE
participant Backend as MoE Op Backend
participant FlashInfer as FlashInfer/CUDA Ops
Client->>Resolver: resolve_moe_cls(model_config, routing_method, dtype, override_quant_config)
Resolver->>Selector: get_moe_cls(model_config, routing_method, dtype, override_quant_config)
alt has_quant == True
Selector->>TRTLLMGen: Check specific quant predicates
TRTLLMGen-->>Selector: Return if match
else has_quant == False
Selector->>TRTLLMGen: Check BF16 + FlashInfer availability
alt BF16 && FlashInfer available
TRTLLMGen-->>Selector: Return TRTLLMGenFusedMoE
else Missing FlashInfer
Selector-->>Resolver: Raise RuntimeError
end
end
Resolver->>TRTLLMGen: Check routing method support for BF16 unquantized
alt Routing not supported && unquantized
Resolver->>Resolver: Fall back to CutlassFusedMoE
end
Resolver-->>Client: MoE class selected
Client->>TRTLLMGen: create_weights() / load_weights()
TRTLLMGen->>Backend: run_bf16_moe() (if BF16 unquantized)
Backend->>FlashInfer: trtllm_bf16_moe() or trtllm_bf16_routed_moe()
FlashInfer-->>Backend: Output tensor
Backend-->>TRTLLMGen: Result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (3)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (2)
5963-5973: Assert the selected MoE backend in the BF16 matrix.This is the new coverage point for the TRTLLM BF16 MoE path, but it only checks task accuracy. A silent fallback would still pass, so please assert the resolved backend or, at minimum,
llm.args.moe_config.backend.🧪 Minimal assertion
with LLM(self.MODEL_PATH, tensor_parallel_size=tp_size, moe_expert_parallel_size=1, max_seq_len=4096, max_batch_size=32, enable_chunked_prefill=True, kv_cache_config=kv_cache_config, cuda_graph_config=cuda_graph_config, moe_config=moe_config) as llm: + assert llm.args.moe_config.backend == moe_backend task = GSM8K(self.MODEL_NAME) task.evaluate(llm, extra_evaluator_kwargs=self.EXTRA_EVALUATOR_KWARGS)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/defs/accuracy/test_llm_api_pytorch.py` around lines 5963 - 5973, The test currently only checks task accuracy but doesn't assert that the intended MoE backend was actually selected, allowing silent fallbacks to pass; after entering the LLM context created with MoeConfig(backend=moe_backend) and LLM(... ) as llm, add an assertion that verifies the resolved backend (for example by checking llm.args.moe_config.backend or another resolved backend field on the llm instance) equals the expected moe_backend so the BF16 MoE path is exercised and failures are caught.
5969-5973: Keep the CUDA-graph ladder aligned with the new batch cap.
max_batch_sizeis now 32, but the explicit capture sizes defined just above still include 64 and 128. If those entries are honored, this matrix still warms larger graphs than it can ever serve.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/defs/accuracy/test_llm_api_pytorch.py` around lines 5969 - 5973, The CUDA-graph capture sizes (defined in the cuda_graph_config/capture ladder) still include 64 and 128 while max_batch_size was changed to 32; update the cuda_graph_config capture sizes used by the context manager (the cuda_graph_config passed into the llm with max_batch_size=32) so none exceed 32—remove or replace the 64 and 128 entries and ensure the ladder ends at or below max_batch_size to prevent warming graphs larger than the serving cap.tensorrt_llm/_torch/modules/fused_moe/quantization.py (1)
700-701: Consider annotating the shared cache withClassVar.Mutable default values share state across all instances of the class, while not being obvious. This can lead to bugs when the attributes are changed in one instance, as those changes will unexpectedly affect all other instances.
The current implementation is intentional (matching the pattern at lines 2664 and 3873), but explicitly annotating the variable with
typing.ClassVarindicates that it is intended to be shared across all instances. This also silences the Ruff RUF012 warning and documents the intent.🛠️ Suggested annotation
+from typing import ClassVar + class BF16TRTLLMGenFusedMoEMethod(UnquantizedFusedMoEMethod): # BlockMajorK uses 128-byte K blocks. BF16 has 2 bytes per element. block_k = 64 use_shuffled_weight = True weight_layout = TRTLLM_GEN_WEIGHT_LAYOUT_BLOCK_MAJOR_K needs_post_load_processing_for_dummy = True - _cache_permute_indices: Dict[tuple[tuple[int, ...], str, int], - torch.Tensor] = {} + _cache_permute_indices: ClassVar[Dict[tuple[tuple[int, ...], str, int], + torch.Tensor]] = {}Note: The same pattern exists for
NVFP4TRTLLMGenFusedMoEBaseMethod(line 2664) andMXFP4WeightTRTLLMGenFusedMoEMethod(line 3873), which could benefit from similar annotations.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/modules/fused_moe/quantization.py` around lines 700 - 701, Annotate the shared mutable class attribute _cache_permute_indices as a ClassVar to make the shared intent explicit and silence Ruff RUF012: import ClassVar from typing and change the annotation to ClassVar[Dict[tuple[tuple[int, ...], str, int], torch.Tensor]] while keeping the existing initializer ({}); apply the same pattern to the similar shared attributes on NVFP4TRTLLMGenFusedMoEBaseMethod and MXFP4WeightTRTLLMGenFusedMoEMethod so these mutable defaults are clearly documented as class-level shared state.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py`:
- Around line 192-195: The code temporarily mutates
backend_model_config.skip_create_weights_in_init and
backend_model_config._frozen but does not reliably restore the original values
on all paths; capture the original values (e.g., tmp_skip_create_weights_in_init
and tmp_frozen = backend_model_config._frozen) before mutating, set
skip_create_weights_in_init = True and _frozen = False only for the operation,
and restore both original values in a finally block around the call to
create_moe_backend (and the analogous block at the 240-244 site) so exceptions
do not leak mutated state.
In `@tensorrt_llm/_torch/modules/fused_moe/create_moe.py`:
- Around line 75-82: The BF16 branch in resolve_moe_cls/create_moe currently
only checks model_config.pretrained_config.torch_dtype and ignores the call-site
dtype argument, so callers passing dtype=torch.bfloat16 won't select
TRTLLMGenFusedMoE; update both BF16 checks (the block referencing
TRTLLMGenFusedMoE around the current 75-82 and the similar branch at 97-113) to
honor the dtype parameter by treating the branch as true when dtype is
torch.bfloat16 OR model_config.pretrained_config.torch_dtype is torch.bfloat16
(while still requiring not has_quant and the FlashInfer availability check),
i.e., use a combined condition like: not has_quant and (dtype is torch.bfloat16
or getattr(model_config.pretrained_config, "torch_dtype", None) is
torch.bfloat16) before returning TRTLLMGenFusedMoE or raising the same
RuntimeError.
In `@tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py`:
- Around line 323-325: The function signature for
_supports_flashinfer_bf16_routing_method has a hanging indent causing an E125
lint error; fix it by reformatting the parameter continuation so the closing
parenthesis aligns with the opening or move the closing parenthesis to the same
line as the last parameter, e.g. adjust the indentation of the line with
"routing_method: BaseMoeRoutingMethod, ) -> bool:" so it no longer creates a
misaligned continuation; update the def for
_supports_flashinfer_bf16_routing_method accordingly.
In `@tensorrt_llm/_torch/modules/fused_moe/moe_op_backend.py`:
- Around line 787-808: This branch unconditionally calls trtllm_bf16_moe when
router_logits is present, but TRTLLMGenFusedMoE._requires_separated_routing
documents BF16 direct-routing is inaccurate; update the guard in the block that
calls trtllm_bf16_moe to fail fast: detect BF16 direct-routing scenarios (use
routing_logits presence and routing_method_type via self.cvt_routing_method_type
or the same predicate used by TRTLLMGenFusedMoE._requires_separated_routing) and
raise a clear RuntimeError (or fallback to FP32/top-k precompute) instead of
dispatching trtllm_bf16_moe; reference trtllm_bf16_moe,
TRTLLMGenFusedMoE._requires_separated_routing, and forward_impl to align
behavior.
In `@tests/integration/test_lists/test-db/l0_b200.yml`:
- Around line 77-80: The test list includes tp2 variants (e.g.,
accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_bf16[tp2-CUTLASS],
::test_bf16[tp2-TRTLLM], ::test_fp8[tp2]) but the surrounding job is constrained
to a single GPU (system_gpu_count: 1), causing scheduling failures; update the
entries to match the single-GPU lane by either removing the tp2 variants or
replacing them with their tp1 equivalents (e.g., change tp2-* to tp1-* or drop
those lines) so the listed tests are compatible with the 1-GPU configuration.
In `@tests/unittest/_torch/modules/moe/moe_test_utils.py`:
- Around line 238-240: The TRTLLM C++ routing-kernel skips are being applied to
the BF16 FlashInfer path because quant_algo==None still reaches
should_skip_trtllm(); fix by guarding the old C++ routing-kernel checks so they
only run for quantized paths (i.e., require quant_algo is not None and/or
quant_algo in trtllm_gen_quant_algos) before calling should_skip_trtllm(), or
move the BF16 (quant_algo is None) BF16 block ahead of the C++ routing skips;
update the logic around quant_algo, trtllm_gen_quant_algos and
should_skip_trtllm to ensure DefaultMoeRoutingMethod, MiniMaxM2MoeRoutingMethod
and the Llama4 top-k restriction are not applied to the BF16 FlashInfer path
(which should only ban DeepSeekV3).
---
Nitpick comments:
In `@tensorrt_llm/_torch/modules/fused_moe/quantization.py`:
- Around line 700-701: Annotate the shared mutable class attribute
_cache_permute_indices as a ClassVar to make the shared intent explicit and
silence Ruff RUF012: import ClassVar from typing and change the annotation to
ClassVar[Dict[tuple[tuple[int, ...], str, int], torch.Tensor]] while keeping the
existing initializer ({}); apply the same pattern to the similar shared
attributes on NVFP4TRTLLMGenFusedMoEBaseMethod and
MXFP4WeightTRTLLMGenFusedMoEMethod so these mutable defaults are clearly
documented as class-level shared state.
In `@tests/integration/defs/accuracy/test_llm_api_pytorch.py`:
- Around line 5963-5973: The test currently only checks task accuracy but
doesn't assert that the intended MoE backend was actually selected, allowing
silent fallbacks to pass; after entering the LLM context created with
MoeConfig(backend=moe_backend) and LLM(... ) as llm, add an assertion that
verifies the resolved backend (for example by checking
llm.args.moe_config.backend or another resolved backend field on the llm
instance) equals the expected moe_backend so the BF16 MoE path is exercised and
failures are caught.
- Around line 5969-5973: The CUDA-graph capture sizes (defined in the
cuda_graph_config/capture ladder) still include 64 and 128 while max_batch_size
was changed to 32; update the cuda_graph_config capture sizes used by the
context manager (the cuda_graph_config passed into the llm with
max_batch_size=32) so none exceed 32—remove or replace the 64 and 128 entries
and ensure the ladder ends at or below max_batch_size to prevent warming graphs
larger than the serving cap.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5bec131a-2811-443f-ba6b-06d59f667b43
📒 Files selected for processing (11)
tensorrt_llm/_torch/modules/fused_moe/configurable_moe.pytensorrt_llm/_torch/modules/fused_moe/create_moe.pytensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.pytensorrt_llm/_torch/modules/fused_moe/moe_op_backend.pytensorrt_llm/_torch/modules/fused_moe/quantization.pytests/integration/defs/accuracy/test_llm_api_pytorch.pytests/integration/test_lists/qa/llm_function_core.txttests/integration/test_lists/qa/llm_function_core_sanity.txttests/integration/test_lists/test-db/l0_b200.ymltests/integration/test_lists/waives.txttests/unittest/_torch/modules/moe/moe_test_utils.py
| tmp_skip_create_weights_in_init = backend_model_config.skip_create_weights_in_init | ||
| backend_model_config._frozen = False | ||
| backend_model_config.skip_create_weights_in_init = True | ||
| backend_model_config._frozen = True |
There was a problem hiding this comment.
Config freeze/skip flags are not safely restored across all paths.
Line 193 and Line 195 force _frozen=True instead of restoring the original frozen state, and if create_moe_backend(...) throws, skip_create_weights_in_init is left mutated. This can leak state into subsequent layer construction.
Suggested fix
- tmp_skip_create_weights_in_init = backend_model_config.skip_create_weights_in_init
- backend_model_config._frozen = False
- backend_model_config.skip_create_weights_in_init = True
- backend_model_config._frozen = True
-
- backend = create_moe_backend(
+ original_skip_create_weights_in_init = backend_model_config.skip_create_weights_in_init
+ original_frozen = backend_model_config._frozen
+ try:
+ backend_model_config._frozen = False
+ backend_model_config.skip_create_weights_in_init = True
+ backend_model_config._frozen = original_frozen
+
+ backend = create_moe_backend(
moe_cls=moe_cls,
routing_method=routing_method,
num_experts=self.num_experts,
hidden_size=self.hidden_size,
intermediate_size=self.intermediate_size,
dtype=self.dtype,
reduce_results=self.reduce_results,
model_config=backend_model_config,
aux_stream_dict=self.aux_stream_dict,
weight_loading_mode=self.weight_loading_mode,
bias=kwargs.get("bias", False),
apply_router_weight_on_input=self.apply_router_weight_on_input,
layer_idx=None,
swiglu_alpha=kwargs.get("swiglu_alpha"),
swiglu_beta=kwargs.get("swiglu_beta"),
swiglu_limit=kwargs.get("swiglu_limit"),
init_load_balancer=False,
without_comm=True,
activation_type=self.activation_type,
- )
+ )
+ finally:
+ backend_model_config._frozen = False
+ backend_model_config.skip_create_weights_in_init = original_skip_create_weights_in_init
+ backend_model_config._frozen = original_frozen
@@
- backend_model_config._frozen = False
- backend_model_config.skip_create_weights_in_init = tmp_skip_create_weights_in_init
- backend_model_config._frozen = True
- if not backend_model_config.skip_create_weights_in_init:
+ if not original_skip_create_weights_in_init:
self.backend.create_weights()Also applies to: 240-244
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py` around lines 192 -
195, The code temporarily mutates
backend_model_config.skip_create_weights_in_init and
backend_model_config._frozen but does not reliably restore the original values
on all paths; capture the original values (e.g., tmp_skip_create_weights_in_init
and tmp_frozen = backend_model_config._frozen) before mutating, set
skip_create_weights_in_init = True and _frozen = False only for the operation,
and restore both original values in a finally block around the call to
create_moe_backend (and the analogous block at the 240-244 site) so exceptions
do not leak mutated state.
| if not has_quant and model_config.pretrained_config is not None and getattr( | ||
| model_config.pretrained_config, "torch_dtype", | ||
| None) == torch.bfloat16: | ||
| if TRTLLMGenFusedMoE._is_flashinfer_fused_moe_available(): | ||
| return TRTLLMGenFusedMoE | ||
| raise RuntimeError( | ||
| "TRTLLMGenFusedMoE BF16 path requires FlashInfer fused MoE with " | ||
| "trtllm_bf16_moe support, but it is not available.") |
There was a problem hiding this comment.
Honor call-site dtype overrides when resolving the BF16 backend.
resolve_moe_cls() takes dtype, but the unquantized BF16 branch still keys off model_config.pretrained_config.torch_dtype only. A caller doing create_moe(..., dtype=torch.bfloat16) will therefore fall back to CutlassFusedMoE whenever the pretrained config is unset or still says float16, so the new FlashInfer-backed TRTLLM path never gets selected.
Also applies to: 97-113
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/_torch/modules/fused_moe/create_moe.py` around lines 75 - 82,
The BF16 branch in resolve_moe_cls/create_moe currently only checks
model_config.pretrained_config.torch_dtype and ignores the call-site dtype
argument, so callers passing dtype=torch.bfloat16 won't select
TRTLLMGenFusedMoE; update both BF16 checks (the block referencing
TRTLLMGenFusedMoE around the current 75-82 and the similar branch at 97-113) to
honor the dtype parameter by treating the branch as true when dtype is
torch.bfloat16 OR model_config.pretrained_config.torch_dtype is torch.bfloat16
(while still requiring not has_quant and the FlashInfer availability check),
i.e., use a combined condition like: not has_quant and (dtype is torch.bfloat16
or getattr(model_config.pretrained_config, "torch_dtype", None) is
torch.bfloat16) before returning TRTLLMGenFusedMoE or raising the same
RuntimeError.
| def _supports_flashinfer_bf16_routing_method( | ||
| routing_method: BaseMoeRoutingMethod, ) -> bool: | ||
| # FIXME: ban DeepSeekV3 FlashInfer trtllm_bf16_routed_moe() as it appears to have bug |
There was a problem hiding this comment.
Fix the hanging indent in _supports_flashinfer_bf16_routing_method.
Flake8 is already reporting E125 here, so the lint job will remain red until the continuation line is re-indented or the closing parenthesis is moved.
Minimal fix
`@staticmethod`
def _supports_flashinfer_bf16_routing_method(
- routing_method: BaseMoeRoutingMethod, ) -> bool:
+ routing_method: BaseMoeRoutingMethod,
+ ) -> bool:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _supports_flashinfer_bf16_routing_method( | |
| routing_method: BaseMoeRoutingMethod, ) -> bool: | |
| # FIXME: ban DeepSeekV3 FlashInfer trtllm_bf16_routed_moe() as it appears to have bug | |
| `@staticmethod` | |
| def _supports_flashinfer_bf16_routing_method( | |
| routing_method: BaseMoeRoutingMethod, | |
| ) -> bool: | |
| # FIXME: ban DeepSeekV3 FlashInfer trtllm_bf16_routed_moe() as it appears to have bug |
🧰 Tools
🪛 Flake8 (7.3.0)
[error] 324-324: continuation line with same indent as next logical line
(E125)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py` around lines
323 - 325, The function signature for _supports_flashinfer_bf16_routing_method
has a hanging indent causing an E125 lint error; fix it by reformatting the
parameter continuation so the closing parenthesis aligns with the opening or
move the closing parenthesis to the same line as the last parameter, e.g. adjust
the indentation of the line with "routing_method: BaseMoeRoutingMethod, ) ->
bool:" so it no longer creates a misaligned continuation; update the def for
_supports_flashinfer_bf16_routing_method accordingly.
| if router_logits is not None: | ||
| result = self._fused_moe.trtllm_bf16_moe( | ||
| routing_logits=router_logits, | ||
| routing_bias=routing_bias, | ||
| hidden_states=hidden_states, | ||
| gemm1_weights=gemm1_weights, | ||
| gemm2_weights=gemm2_weights, | ||
| num_experts=num_experts, | ||
| top_k=top_k, | ||
| n_group=n_group, | ||
| topk_group=topk_group, | ||
| intermediate_size=intermediate_size, | ||
| local_expert_offset=local_expert_offset, | ||
| local_num_experts=local_num_experts, | ||
| routed_scaling_factor=routed_scaling_factor, | ||
| routing_method_type=self.cvt_routing_method_type(routing_method_type), | ||
| use_shuffled_weight=use_shuffled_weight, | ||
| weight_layout=weight_layout, | ||
| do_finalize=do_finalize, | ||
| enable_pdl=enable_pdl, | ||
| tune_max_num_tokens=tune_max_num_tokens, | ||
| ) |
There was a problem hiding this comment.
Fail fast on the BF16 direct-routing kernel for now.
TRTLLMGenFusedMoE._requires_separated_routing() now documents that FlashInfer BF16 direct routing has an accuracy bug, but this branch still dispatches trtllm_bf16_moe() whenever router_logits is present. TRTLLMGenFusedMoE.forward_impl() only precomputes top-k on the post-comm path, so single-GPU / legacy callers can still hit this kernel and get silent miscomputations.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/_torch/modules/fused_moe/moe_op_backend.py` around lines 787 -
808, This branch unconditionally calls trtllm_bf16_moe when router_logits is
present, but TRTLLMGenFusedMoE._requires_separated_routing documents BF16
direct-routing is inaccurate; update the guard in the block that calls
trtllm_bf16_moe to fail fast: detect BF16 direct-routing scenarios (use
routing_logits presence and routing_method_type via self.cvt_routing_method_type
or the same predicate used by TRTLLMGenFusedMoE._requires_separated_routing) and
raise a clear RuntimeError (or fallback to FP32/top-k precompute) instead of
dispatching trtllm_bf16_moe; reference trtllm_bf16_moe,
TRTLLMGenFusedMoE._requires_separated_routing, and forward_impl to align
behavior.
| - accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_bf16[tp2-CUTLASS] | ||
| - accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_bf16[tp2-TRTLLM] | ||
| - accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_fp8[tp1] | ||
| - accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_fp8[tp2] |
There was a problem hiding this comment.
tp2 cases are inconsistent with this single-GPU lane.
Line 77–Line 80 add tp2 test IDs under a condition constrained to exactly 1 GPU (system_gpu_count: 1). This is likely to fail scheduling/execution in pre-merge l0_b200.
Suggested fix
- accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_bf16[tp1-CUTLASS]
- accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_bf16[tp1-TRTLLM]
- - accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_bf16[tp2-CUTLASS]
- - accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_bf16[tp2-TRTLLM]
- accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_fp8[tp1]
- - accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_fp8[tp2]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_bf16[tp2-CUTLASS] | |
| - accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_bf16[tp2-TRTLLM] | |
| - accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_fp8[tp1] | |
| - accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_fp8[tp2] | |
| - accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_fp8[tp1] |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/integration/test_lists/test-db/l0_b200.yml` around lines 77 - 80, The
test list includes tp2 variants (e.g.,
accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_bf16[tp2-CUTLASS],
::test_bf16[tp2-TRTLLM], ::test_fp8[tp2]) but the surrounding job is constrained
to a single GPU (system_gpu_count: 1), causing scheduling failures; update the
entries to match the single-GPU lane by either removing the tp2 variants or
replacing them with their tp1 equivalents (e.g., change tp2-* to tp1-* or drop
those lines) so the listed tests are compatible with the 1-GPU configuration.
| # Quant_algo==None (BF16 path) also falls through and must meet the should_skip_trtllm criteria | ||
| if quant_algo is not None and quant_algo not in trtllm_gen_quant_algos: | ||
| return None |
There was a problem hiding this comment.
Don't apply the TRTLLM C++ routing skips to the BF16 FlashInfer path.
quant_algo is None now reaches should_skip_trtllm(), but the older C++ routing-kernel guard still runs before this BF16 block. That still skips BF16 TRTLLM cases for routing methods like DefaultMoeRoutingMethod, MiniMaxM2MoeRoutingMethod, and the Llama4 top-k restriction, even though the BF16 FlashInfer path now uses separated routing and only bans DeepSeekV3. Most of the new BF16 coverage stays disabled.
Also applies to: 263-272
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/unittest/_torch/modules/moe/moe_test_utils.py` around lines 238 - 240,
The TRTLLM C++ routing-kernel skips are being applied to the BF16 FlashInfer
path because quant_algo==None still reaches should_skip_trtllm(); fix by
guarding the old C++ routing-kernel checks so they only run for quantized paths
(i.e., require quant_algo is not None and/or quant_algo in
trtllm_gen_quant_algos) before calling should_skip_trtllm(), or move the BF16
(quant_algo is None) BF16 block ahead of the C++ routing skips; update the logic
around quant_algo, trtllm_gen_quant_algos and should_skip_trtllm to ensure
DefaultMoeRoutingMethod, MiniMaxM2MoeRoutingMethod and the Llama4 top-k
restriction are not applied to the BF16 FlashInfer path (which should only ban
DeepSeekV3).
7863d16 to
2319281
Compare
|
/bot run --add-multi-gpu-test --disable-fail-fast |
|
PR_Github #41692 [ run ] triggered by Bot. Commit: |
|
PR_Github #41692 [ run ] completed with state
|
|
/bot run |
|
PR_Github #41778 [ run ] triggered by Bot. Commit: |
|
PR_Github #41778 [ run ] completed with state
|
|
/bot run |
|
PR_Github #41826 [ run ] triggered by Bot. Commit: |
|
PR_Github #41826 [ run ] completed with state |
2319281 to
a65b400
Compare
|
/bot run |
|
PR_Github #41864 [ run ] triggered by Bot. Commit: |
|
PR_Github #41864 [ run ] completed with state
|
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16_4gpus_python_scheduler[ep4-mtp_nextn=0] SKIP (https://nvbugs/5997051) | ||
| perf/test_perf_sanity.py::test_e2e[aggr_upload-deepseek_v32_fp4_blackwell-v32_fp4_tep8_mtp3_8k1k] SKIP (https://nvbugs/5997092) | ||
| accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_fp8 SKIP (https://nvbugs/6004530) | ||
| accuracy/test_llm_api_pytorch.py::TestLlama3_1_8B_Instruct_RocketKV::test_auto_dtype SKIP (https://nvbugs/6007197) |
There was a problem hiding this comment.
Is this case TestLlama3_1_8B_Instruct_RocketKV added by accident? (I had similar rebase conflict #12257 and I reverted this in later commits)
There was a problem hiding this comment.
Thanks ,removed
| @staticmethod | ||
| def _supports_flashinfer_bf16_routing_method( | ||
| routing_method: BaseMoeRoutingMethod, ) -> bool: | ||
| # FIXME: ban DeepSeekV3 FlashInfer trtllm_bf16_routed_moe() as it appears to have bug |
There was a problem hiding this comment.
Note: this will be addressed by flashinfer-ai/flashinfer#2911.
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
a65b400 to
4652726
Compare
|
/bot run |
|
PR_Github #41914 [ run ] triggered by Bot. Commit: |
|
PR_Github #41914 [ run ] completed with state
|
|
/bot run |
|
PR_Github #41933 [ run ] triggered by Bot. Commit: |
|
PR_Github #41933 [ run ] completed with state
|
|
/bot run |
|
PR_Github #42020 [ run ] triggered by Bot. Commit: |
|
PR_Github #42020 [ run ] completed with state
|
|
/bot run |
|
PR_Github #42078 [ run ] triggered by Bot. Commit: |
|
PR_Github #42078 [ run ] completed with state
|
|
/bot run |
|
PR_Github #42143 [ run ] triggered by Bot. Commit: |
|
PR_Github #42143 [ run ] completed with state
|
|
/bot run |
|
PR_Github #42210 [ run ] triggered by Bot. Commit: |
|
/bot run |
|
PR_Github #42315 [ run ] triggered by Bot. Commit: |
|
PR_Github #42315 [ run ] completed with state
|
|
/bot run |
|
PR_Github #42444 [ run ] triggered by Bot. Commit: |
|
PR_Github #42444 [ run ] completed with state
|
|
/bot run |
|
PR_Github #42485 [ run ] triggered by Bot. Commit: |
|
PR_Github #42485 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
Summary by CodeRabbit
Release Notes
New Features
Tests
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.