Feat: Trtllm-gen MxFP8 MoE integration by IwakuraRein · Pull Request #2505 · flashinfer-ai/flashinfer

IwakuraRein · 2026-02-06T01:45:45Z

📌 Description

Add the trtllm-gen mxfp8 moe. It uses the existing trtllm_fp8_block_scale_moe api and can be selected by setting fp8_quantization_type

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

New Features
- Added MxFP8 as an FP8 quantization option, exposed via CLI and public API.
Refactor
- Replaced boolean FP8 flag with an explicit quantization-type enum propagated across MoE interfaces and launchers.
Bug Fixes
- Added stricter config validation and bounds checks with clearer error messages.
Tests
- Extended tests to cover DeepSeek and MxFP8 variants and new configuration constraints.
Chores
- Updated runtime artifact checksums/paths.

…-integration

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

coderabbitai · 2026-02-06T01:45:51Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds an FP8 quantization enum and MxFP8 support across Python, C++ launchers, benchmarks, and tests; threads a new fp8_quantization_type through MoE entry points, config generation, kernel launcher wiring, and autotuner dispatch, plus artifact checksum updates and diagnostic logging.

Changes

Cohort / File(s)	Summary
Benchmark & CLI `benchmarks/bench_trtllm_gen_fused_moe_autotuner.py`	Adds `MxFP8xMxFP8` quant_mode, routes FP8 flows to distinct DeepSeek/MxFp8 paths, passes `fp8_quantization_type` to autotuner, prints scale-shape diagnostics.
Core Python API `flashinfer/fused_moe/core.py`, `flashinfer/fused_moe/__init__.py`	Adds `GatedActType` and `Fp8QuantizationType` enums; replaces `use_deepseek_fp8` boolean with `fp8_quantization_type` across MoE APIs; branches dtype/scale sourcing and public signatures for DeepSeek vs MxFp8.
CUDA Kernel Launcher & Public API `csrc/trtllm_fused_moe_kernel_launcher.cu`, `flashinfer/fused_moe/__init__.py`	Introduces `Fp8QuantizationType` and string helper; threads quantization type through launcher constructors, getValidConfigs, tile selection, allocations, validation messages, and run paths to support DeepSeek/MxFp8/per-tensor variants.
CUDA Runners / GEMM Init `csrc/trtllm_batched_gemm_runner.cu`, `csrc/trtllm_fused_moe_runner.cu`	Value-initializes `BatchedGemmData`, sets valid M/N/K fields, and adds bounds checks for `configIndex` in runner APIs.
Python package artifacts `flashinfer/artifacts.py`	Updates `TRTLLM_GEN_BMM` artifact path and checksum constants.
Tests & Utilities `tests/moe/test_trtllm_gen_fused_moe.py`, `tests/moe/test_dpsk_fused_moe_fp8.py`, `tests/moe/utils.py`	Adds MXFp8 test branches, MXFp8 quantize/dequantize helpers and MXFp8 reference runner; expands `QuantMode` with `FP8_BLOCK_SCALE_MXFP8`/`FP8_PER_TENSOR`; updates skip_checks to require shuffle/layout for MxFp8 and maps tests to new `Fp8QuantizationType`.

Sequence Diagram(s)

mermaid
sequenceDiagram
rect rgba(200,200,255,0.5)
participant CLI as CLI (bench/autotuner)
end
rect rgba(200,255,200,0.5)
participant Autotuner as Autotuner/Python
participant Core as fused_moe Core
end
rect rgba(255,200,200,0.5)
participant Launcher as C++ Launcher
participant GPU as GPU Kernel
end

CLI->>Autotuner: parse --quant-mode (e.g., MxFP8xMxFP8)
Autotuner->>Core: call autotune/run with fp8_quantization_type
Core->>Launcher: request valid configs / instantiate with quantization_type
Launcher->>GPU: launch kernels with quantization-aware buffers/scales
GPU-->>Launcher: return profiling/results
Launcher-->>Core: pass profiling/results
Core-->>Autotuner: autotuner records best config

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

chore: update benchmark scripts; fix trtllm-gen moe comments #2412 — overlaps benchmark/autotuner FP8 dispatch and CLI quant-mode handling.
chore/feat: A2A + MoE benchmark; add routed counterpart for trtllm_gen_fp8_fused_moe #2379 — modifies FP8 block-scale MoE paths and launcher signatures; related to quantization-type threading.
Optimize quantization function in large problem size #2343 — adjusts MxFP8 quantization kernel paths and dispatch used by these changes.

Suggested labels

run-ci

Suggested reviewers

djmmoss
cyx-6
bkryu
nvmbreughe
yzh119
aleozlx
joker-eph

Poem

🐇 I hopped through enums and tiny byte scales,
Routed MxFP8 flows across kernels and trails,
Autotuners hum, launchers line up the crew,
Scales reshape, tests follow — a quantized view,
🥕 Rabbit cheers: kernels tuned and passing too!

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.93% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	PR description is brief but complete, covering main changes and passing pre-commit checks. However, 'All tests are passing' is unchecked, and no related issues are linked.	Clarify test status: confirm all tests pass before merge. Link any related issues if applicable.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Feat: Trtllm-gen MxFP8 MoE integration' clearly summarizes the main change: adding MxFP8 quantization support to the TrtLLM-gen MoE implementation.
Merge Conflict Detection	✅ Passed	✅ No merge conflicts detected when merging into `main`

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-02-06T01:46:14Z

Summary of Changes

Hello @IwakuraRein, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the TensorRT-LLM fused Mixture-of-Experts (MoE) implementation by integrating MxFP8 quantization. This integration provides a new, flexible FP8 quantization option alongside the existing DeepSeek FP8, allowing for fine-grained control over mixed-precision computations. The changes span core kernel logic, benchmarking, and testing, ensuring that the new quantization mode is robustly supported and validated across the system.

Highlights

MxFP8 Quantization Integration: Introduced support for MxFP8 (mixed FP8) quantization within the TensorRT-LLM fused Mixture-of-Experts (MoE) kernels, allowing for more flexible and potentially optimized FP8 operations.
Fp8QuantizationType Enum: A new Fp8QuantizationType enum was added to differentiate between various FP8 quantization schemes, including DeepSeek FP8 and the newly integrated MxFp8, enabling explicit control over the quantization method used.
Benchmarking and Testing Expansion: The benchmarking suite (bench_trtllm_gen_fused_moe_autotuner.py) and unit tests (test_trtllm_gen_fused_moe.py) were extended to cover the new MxFP8 quantization mode, ensuring its correctness and performance characteristics are validated.
Kernel Configuration Adjustments: Modifications were made to the C++ kernel launcher (trtllm_fused_moe_kernel_launcher.cu) and batched GEMM runner (trtllm_batched_gemm_runner.cu) to correctly handle the specific requirements and configurations of MxFP8, including dtype handling and skipping incompatible configurations.
Artifact Updates: The TRTLLM_GEN_BMM artifact path and checksum in flashinfer/artifacts.py were updated, indicating changes to the pre-compiled batched GEMM kernels.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

benchmarks/bench_trtllm_gen_fused_moe_autotuner.py
- Imported partial from functools and Fp8QuantizationType from flashinfer.fused_moe.
- Added a new mxint4_quantize function for mixed-integer 4-bit quantization.
- Extended the quant_mode literal type to include "MxFP8xMxFP8" and "MxInt4xBf16".
- Modified the quantization logic to handle "MxFP8xMxFP8" mode, utilizing mxfp8_quantize and reshaping scales accordingly.
- Updated bench_gpu_time calls to include enable_cupti, use_cuda_graph, input_kwargs, and cold_l2_cache for more comprehensive benchmarking.
- Introduced bench_trtllm_gen_fused_moe_autotuner_mxint4 for benchmarking MxInt4 quantization.
- Refactored the main execution block to dynamically select the appropriate benchmark function based on args.quant_mode.
csrc/trtllm_batched_gemm_runner.cu
- Added a new condition to skip specific configurations for MxE4m3 dtypes when mNumSlicesForSplitK > 1 to prevent unsupported operations.
csrc/trtllm_fused_moe_kernel_launcher.cu
- Defined a new Fp8QuantizationType enum to explicitly manage different FP8 quantization types.
- Modified the Fp8BlockScaleLauncher constructor and init method to accept and utilize the new Fp8QuantizationType.
- Adjusted the mDtypeAct and mDtypeWeights assignments within Fp8BlockScaleLauncher::init based on the quantization_type.
- Updated the mUseDeepSeekFp8 flag logic to be conditional on quantization_type == Fp8QuantizationType::DeepSeekFp8.
- Enhanced Fp8BlockScaleLauncher::check_moe_common to perform dtype and dimension checks for scale tensors specific to DeepSeekFp8 and MxFp8.
- Modified the allocation of gemm1_output_scale and activation_output_scale to adapt to the chosen quantization_type.
- Added Fp8QuantizationType as a parameter to Fp8BlockScaleLauncher::getValidConfigs.
- Included Fp8QuantizationType as a parameter in the trtllm_fp8_block_scale_moe function signature.
- Updated trtllm_get_valid_moe_configs to use Fp8QuantizationType instead of a boolean useDeepSeekFp8 for more granular control.
flashinfer/artifacts.py
- Updated the TRTLLM_GEN_BMM artifact path and its corresponding checksum to reflect new pre-compiled binaries.
flashinfer/fused_moe/init.py
- Imported the newly defined Fp8QuantizationType enum.
flashinfer/fused_moe/core.py
- Introduced GatedActType and Fp8QuantizationType enums for better type management.
- Refactored TrtllmGenFusedMoE class to use fp8_quantization_type instead of the boolean use_deepseek_fp8.
- Modified the forward method to correctly handle MxFp8 quantization for hidden_states_scale based on the fp8_quantization_type.
- Ensured fp8_quantization_type is passed to all relevant moe_op.trtllm_fp8_block_scale_moe calls.
- Updated the signatures of trtllm_bf16_moe_op, trtllm_fp8_per_tensor_scale_moe_op, trtllm_fp8_block_scale_moe_op, trtllm_fp4_block_scale_moe_op, trtllm_mxint4_block_scale_moe_op, trtllm_fp8_block_scale_moe, and trtllm_fp8_block_scale_routed_moe to include the fp8_quantization_type parameter.
tests/moe/test_trtllm_gen_fused_moe.py
- Imported Fp8QuantizationType for use in tests.
- Modified the FP8BlockScaleMoe class to accept fp8_quantization_type in its constructor and use it for the quant_mode property.
- Updated quantize_weights in FP8BlockScaleMoe to use mxfp8_quantize_batches for MxFp8 mode.
- Adjusted quantize_inputs in FP8BlockScaleMoe to use mxfp8_quantize for MxFp8 mode, including handling of swizzling.
- Modified prepare_static_weights_for_kernel to dynamically set epilogue_tile_m and reorder weights for MxFp8 based on the quantization type.
- Ensured quantization_mode is passed to the trtllm_fp8_block_scale_moe function call.
- Updated compute_reference in FP8BlockScaleMoe to call run_moe_reference_mxfp8 for MxFp8 mode.
- Added new helper functions mxfp8_quantize_batches, mxfp8_dequantize_batches, and run_moe_reference_mxfp8 for MxFp8 testing.
- Modified run_moe_dequant to correctly handle FP8_BLOCK_SCALE_MXFP8.
- Updated pytest.param definitions to explicitly differentiate between FP8_Block_DeepSeek and FP8_Block_MxFp8 test cases.
tests/moe/utils.py
- Imported WeightLayout from flashinfer.fused_moe.
- Renamed QuantMode.FP8_BLOCK_SCALE to QuantMode.FP8_BLOCK_SCALE_DEEPSEEK and added QuantMode.FP8_BLOCK_SCALE_MXFP8.
- Added specific skip checks for MxFp8 quantization, enforcing that use_shuffled_weight must be true and weight_layout must be MajorK.

Activity

The pull request introduces mxfp8 (mixed FP8) quantization support for TensorRT-LLM fused MoE kernels.
It involves significant changes across C++ kernel implementations, Python bindings, benchmarking scripts, and unit tests.
The author, IwakuraRein, has provided a template for the PR description, but the specific details for this PR are left empty, indicating that the changes are primarily code-driven and self-explanatory through the diffs.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request integrates mxfp8 support into the trtllm fused MoE kernels. The changes are extensive, touching benchmark scripts, C++ kernel launchers, and Python bindings. The introduction of Fp8QuantizationType is a good refactoring that makes the code more extensible. The tests have also been updated to cover the new quantization modes.

My review focuses on improving code maintainability by reducing duplication in the benchmark scripts and C++ kernel launcher. I've also pointed out some leftover debugging code and minor issues that should be addressed before merging.

benchmarks/bench_trtllm_gen_fused_moe_autotuner.py

gemini-code-assist · 2026-02-06T01:47:55Z

benchmarks/bench_trtllm_gen_fused_moe_autotuner.py

+    print(f"No autotune: {ms:.3f} ms; with autotune: {ms_tuned:.3f} ms")
+
+
+def bench_trtllm_gen_fused_moe_autotuner_mxint4(


This function bench_trtllm_gen_fused_moe_autotuner_mxint4 is very similar to bench_trtllm_gen_fused_moe_autotuner_fp8 and bench_trtllm_gen_fused_moe_autotuner_fp4. To improve maintainability and reduce code duplication, consider refactoring these into a more generic benchmark function or a base class. This could accept quantization functions and the specific MoE kernel as parameters, centralizing the common benchmarking logic.

gemini-code-assist · 2026-02-06T01:47:55Z

csrc/trtllm_fused_moe_kernel_launcher.cu

    FusedMoeLauncher::check_moe_common();

    TVM_FFI_ICHECK_EQ(hidden_states.dtype(), dl_float8_e4m3fn) << "hidden_states must be fp8.";
-    TVM_FFI_ICHECK_EQ(hidden_states_scale.dtype(), dl_float32)
-        << "hidden_states_scale must be float.";
-    TVM_FFI_ICHECK_EQ(hidden_states_scale.ndim(), 2) << "hidden_states_scale must be 2D.";
-    TVM_FFI_ICHECK_EQ(hidden_states_scale.size(0), hidden_states.size(1) / 128)
-        << "hidden_states_scale dim0 must match hidden_states dim1 / 128.";
-    TVM_FFI_ICHECK_EQ(hidden_states_scale.size(1), args->num_tokens)
-        << "hidden_states_scale dim1 must match num_tokens.";
+    if (quantization_type == Fp8QuantizationType::DeepSeekFp8) {
+      TVM_FFI_ICHECK_EQ(hidden_states_scale.dtype(), dl_float32)
+          << "hidden_states_scale must be float.";
+      TVM_FFI_ICHECK_EQ(hidden_states_scale.ndim(), 2) << "hidden_states_scale must be 2D.";
+      TVM_FFI_ICHECK_EQ(hidden_states_scale.size(0), hidden_states.size(1) / 128)
+          << "hidden_states_scale dim0 must match hidden_states dim1 / 128.";
+      TVM_FFI_ICHECK_EQ(hidden_states_scale.size(1), args->num_tokens)
+          << "hidden_states_scale dim1 must match num_tokens.";
+    } else if (quantization_type == Fp8QuantizationType::MxFp8) {
+      TVM_FFI_ICHECK_EQ(hidden_states_scale.dtype(), dl_uint8);
+    }

    TVM_FFI_ICHECK_EQ(gemm1_weights.dtype(), dl_float8_e4m3fn) << "gemm1_weights must be fp8.";
    TVM_FFI_ICHECK_EQ(gemm2_weights.dtype(), dl_float8_e4m3fn) << "gemm2_weights must be fp8.";

-    TVM_FFI_ICHECK_EQ(gemm1_weights_scale.dtype(), dl_float32)
-        << "gemm1_weights_scale must be float.";
-    TVM_FFI_ICHECK_EQ(gemm1_weights_scale.ndim(), 3) << "gemm1_weights_scale must be 3D.";
-    TVM_FFI_ICHECK_EQ(gemm1_weights_scale.size(0), args->local_num_experts)
-        << "gemm1_weights_scale has incorrect shape.";
-    TVM_FFI_ICHECK_EQ(args->intermediate_size % 128, 0)
-        << "intermediate_size must be a multiple of 128.";
-    TVM_FFI_ICHECK_EQ(gemm1_weights_scale.size(1), 2 * args->intermediate_size / 128)
-        << "gemm1_weights_scale has incorrect shape.";
-    TVM_FFI_ICHECK_EQ(gemm1_weights_scale.size(2), args->hidden_size / 128)
-        << "gemm1_weights_scale has incorrect shape.";
+    if (quantization_type == Fp8QuantizationType::DeepSeekFp8) {
+      TVM_FFI_ICHECK_EQ(gemm1_weights_scale.dtype(), dl_float32)
+          << "gemm1_weights_scale must be float.";
+      TVM_FFI_ICHECK_EQ(gemm1_weights_scale.ndim(), 3) << "gemm1_weights_scale must be 3D.";
+      TVM_FFI_ICHECK_EQ(gemm1_weights_scale.size(0), args->local_num_experts)
+          << "gemm1_weights_scale has incorrect shape.";
+      TVM_FFI_ICHECK_EQ(args->intermediate_size % 128, 0)
+          << "intermediate_size must be a multiple of 128.";
+      TVM_FFI_ICHECK_EQ(gemm1_weights_scale.size(1), 2 * args->intermediate_size / 128)
+          << "gemm1_weights_scale has incorrect shape.";
+      TVM_FFI_ICHECK_EQ(gemm1_weights_scale.size(2), args->hidden_size / 128)
+          << "gemm1_weights_scale has incorrect shape.";
+    } else if (quantization_type == Fp8QuantizationType::MxFp8) {
+      TVM_FFI_ICHECK_EQ(gemm1_weights_scale.dtype(), dl_uint8)
+          << "gemm1_weights_scale must be uint8.";
+    }

-    TVM_FFI_ICHECK_EQ(gemm2_weights_scale.dtype(), dl_float32)
-        << "gemm2_weights_scale must be float.";
-    TVM_FFI_ICHECK_EQ(gemm2_weights_scale.ndim(), 3) << "gemm2_weights_scale must be 3D.";
-    TVM_FFI_ICHECK_EQ(gemm2_weights_scale.size(0), args->local_num_experts)
-        << "gemm2_weights_scale has incorrect shape.";
-    TVM_FFI_ICHECK_EQ(gemm2_weights_scale.size(1), args->hidden_size / 128)
-        << "gemm2_weights_scale has incorrect shape.";
-    TVM_FFI_ICHECK_EQ(gemm2_weights_scale.size(2), args->intermediate_size / 128)
-        << "gemm2_weights_scale has incorrect shape.";
+    if (quantization_type == Fp8QuantizationType::DeepSeekFp8) {
+      TVM_FFI_ICHECK_EQ(gemm2_weights_scale.dtype(), dl_float32)
+          << "gemm2_weights_scale must be float.";
+      TVM_FFI_ICHECK_EQ(gemm2_weights_scale.ndim(), 3) << "gemm2_weights_scale must be 3D.";
+      TVM_FFI_ICHECK_EQ(gemm2_weights_scale.size(0), args->local_num_experts)
+          << "gemm2_weights_scale has incorrect shape.";
+      TVM_FFI_ICHECK_EQ(gemm2_weights_scale.size(1), args->hidden_size / 128)
+          << "gemm2_weights_scale has incorrect shape.";
+      TVM_FFI_ICHECK_EQ(gemm2_weights_scale.size(2), args->intermediate_size / 128)
+          << "gemm2_weights_scale has incorrect shape.";
+    } else if (quantization_type == Fp8QuantizationType::MxFp8) {
+      TVM_FFI_ICHECK_EQ(gemm2_weights_scale.dtype(), dl_uint8)
+          << "gemm2_weights_scale must be uint8.";
+    }

    check_weights_shape("gemm1");
    check_weights_shape("gemm2");
-    TVM_FFI_ICHECK_EQ(args->intermediate_size % 128, 0)
-        << "intermediate_size must be a multiple of 128.";
+
+    if (quantization_type == Fp8QuantizationType::DeepSeekFp8) {
+      TVM_FFI_ICHECK_EQ(args->intermediate_size % 128, 0)
+          << "intermediate_size must be a multiple of 128.";
+    }
  }


The validation logic within this function, particularly for gemm1_weights_scale and gemm2_weights_scale under the Fp8QuantizationType::DeepSeekFp8 condition, is quite repetitive. To improve maintainability and reduce code duplication, consider extracting the common checks into a helper function.

csrc/trtllm_fused_moe_kernel_launcher.cu

flashinfer/fused_moe/core.py

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

csrc/trtllm_fused_moe_kernel_launcher.cu

…integration

vincentzed · 2026-02-08T03:12:16Z

Hi @IwakuraRein . Currently we use this in sgl.

However it seems like we are missing cubin for some dim.

I build from src from this branch on this commit 1dc688d

[DEBUG] TRTLLM-Gen launch info: numCtasX = 1, numCtasY = 4, numCtasZ = 4096, clusterDimX = 1
[2026-02-08 03:07:34] trtllm_fp8_block_scale_moe call:
  fp8_quantization_type=2
  routing_logits: shape=torch.Size([4096, 128]) dtype=torch.bfloat16
  routing_bias: None
  hidden_states (a_q): shape=torch.Size([4096, 2048]) dtype=torch.float8_e4m3fn
  hidden_states_scale (a_sf): shape=torch.Size([4096, 64]) dtype=torch.uint8
  gemm1_weights (w13): shape=torch.Size([128, 1536, 2048]) dtype=torch.float8_e4m3fn
  gemm1_weights_scale (w13_scale_inv): shape=torch.Size([128, 98304]) dtype=torch.uint8
  gemm2_weights (w2): shape=torch.Size([128, 2048, 768]) dtype=torch.float8_e4m3fn
  gemm2_weights_scale (w2_scale_inv): shape=torch.Size([128, 49152]) dtype=torch.uint8
  num_experts=128 top_k=8 n_group=0 topk_group=0
  intermediate_size=768 local_expert_offset=0 local_num_experts=128
  routed_scaling_factor=1.0 routing_method_type=1
  use_shuffled_weight=True tune_max_num_tokens=4096

 File "/sgl-workspace/flashinfer/flashinfer/fused_moe/core.py", line 2373, in trtllm_fp8_block_scale_moe
    return get_trtllm_moe_sm100_module().trtllm_fp8_block_scale_moe(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/flashinfer/flashinfer/fused_moe/core.py", line 1694, in trtllm_fp8_block_scale_moe_op
    result = moe_op.trtllm_fp8_block_scale_moe(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "python/tvm_ffi/cython/function.pxi", line 923, in tvm_ffi.core.Function.__call__
RuntimeError: Error in function 'TrtllmGenBatchedGemmRunner' at /sgl-workspace/flashinfer/csrc/trtllm_batched_gemm_runner.cu:138: No kernel found for the given options: mDtypeA: MxE4m3, mDtypeB: MxE4m3, mDtypeC: Bfloat16, mUseDeepSeekFp8: 0, mActType: 0, mEltwiseActType: 0, mTransposeMmaOutput: 1, mRouteAct: 1, mFusedAct: 1, mIsStaticBatch: 0, mTileSize: 64

Context: we are building the sglang MXFP8 trtllm_moe runner along with mm_mxfp8 flashinfer modelopt linear, so this would be quite useful.

If it turns out that my usages is wrong... user error. but even after inspect cubin, it seem like this shape should be available. Do you have any ideas?

should there be tileSize=64 cubin?

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

IwakuraRein · 2026-02-09T18:20:35Z

@vincentzed Hi. There are tile size 64 cubins for mxfp8. I tried your problem shape and cannot reproduce the error. Could you try pull the latest commit? 1dc688d won't compile due to a typo so maybe flashinfer is using the old jit cache.

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

…integration

IwakuraRein · 2026-02-13T21:06:03Z

/bot run

flashinfer-bot · 2026-02-13T21:06:04Z

[CANCELING] Pipeline #43998281: canceled

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

csrc/trtllm_fused_moe_kernel_launcher.cu (1)

1079-1105: ⚠️ Potential issue | 🔴 Critical

getValidConfigs uses wrong Runner constructor for MxFp8, causing config mismatch with runtime.

For MxFp8, prepare_moe_common (lines 326–335) constructs the Runner with the two-dtype constructor (passing mDtypeAct, mDtypeWeights, activation_type) when the condition E4m3 && E4m3 && mUseDeepSeekFp8 is false. However, getValidConfigs always uses the weights-only constructor (line 1091–1094), regardless of quantization_type. This means config enumeration and the actual kernel runner see different valid config sets — the root cause of "No kernel found" errors at runtime.

Proposed fix: branch getValidConfigs to match prepare_moe_common logic

     for (int32_t tile_N : selected_tile_nums) {
-      auto moe_runner = std::make_unique<tensorrt_llm::kernels::trtllmgen_moe::MoE::Runner>(
-          dtype_weights,                                          // dtype_weights for DeepSeek FP8
-          quantization_type == Fp8QuantizationType::DeepSeekFp8,  // useDeepSeekFp8
-          tile_N, use_shuffled_weight, static_cast<batchedGemm::gemm::MatrixLayout>(weight_layout));
+      std::unique_ptr<tensorrt_llm::kernels::trtllmgen_moe::MoE::Runner> moe_runner;
+      if (quantization_type == Fp8QuantizationType::DeepSeekFp8) {
+        moe_runner = std::make_unique<tensorrt_llm::kernels::trtllmgen_moe::MoE::Runner>(
+            dtype_weights, true /* useDeepSeekFp8 */, tile_N, use_shuffled_weight,
+            static_cast<batchedGemm::gemm::MatrixLayout>(weight_layout));
+      } else {
+        // MxFp8: match two-dtype constructor from prepare_moe_common
+        moe_runner = std::make_unique<tensorrt_llm::kernels::trtllmgen_moe::MoE::Runner>(
+            dtype_weights, dtype_weights, false /* useDeepSeekFp8 */, tile_N,
+            ActivationType::Swiglu, use_shuffled_weight,
+            static_cast<batchedGemm::gemm::MatrixLayout>(weight_layout));
+      }
 
       auto cfgs = moe_runner->getValidConfigIndices(top_k, hidden_size, intermediate_size,
                                                     num_local_experts, num_tokens);

🧹 Nitpick comments (2)

csrc/trtllm_fused_moe_kernel_launcher.cu (2)
1004-1012: MxFp8 path does not explicitly set workspace.activation_output / workspace.activation_output_scale.

Only the DeepSeekFp8 branch (lines 1007–1010) assigns these workspace pointers. The MxFp8 path relies on implicit zero-initialization. Consider explicitly setting them to nullptr to be safe against future refactors where prepare_moe might be re-entered or workspace partially reused.
Proposed fix
     if (quantization_type == Fp8QuantizationType::DeepSeekFp8) {
       workspace.activation_output = activation_output.data_ptr();
       workspace.activation_output_scale = static_cast<float*>(activation_output_scale.data_ptr());
+    } else {
+      workspace.activation_output = nullptr;
+      workspace.activation_output_scale = nullptr;
     }
1006-1006: static_cast<float*> on a dl_uint8 tensor for MxFp8 — type mismatch in workspace pointer.

For MxFp8, gemm1_output_scale is allocated as dl_uint8 (line 990), but line 1006 unconditionally casts it to float*. The kernel likely consumes the raw address, but this cast is misleading and could mask bugs if the workspace struct gains type-safety. Consider a void* intermediate or a comment noting the intentional reinterpretation.

flashinfer-bot · 2026-02-13T21:07:10Z

GitLab MR !316 has been updated with latest changes, and the CI pipeline #44003338 is currently running. I'll report back once the pipeline job completes.

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

IwakuraRein · 2026-02-14T03:21:04Z

/bot run

flashinfer-bot · 2026-02-14T03:21:57Z

GitLab MR !316 has been updated with latest changes, and the CI pipeline #44028049 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-02-14T08:49:08Z

[FAILED] Pipeline #44028049: 14/20 passed

danisereb · 2026-02-16T14:49:36Z

Hey, @IwakuraRein
Is it possible to add support for Relu2 activation ?

We want to use it with Nemotron models:
https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8/blob/main/config.json#L31

IwakuraRein · 2026-02-17T18:14:20Z

Hey, @IwakuraRein Is it possible to add support for Relu2 activation ?

We want to use it with Nemotron models: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8/blob/main/config.json#L31

Hi @danisereb, currently the cubins for Relu2 are not generated yet. We can add it in another PR.

aleozlx

lgtm

## 📌 Description @HumansAnd  #2505 implements mxfp8 for trtllm backend. However, in SGLang, `--moe-runner-backend flashinfer_trtllm` bypasses SGLang topk implementation and does not work with expert routing replay in MoE RL. We want to implement `mxfp8 x mxfp8` for `cutlass_fused_moe` which works with MoE RL training. This PR mainly reuses existing code path for `WMxfp4AMxfp8Quant`: https://github.com/flashinfer-ai/flashinfer/blob/952b6ab2838d676b4257fcc23bb00f67fdd38efc/csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu#L1191 ## 🔍 Related Issues  miles MXFP8/NVFP4 RL roadmap: radixark/miles#615 SGLang FlashInfer MXFP8 integration: sgl-project/sglang#18945 ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **New Features** * Toggleable MXFPX/MXFP8 activation-scaling across MOE inference, updating workspace sizing, kernel selection, block-scaling and dispatch to enable MXFP8-aware execution and validation. * Added MXFP8×MXFP8 quantization mode and emitted MXFPX-aware GEMM/kernel variants; public APIs now expose an MXFPX/activation-scaling flag. * **Tests** * Added unit tests and helpers for MXFP8 quantization, packing/dequantization, and end-to-end MXFP8×MXFP8 MOE inference validation.

## 📌 Description @HumansAnd  flashinfer-ai#2505 implements mxfp8 for trtllm backend. However, in SGLang, `--moe-runner-backend flashinfer_trtllm` bypasses SGLang topk implementation and does not work with expert routing replay in MoE RL. We want to implement `mxfp8 x mxfp8` for `cutlass_fused_moe` which works with MoE RL training. This PR mainly reuses existing code path for `WMxfp4AMxfp8Quant`: https://github.com/flashinfer-ai/flashinfer/blob/952b6ab2838d676b4257fcc23bb00f67fdd38efc/csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu#L1191 ## 🔍 Related Issues  miles MXFP8/NVFP4 RL roadmap: radixark/miles#615 SGLang FlashInfer MXFP8 integration: sgl-project/sglang#18945 ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **New Features** * Toggleable MXFPX/MXFP8 activation-scaling across MOE inference, updating workspace sizing, kernel selection, block-scaling and dispatch to enable MXFP8-aware execution and validation. * Added MXFP8×MXFP8 quantization mode and emitted MXFPX-aware GEMM/kernel variants; public APIs now expose an MXFPX/activation-scaling flag. * **Tests** * Added unit tests and helpers for MXFP8 quantization, packing/dequantization, and end-to-end MXFP8×MXFP8 MOE inference validation.

## 📌 Description @HumansAnd  flashinfer-ai#2505 implements mxfp8 for trtllm backend. However, in SGLang, `--moe-runner-backend flashinfer_trtllm` bypasses SGLang topk implementation and does not work with expert routing replay in MoE RL. We want to implement `mxfp8 x mxfp8` for `cutlass_fused_moe` which works with MoE RL training. This PR mainly reuses existing code path for `WMxfp4AMxfp8Quant`: https://github.com/flashinfer-ai/flashinfer/blob/952b6ab2838d676b4257fcc23bb00f67fdd38efc/csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu#L1191 ## 🔍 Related Issues  miles MXFP8/NVFP4 RL roadmap: radixark/miles#615 SGLang FlashInfer MXFP8 integration: sgl-project/sglang#18945 ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **New Features** * Toggleable MXFPX/MXFP8 activation-scaling across MOE inference, updating workspace sizing, kernel selection, block-scaling and dispatch to enable MXFP8-aware execution and validation. * Added MXFP8×MXFP8 quantization mode and emitted MXFPX-aware GEMM/kernel variants; public APIs now expose an MXFPX/activation-scaling flag. * **Tests** * Added unit tests and helpers for MXFP8 quantization, packing/dequantization, and end-to-end MXFP8×MXFP8 MOE inference validation.  Signed-off-by: Amey Naik <212485788+ameynaik-hub@users.noreply.github.com>

nekorobov and others added 7 commits February 5, 2026 03:37

wip: not compiles yet

557db0a

fix: compiles, but hangs in autotuning

45cdb86

banned splitK and tileN 256, unit test works

d8c15b4

Merge remote-tracking branch 'origin/main' into nkorobov/mxfp8-trtllm…

8a7a269

…-integration

upd

77c49a7

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

add mxfp8 bench

3e1a29f

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

fix test

b12c461

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

gemini-code-assist bot reviewed Feb 6, 2026

View reviewed changes

IwakuraRein added 4 commits February 6, 2026 01:54

upd comments

46eddfa

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

drop tile==8 and use unroll loop 2x

b046320

fix test

acf0c39

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

WAR: drop all UnrollLoop2xForMma kernels

2702ee2

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

vipulSharma18 reviewed Feb 6, 2026

View reviewed changes

csrc/trtllm_fused_moe_kernel_launcher.cu Show resolved Hide resolved

Merge remote-tracking branch 'origin/main' into siyuanf/mxfp8-trtllm-…

1dc688d

…integration

vincentzed mentioned this pull request Feb 8, 2026

[ModelOpt MXFP8] sgl-project/sglang#18449

Closed

5 tasks

address comment

4e83b82

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

fix unit test

aae1719

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

IwakuraRein force-pushed the siyuanf/mxfp8-trtllm-integration branch from 0adc056 to aae1719 Compare February 9, 2026 21:13

nekorobov and others added 4 commits February 9, 2026 16:08

fix hang and segfault

73d7594

use permute cache in unit test (WIP)

4354ec4

use permute cache in unit test (WIP)

0944312

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

Revert "use permute cache in unit test (WIP)"

aa85e94

IwakuraRein changed the title ~~mxfp8 trtllm integration~~ Feat: Trtllm-gen MxFP8 MoE integration Feb 12, 2026

Merge remote-tracking branch 'origin/main' into siyuanf/mxfp8-trtllm-…

a7ebf1e

…integration

IwakuraRein marked this pull request as ready for review February 12, 2026 17:45

IwakuraRein requested a review from nv-yunzheq as a code owner February 12, 2026 17:45

coderabbitai bot reviewed Feb 13, 2026

View reviewed changes

fix intermediate_size_factor initialization

03cac02

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

IwakuraRein force-pushed the siyuanf/mxfp8-trtllm-integration branch from 3e0dbdd to 03cac02 Compare February 14, 2026 00:36

allow split k

19417d1

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

IwakuraRein enabled auto-merge (squash) February 17, 2026 18:17

aleozlx approved these changes Feb 17, 2026

View reviewed changes

yzh119 approved these changes Feb 17, 2026

View reviewed changes

IwakuraRein merged commit 952b6ab into flashinfer-ai:main Feb 17, 2026
31 checks passed

aleozlx mentioned this pull request Feb 18, 2026

[Feature Request] mxfp8-blockscale (32 group size) TRTLLM Fused MoE Kernel for Blackwell (SM100) #2360

Closed

zianglih mentioned this pull request Feb 18, 2026

Implement cutlass_fused_moe mxfp8 #2581

Merged

5 tasks

IwakuraRein added the v0.6.4 release blocker label for v0.6.4 label Feb 18, 2026

mmangkad mentioned this pull request Feb 19, 2026

[FlashInfer] Bump FlashInfer version from 0.6.3 to 0.6.4 sgl-project/sglang#19005

Merged

5 tasks

coderabbitai bot mentioned this pull request Mar 3, 2026

Added mxint4 routed moe #2669

Open

5 tasks

danisereb mentioned this pull request Mar 4, 2026

Add support for ModelOpt MXFP8 MoE models vllm-project/vllm#35986

Merged

5 tasks

coderabbitai bot mentioned this pull request Mar 5, 2026

feat: support routing replay in trtllm_fp8_block_scale_moe and fused_topk_deepseek #2685

Open

5 tasks

This was referenced Mar 8, 2026

feat: Add support for TRTLLM MXFP8 non-gated MoE with ReLU2 #2707

Merged

[feat] trtllm-gen mxfp8 gemm #2653

Merged

Add cute-dsl backends to mxfp[8,4]_quantization for future refactor #2443

Merged

coderabbitai bot mentioned this pull request Mar 17, 2026

fix: support fp32 logits for fp8_per_tensor and fp8_block #2534

Open

5 tasks

		print(f"No autotune: {ms:.3f} ms; with autotune: {ms_tuned:.3f} ms")


		def bench_trtllm_gen_fused_moe_autotuner_mxint4(

Conversation

IwakuraRein commented Feb 6, 2026 • edited by yzh119 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

gemini-code-assist bot commented Feb 6, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vincentzed commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IwakuraRein commented Feb 9, 2026

Uh oh!

IwakuraRein commented Feb 13, 2026

Uh oh!

flashinfer-bot commented Feb 13, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

flashinfer-bot commented Feb 13, 2026

Uh oh!

IwakuraRein commented Feb 14, 2026

Uh oh!

flashinfer-bot commented Feb 14, 2026

Uh oh!

flashinfer-bot commented Feb 14, 2026

Uh oh!

danisereb commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IwakuraRein commented Feb 17, 2026

Uh oh!

aleozlx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

IwakuraRein commented Feb 6, 2026 •

edited by yzh119

Loading

coderabbitai bot commented Feb 6, 2026 •

edited

Loading

vincentzed commented Feb 8, 2026 •

edited

Loading

danisereb commented Feb 16, 2026 •

edited

Loading