Skip to content

benchmark: Add MXFP4/MXFP8 quantization mode support to FP4 MoE benchmark#2635

Merged
bkryu merged 5 commits intoflashinfer-ai:mainfrom
bkryu:bench_moe_mxfp4
Feb 25, 2026
Merged

benchmark: Add MXFP4/MXFP8 quantization mode support to FP4 MoE benchmark#2635
bkryu merged 5 commits intoflashinfer-ai:mainfrom
bkryu:bench_moe_mxfp4

Conversation

@bkryu
Copy link
Collaborator

@bkryu bkryu commented Feb 25, 2026

📌 Description

  • Add --fp4_mode CLI argument to trtllm_fp4_block_scale_moe benchmark with three modes:
    • nvfp4 (default, existing behavior): NvFP4 weights + NvFP4 hidden states, block_size=16
    • mxfp4_bf16: MXFP4 weights + BF16 hidden states, block_size=32
    • mxfp4_mxfp8: MXFP4 weights + MXFP8 hidden states, block_size=32
  • Add backward compatibility with flashinfer 0.6.0 (pre-0.6.3), where ActivationType was not yet exported from the top-level package and MoE APIs used gated_act_type instead of activation_type
  • Fix CSV output column mismatch: moe.py wrote to cur_res["activation_type"] but the CSV column was "gated_act", causing the field to be silently empty

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • New Features

    • Added FP4 mode selection for benchmarks (nvfp4 default, mxfp4_mxfp8, mxfp4_bf16) with a CLI flag and updated benchmark output to include activation type and FP4 mode.
    • Benchmarks now report FP-format-aware size/scale behavior for new modes.
  • Chores

    • Backward-compatible handling of older activation argument names.
    • Adjusted quantization and vector-size logic to support mxfp4/mxfp8 modes.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @bkryu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the FP4 MoE benchmarking capabilities by introducing support for MXFP4 and MXFP8 quantization modes, allowing for more diverse performance analysis. It also ensures broader compatibility with different versions of the FlashInfer library by adapting to API changes related to activation types. Additionally, a minor but important fix was applied to ensure accurate and complete data logging in the benchmark's CSV output.

Highlights

  • Expanded FP4 MoE Benchmarking: Added a new --fp4_mode CLI argument to the trtllm_fp4_block_scale_moe benchmark, allowing users to select between nvfp4 (default), mxfp4_bf16, and mxfp4_mxfp8 quantization modes. This enables comprehensive performance evaluation across different FP4/MXFP4 weight and hidden state configurations.
  • Backward Compatibility for FlashInfer APIs: Implemented backward compatibility for flashinfer versions prior to 0.6.3, specifically addressing changes in how ActivationType is exported and how MoE APIs handle activation types (transition from gated_act_type to activation_type).
  • CSV Output Fix: Corrected a mismatch in the CSV output where moe.py was writing to cur_res["activation_type"] but the CSV column expected gated_act, leading to silently empty fields. The fp4_mode is also now correctly added to the CSV output.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • benchmarks/routines/flashinfer_benchmark_utils.py
    • Updated _MOE_BENCHMARK_COLUMNS to include activation_type and fp4_mode for consistent CSV output.
  • benchmarks/routines/moe.py
    • Imported inspect module for runtime introspection.
    • Added a try-except block to import ActivationType for backward compatibility with older flashinfer versions.
    • Imported mxfp8_quantize for MXFP8 quantization support.
    • Introduced _ACTIVATION_TO_GATED_ACT mapping and _activation_kwarg helper function to handle activation_type vs. gated_act_type API differences.
    • Added --fp4_mode argument to parse_moe_args with choices for nvfp4, mxfp4_mxfp8, and mxfp4_bf16.
    • Modified testTrtllmFp4BlockScaleMoe to dynamically prepare hidden states and scales based on the selected fp4_mode.
    • Updated run_fp4_moe to use the _activation_kwarg helper for activation_type parameter passing.
    • Dynamically set input_format and weight_format in run_fp4_moe based on the chosen fp4_mode.
    • Added fp4_mode to the cur_res dictionary for CSV logging in run_fp4_moe.
    • Updated run_fp8_per_tensor_moe to use the _activation_kwarg helper for activation_type parameter passing.
  • benchmarks/routines/moe_utils.py
    • Adjusted sf_vec_size in quantize_fp4 to be 32 when use_ue8m0 is true, otherwise 16.
    • Added mxfp8 format to get_effective_bytes calculation for bandwidth estimation.
Activity
  • No specific review comments or activities have been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 25, 2026

📝 Walkthrough

Walkthrough

Adds FP4 mode selection and ActivationType compatibility to MoE benchmark code: updates output columns (removes gated_act, adds activation_type and fp4_mode), introduces --fp4_mode CLI flag, adds activation-kwarg helper and FP4/FP8 quantization/path branching.

Changes

Cohort / File(s) Summary
Output Column Updates
benchmarks/routines/flashinfer_benchmark_utils.py
Removed gated_act and added activation_type and fp4_mode entries in the moe output columns.
MoE CLI, FP4/FP8 Pathing & Activation Handling
benchmarks/routines/moe.py
Added ActivationType import fallback, _activation_kwarg() helper and _ACTIVATION_TO_GATED_ACT mapping; new CLI --fp4_mode (nvfp4, mxfp4_mxfp8, mxfp4_bf16); integrated mxfp8_quantize and fp4_mode branching across FP4/FP8 quantization, scale/reshape logic, activation-kwarg propagation, and runtime result metadata.
Quantization Utilities / Vector Sizes
benchmarks/routines/moe_utils.py
Added SF_VEC_SIZE dict; sf_vec_size selection now depends on FP4 mode/use_ue8m0; extended get_effective_bytes() to support mxfp8 and adjusted fp4 effective-byte calculations.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant CLI as CLI
  participant Bench as Benchmark Runner
  participant Q as Quantizer (moe_utils)
  participant Kernel as Fused MoE Kernel
  participant Store as Result Collector

  CLI->>Bench: parse args (--fp4_mode, activation_type, other flags)
  Bench->>Q: prepare/quantize tensors (fp4_mode, sf_vec_size)
  Q-->>Bench: quantized tensors, scales, metadata
  Bench->>Kernel: invoke kernel (passes activation kwarg via _activation_kwarg)
  Kernel-->>Bench: runtime metrics, outputs
  Bench->>Store: append result (includes fp4_mode, activation_type, formats)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

run-ci

Suggested reviewers

  • Anerudhan
  • cyx-6
  • jiahanc
  • nv-yunzheq

Poem

🐰 I hopped through code with nimble feet,
Swapped an old gate for a brand-new beat,
FP4 modes now dance in line,
Activation types align just fine,
Benchmarks sing — quantized and neat! 🎉

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 72.73% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately describes the main change: adding MXFP4/MXFP8 quantization mode support to the FP4 MoE benchmark, which is the primary objective of the changeset.
Description check ✅ Passed The PR description includes all required sections from the template: a detailed description of changes, related issues section (though empty), and a completed pre-commit and testing checklist.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for MXFP4/MXFP8 quantization modes to the FP4 MoE benchmark, including a new --fp4_mode CLI argument. It also introduces backward compatibility for older flashinfer versions by dynamically handling ActivationType and gated_act_type API differences, and fixes a CSV output column mismatch.

The changes are well-structured and the backward compatibility is handled cleanly. However, I've found a critical issue related to the quantization of hidden_states. The swizzled layout for scale factors requires the input tensor's row count to be a multiple of 128, but hidden_states can have an arbitrary number of rows (num_tokens). This will lead to runtime errors. I've provided suggestions to use a non-swizzled layout for activations to fix this.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
benchmarks/routines/moe.py (1)

1358-1383: ⚠️ Potential issue | 🟡 Minor

activation_type not written to cur_res in testTrtllmFp8BlockScaleMoe.

The "activation_type" column now exists in output_column_dict["moe"], and testTrtllmFp8PerTensorScaleMoe correctly populates cur_res["activation_type"], but testTrtllmFp8BlockScaleMoe omits this field, leaving it empty in CSV output. Since activation_type is accepted by neither the FP8 block-scale kernel (not passed at all in run_fp8_block_moe) nor written to cur_res, the omission is consistent but the CSV column will always be blank for this routine.

Consider adding cur_res["activation_type"] = args.activation_type.name for completeness, or if the FP8 block-scale kernel genuinely ignores activation type, add a comment explaining why it's excluded.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/routines/moe.py` around lines 1358 - 1383, The CSV result for
testTrtllmFp8BlockScaleMoe omits the activation_type field causing an empty
column; update the code that builds cur_res in testTrtllmFp8BlockScaleMoe to
include cur_res["activation_type"] = args.activation_type.name (or
args.activation_type) so the activation_type is recorded like in
testTrtllmFp8PerTensorScaleMoe, and if run_fp8_block_moe genuinely ignores
activation type add a brief comment near run_fp8_block_moe explaining why
activation_type is not used.
🧹 Nitpick comments (2)
benchmarks/routines/moe.py (2)

543-544: Redundant .to(torch.bfloat16)hidden_states is already BF16.

hidden_states is created as torch.bfloat16 at the create_trtllm_moe_test_data call. The cast is a no-op and adds a small overhead in setup.

♻️ Suggested fix
     if fp4_mode == "mxfp4_bf16":
-        hidden_states_fp4 = hidden_states.to(torch.bfloat16)
+        hidden_states_fp4 = hidden_states
         hidden_states_scale_linear_fp4 = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/routines/moe.py` around lines 543 - 544, Remove the redundant cast
to bfloat16: in the block handling fp4_mode == "mxfp4_bf16" (where the variable
hidden_states is set), delete the .to(torch.bfloat16) call or guard it so you
only cast if hidden_states.dtype is not torch.bfloat16; this touches the
fp4_mode check around hidden_states assignment (created by
create_trtllm_moe_test_data) — simply rely on the input being BF16 or perform a
dtype check before casting to avoid the no-op overhead.

61-68: inspect.signature is called on every benchmark iteration — cache the result.

_activation_kwarg is invoked inside run_fp4_moe and run_fp8_per_tensor_moe, which are called for each iteration by bench_gpu_time. inspect.signature has non-trivial overhead. Pre-compute it once before the benchmark loop.

♻️ Suggested fix — compute kwargs once outside the closure
-def _activation_kwarg(fn, activation_type: ActivationType) -> dict:
-    """Return the correct activation keyword argument for *fn* in the installed version."""
-    sig = inspect.signature(fn)
-    if "activation_type" in sig.parameters:
-        return {"activation_type": activation_type.value}
-    if "gated_act_type" in sig.parameters:
-        return {"gated_act_type": _ACTIVATION_TO_GATED_ACT.get(activation_type, 0)}
-    return {}

Then at each call site, pre-compute outside the inner closure:

# Before defining run_fp4_moe / run_fp8_per_tensor_moe:
activation_kwargs = _activation_kwarg(trtllm_fp4_block_scale_moe, activation_type)

def run_fp4_moe(...):
    return trtllm_fp4_block_scale_moe(
        ...,
        **activation_kwargs,
    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/routines/moe.py` around lines 61 - 68, The helper
_activation_kwarg currently calls inspect.signature(fn) on every benchmark
iteration; compute and cache the activation kwargs once before the benchmark
loop and pass them into the per-iteration functions instead of calling
_activation_kwarg repeatedly. Concretely, call
_activation_kwarg(trtllm_fp4_block_scale_moe, activation_type) (and any other
target model functions) once and store the returned dict (e.g.,
activation_kwargs), then update run_fp4_moe and run_fp8_per_tensor_moe to
accept/use the precomputed activation_kwargs and spread them into the model
invocation instead of calling _activation_kwarg inside the per-iteration
closure.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmarks/routines/moe.py`:
- Around line 563-572: The code currently replaces
hidden_states_scale_linear_fp4 with a synthetic all-ones tensor when its element
count mismatches expected_scale_elems but only prints an info message if
args.verbose >= 1; change this to always emit a warning (unconditional) when the
fallback happens and include context (current and expected sizes and that an
all-ones tensor is being used) so users see the substitution even at default
verbosity. Locate the block using expected_scale_elems and
hidden_states_scale_linear_fp4 and replace the gated print with a call that
always logs or prints a clear warning message before creating the
torch.ones(...) fallback (include device and dtype details in the message if
helpful). Ensure behavior of assigning hidden_states_scale_linear_fp4 =
torch.ones(expected_scale_elems, device=device, dtype=torch.float8_e4m3fn)
remains unchanged.
- Around line 66-67: The code silently falls back to 0 for unknown
activation_type in the legacy gated_act_type path; update the block that checks
"gated_act_type" to detect when activation_type is not a key in
_ACTIVATION_TO_GATED_ACT and emit a warning (e.g., warnings.warn or
logger.warning) including the unexpected activation_type and that we're using
the legacy fallback value, then return the fallback {"gated_act_type":
_ACTIVATION_TO_GATED_ACT.get(activation_type, 0)} as before; reference the
activation_type variable and the _ACTIVATION_TO_GATED_ACT mapping to locate
where to add the warning.
- Around line 547-551: The code is reinterpreting uint8 scale bytes as float8
via hs_scale.view(torch.float8_e4m3fn) which is unsafe; replace that
reinterpretation by explicitly converting the scale values from the returned
uint8 representation into the float8 numeric type (or use an existing helper
like a mxfp8 scale decode/dequantize function) before reshaping. In other words,
locate mxfp8_quantize and the variables hs_scale and
hidden_states_scale_linear_fp4 and change the
hs_scale.view(torch.float8_e4m3fn).reshape(...) step to perform a proper
conversion/cast from torch.uint8 to torch.float8_e4m3fn (or call the mxfp8
scale-decoding routine) and then reshape to (num_tokens, -1), rather than
reinterpreting raw bytes.

---

Outside diff comments:
In `@benchmarks/routines/moe.py`:
- Around line 1358-1383: The CSV result for testTrtllmFp8BlockScaleMoe omits the
activation_type field causing an empty column; update the code that builds
cur_res in testTrtllmFp8BlockScaleMoe to include cur_res["activation_type"] =
args.activation_type.name (or args.activation_type) so the activation_type is
recorded like in testTrtllmFp8PerTensorScaleMoe, and if run_fp8_block_moe
genuinely ignores activation type add a brief comment near run_fp8_block_moe
explaining why activation_type is not used.

---

Nitpick comments:
In `@benchmarks/routines/moe.py`:
- Around line 543-544: Remove the redundant cast to bfloat16: in the block
handling fp4_mode == "mxfp4_bf16" (where the variable hidden_states is set),
delete the .to(torch.bfloat16) call or guard it so you only cast if
hidden_states.dtype is not torch.bfloat16; this touches the fp4_mode check
around hidden_states assignment (created by create_trtllm_moe_test_data) —
simply rely on the input being BF16 or perform a dtype check before casting to
avoid the no-op overhead.
- Around line 61-68: The helper _activation_kwarg currently calls
inspect.signature(fn) on every benchmark iteration; compute and cache the
activation kwargs once before the benchmark loop and pass them into the
per-iteration functions instead of calling _activation_kwarg repeatedly.
Concretely, call _activation_kwarg(trtllm_fp4_block_scale_moe, activation_type)
(and any other target model functions) once and store the returned dict (e.g.,
activation_kwargs), then update run_fp4_moe and run_fp8_per_tensor_moe to
accept/use the precomputed activation_kwargs and spread them into the model
invocation instead of calling _activation_kwarg inside the per-iteration
closure.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9826c26 and 2371df9.

📒 Files selected for processing (3)
  • benchmarks/routines/flashinfer_benchmark_utils.py
  • benchmarks/routines/moe.py
  • benchmarks/routines/moe_utils.py

@bkryu bkryu self-assigned this Feb 25, 2026
@bkryu bkryu added ready benchmark Pertains to performance benchmarking labels Feb 25, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
benchmarks/routines/moe.py (2)

549-551: Redundant .to(torch.bfloat16)hidden_states is already bfloat16.

create_trtllm_moe_test_data constructs hidden_states as dtype=torch.bfloat16, so the cast on line 550 is a no-op. Not harmful, but a minor clarity issue.

♻️ Proposed fix
     if fp4_mode == "mxfp4_bf16":
-        hidden_states_fp4 = hidden_states.to(torch.bfloat16)
+        hidden_states_fp4 = hidden_states  # already bfloat16
         hidden_states_scale_linear_fp4 = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/routines/moe.py` around lines 549 - 551, Redundant cast: remove
the unnecessary .to(torch.bfloat16) call and assign hidden_states_fp4 =
hidden_states directly in the fp4_mode == "mxfp4_bf16" branch (leave
hidden_states_scale_linear_fp4 = None); locate the branch using the fp4_mode
variable and the hidden_states_fp4/hidden_states names in
create_trtllm_moe_test_data to make the change.

68-72: Ruff TRY003 — consider shortening inline exception message or wrapping in a custom exception.

Both this block (line 68–72) and the equivalent at lines 554–558 are flagged by Ruff TRY003 for long messages outside the exception class. For a benchmark file the impact is minimal, but aligning with the linter keeps CI clean.

♻️ Suggested fix (example for lines 68–72)
-        if activation_type not in _ACTIVATION_TO_GATED_ACT:
-            raise ValueError(
-                f"Activation type {activation_type.name} is not supported by the "
-                f"installed flashinfer version (pre-0.6.3 only supports "
-                f"{[k.name for k in _ACTIVATION_TO_GATED_ACT]})"
-            )
+        if activation_type not in _ACTIVATION_TO_GATED_ACT:
+            supported = [k.name for k in _ACTIVATION_TO_GATED_ACT]
+            raise ValueError(
+                f"Activation {activation_type.name!r} unsupported pre-0.6.3; supported: {supported}"
+            )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/routines/moe.py` around lines 68 - 72, The ValueError raised for
unsupported activations has an overly long inline message; shorten it to a
concise message like "Unsupported activation type: {activation_type.name}" and
move the detailed list of supported activations (from _ACTIVATION_TO_GATED_ACT)
into either a separate variable used for logging or into a small custom
exception class (e.g., ActivationNotSupported) that formats the full detail in
its __str__; update the raise sites (the ValueError instances around
activation_type and the equivalent at the other location) to use the short
message or raise the new ActivationNotSupported to satisfy Ruff TRY003 while
preserving the detailed info elsewhere for debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@benchmarks/routines/moe.py`:
- Around line 549-551: Redundant cast: remove the unnecessary
.to(torch.bfloat16) call and assign hidden_states_fp4 = hidden_states directly
in the fp4_mode == "mxfp4_bf16" branch (leave hidden_states_scale_linear_fp4 =
None); locate the branch using the fp4_mode variable and the
hidden_states_fp4/hidden_states names in create_trtllm_moe_test_data to make the
change.
- Around line 68-72: The ValueError raised for unsupported activations has an
overly long inline message; shorten it to a concise message like "Unsupported
activation type: {activation_type.name}" and move the detailed list of supported
activations (from _ACTIVATION_TO_GATED_ACT) into either a separate variable used
for logging or into a small custom exception class (e.g.,
ActivationNotSupported) that formats the full detail in its __str__; update the
raise sites (the ValueError instances around activation_type and the equivalent
at the other location) to use the short message or raise the new
ActivationNotSupported to satisfy Ruff TRY003 while preserving the detailed info
elsewhere for debugging.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2371df9 and f4567b6.

📒 Files selected for processing (1)
  • benchmarks/routines/moe.py

choices=["nvfp4", "mxfp4_mxfp8", "mxfp4_bf16"],
help=(
"FP4 quantization mode for trtllm_fp4_block_scale_moe: "
"nvfp4 (NvFP4 weights + NvFP4 hidden states, block_size=16), "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these block_size are hardcoded numbers, we can create a utility function or a dict to store the block_size, e.g.:

sf_vec_size = {
    "nvfp4": 16,
    "mxfp4": 32,
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in latest commit d149911

return 0.5 + 1 / 32
elif fmt == "mxfp8":
# 1 e4m3 + 1 ue8m0 scale per 32-element block
return 1.0 + 1 / 32
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto; addressed in latest commit d149911

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
benchmarks/routines/moe_utils.py (1)

752-752: sf_vec_size = 16 should use SF_VEC_SIZE["nvfp4"] for consistency.

This function is the only remaining site with a hardcoded block size now that SF_VEC_SIZE exists.

♻️ Proposed fix
-    sf_vec_size = 16
+    sf_vec_size = SF_VEC_SIZE["nvfp4"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/routines/moe_utils.py` at line 752, Replace the hardcoded block
size assignment "sf_vec_size = 16" with the centralized constant lookup
SF_VEC_SIZE["nvfp4"] so the function uses the shared SF_VEC_SIZE map; locate the
occurrence of the variable sf_vec_size in the file and change its value to be
assigned from SF_VEC_SIZE["nvfp4"] (ensuring SF_VEC_SIZE is imported/available
in the module).
benchmarks/routines/moe.py (1)

62-75: ValueError for unsupported activation on pre-0.6.3 APIs is a good improvement over a silent fallback.

Minor: Ruff flags the long inline message at lines 69-73 (TRY003). Consider moving it into a dedicated exception class or shortening it if you want to silence linting.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/routines/moe.py` around lines 62 - 75, The long inline ValueError
message in _activation_kwarg triggers Ruff TRY003; replace the multi-line inline
message with a concise raise using a dedicated exception class or a shorter
message: define a custom exception (e.g., UnsupportedActivationError) near the
top of the module that formats or stores the full multi-line guidance, then in
_activation_kwarg raise UnsupportedActivationError(activation_type,
list(_ACTIVATION_TO_GATED_ACT)) or raise ValueError with a single-line message
that references the custom exception or points to docs; reference the symbols
_activation_kwarg, ActivationType, _ACTIVATION_TO_GATED_ACT and the ValueError
raise site when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@benchmarks/routines/moe_utils.py`:
- Line 752: Replace the hardcoded block size assignment "sf_vec_size = 16" with
the centralized constant lookup SF_VEC_SIZE["nvfp4"] so the function uses the
shared SF_VEC_SIZE map; locate the occurrence of the variable sf_vec_size in the
file and change its value to be assigned from SF_VEC_SIZE["nvfp4"] (ensuring
SF_VEC_SIZE is imported/available in the module).

In `@benchmarks/routines/moe.py`:
- Around line 62-75: The long inline ValueError message in _activation_kwarg
triggers Ruff TRY003; replace the multi-line inline message with a concise raise
using a dedicated exception class or a shorter message: define a custom
exception (e.g., UnsupportedActivationError) near the top of the module that
formats or stores the full multi-line guidance, then in _activation_kwarg raise
UnsupportedActivationError(activation_type, list(_ACTIVATION_TO_GATED_ACT)) or
raise ValueError with a single-line message that references the custom exception
or points to docs; reference the symbols _activation_kwarg, ActivationType,
_ACTIVATION_TO_GATED_ACT and the ValueError raise site when making the change.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f4567b6 and d149911.

📒 Files selected for processing (2)
  • benchmarks/routines/moe.py
  • benchmarks/routines/moe_utils.py

@bkryu bkryu merged commit fcc47b8 into flashinfer-ai:main Feb 25, 2026
24 checks passed
@bkryu bkryu deleted the bench_moe_mxfp4 branch February 26, 2026 01:44
ameynaik-hub pushed a commit to ameynaik-hub/flashinfer that referenced this pull request Mar 18, 2026
…mark (flashinfer-ai#2635)

<!-- .github/pull_request_template.md -->

## 📌 Description

* Add --fp4_mode CLI argument to trtllm_fp4_block_scale_moe benchmark
with three modes:
* nvfp4 (default, existing behavior): NvFP4 weights + NvFP4 hidden
states, block_size=16
  * mxfp4_bf16: MXFP4 weights + BF16 hidden states, block_size=32
  * mxfp4_mxfp8: MXFP4 weights + MXFP8 hidden states, block_size=32
* Add backward compatibility with flashinfer 0.6.0 (pre-0.6.3), where
ActivationType was not yet exported from the top-level package and MoE
APIs used gated_act_type instead of activation_type
* Fix CSV output column mismatch: moe.py wrote to
cur_res["activation_type"] but the CSV column was "gated_act", causing
the field to be silently empty

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added FP4 mode selection for benchmarks (nvfp4 default, mxfp4_mxfp8,
mxfp4_bf16) with a CLI flag and updated benchmark output to include
activation type and FP4 mode.
* Benchmarks now report FP-format-aware size/scale behavior for new
modes.

* **Chores**
  * Backward-compatible handling of older activation argument names.
* Adjusted quantization and vector-size logic to support mxfp4/mxfp8
modes.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Amey Naik <212485788+ameynaik-hub@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark Pertains to performance benchmarking ready

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants