[performance]optimize for nvfp4 by Bruce-x-1997 · Pull Request #2268 · flashinfer-ai/flashinfer

Bruce-x-1997 · 2025-12-25T13:13:49Z

📌 Description

I find the nvfp4 implemantation could only 1.3-1.4x speedup compared to fp8 in deepseek-v3-0324 model .
and as the fp4 pflops is twice that of fp8, I think there should be some points that could be optimization.

now after applying this pr, we can get an extra 10-15% speedup on fp4.
1369.89/1192.91=1.148 ~= 15% speedup

test cmd

python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 1000 --random-input 1000 --random-output 1000 --max-concurrency 60 --port 30000 --host 0.0.0.0

accuracy

+------------------+-----------+----------+----------+-------+---------+---------+
| Model            | Dataset   | Metric   | Subset   |   Num |   Score | Cat.0   |
+==================+===========+==========+==========+=======+=========+=========+
| DeepSeek-V3-0324 | aime24    | mean_acc | default  |   300 |  0.5467 | default |
+------------------+-----------+----------+----------+-------+---------+---------+

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Refactor
- Optimized FP4/FP8 quantization paths with improved register efficiency
- Enhanced kernel launch configuration to improve GPU occupancy and performance
- Streamlined accumulation processes to reduce memory footprint

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: bruce.xu <bruce.xu@gmicloud.ai>

gemini-code-assist · 2025-12-25T13:14:06Z

Summary of Changes

Hello @Bruce-x-1997, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on a series of performance optimizations for the NVFP4 implementation, addressing observed bottlenecks where FP4 was not achieving its theoretical speedup over FP8. The changes primarily involve reducing register usage in CUDA kernels, streamlining floating-point conversions and scale factor calculations, and fine-tuning kernel launch configurations to improve GPU occupancy and overall efficiency for FP4 operations.

Highlights

Efficient FP4 Conversion: Introduced a new __device__ function fp32_pair_to_e2m1 using PTX assembly for optimized float2 to e2m1 conversion, reducing register pressure and enabling pipelined processing.
Scale Factor (SF) Calculation Optimization: Refactored the scale factor calculation within cvt_warp_fp16_to_fp4 to directly compute the quantized scale factor and output scale, minimizing intermediate storage and reducing reciprocal operations.
Reduced Register Usage in All-Reduce: Modified the allreduce_sum function to process elements one at a time when using Fp32Acc, significantly decreasing register pressure by avoiding large temporary arrays.
FP4 Kernel Launch Tuning: Implemented specific kernel launch optimizations for FP4 quantization, dynamically selecting block_size values (e.g., 160, 192, 128) to enhance GPU occupancy and performance for FP4 kernels.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2025-12-25T13:14:08Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

This pull request optimizes FP4/FP8 quantization paths in the allreduce fusion CUDA kernel by introducing a new e2m1 conversion helper function, refactoring quantization flows to reduce register pressure, optimizing accumulator usage, and adjusting kernel launch parameters for improved occupancy.

Changes

Cohort / File(s)	Summary
FP4/FP8 Quantization Optimization `include/flashinfer/comm/trtllm_allreduce_fusion.cuh`	Added `fp32_pair_to_e2m1()` device function for inline PTX conversion of FP32 pairs to e2m1 format. Refactored `cvt_warp_fp16_to_fp4()` to compute quantized SF directly and use RECIPROCAL_6 constant, replacing intermediate SFValue calculations and consolidating per-iteration accumulation. Optimized `allreduce_sum()` to use scalar per-element accumulators instead of arrays. Enhanced `allreduce_fusion_kernel_launcher()` with FP4-specific block size pre-optimization logic before SM count checks to improve kernel occupancy.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Registers were crowded, quantization took space,
So we packed the e2m1 at a brisk GPU pace,
Scalar accumulators hop where arrays once stayed,
FP4 optimization in the kernel we've made,
Reduced pressure, improved flow—efficiency's won the race! 🚀

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[performance]optimize for nvfp4' directly relates to the main objective of the PR, which is to optimize the nvfp4 implementation for better performance. However, it is somewhat generic and lacks specificity about which aspects are optimized.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check	✅ Passed	The PR description contains performance motivation, benchmark results, and testing claims, but lacks critical implementation details about the changes.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces several well-reasoned performance optimizations for the nvfp4 implementation. The changes focus on reducing register pressure, improving instruction efficiency, and tuning kernel launch parameters. The optimizations, such as pipelining the quantization process and simplifying scale factor calculations, are sound. I have one suggestion to improve the code's readability and maintainability.

gemini-code-assist · 2025-12-25T13:17:17Z

include/flashinfer/comm/trtllm_allreduce_fusion.cuh

+  if constexpr (GetQuantType<Pattern> == QuantType::kFP4) {
+    // Try to use 160 as block_size if possible (better occupancy for FP4)
+    if (threads_per_token % 160 == 0 && 160 <= max_threads_per_block && 160 >= 128) {
+      block_size = 160;
+      cluster_size = threads_per_token / 160;
+      if (cluster_size > 8) cluster_size = 8;
+    }
+    // Fallback: try 192, 128 if 160 doesn't work
+    else if (threads_per_token % 192 == 0 && 192 <= max_threads_per_block && 192 >= 128) {
+      block_size = 192;
+      cluster_size = threads_per_token / 192;
+      if (cluster_size > 8) cluster_size = 8;
+    } else if (threads_per_token % 128 == 0 && 128 <= max_threads_per_block) {
+      block_size = 128;
+      cluster_size = threads_per_token / 128;
+      if (cluster_size > 8) cluster_size = 8;
+    }
+    // Update threads_per_block to match block_size for SM count check
+    threads_per_block = block_size;
+  }


The logic for selecting a special block size for FP4 kernels is a bit repetitive and contains some redundant conditions. This can be refactored to be more concise and easier to read.

Specifically:

The condition 160 >= 128 (and similar for 192) is always true and can be removed.

The block of code to update cluster_size and cap it at 8 is repeated in each branch.

I've suggested a refactoring that consolidates this logic, making it cleaner and more maintainable without changing the functionality.

if constexpr (GetQuantType<Pattern> == QuantType::kFP4) { int new_block_size = 0; // Try to use 160 as block_size if possible (better occupancy for FP4) if (threads_per_token % 160 == 0 && 160 <= max_threads_per_block) { new_block_size = 160; } else if (threads_per_token % 192 == 0 && 192 <= max_threads_per_block) { // Fallback: try 192, 128 if 160 doesn't work new_block_size = 192; } else if (threads_per_token % 128 == 0 && 128 <= max_threads_per_block) { new_block_size = 128; } if (new_block_size > 0) { block_size = new_block_size; cluster_size = threads_per_token / block_size; if (cluster_size > 8) { cluster_size = 8; } // Update threads_per_block to match block_size for SM count check threads_per_block = block_size; } }

gemini's suggestion looks reasonable.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 421433e and 9efd439.

📒 Files selected for processing (1)

include/flashinfer/comm/trtllm_allreduce_fusion.cuh

🔇 Additional comments (5)

include/flashinfer/comm/trtllm_allreduce_fusion.cuh (5)

534-552: LGTM - New helper function for pipelined e2m1 conversion.

The function correctly uses PTX cvt.rn.satfinite.e2m1x2.f32 for sm_100+ and falls back to returning 0 for older architectures, consistent with other conversion functions in this file. The operand ordering (%2, %1 → pair.y, pair.x) matches the convention used in fp32_vec_to_e2m1.

674-694: Good optimization - Reduced register pressure in quantization loop.

Processing one float2 pair at a time instead of collecting into a 4-element array reduces register usage from 32 bytes to 8 bytes. The bit-packing with (i * 8) shifts correctly places each byte in the appropriate position of the uint32_t result.

1132-1157: Good optimization - Per-element accumulation reduces register footprint.

The refactored loop processes each element position independently, accumulating across all ranks before immediately casting back to type T. This eliminates the intermediate float acc[VEC_SIZE] array, reducing register usage. The result is mathematically equivalent (within FP associativity tolerances).

1456-1469: SM count adjustment loop looks correct.

The loop properly adjusts both threads_per_block and block_size (for FP4) when scaling down cluster_size due to SM count constraints. The separation of concerns between FP4 and non-FP4 paths is appropriate.

663-667: The outputScale formula is mathematically correct. Both the old and new implementations compute the same result: SFScaleVal / quantized_sf, where quantized_sf = fp8(SFScaleVal * vecMax / 6.0). The recipe comment notation is admittedly confusing, but algebraically simplifies to the same formula in both versions.

coderabbitai · 2025-12-25T13:18:10Z

include/flashinfer/comm/trtllm_allreduce_fusion.cuh

+  if constexpr (GetQuantType<Pattern> == QuantType::kFP4) {
+    // Try to use 160 as block_size if possible (better occupancy for FP4)
+    if (threads_per_token % 160 == 0 && 160 <= max_threads_per_block && 160 >= 128) {
+      block_size = 160;
+      cluster_size = threads_per_token / 160;
+      if (cluster_size > 8) cluster_size = 8;
+    }
+    // Fallback: try 192, 128 if 160 doesn't work
+    else if (threads_per_token % 192 == 0 && 192 <= max_threads_per_block && 192 >= 128) {
+      block_size = 192;
+      cluster_size = threads_per_token / 192;
+      if (cluster_size > 8) cluster_size = 8;
+    } else if (threads_per_token % 128 == 0 && 128 <= max_threads_per_block) {
+      block_size = 128;
+      cluster_size = threads_per_token / 128;
+      if (cluster_size > 8) cluster_size = 8;
+    }
+    // Update threads_per_block to match block_size for SM count check
+    threads_per_block = block_size;
+  }


⚠️ Potential issue | 🟠 Major

Potential correctness issue: threads_per_token may not be fully covered when cluster_size is capped.

When threads_per_token / block_size > 8, the cluster_size is capped at 8 but block_size is not adjusted to compensate. This breaks the invariant threads_per_token == block_size * cluster_size.

Example with hidden_dim=12800, VEC_SIZE=8:

threads_per_token = 1600

block_size = 160, cluster_size = 10 → capped to 8

Effective threads = 160 * 8 = 1280 < 1600

This would result in some elements not being processed.

🔎 Suggested fix: Recalculate block_size after capping cluster_size

if (threads_per_token % 160 == 0 && 160 <= max_threads_per_block && 160 >= 128) { block_size = 160; cluster_size = threads_per_token / 160; - if (cluster_size > 8) cluster_size = 8; + if (cluster_size > 8) { + cluster_size = 8; + // Recalculate block_size to ensure full coverage + block_size = threads_per_token / cluster_size; + } }

Apply similar logic to the 192 and 128 fallback cases.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In include/flashinfer/comm/trtllm_allreduce_fusion.cuh around lines 1433-1452, the current FP4 branch caps cluster_size at 8 without adjusting block_size, breaking the invariant threads_per_token == block_size * cluster_size and leaving some threads unprocessed; after capping cluster_size recompute block_size = threads_per_token / cluster_size (and if needed enforce block_size <= max_threads_per_block and >=128) for the 160, 192 and 128 branches so the product exactly covers threads_per_token, then update threads_per_block = block_size for the SM check.

yzh119 · 2025-12-25T19:53:10Z

/bot run

flashinfer-bot · 2025-12-25T19:53:57Z

GitLab MR !219 has been created, and the CI pipeline #40815147 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2025-12-26T01:58:45Z

[FAILED] Pipeline #40815147: 11/20 passed

yzh119

cc @timlee0212 for viz.

yzh119 · 2025-12-26T07:59:31Z

include/flashinfer/comm/trtllm_allreduce_fusion.cuh

-  // Convert the input to float.
-  float2 fp2Vals[details::CVT_FP4_ELTS_PER_THREAD / 2];
+  // Convert the input to float and quantize (pipelined to reduce register usage).
+  // Optimization: use single float2 instead of array to reduce register pressure from 32 bytes to 8


Do you have any profiling results showing the register usage (e.g. from cuobjdump or ncu)?

oh, I will give it these days

yzh119 · 2025-12-26T08:01:56Z

include/flashinfer/comm/trtllm_allreduce_fusion.cuh

+  if constexpr (GetQuantType<Pattern> == QuantType::kFP4) {
+    // Try to use 160 as block_size if possible (better occupancy for FP4)
+    if (threads_per_token % 160 == 0 && 160 <= max_threads_per_block && 160 >= 128) {
+      block_size = 160;
+      cluster_size = threads_per_token / 160;
+      if (cluster_size > 8) cluster_size = 8;
+    }
+    // Fallback: try 192, 128 if 160 doesn't work
+    else if (threads_per_token % 192 == 0 && 192 <= max_threads_per_block && 192 >= 128) {
+      block_size = 192;
+      cluster_size = threads_per_token / 192;
+      if (cluster_size > 8) cluster_size = 8;
+    } else if (threads_per_token % 128 == 0 && 128 <= max_threads_per_block) {
+      block_size = 128;
+      cluster_size = threads_per_token / 128;
+      if (cluster_size > 8) cluster_size = 8;
+    }
+    // Update threads_per_block to match block_size for SM count check
+    threads_per_block = block_size;
+  }


gemini's suggestion looks reasonable.

Edenzzzz · 2025-12-28T05:13:26Z

include/flashinfer/comm/trtllm_allreduce_fusion.cuh

-                                   : 0.0f;
+  // Optimization: mathematically equivalent to SFScaleVal / quantized_sf, but more efficient
+  // (reduces 1 reciprocal call and 1 multiply operation)
+  float outputScale = quantized_sf != 0 ? SFScaleVal / quantized_sf : 0.0f;


Interesting, I thought division would take a few times longer than reciprocal approximation--curious if you ablated this

EDIT: I tried and it doesn't make noticeable diff

timlee0212 · 2026-01-12T13:15:46Z

include/flashinfer/comm/trtllm_allreduce_fusion.cuh

+
+  // FP4 optimization: apply BEFORE SM count check to avoid being overridden
+  // This allows FP4 to use smaller block_size even when cluster_num is large
+  if constexpr (GetQuantType<Pattern> == QuantType::kFP4) {


Hi @Bruce-x-1997 , will this occupancy optimization triggered in DeepSeek case mentioned in this case? 7168/8 = 896 is not divisible by either 160 or 192, so 128 will be used, which is typically a preferred block_size. I have tried other optimizations mentioned in this PR but did not found noticeable difference (~ 0.05uS faster). Haven't check the register usage though. Is the 10~15% improvement mentioned in PR results from using this block size (and possibly different cluster size) instead of the original heuristic?

optimize for nvfp4

9efd439

Signed-off-by: bruce.xu <bruce.xu@gmicloud.ai>

Bruce-x-1997 requested review from cyx-6, djmmoss, nvmbreughe, yongwww and yzh119 as code owners December 25, 2025 13:13

gemini-code-assist bot reviewed Dec 25, 2025

View reviewed changes

coderabbitai bot reviewed Dec 25, 2025

View reviewed changes

yzh119 approved these changes Dec 26, 2025

View reviewed changes

yzh119 merged commit 4de2a45 into flashinfer-ai:main Dec 27, 2025
4 checks passed

Edenzzzz reviewed Dec 28, 2025

View reviewed changes

timlee0212 added a commit to timlee0212/flashinfer that referenced this pull request Jan 6, 2026

Incorporate fp4 quant optimizations from flashinfer-ai#2268.

b58ad96

coderabbitai bot mentioned this pull request Jan 9, 2026

feat: Add FP8/NVFP4 quant fusion for MNNVL Allreduce #2263

Open

5 tasks

timlee0212 added a commit to timlee0212/flashinfer that referenced this pull request Jan 12, 2026

Incorporate fp4 quant optimizations from flashinfer-ai#2268.

99b93cd

timlee0212 reviewed Jan 12, 2026

View reviewed changes

coderabbitai bot mentioned this pull request Jan 13, 2026

Optimize quantization function in large problem size #2343

Merged

5 tasks

nv-yunzheq mentioned this pull request Feb 4, 2026

fix: update block size and tile size heuristic for 16+ num of token #2492

Open

5 tasks

coderabbitai bot mentioned this pull request Feb 13, 2026

[Bugfix][comm] Fix FP4 one-shot launch config instability in trtllm_allreduce_fusion #2557

Merged

5 tasks

Conversation

Bruce-x-1997 commented Dec 25, 2025 • edited by yzh119 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Dec 25, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

yzh119 Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

yzh119 commented Dec 25, 2025

Uh oh!

flashinfer-bot commented Dec 25, 2025

Uh oh!

flashinfer-bot commented Dec 26, 2025

Uh oh!

yzh119 left a comment

Choose a reason for hiding this comment

Uh oh!

yzh119 Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

Bruce-x-1997 Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

yzh119 Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Edenzzzz Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timlee0212 Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Bruce-x-1997 commented Dec 25, 2025 •

edited by yzh119

Loading

coderabbitai bot commented Dec 25, 2025 •

edited

Loading

Edenzzzz Dec 28, 2025 •

edited

Loading

timlee0212 Jan 12, 2026 •

edited

Loading