Skip to content

[performance]optimize for nvfp4#2268

Merged
yzh119 merged 1 commit intoflashinfer-ai:mainfrom
Bruce-x-1997:bruce_optimize_nvfp4_main
Dec 27, 2025
Merged

[performance]optimize for nvfp4#2268
yzh119 merged 1 commit intoflashinfer-ai:mainfrom
Bruce-x-1997:bruce_optimize_nvfp4_main

Conversation

@Bruce-x-1997
Copy link
Contributor

@Bruce-x-1997 Bruce-x-1997 commented Dec 25, 2025

📌 Description

I find the nvfp4 implemantation could only 1.3-1.4x speedup compared to fp8 in deepseek-v3-0324 model .
and as the fp4 pflops is twice that of fp8, I think there should be some points that could be optimization.

now after applying this pr, we can get an extra 10-15% speedup on fp4.
1369.89/1192.91=1.148 ~= 15% speedup

test cmd

python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 1000 --random-input 1000 --random-output 1000 --max-concurrency 60 --port 30000 --host 0.0.0.0

accuracy

+------------------+-----------+----------+----------+-------+---------+---------+
| Model            | Dataset   | Metric   | Subset   |   Num |   Score | Cat.0   |
+==================+===========+==========+==========+=======+=========+=========+
| DeepSeek-V3-0324 | aime24    | mean_acc | default  |   300 |  0.5467 | default |
+------------------+-----------+----------+----------+-------+---------+---------+ 

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • Refactor
    • Optimized FP4/FP8 quantization paths with improved register efficiency
    • Enhanced kernel launch configuration to improve GPU occupancy and performance
    • Streamlined accumulation processes to reduce memory footprint

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: bruce.xu <bruce.xu@gmicloud.ai>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Bruce-x-1997, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on a series of performance optimizations for the NVFP4 implementation, addressing observed bottlenecks where FP4 was not achieving its theoretical speedup over FP8. The changes primarily involve reducing register usage in CUDA kernels, streamlining floating-point conversions and scale factor calculations, and fine-tuning kernel launch configurations to improve GPU occupancy and overall efficiency for FP4 operations.

Highlights

  • Efficient FP4 Conversion: Introduced a new __device__ function fp32_pair_to_e2m1 using PTX assembly for optimized float2 to e2m1 conversion, reducing register pressure and enabling pipelined processing.
  • Scale Factor (SF) Calculation Optimization: Refactored the scale factor calculation within cvt_warp_fp16_to_fp4 to directly compute the quantized scale factor and output scale, minimizing intermediate storage and reducing reciprocal operations.
  • Reduced Register Usage in All-Reduce: Modified the allreduce_sum function to process elements one at a time when using Fp32Acc, significantly decreasing register pressure by avoiding large temporary arrays.
  • FP4 Kernel Launch Tuning: Implemented specific kernel launch optimizations for FP4 quantization, dynamically selecting block_size values (e.g., 160, 192, 128) to enhance GPU occupancy and performance for FP4 kernels.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 25, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

This pull request optimizes FP4/FP8 quantization paths in the allreduce fusion CUDA kernel by introducing a new e2m1 conversion helper function, refactoring quantization flows to reduce register pressure, optimizing accumulator usage, and adjusting kernel launch parameters for improved occupancy.

Changes

Cohort / File(s) Summary
FP4/FP8 Quantization Optimization
include/flashinfer/comm/trtllm_allreduce_fusion.cuh
Added fp32_pair_to_e2m1() device function for inline PTX conversion of FP32 pairs to e2m1 format. Refactored cvt_warp_fp16_to_fp4() to compute quantized SF directly and use RECIPROCAL_6 constant, replacing intermediate SFValue calculations and consolidating per-iteration accumulation. Optimized allreduce_sum() to use scalar per-element accumulators instead of arrays. Enhanced allreduce_fusion_kernel_launcher() with FP4-specific block size pre-optimization logic before SM count checks to improve kernel occupancy.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Registers were crowded, quantization took space,
So we packed the e2m1 at a brisk GPU pace,
Scalar accumulators hop where arrays once stayed,
FP4 optimization in the kernel we've made,
Reduced pressure, improved flow—efficiency's won the race! 🚀

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title '[performance]optimize for nvfp4' directly relates to the main objective of the PR, which is to optimize the nvfp4 implementation for better performance. However, it is somewhat generic and lacks specificity about which aspects are optimized.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed The PR description contains performance motivation, benchmark results, and testing claims, but lacks critical implementation details about the changes.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several well-reasoned performance optimizations for the nvfp4 implementation. The changes focus on reducing register pressure, improving instruction efficiency, and tuning kernel launch parameters. The optimizations, such as pipelining the quantization process and simplifying scale factor calculations, are sound. I have one suggestion to improve the code's readability and maintainability.

Comment on lines +1433 to +1452
if constexpr (GetQuantType<Pattern> == QuantType::kFP4) {
// Try to use 160 as block_size if possible (better occupancy for FP4)
if (threads_per_token % 160 == 0 && 160 <= max_threads_per_block && 160 >= 128) {
block_size = 160;
cluster_size = threads_per_token / 160;
if (cluster_size > 8) cluster_size = 8;
}
// Fallback: try 192, 128 if 160 doesn't work
else if (threads_per_token % 192 == 0 && 192 <= max_threads_per_block && 192 >= 128) {
block_size = 192;
cluster_size = threads_per_token / 192;
if (cluster_size > 8) cluster_size = 8;
} else if (threads_per_token % 128 == 0 && 128 <= max_threads_per_block) {
block_size = 128;
cluster_size = threads_per_token / 128;
if (cluster_size > 8) cluster_size = 8;
}
// Update threads_per_block to match block_size for SM count check
threads_per_block = block_size;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for selecting a special block size for FP4 kernels is a bit repetitive and contains some redundant conditions. This can be refactored to be more concise and easier to read.

Specifically:

  • The condition 160 >= 128 (and similar for 192) is always true and can be removed.
  • The block of code to update cluster_size and cap it at 8 is repeated in each branch.

I've suggested a refactoring that consolidates this logic, making it cleaner and more maintainable without changing the functionality.

  if constexpr (GetQuantType<Pattern> == QuantType::kFP4) {
    int new_block_size = 0;
    // Try to use 160 as block_size if possible (better occupancy for FP4)
    if (threads_per_token % 160 == 0 && 160 <= max_threads_per_block) {
      new_block_size = 160;
    } else if (threads_per_token % 192 == 0 && 192 <= max_threads_per_block) {
      // Fallback: try 192, 128 if 160 doesn't work
      new_block_size = 192;
    } else if (threads_per_token % 128 == 0 && 128 <= max_threads_per_block) {
      new_block_size = 128;
    }

    if (new_block_size > 0) {
      block_size = new_block_size;
      cluster_size = threads_per_token / block_size;
      if (cluster_size > 8) {
        cluster_size = 8;
      }
      // Update threads_per_block to match block_size for SM count check
      threads_per_block = block_size;
    }
  }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gemini's suggestion looks reasonable.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 421433e and 9efd439.

📒 Files selected for processing (1)
  • include/flashinfer/comm/trtllm_allreduce_fusion.cuh
🔇 Additional comments (5)
include/flashinfer/comm/trtllm_allreduce_fusion.cuh (5)

534-552: LGTM - New helper function for pipelined e2m1 conversion.

The function correctly uses PTX cvt.rn.satfinite.e2m1x2.f32 for sm_100+ and falls back to returning 0 for older architectures, consistent with other conversion functions in this file. The operand ordering (%2, %1pair.y, pair.x) matches the convention used in fp32_vec_to_e2m1.


674-694: Good optimization - Reduced register pressure in quantization loop.

Processing one float2 pair at a time instead of collecting into a 4-element array reduces register usage from 32 bytes to 8 bytes. The bit-packing with (i * 8) shifts correctly places each byte in the appropriate position of the uint32_t result.


1132-1157: Good optimization - Per-element accumulation reduces register footprint.

The refactored loop processes each element position independently, accumulating across all ranks before immediately casting back to type T. This eliminates the intermediate float acc[VEC_SIZE] array, reducing register usage. The result is mathematically equivalent (within FP associativity tolerances).


1456-1469: SM count adjustment loop looks correct.

The loop properly adjusts both threads_per_block and block_size (for FP4) when scaling down cluster_size due to SM count constraints. The separation of concerns between FP4 and non-FP4 paths is appropriate.


663-667: The outputScale formula is mathematically correct. Both the old and new implementations compute the same result: SFScaleVal / quantized_sf, where quantized_sf = fp8(SFScaleVal * vecMax / 6.0). The recipe comment notation is admittedly confusing, but algebraically simplifies to the same formula in both versions.

Comment on lines +1433 to +1452
if constexpr (GetQuantType<Pattern> == QuantType::kFP4) {
// Try to use 160 as block_size if possible (better occupancy for FP4)
if (threads_per_token % 160 == 0 && 160 <= max_threads_per_block && 160 >= 128) {
block_size = 160;
cluster_size = threads_per_token / 160;
if (cluster_size > 8) cluster_size = 8;
}
// Fallback: try 192, 128 if 160 doesn't work
else if (threads_per_token % 192 == 0 && 192 <= max_threads_per_block && 192 >= 128) {
block_size = 192;
cluster_size = threads_per_token / 192;
if (cluster_size > 8) cluster_size = 8;
} else if (threads_per_token % 128 == 0 && 128 <= max_threads_per_block) {
block_size = 128;
cluster_size = threads_per_token / 128;
if (cluster_size > 8) cluster_size = 8;
}
// Update threads_per_block to match block_size for SM count check
threads_per_block = block_size;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential correctness issue: threads_per_token may not be fully covered when cluster_size is capped.

When threads_per_token / block_size > 8, the cluster_size is capped at 8 but block_size is not adjusted to compensate. This breaks the invariant threads_per_token == block_size * cluster_size.

Example with hidden_dim=12800, VEC_SIZE=8:

  • threads_per_token = 1600
  • block_size = 160, cluster_size = 10 → capped to 8
  • Effective threads = 160 * 8 = 1280 < 1600

This would result in some elements not being processed.

🔎 Suggested fix: Recalculate block_size after capping cluster_size
     if (threads_per_token % 160 == 0 && 160 <= max_threads_per_block && 160 >= 128) {
       block_size = 160;
       cluster_size = threads_per_token / 160;
-      if (cluster_size > 8) cluster_size = 8;
+      if (cluster_size > 8) {
+        cluster_size = 8;
+        // Recalculate block_size to ensure full coverage
+        block_size = threads_per_token / cluster_size;
+      }
     }

Apply similar logic to the 192 and 128 fallback cases.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In include/flashinfer/comm/trtllm_allreduce_fusion.cuh around lines 1433-1452,
the current FP4 branch caps cluster_size at 8 without adjusting block_size,
breaking the invariant threads_per_token == block_size * cluster_size and
leaving some threads unprocessed; after capping cluster_size recompute
block_size = threads_per_token / cluster_size (and if needed enforce block_size
<= max_threads_per_block and >=128) for the 160, 192 and 128 branches so the
product exactly covers threads_per_token, then update threads_per_block =
block_size for the SM check.

@yzh119
Copy link
Collaborator

yzh119 commented Dec 25, 2025

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !219 has been created, and the CI pipeline #40815147 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Collaborator

[FAILED] Pipeline #40815147: 11/20 passed

Copy link
Collaborator

@yzh119 yzh119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @timlee0212 for viz.

// Convert the input to float.
float2 fp2Vals[details::CVT_FP4_ELTS_PER_THREAD / 2];
// Convert the input to float and quantize (pipelined to reduce register usage).
// Optimization: use single float2 instead of array to reduce register pressure from 32 bytes to 8
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any profiling results showing the register usage (e.g. from cuobjdump or ncu)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I will give it these days

Comment on lines +1433 to +1452
if constexpr (GetQuantType<Pattern> == QuantType::kFP4) {
// Try to use 160 as block_size if possible (better occupancy for FP4)
if (threads_per_token % 160 == 0 && 160 <= max_threads_per_block && 160 >= 128) {
block_size = 160;
cluster_size = threads_per_token / 160;
if (cluster_size > 8) cluster_size = 8;
}
// Fallback: try 192, 128 if 160 doesn't work
else if (threads_per_token % 192 == 0 && 192 <= max_threads_per_block && 192 >= 128) {
block_size = 192;
cluster_size = threads_per_token / 192;
if (cluster_size > 8) cluster_size = 8;
} else if (threads_per_token % 128 == 0 && 128 <= max_threads_per_block) {
block_size = 128;
cluster_size = threads_per_token / 128;
if (cluster_size > 8) cluster_size = 8;
}
// Update threads_per_block to match block_size for SM count check
threads_per_block = block_size;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gemini's suggestion looks reasonable.

@yzh119 yzh119 merged commit 4de2a45 into flashinfer-ai:main Dec 27, 2025
4 checks passed
: 0.0f;
// Optimization: mathematically equivalent to SFScaleVal / quantized_sf, but more efficient
// (reduces 1 reciprocal call and 1 multiply operation)
float outputScale = quantized_sf != 0 ? SFScaleVal / quantized_sf : 0.0f;
Copy link
Contributor

@Edenzzzz Edenzzzz Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I thought division would take a few times longer than reciprocal approximation--curious if you ablated this

EDIT: I tried and it doesn't make noticeable diff

timlee0212 added a commit to timlee0212/flashinfer that referenced this pull request Jan 6, 2026
timlee0212 added a commit to timlee0212/flashinfer that referenced this pull request Jan 12, 2026

// FP4 optimization: apply BEFORE SM count check to avoid being overridden
// This allows FP4 to use smaller block_size even when cluster_num is large
if constexpr (GetQuantType<Pattern> == QuantType::kFP4) {
Copy link
Contributor

@timlee0212 timlee0212 Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Bruce-x-1997 , will this occupancy optimization triggered in DeepSeek case mentioned in this case? 7168/8 = 896 is not divisible by either 160 or 192, so 128 will be used, which is typically a preferred block_size. I have tried other optimizations mentioned in this PR but did not found noticeable difference (~ 0.05uS faster). Haven't check the register usage though. Is the 10~15% improvement mentioned in PR results from using this block size (and possibly different cluster size) instead of the original heuristic?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants