[performance]optimize for nvfp4#2268
Conversation
Signed-off-by: bruce.xu <bruce.xu@gmicloud.ai>
Summary of ChangesHello @Bruce-x-1997, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request focuses on a series of performance optimizations for the NVFP4 implementation, addressing observed bottlenecks where FP4 was not achieving its theoretical speedup over FP8. The changes primarily involve reducing register usage in CUDA kernels, streamlining floating-point conversions and scale factor calculations, and fine-tuning kernel launch configurations to improve GPU occupancy and overall efficiency for FP4 operations. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughThis pull request optimizes FP4/FP8 quantization paths in the allreduce fusion CUDA kernel by introducing a new e2m1 conversion helper function, refactoring quantization flows to reduce register pressure, optimizing accumulator usage, and adjusting kernel launch parameters for improved occupancy. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces several well-reasoned performance optimizations for the nvfp4 implementation. The changes focus on reducing register pressure, improving instruction efficiency, and tuning kernel launch parameters. The optimizations, such as pipelining the quantization process and simplifying scale factor calculations, are sound. I have one suggestion to improve the code's readability and maintainability.
| if constexpr (GetQuantType<Pattern> == QuantType::kFP4) { | ||
| // Try to use 160 as block_size if possible (better occupancy for FP4) | ||
| if (threads_per_token % 160 == 0 && 160 <= max_threads_per_block && 160 >= 128) { | ||
| block_size = 160; | ||
| cluster_size = threads_per_token / 160; | ||
| if (cluster_size > 8) cluster_size = 8; | ||
| } | ||
| // Fallback: try 192, 128 if 160 doesn't work | ||
| else if (threads_per_token % 192 == 0 && 192 <= max_threads_per_block && 192 >= 128) { | ||
| block_size = 192; | ||
| cluster_size = threads_per_token / 192; | ||
| if (cluster_size > 8) cluster_size = 8; | ||
| } else if (threads_per_token % 128 == 0 && 128 <= max_threads_per_block) { | ||
| block_size = 128; | ||
| cluster_size = threads_per_token / 128; | ||
| if (cluster_size > 8) cluster_size = 8; | ||
| } | ||
| // Update threads_per_block to match block_size for SM count check | ||
| threads_per_block = block_size; | ||
| } |
There was a problem hiding this comment.
The logic for selecting a special block size for FP4 kernels is a bit repetitive and contains some redundant conditions. This can be refactored to be more concise and easier to read.
Specifically:
- The condition
160 >= 128(and similar for 192) is always true and can be removed. - The block of code to update
cluster_sizeand cap it at 8 is repeated in each branch.
I've suggested a refactoring that consolidates this logic, making it cleaner and more maintainable without changing the functionality.
if constexpr (GetQuantType<Pattern> == QuantType::kFP4) {
int new_block_size = 0;
// Try to use 160 as block_size if possible (better occupancy for FP4)
if (threads_per_token % 160 == 0 && 160 <= max_threads_per_block) {
new_block_size = 160;
} else if (threads_per_token % 192 == 0 && 192 <= max_threads_per_block) {
// Fallback: try 192, 128 if 160 doesn't work
new_block_size = 192;
} else if (threads_per_token % 128 == 0 && 128 <= max_threads_per_block) {
new_block_size = 128;
}
if (new_block_size > 0) {
block_size = new_block_size;
cluster_size = threads_per_token / block_size;
if (cluster_size > 8) {
cluster_size = 8;
}
// Update threads_per_block to match block_size for SM count check
threads_per_block = block_size;
}
}
There was a problem hiding this comment.
gemini's suggestion looks reasonable.
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
include/flashinfer/comm/trtllm_allreduce_fusion.cuh
🔇 Additional comments (5)
include/flashinfer/comm/trtllm_allreduce_fusion.cuh (5)
534-552: LGTM - New helper function for pipelined e2m1 conversion.The function correctly uses PTX
cvt.rn.satfinite.e2m1x2.f32for sm_100+ and falls back to returning 0 for older architectures, consistent with other conversion functions in this file. The operand ordering (%2, %1→pair.y, pair.x) matches the convention used infp32_vec_to_e2m1.
674-694: Good optimization - Reduced register pressure in quantization loop.Processing one
float2pair at a time instead of collecting into a 4-element array reduces register usage from 32 bytes to 8 bytes. The bit-packing with(i * 8)shifts correctly places each byte in the appropriate position of theuint32_tresult.
1132-1157: Good optimization - Per-element accumulation reduces register footprint.The refactored loop processes each element position independently, accumulating across all ranks before immediately casting back to type
T. This eliminates the intermediatefloat acc[VEC_SIZE]array, reducing register usage. The result is mathematically equivalent (within FP associativity tolerances).
1456-1469: SM count adjustment loop looks correct.The loop properly adjusts both
threads_per_blockandblock_size(for FP4) when scaling down cluster_size due to SM count constraints. The separation of concerns between FP4 and non-FP4 paths is appropriate.
663-667: The outputScale formula is mathematically correct. Both the old and new implementations compute the same result:SFScaleVal / quantized_sf, wherequantized_sf = fp8(SFScaleVal * vecMax / 6.0). The recipe comment notation is admittedly confusing, but algebraically simplifies to the same formula in both versions.
| if constexpr (GetQuantType<Pattern> == QuantType::kFP4) { | ||
| // Try to use 160 as block_size if possible (better occupancy for FP4) | ||
| if (threads_per_token % 160 == 0 && 160 <= max_threads_per_block && 160 >= 128) { | ||
| block_size = 160; | ||
| cluster_size = threads_per_token / 160; | ||
| if (cluster_size > 8) cluster_size = 8; | ||
| } | ||
| // Fallback: try 192, 128 if 160 doesn't work | ||
| else if (threads_per_token % 192 == 0 && 192 <= max_threads_per_block && 192 >= 128) { | ||
| block_size = 192; | ||
| cluster_size = threads_per_token / 192; | ||
| if (cluster_size > 8) cluster_size = 8; | ||
| } else if (threads_per_token % 128 == 0 && 128 <= max_threads_per_block) { | ||
| block_size = 128; | ||
| cluster_size = threads_per_token / 128; | ||
| if (cluster_size > 8) cluster_size = 8; | ||
| } | ||
| // Update threads_per_block to match block_size for SM count check | ||
| threads_per_block = block_size; | ||
| } |
There was a problem hiding this comment.
Potential correctness issue: threads_per_token may not be fully covered when cluster_size is capped.
When threads_per_token / block_size > 8, the cluster_size is capped at 8 but block_size is not adjusted to compensate. This breaks the invariant threads_per_token == block_size * cluster_size.
Example with hidden_dim=12800, VEC_SIZE=8:
threads_per_token = 1600block_size = 160,cluster_size = 10→ capped to8- Effective threads =
160 * 8 = 1280 < 1600
This would result in some elements not being processed.
🔎 Suggested fix: Recalculate block_size after capping cluster_size
if (threads_per_token % 160 == 0 && 160 <= max_threads_per_block && 160 >= 128) {
block_size = 160;
cluster_size = threads_per_token / 160;
- if (cluster_size > 8) cluster_size = 8;
+ if (cluster_size > 8) {
+ cluster_size = 8;
+ // Recalculate block_size to ensure full coverage
+ block_size = threads_per_token / cluster_size;
+ }
}Apply similar logic to the 192 and 128 fallback cases.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In include/flashinfer/comm/trtllm_allreduce_fusion.cuh around lines 1433-1452,
the current FP4 branch caps cluster_size at 8 without adjusting block_size,
breaking the invariant threads_per_token == block_size * cluster_size and
leaving some threads unprocessed; after capping cluster_size recompute
block_size = threads_per_token / cluster_size (and if needed enforce block_size
<= max_threads_per_block and >=128) for the 160, 192 and 128 branches so the
product exactly covers threads_per_token, then update threads_per_block =
block_size for the SM check.
|
/bot run |
|
[FAILED] Pipeline #40815147: 11/20 passed |
| // Convert the input to float. | ||
| float2 fp2Vals[details::CVT_FP4_ELTS_PER_THREAD / 2]; | ||
| // Convert the input to float and quantize (pipelined to reduce register usage). | ||
| // Optimization: use single float2 instead of array to reduce register pressure from 32 bytes to 8 |
There was a problem hiding this comment.
Do you have any profiling results showing the register usage (e.g. from cuobjdump or ncu)?
There was a problem hiding this comment.
oh, I will give it these days
| if constexpr (GetQuantType<Pattern> == QuantType::kFP4) { | ||
| // Try to use 160 as block_size if possible (better occupancy for FP4) | ||
| if (threads_per_token % 160 == 0 && 160 <= max_threads_per_block && 160 >= 128) { | ||
| block_size = 160; | ||
| cluster_size = threads_per_token / 160; | ||
| if (cluster_size > 8) cluster_size = 8; | ||
| } | ||
| // Fallback: try 192, 128 if 160 doesn't work | ||
| else if (threads_per_token % 192 == 0 && 192 <= max_threads_per_block && 192 >= 128) { | ||
| block_size = 192; | ||
| cluster_size = threads_per_token / 192; | ||
| if (cluster_size > 8) cluster_size = 8; | ||
| } else if (threads_per_token % 128 == 0 && 128 <= max_threads_per_block) { | ||
| block_size = 128; | ||
| cluster_size = threads_per_token / 128; | ||
| if (cluster_size > 8) cluster_size = 8; | ||
| } | ||
| // Update threads_per_block to match block_size for SM count check | ||
| threads_per_block = block_size; | ||
| } |
There was a problem hiding this comment.
gemini's suggestion looks reasonable.
| : 0.0f; | ||
| // Optimization: mathematically equivalent to SFScaleVal / quantized_sf, but more efficient | ||
| // (reduces 1 reciprocal call and 1 multiply operation) | ||
| float outputScale = quantized_sf != 0 ? SFScaleVal / quantized_sf : 0.0f; |
There was a problem hiding this comment.
Interesting, I thought division would take a few times longer than reciprocal approximation--curious if you ablated this
EDIT: I tried and it doesn't make noticeable diff
|
|
||
| // FP4 optimization: apply BEFORE SM count check to avoid being overridden | ||
| // This allows FP4 to use smaller block_size even when cluster_num is large | ||
| if constexpr (GetQuantType<Pattern> == QuantType::kFP4) { |
There was a problem hiding this comment.
Hi @Bruce-x-1997 , will this occupancy optimization triggered in DeepSeek case mentioned in this case? 7168/8 = 896 is not divisible by either 160 or 192, so 128 will be used, which is typically a preferred block_size. I have tried other optimizations mentioned in this PR but did not found noticeable difference (~ 0.05uS faster). Haven't check the register usage though. Is the 10~15% improvement mentioned in PR results from using this block size (and possibly different cluster size) instead of the original heuristic?
📌 Description
I find the nvfp4 implemantation could only 1.3-1.4x speedup compared to fp8 in deepseek-v3-0324 model .
and as the fp4 pflops is twice that of fp8, I think there should be some points that could be optimization.
now after applying this pr, we can get an extra 10-15% speedup on fp4.
1369.89/1192.91=1.148 ~= 15% speedup
test cmd
accuracy
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.