Skip to content

fix: update block size and tile size heuristic for 16+ num of token#2492

Open
nv-yunzheq wants to merge 1 commit intoflashinfer-ai:mainfrom
nv-yunzheq:fix_2407
Open

fix: update block size and tile size heuristic for 16+ num of token#2492
nv-yunzheq wants to merge 1 commit intoflashinfer-ai:mainfrom
nv-yunzheq:fix_2407

Conversation

@nv-yunzheq
Copy link
Collaborator

@nv-yunzheq nv-yunzheq commented Feb 4, 2026

update block size and tile size heuristic for 16+ num of token issue #2407

The regression is caused by PR #2268
The fix is to adjust the heursitic choice based experiment, where when number of tokens is larger than 16, the original heuristic works better.
Result on B200:

FP4 Patterns with oneshot=False (the affected cases)
Token Before 4de2a45 After 4de2a45 (buggy) With Fix Result
4 ~198 us ~124 us ~124 us Preserves 37% improvement
8 ~104 us ~69 us ~70 us Preserves 33% improvement
16 ~60 us ~44 us ~44 us Preserves 27% improvement
32 ~37 us ~41 us ~37 us Regression fixed
64 ~27 us ~34 us ~27 us Regression fixed
128 ~29 us ~38 us ~29 us Regression fixed

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • Performance Improvements
    • Refined FP4 optimization behavior to apply conditionally based on token count thresholds and operation patterns. The optimization now activates selectively for oneshot operations or scenarios with small token counts, providing improved efficiency for these specific use cases.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @nv-yunzheq, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a performance regression observed in FP4 operations with higher token counts by refining the block size heuristic. The change ensures that the optimal block size is selected based on whether the operation is in oneshot mode or if the token count exceeds a specific threshold, thereby restoring and improving performance for affected scenarios.

Highlights

  • Performance Regression Fix: This pull request addresses and fixes a performance regression introduced in PR [performance]optimize for nvfp4 #2268, specifically impacting FP4 patterns when oneshot=False and the number of tokens is 32 or greater.
  • Conditional FP4 Optimization: The FP4 block size optimization, which previously applied broadly, is now conditionally applied. It will only be active for oneshot mode or when the number of tokens is less than or equal to 16. For larger token counts (e.g., 32+ tokens) with oneshot=False, the original heuristic is now used, which has been shown to perform better.
  • Heuristic Adjustment: A new constant, kFP4OptimizationTokenThreshold (set to 16), has been introduced to define the boundary for applying the FP4 optimization, ensuring that the most performant heuristic is chosen based on token count.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • include/flashinfer/comm/trtllm_allreduce_fusion.cuh
    • Modified the logic for applying the FP4 block size optimization.
    • Introduced kFP4OptimizationTokenThreshold (set to 16) to control when the FP4 optimization is active.
    • Updated the if constexpr conditions to use a new boolean variable apply_fp4_optimization for clarity and correctness.
Activity
  • The author has provided a detailed description of the problem, the proposed solution, and benchmark results demonstrating the fix.
  • Pre-commit checks and test status are pending completion by the author.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 4, 2026

📝 Walkthrough

Walkthrough

The PR adds conditional gating to FP4 optimization in AllReduce fusion, making it active only for FP4 patterns with oneshot mode or small token counts (≤16), replacing the previous unconditional approach and updating downstream block_size and cluster_size logic accordingly.

Changes

Cohort / File(s) Summary
FP4 Optimization Gating
include/flashinfer/comm/trtllm_allreduce_fusion.cuh
Introduces conditional gating mechanism for FP4 optimization with a threshold-based token count constraint. FP4 optimization now runs only when pattern is FP4 AND (oneshot OR token_num ≤ 16). Updates block_size and cluster_size logic to respect whether FP4 optimization was applied.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • djmmoss
  • yongwww
  • nvmbreughe
  • cyx-6
  • yzh119

Poem

🐰 Hop, hop, hooray! FP4's got its gating creed,
Conditional and cautious, just what we need,
Oneshot or small tokens light the way,
Block sizes and clusters join the play,
Performance tuning, hooray! 🚀

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Title check ⚠️ Warning The title mentions '16+ num of token' but the PR objectives indicate the fix applies to '32+ num of tokens', creating a discrepancy between the title and the actual implementation. Update the title to 'fix: update block size and tile size heuristic for 32+ num of tokens' to accurately reflect the threshold documented in the PR objectives.
✅ Passed checks (2 passed)
Check name Status Explanation
Description check ✅ Passed The PR description includes the key information: issue reference (#2407), root cause (PR #2268), the fix rationale, and detailed benchmark results on B200 showing performance impact across different token counts.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@nv-yunzheq nv-yunzheq changed the title fix: update block size and tile size heuristic for 32+ num of token fix: update block size and tile size heuristic for 16+ num of token Feb 4, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes a performance regression for FP4 patterns with 32 or more tokens by adjusting the heuristic for applying a specific FP4 optimization. The change is well-contained and the logic is sound. I've included a suggestion to simplify the code for updating the block_size, which improves maintainability without altering the core logic.

Comment on lines +1468 to 1470
if (apply_fp4_optimization) {
block_size = threads_per_block;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For simplification, this conditional update inside the loop can be removed. block_size is not read within the loop, so it can be updated once after the loop finishes. This simplifies the logic when combined with the change I'm suggesting below.

 

Comment on lines +1474 to 1476
if (!apply_fp4_optimization) {
block_size = threads_per_block;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic can be simplified by making the assignment unconditional. The block_size should be updated to the final threads_per_block value after the loop, regardless of whether the FP4 optimization was applied. This makes the code cleaner.

  block_size = threads_per_block;

@nv-yunzheq
Copy link
Collaborator Author

@timlee0212 I saw you left a comment in the original PR that caused the regression. Do you want to take a look at this one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants