fix: update block size and tile size heuristic for 16+ num of token by nv-yunzheq · Pull Request #2492 · flashinfer-ai/flashinfer

nv-yunzheq · 2026-02-04T19:14:04Z

update block size and tile size heuristic for 16+ num of token issue #2407

The regression is caused by PR #2268
The fix is to adjust the heursitic choice based experiment, where when number of tokens is larger than 16, the original heuristic works better.
Result on B200:

FP4 Patterns with oneshot=False (the affected cases)

Token	Before `4de2a45`	After `4de2a45` (buggy)	With Fix	Result
4	~198 us	~124 us	~124 us	Preserves 37% improvement
8	~104 us	~69 us	~70 us	Preserves 33% improvement
16	~60 us	~44 us	~44 us	Preserves 27% improvement
32	~37 us	~41 us	~37 us	Regression fixed
64	~27 us	~34 us	~27 us	Regression fixed
128	~29 us	~38 us	~29 us	Regression fixed

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Performance Improvements
- Refined FP4 optimization behavior to apply conditionally based on token count thresholds and operation patterns. The optimization now activates selectively for oneshot operations or scenarios with small token counts, providing improved efficiency for these specific use cases.

gemini-code-assist · 2026-02-04T19:14:18Z

Summary of Changes

Hello @nv-yunzheq, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a performance regression observed in FP4 operations with higher token counts by refining the block size heuristic. The change ensures that the optimal block size is selected based on whether the operation is in oneshot mode or if the token count exceeds a specific threshold, thereby restoring and improving performance for affected scenarios.

Highlights

Performance Regression Fix: This pull request addresses and fixes a performance regression introduced in PR [performance]optimize for nvfp4 #2268, specifically impacting FP4 patterns when oneshot=False and the number of tokens is 32 or greater.
Conditional FP4 Optimization: The FP4 block size optimization, which previously applied broadly, is now conditionally applied. It will only be active for oneshot mode or when the number of tokens is less than or equal to 16. For larger token counts (e.g., 32+ tokens) with oneshot=False, the original heuristic is now used, which has been shown to perform better.
Heuristic Adjustment: A new constant, kFP4OptimizationTokenThreshold (set to 16), has been introduced to define the boundary for applying the FP4 optimization, ensuring that the most performant heuristic is chosen based on token count.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

include/flashinfer/comm/trtllm_allreduce_fusion.cuh
- Modified the logic for applying the FP4 block size optimization.
- Introduced kFP4OptimizationTokenThreshold (set to 16) to control when the FP4 optimization is active.
- Updated the if constexpr conditions to use a new boolean variable apply_fp4_optimization for clarity and correctness.

Activity

The author has provided a detailed description of the problem, the proposed solution, and benchmark results demonstrating the fix.
Pre-commit checks and test status are pending completion by the author.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-02-04T19:14:33Z

📝 Walkthrough

Walkthrough

The PR adds conditional gating to FP4 optimization in AllReduce fusion, making it active only for FP4 patterns with oneshot mode or small token counts (≤16), replacing the previous unconditional approach and updating downstream block_size and cluster_size logic accordingly.

Changes

Cohort / File(s)	Summary
FP4 Optimization Gating `include/flashinfer/comm/trtllm_allreduce_fusion.cuh`	Introduces conditional gating mechanism for FP4 optimization with a threshold-based token count constraint. FP4 optimization now runs only when pattern is FP4 AND (oneshot OR token_num ≤ 16). Updates block_size and cluster_size logic to respect whether FP4 optimization was applied.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

[performance]optimize for nvfp4 #2268: Both PRs modify FP4 optimization and kernel-launch logic in trtllm_allreduce_fusion.cuh, adjusting gating thresholds and block_size/SM-adjustment strategies.
Added heuristic for trtllm_allreduce_fusion #1972: Main PR's conditional FP4/oneshot gating directly depends on the oneshot decision and token-threshold heuristics updated in the related AllReduce fusion logic.

Suggested reviewers

djmmoss
yongwww
nvmbreughe
cyx-6
yzh119

Poem

🐰 Hop, hop, hooray! FP4's got its gating creed,
Conditional and cautious, just what we need,
Oneshot or small tokens light the way,
Block sizes and clusters join the play,
Performance tuning, hooray! 🚀

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The title mentions '16+ num of token' but the PR objectives indicate the fix applies to '32+ num of tokens', creating a discrepancy between the title and the actual implementation.	Update the title to 'fix: update block size and tile size heuristic for 32+ num of tokens' to accurately reflect the threshold documented in the PR objectives.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description check	✅ Passed	The PR description includes the key information: issue reference (`#2407`), root cause (PR `#2268`), the fix rationale, and detailed benchmark results on B200 showing performance impact across different token counts.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request correctly fixes a performance regression for FP4 patterns with 32 or more tokens by adjusting the heuristic for applying a specific FP4 optimization. The change is well-contained and the logic is sound. I've included a suggestion to simplify the code for updating the block_size, which improves maintainability without altering the core logic.

gemini-code-assist · 2026-02-04T19:17:17Z

include/flashinfer/comm/trtllm_allreduce_fusion.cuh

+    if (apply_fp4_optimization) {
      block_size = threads_per_block;
    }


For simplification, this conditional update inside the loop can be removed. block_size is not read within the loop, so it can be updated once after the loop finishes. This simplifies the logic when combined with the change I'm suggesting below.

gemini-code-assist · 2026-02-04T19:17:17Z

include/flashinfer/comm/trtllm_allreduce_fusion.cuh

+  if (!apply_fp4_optimization) {
    block_size = threads_per_block;
  }


This logic can be simplified by making the assignment unconditional. The block_size should be updated to the final threads_per_block value after the loop, regardless of whether the FP4 optimization was applied. This makes the code cleaner.

block_size = threads_per_block;

nv-yunzheq · 2026-02-04T19:18:00Z

@timlee0212 I saw you left a comment in the original PR that caused the regression. Do you want to take a look at this one?

update block size and tile size heuristic for 32+ num of token

0d1f95b

nv-yunzheq requested review from djmmoss, nvmbreughe, yongwww and yzh119 as code owners February 4, 2026 19:14

nv-yunzheq changed the title ~~fix: update block size and tile size heuristic for 32+ num of token~~ fix: update block size and tile size heuristic for 16+ num of token Feb 4, 2026

gemini-code-assist bot reviewed Feb 4, 2026

View reviewed changes

nv-yunzheq mentioned this pull request Feb 5, 2026

[Bug] Performance regression on flashinfer.comm.trtllm_allreduce_fusion in 0.6.1 #2407

Open

aleozlx added the op: comm label Feb 20, 2026

aleozlx self-assigned this Feb 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: update block size and tile size heuristic for 16+ num of token#2492

fix: update block size and tile size heuristic for 16+ num of token#2492
nv-yunzheq wants to merge 1 commit intoflashinfer-ai:mainfrom
nv-yunzheq:fix_2407

nv-yunzheq commented Feb 4, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

gemini-code-assist bot commented Feb 4, 2026

Uh oh!

coderabbitai bot commented Feb 4, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 4, 2026

Uh oh!

gemini-code-assist bot Feb 4, 2026

Uh oh!

nv-yunzheq commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nv-yunzheq commented Feb 4, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

update block size and tile size heuristic for 16+ num of token issue #2407

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Feb 4, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

nv-yunzheq commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nv-yunzheq commented Feb 4, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 4, 2026 •

edited

Loading