fix: use type-specific FP8 max value for clamping in RMSNorm quantization kernels by Bias92 · Pull Request #2612 · flashinfer-ai/flashinfer

Bias92 · 2026-02-21T16:42:53Z

Summary

Replace hardcoded FP8 E4M3 clamp value (448.0) with a type-aware fp8_clamp_max<O> trait in RMSNormQuantKernel and FusedAddRMSNormQuantKernel.

Problem

Both kernels hardcode the E4M3 max value for output clamping:

output_vec[j] = fmaxf(-448.0f, fminf(output_vec[j], 448.0f));

However, the output type O is dispatched via DISPATCH_DLPACK_DTYPE_TO_CTYPE_FP8 in csrc/norm.cu, which handles both E4M3 (max=448) and E5M2 (max=57344). When the output dtype is E5M2, this incorrectly truncates ~99% of the representable range.

Fix

Added fp8_clamp_max<T> trait with correct max values:

__nv_fp8_e4m3: 448.0f
__nv_fp8_e5m2: 57344.0f

Applied to both RMSNormQuantKernel and FusedAddRMSNormQuantKernel.

Summary by CodeRabbit

Refactor
- Replaced fixed ±448.0 clamping with type-specific clamp values for FP8-like outputs.
- Updated quantized RMS normalization paths to use the new per-output clamping constants, improving numerical consistency across precisions.
- Clamping is now driven by output precision so bounds vary appropriately; no other algorithmic behavior was changed.

gemini-code-assist · 2026-02-21T16:43:08Z

Summary of Changes

Hello @Bias92, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical bug in FP8 quantization kernels where a hardcoded clamp value for E4M3 was being applied universally, leading to significant data truncation for E5M2 output types. By introducing a type-aware trait, the system now dynamically applies the correct maximum clamp value based on the specific FP8 format, thereby improving numerical precision and correctness in RMSNorm and FusedAddRMSNorm quantization.

Highlights

FP8 Clamping Correction: Replaced a hardcoded FP8 E4M3 clamp value (448.0f) with a type-aware fp8_clamp_max<O> trait in RMSNormQuantKernel and FusedAddRMSNormQuantKernel.
Type-Specific Max Values: Introduced fp8_clamp_max trait to provide correct maximum clamp values for both __nv_fp8_e4m3 (448.0f) and __nv_fp8_e5m2 (57344.0f) types.
Prevented Truncation: Ensured that E5M2 output types are no longer incorrectly truncated, preserving their full representable range during clamping operations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

include/flashinfer/norm.cuh
- Added fp8_clamp_max template struct with specializations for __nv_fp8_e4m3 and __nv_fp8_e5m2 to define type-specific maximum clamp values.
- Modified RMSNormQuantKernel to use fp8_clamp_max<O>::value for output clamping instead of the hardcoded 448.0f.
- Modified FusedAddRMSNormQuantKernel to use fp8_clamp_max<O>::value for output clamping instead of the hardcoded 448.0f.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-02-21T16:43:10Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 11088ee and a45c497.

📒 Files selected for processing (1)

include/flashinfer/norm.cuh

📝 Walkthrough

Walkthrough

Introduces a templated FP8 clamp bound (fp8_clamp_max) with specializations for __nv_fp8_e4m3 and __nv_fp8_e5m2, and replaces hard-coded ±448.0f clamps in RMSNormQuantKernel and FusedAddRMSNormQuantKernel with fp8_clamp_max<O>::value.

Changes

Cohort / File(s)	Summary
FP8 clamping and kernels `include/flashinfer/norm.cuh`	Added `template <typename T> struct fp8_clamp_max` and specializations (`__nv_fp8_e4m3` -> `448.0f`, `__nv_fp8_e5m2` -> `57344.0f`). Replaced hard-coded per-element clamp bounds in `RMSNormQuantKernel` and `FusedAddRMSNormQuantKernel` with `fp8_clamp_max<O>::value`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

feat: RMSNorm/Fused RMSNorm + FP8 Quantization kernels #2243 — Introduced the quantized RMSNorm kernels that contained the previous hard-coded FP8 clamp bounds which this PR refactors.

Suggested reviewers

IwakuraRein
kahyunnam
jiahanc
nv-yunzheq

Poem

🐰 Hops of code in twilight's glow,

Templates teach the clamps to know,
FP8 bounds no longer hard,
Type-led bounds play on the yard,
Tiny bytes with safer flow 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	The description provides context, problem statement, and the fix applied, but omits required sections from the template (checklist items for pre-commit and tests).	Complete the PR template sections: verify pre-commit checks ran, confirm tests added/updated, and mark the checklist items as required by the template.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: replacing hardcoded FP8 clamping values with type-aware bounds in RMSNorm kernels.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

The pull request effectively addresses the hardcoded FP8 max value issue by introducing a type-aware fp8_clamp_max<O> trait. This change correctly handles both E4M3 and E5M2 FP8 types, preventing incorrect truncation of the representable range. The implementation is clean and directly resolves the identified problem.

coderabbitai

🧹 Nitpick comments (1)

include/flashinfer/norm.cuh (1)
148-159: Dispatch coverage verified; code is correct.

The fp8_clamp_max trait values (448.0f for E4M3, 57344.0f for E5M2) are correct. The incomplete primary template compiles safely because the DISPATCH_DLPACK_DTYPE_TO_CTYPE_FP8 macro exhaustively restricts the output type O to only __nv_fp8_e4m3 and __nv_fp8_e5m2 before kernel instantiation; any other type fails at runtime with an explicit TVM_FFI_ICHECK error, not a cryptic compile-time incomplete-type message.

Optional improvements (not required):

Diagnostic clarity — Adding a dependent-false static_assert in the primary template would make unsupported types more explicit, though the practical risk is low given the dispatch guards:
♻️ Optional: improve diagnostic
 template <typename T>
-struct fp8_clamp_max;
+struct fp8_clamp_max {
+  static_assert(sizeof(T) == 0,
+                "fp8_clamp_max: unsupported FP8 type; add a specialization for this type.");
+};
cuda::std::numeric_limits alternative — CCCL issue #3349 tracks extending cuda::std::numeric_limits for FP8 types. If the CUDA toolkit in use supports it, the hardcoded constants can be replaced with cuda::std::numeric_limits<O>::max(). Verify toolkit compatibility before switching.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/flashinfer/norm.cuh` around lines 148 - 159, Add an explicit
compile-time diagnostic for unsupported FP8 types by updating the primary
template fp8_clamp_max to contain a dependent-false static_assert that triggers
when instantiated with any type other than the specialized __nv_fp8_e4m3 and
__nv_fp8_e5m2; keep the existing specializations as-is and do not change
dispatch logic (DISPATCH_DLPACK_DTYPE_TO_CTYPE_FP8) — this will make errors
clearer if someone tries to instantiate fp8_clamp_max with an unsupported type.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@include/flashinfer/norm.cuh`:
- Around line 148-159: Add an explicit compile-time diagnostic for unsupported
FP8 types by updating the primary template fp8_clamp_max to contain a
dependent-false static_assert that triggers when instantiated with any type
other than the specialized __nv_fp8_e4m3 and __nv_fp8_e5m2; keep the existing
specializations as-is and do not change dispatch logic
(DISPATCH_DLPACK_DTYPE_TO_CTYPE_FP8) — this will make errors clearer if someone
tries to instantiate fp8_clamp_max with an unsupported type.

…tion kernels

yzh119

LGTM, thanks for the fix.

yzh119 · 2026-02-23T03:08:26Z

/bot run

flashinfer-bot · 2026-02-23T03:09:37Z

GitLab MR !338 has been created, and the CI pipeline #44589115 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-02-23T14:11:39Z

[FAILED] Pipeline #44589115: 13/20 passed

Bias92 · 2026-02-28T05:02:11Z

The 3 failing checks appear unrelated to this change — remove-label is a permissions issue for external contributors, and the JIT Unittest failures on T4/A10G were cancelled due to infrastructure timeouts before any tests ran

Bias92 · 2026-03-04T16:00:47Z

Hi @jiahanc, @kahyunnam, @lwakuraRein, @nv-yunzheq — I hope you're all doing well! I wanted to send a gentle ping on this PR, as @yzh119 has kindly approved it.

Regarding the failing CI checks — these appear to be unrelated to the actual change:

PR Label Cleanup / remove-label: a permissions issue for external contributors
- JIT Unittest (T4/A10G): cancelled due to infrastructure timeouts before any tests ran
  I'd greatly appreciate a review when you have a moment. Thank you so much for your time!

Bias92 requested review from IwakuraRein, jiahanc, kahyunnam, nv-yunzheq and yzh119 as code owners February 21, 2026 16:42

gemini-code-assist bot reviewed Feb 21, 2026

View reviewed changes

coderabbitai bot reviewed Feb 21, 2026

View reviewed changes

fix: use type-specific FP8 max value for clamping in RMSNorm quantiza…

11088ee

…tion kernels

Bias92 force-pushed the fix/fp8-e5m2-clamp-range-norm branch from febb4f4 to 11088ee Compare February 21, 2026 17:07

yzh119 approved these changes Feb 23, 2026

View reviewed changes

yzh119 added the run-ci label Feb 23, 2026

style: fix pre-commit issues

a45c497

Conversation

Bias92 commented Feb 21, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Fix

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Feb 21, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

yzh119 left a comment

Choose a reason for hiding this comment

Uh oh!

yzh119 commented Feb 23, 2026

Uh oh!

flashinfer-bot commented Feb 23, 2026

Uh oh!

flashinfer-bot commented Feb 23, 2026

Uh oh!

Bias92 commented Feb 28, 2026

Uh oh!

Bias92 commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Bias92 commented Feb 21, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 21, 2026 •

edited

Loading