Skip to content

fix: use type-specific FP8 max value for clamping in RMSNorm quantization kernels#2612

Open
Bias92 wants to merge 2 commits intoflashinfer-ai:mainfrom
Bias92:fix/fp8-e5m2-clamp-range-norm
Open

fix: use type-specific FP8 max value for clamping in RMSNorm quantization kernels#2612
Bias92 wants to merge 2 commits intoflashinfer-ai:mainfrom
Bias92:fix/fp8-e5m2-clamp-range-norm

Conversation

@Bias92
Copy link
Contributor

@Bias92 Bias92 commented Feb 21, 2026

Summary

Replace hardcoded FP8 E4M3 clamp value (448.0) with a type-aware fp8_clamp_max<O> trait in RMSNormQuantKernel and FusedAddRMSNormQuantKernel.

Problem

Both kernels hardcode the E4M3 max value for output clamping:

output_vec[j] = fmaxf(-448.0f, fminf(output_vec[j], 448.0f));

However, the output type O is dispatched via DISPATCH_DLPACK_DTYPE_TO_CTYPE_FP8 in csrc/norm.cu, which handles both E4M3 (max=448) and E5M2 (max=57344). When the output dtype is E5M2, this incorrectly truncates ~99% of the representable range.

Fix

Added fp8_clamp_max<T> trait with correct max values:

  • __nv_fp8_e4m3: 448.0f
  • __nv_fp8_e5m2: 57344.0f

Applied to both RMSNormQuantKernel and FusedAddRMSNormQuantKernel.

Summary by CodeRabbit

  • Refactor
    • Replaced fixed ±448.0 clamping with type-specific clamp values for FP8-like outputs.
    • Updated quantized RMS normalization paths to use the new per-output clamping constants, improving numerical consistency across precisions.
    • Clamping is now driven by output precision so bounds vary appropriately; no other algorithmic behavior was changed.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Bias92, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical bug in FP8 quantization kernels where a hardcoded clamp value for E4M3 was being applied universally, leading to significant data truncation for E5M2 output types. By introducing a type-aware trait, the system now dynamically applies the correct maximum clamp value based on the specific FP8 format, thereby improving numerical precision and correctness in RMSNorm and FusedAddRMSNorm quantization.

Highlights

  • FP8 Clamping Correction: Replaced a hardcoded FP8 E4M3 clamp value (448.0f) with a type-aware fp8_clamp_max<O> trait in RMSNormQuantKernel and FusedAddRMSNormQuantKernel.
  • Type-Specific Max Values: Introduced fp8_clamp_max trait to provide correct maximum clamp values for both __nv_fp8_e4m3 (448.0f) and __nv_fp8_e5m2 (57344.0f) types.
  • Prevented Truncation: Ensured that E5M2 output types are no longer incorrectly truncated, preserving their full representable range during clamping operations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • include/flashinfer/norm.cuh
    • Added fp8_clamp_max template struct with specializations for __nv_fp8_e4m3 and __nv_fp8_e5m2 to define type-specific maximum clamp values.
    • Modified RMSNormQuantKernel to use fp8_clamp_max<O>::value for output clamping instead of the hardcoded 448.0f.
    • Modified FusedAddRMSNormQuantKernel to use fp8_clamp_max<O>::value for output clamping instead of the hardcoded 448.0f.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 21, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 11088ee and a45c497.

📒 Files selected for processing (1)
  • include/flashinfer/norm.cuh

📝 Walkthrough

Walkthrough

Introduces a templated FP8 clamp bound (fp8_clamp_max) with specializations for __nv_fp8_e4m3 and __nv_fp8_e5m2, and replaces hard-coded ±448.0f clamps in RMSNormQuantKernel and FusedAddRMSNormQuantKernel with fp8_clamp_max<O>::value.

Changes

Cohort / File(s) Summary
FP8 clamping and kernels
include/flashinfer/norm.cuh
Added template <typename T> struct fp8_clamp_max and specializations (__nv_fp8_e4m3 -> 448.0f, __nv_fp8_e5m2 -> 57344.0f). Replaced hard-coded per-element clamp bounds in RMSNormQuantKernel and FusedAddRMSNormQuantKernel with fp8_clamp_max<O>::value.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

Suggested reviewers

  • IwakuraRein
  • kahyunnam
  • jiahanc
  • nv-yunzheq

Poem

🐰 Hops of code in twilight's glow,

Templates teach the clamps to know,
FP8 bounds no longer hard,
Type-led bounds play on the yard,
Tiny bytes with safer flow 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The description provides context, problem statement, and the fix applied, but omits required sections from the template (checklist items for pre-commit and tests). Complete the PR template sections: verify pre-commit checks ran, confirm tests added/updated, and mark the checklist items as required by the template.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: replacing hardcoded FP8 clamping values with type-aware bounds in RMSNorm kernels.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request effectively addresses the hardcoded FP8 max value issue by introducing a type-aware fp8_clamp_max<O> trait. This change correctly handles both E4M3 and E5M2 FP8 types, preventing incorrect truncation of the representable range. The implementation is clean and directly resolves the identified problem.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
include/flashinfer/norm.cuh (1)

148-159: Dispatch coverage verified; code is correct.

The fp8_clamp_max trait values (448.0f for E4M3, 57344.0f for E5M2) are correct. The incomplete primary template compiles safely because the DISPATCH_DLPACK_DTYPE_TO_CTYPE_FP8 macro exhaustively restricts the output type O to only __nv_fp8_e4m3 and __nv_fp8_e5m2 before kernel instantiation; any other type fails at runtime with an explicit TVM_FFI_ICHECK error, not a cryptic compile-time incomplete-type message.

Optional improvements (not required):

  1. Diagnostic clarity — Adding a dependent-false static_assert in the primary template would make unsupported types more explicit, though the practical risk is low given the dispatch guards:
♻️ Optional: improve diagnostic
 template <typename T>
-struct fp8_clamp_max;
+struct fp8_clamp_max {
+  static_assert(sizeof(T) == 0,
+                "fp8_clamp_max: unsupported FP8 type; add a specialization for this type.");
+};
  1. cuda::std::numeric_limits alternative — CCCL issue #3349 tracks extending cuda::std::numeric_limits for FP8 types. If the CUDA toolkit in use supports it, the hardcoded constants can be replaced with cuda::std::numeric_limits<O>::max(). Verify toolkit compatibility before switching.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/flashinfer/norm.cuh` around lines 148 - 159, Add an explicit
compile-time diagnostic for unsupported FP8 types by updating the primary
template fp8_clamp_max to contain a dependent-false static_assert that triggers
when instantiated with any type other than the specialized __nv_fp8_e4m3 and
__nv_fp8_e5m2; keep the existing specializations as-is and do not change
dispatch logic (DISPATCH_DLPACK_DTYPE_TO_CTYPE_FP8) — this will make errors
clearer if someone tries to instantiate fp8_clamp_max with an unsupported type.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@include/flashinfer/norm.cuh`:
- Around line 148-159: Add an explicit compile-time diagnostic for unsupported
FP8 types by updating the primary template fp8_clamp_max to contain a
dependent-false static_assert that triggers when instantiated with any type
other than the specialized __nv_fp8_e4m3 and __nv_fp8_e5m2; keep the existing
specializations as-is and do not change dispatch logic
(DISPATCH_DLPACK_DTYPE_TO_CTYPE_FP8) — this will make errors clearer if someone
tries to instantiate fp8_clamp_max with an unsupported type.

@Bias92 Bias92 force-pushed the fix/fp8-e5m2-clamp-range-norm branch from febb4f4 to 11088ee Compare February 21, 2026 17:07
Copy link
Collaborator

@yzh119 yzh119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the fix.

@yzh119
Copy link
Collaborator

yzh119 commented Feb 23, 2026

/bot run

@yzh119 yzh119 added the run-ci label Feb 23, 2026
@flashinfer-bot
Copy link
Collaborator

GitLab MR !338 has been created, and the CI pipeline #44589115 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Collaborator

[FAILED] Pipeline #44589115: 13/20 passed

@Bias92
Copy link
Contributor Author

Bias92 commented Feb 28, 2026

The 3 failing checks appear unrelated to this change — remove-label is a permissions issue for external contributors, and the JIT Unittest failures on T4/A10G were cancelled due to infrastructure timeouts before any tests ran

@Bias92
Copy link
Contributor Author

Bias92 commented Mar 4, 2026

Hi @jiahanc, @kahyunnam, @lwakuraRein, @nv-yunzheq — I hope you're all doing well! I wanted to send a gentle ping on this PR, as @yzh119 has kindly approved it.

Regarding the failing CI checks — these appear to be unrelated to the actual change:

  • PR Label Cleanup / remove-label: a permissions issue for external contributors
    • JIT Unittest (T4/A10G): cancelled due to infrastructure timeouts before any tests ran
      I'd greatly appreciate a review when you have a moment. Thank you so much for your time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants