feat: enable deepgemm jit for fp8 block-scale on SM90 by djmmoss · Pull Request #1969 · flashinfer-ai/flashinfer

djmmoss · 2025-10-23T04:39:14Z

📌 Description

Enable JIT compile for the FP8 DeepGEMM kernels, NVRTC is currently disabled it uses NVCC by default.

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Summary by CodeRabbit

Refactor
- JIT include directory discovery now uses the flashinfer-python package instead of the previous package.
- Updated resolved include path to the flashinfer data location.
- Runtime compilation now consistently uses NVCC; the prior environment-variable toggle was removed.
- Updated warning text when the expected package installation cannot be found.

Signed-off-by: Duncan Moss <djm.moss@gmail.com>

coderabbitai · 2025-10-23T04:39:25Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

The pull request modifies the JIT compilation configuration for the TensorRT LLM deep GEMM module. The include directory discovery mechanism now uses the flashinfer-python package instead of tensorrt_llm, with updated path resolution. Additionally, the NVCC usage logic is simplified to always use NVCC without reading an environment variable.

Changes

Cohort / File(s)	Summary
JIT Include Path Discovery `csrc/nv_internal/tensorrt_llm/deep_gemm/compiler.cuh`	Changes package lookup from `tensorrt_llm` to `flashinfer-python` for JIT include directory resolution. Updates include path from `tensorrt_llm/include` to `flashinfer/data/csrc/nv_internal/tensorrt_llm`.
JIT Compilation Behavior `csrc/nv_internal/tensorrt_llm/deep_gemm/runtime.cuh`	Removes environment variable-based conditional logic from `kJitUseNvcc` lambda. Now always returns true, enforcing NVCC usage regardless of `TRTLLM_DG_JIT_USE_NVCC` setting. Comments note NVRTC switch is blocked by missing headers.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

🐰 Carrots of code, now aligned with flashinfer's bright light,
Include paths redirected, compilation paths set right,
NVCC stays true, no more switches to debate,
Deep GEMM springs forward—let the rabbits celebrate! 🥕✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	The pull request description includes the critical sections from the template: a clear Description section explaining that JIT compilation is being enabled for FP8 DeepGEMM kernels with NVCC as default, and a completed Pull Request Checklist with all pre-commit and test items marked as complete. The "Related Issues" section from the template is absent from the provided description, but this is a relatively minor omission that does not substantially detract from the overall completeness of the description. The description sufficiently communicates the purpose and meets the core requirements for merging.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.
Title Check	✅ Passed	The pull request title "feat: enable deepgemm jit for fp8 block-scale on SM90" is fully related to the main change in the changeset. The underlying modifications to compiler.cuh and runtime.cuh are configuration and infrastructure-level changes that directly enable JIT compilation for FP8 DeepGEMM kernels on SM90 GPUs. The title is specific, mentioning the kernel type (deepgemm jit), the data type (fp8), the scaling method (block-scale), and the target GPU generation (SM90), which aligns with the PR objectives and the actual changes being made.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2025-10-23T04:39:25Z

Summary of Changes

Hello @djmmoss, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates JIT compilation capabilities for FP8 DeepGEMM kernels into the FlashInfer project. The primary goal is to enhance performance by allowing dynamic compilation of these kernels. This is achieved by ensuring that NVCC is always used for JIT compilation and by correctly configuring the include paths to resolve dependencies within the flashinfer-python package, setting the stage for more optimized FP8 operations.

Highlights

JIT Compilation for FP8 DeepGEMM: This pull request enables Just-In-Time (JIT) compilation for FP8 DeepGEMM kernels, which is crucial for optimizing performance of these operations.
Forced NVCC Usage for JIT: NVRTC (NVIDIA Runtime Compilation) is currently disabled, and NVCC is now explicitly set to be used for all JIT compilation. A TODO comment notes that NVRTC requires specific headers like cuda_bf16.h which are currently missing.
Updated Include Path Resolution: The mechanism for locating necessary include directories for JIT compilation has been updated. It now correctly queries flashinfer-python instead of tensorrt_llm to find the internal tensorrt_llm include paths within the FlashInfer project structure.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request enables JIT compilation for FP8 DeepGEMM kernels, defaulting to NVCC. The changes correctly update package paths from tensorrt_llm to flashinfer-python to locate necessary headers. My review focuses on improving code clarity and maintainability. I've suggested updating a log message to reflect the package change and cleaning up commented-out code while preserving important context in a TODO comment.

csrc/nv_internal/tensorrt_llm/deep_gemm/compiler.cuh

gemini-code-assist · 2025-10-23T04:40:13Z

csrc/nv_internal/tensorrt_llm/deep_gemm/runtime.cuh

 static bool kJitUseNvcc = []() {
-  char const* env_var = getenv("TRTLLM_DG_JIT_USE_NVCC");
-  return env_var && (std::string(env_var) == "1" || std::string(env_var) == "true");
+  // char const* env_var = getenv("TRTLLM_DG_JIT_USE_NVCC");
+  // return env_var && (std::string(env_var) == "1" || std::string(env_var) == "true");
+  // always use nvcc
+  // TODO: Enable nvrtc -- need these headers:
+  // [TensorRT-LLM][INFO] Compilation log:
+  // kernel.cu(16): catastrophic error: cannot open source file "cuda_bf16.h"
+  return true;
 }();


The kJitUseNvcc variable is now hardcoded to true, and the previous implementation that used an environment variable is commented out. While the TODO comment explains why NVRTC is currently disabled, leaving commented-out code can reduce readability and maintainability. It would be cleaner to remove the commented-out lines and refine the comments to keep only the essential context.

static bool kJitUseNvcc = []() { // NVRTC is currently disabled due to include issues. Using NVCC by default. // TODO: Enable NVRTC. It fails with errors like: // [TensorRT-LLM][INFO] Compilation log: // kernel.cu(16): catastrophic error: cannot open source file "cuda_bf16.h" return true; }();

Signed-off-by: Duncan Moss <djm.moss@gmail.com>

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

csrc/nv_internal/tensorrt_llm/deep_gemm/compiler.cuh (2)
154-179: Update the outdated comment to reflect the new package name.

The comment at line 155 still references tensorrt_llm, but the code now uses flashinfer-python. Update the comment for consistency.

Apply this diff to update the comment:
       // Parse the location using regex
-      // `pip show tensorrt_llm` will output something like:
+      // `pip show flashinfer-python` will output something like:
       // Location: /usr/local/lib/python3.12/dist-packages
       // Editable project location: /code
180-182: Update the error message to reflect the new package dependency.

The error message references "TensorRT LLM installation" but the code now searches for flashinfer-python. Update the message for consistency.

Apply this diff to update the error message:
     } else {
-      TLLM_LOG_WARNING("Failed to find TensorRT LLM installation, DeepGEMM will be disabled.");
+      TLLM_LOG_WARNING("Failed to find flashinfer-python installation, DeepGEMM will be disabled.");
     }

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0260ab3 and c9a1a1a.

📒 Files selected for processing (2)

csrc/nv_internal/tensorrt_llm/deep_gemm/compiler.cuh (2 hunks)
csrc/nv_internal/tensorrt_llm/deep_gemm/runtime.cuh (1 hunks)

🔇 Additional comments (1)

csrc/nv_internal/tensorrt_llm/deep_gemm/runtime.cuh (1)

38-46: LGTM! Appropriate simplification to force NVCC usage.

The hardcoded return true with commented-out environment variable logic is appropriate given the TODO note about missing headers for NVRTC. This ensures consistent behavior until NVRTC support is fully enabled.

csrc/nv_internal/tensorrt_llm/deep_gemm/compiler.cuh

yzh119 · 2025-10-24T01:21:49Z

csrc/nv_internal/tensorrt_llm/deep_gemm/compiler.cuh

  if (includeDirs.empty()) {
    // Command to execute
-    char const* cmd = "pip show tensorrt_llm 2>/dev/null";
+    char const* cmd = "pip show flashinfer-python 2>/dev/null";


What's the purpose of this command?

For the DeepGEMM JIT, it needs the header files in deep_gemm/, this command finds the installation path which is then used further down to add the deep_gemm/ to the -I

I tend to move the logic to python, pip show flashinfer-python doesn't necessarily show the correct package information (e.g. at AOT time when the package is not installed yet).

Or we can obtain the include path from python and pass the value to C++.

I think this is where a refactor might be necessary, unfortunately these deep_gemm kernels aren't captured as part of AOT.

yzh119 · 2025-10-24T01:22:16Z

csrc/nv_internal/tensorrt_llm/deep_gemm/compiler.cuh

      }
    } else {
-      TLLM_LOG_WARNING("Failed to find TensorRT LLM installation, DeepGEMM will be disabled.");
+      TLLM_LOG_WARNING("Failed to find FlashInfer installation, DeepGEMM will be disabled.");


I guess we can safely assume flashinfer is installed if this function is called?

@djmmoss

## 📌 Description This PR implements the refactor mentioned in https://github.com/flashinfer-ai/flashinfer/pull/1969/files#r2461856020 In our current design we rely on calling `pip show flashinfer-python 2>/dev/null || uv pip show flashinfer-python 2>/dev/null` to obtain deepgemm jit include directory, which is error-prune (e.g. if user do not have `pip` available in the environment it will fail), in this PR we pass the deepgemm jit include directory through python APIs. ## 🔍 Related Issues #1969 ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes cc @djmmoss @jiahanc @nvmbreughe  ## Summary by CodeRabbit * **New Features** * Modules now set DeepGEMM JIT include directories at runtime so fused MoE modules have correct JIT include paths during initialization. * **Chores** * JIT compiler API and module build updated to accept and propagate externally provided include directories. * Minor header/build adjustments to support the new initialization flow.

@djmmoss

…hinfer-ai#2090)  ## 📌 Description This PR implements the refactor mentioned in https://github.com/flashinfer-ai/flashinfer/pull/1969/files#r2461856020 In our current design we rely on calling `pip show flashinfer-python 2>/dev/null || uv pip show flashinfer-python 2>/dev/null` to obtain deepgemm jit include directory, which is error-prune (e.g. if user do not have `pip` available in the environment it will fail), in this PR we pass the deepgemm jit include directory through python APIs. ## 🔍 Related Issues flashinfer-ai#1969 ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes cc @djmmoss @jiahanc @nvmbreughe  ## Summary by CodeRabbit * **New Features** * Modules now set DeepGEMM JIT include directories at runtime so fused MoE modules have correct JIT include paths during initialization. * **Chores** * JIT compiler API and module build updated to accept and propagate externally provided include directories. * Minor header/build adjustments to support the new initialization flow.

…1969)  ## 📌 Description Enable JIT compile for the FP8 DeepGEMM kernels, NVRTC is currently disabled it uses NVCC by default. ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.).  ## Summary by CodeRabbit * **Refactor** * JIT include directory discovery now uses the flashinfer-python package instead of the previous package. * Updated resolved include path to the flashinfer data location. * Runtime compilation now consistently uses NVCC; the prior environment-variable toggle was removed. * Updated warning text when the expected package installation cannot be found.  --------- Signed-off-by: Duncan Moss <djm.moss@gmail.com>

@djmmoss

…hinfer-ai#2090)  ## 📌 Description This PR implements the refactor mentioned in https://github.com/flashinfer-ai/flashinfer/pull/1969/files#r2461856020 In our current design we rely on calling `pip show flashinfer-python 2>/dev/null || uv pip show flashinfer-python 2>/dev/null` to obtain deepgemm jit include directory, which is error-prune (e.g. if user do not have `pip` available in the environment it will fail), in this PR we pass the deepgemm jit include directory through python APIs. ## 🔍 Related Issues flashinfer-ai#1969 ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes cc @djmmoss @jiahanc @nvmbreughe  ## Summary by CodeRabbit * **New Features** * Modules now set DeepGEMM JIT include directories at runtime so fused MoE modules have correct JIT include paths during initialization. * **Chores** * JIT compiler API and module build updated to accept and propagate externally provided include directories. * Minor header/build adjustments to support the new initialization flow.

djmmoss added 2 commits October 22, 2025 21:36

feat: enable deepgemm jit for fp8 block-scale

fe2bd01

Signed-off-by: Duncan Moss <djm.moss@gmail.com>

pre-commit

c9a1a1a

Signed-off-by: Duncan Moss <djm.moss@gmail.com>

djmmoss requested review from aleozlx, cyx-6, wenscarl, yongwww and yzh119 as code owners October 23, 2025 04:39

gemini-code-assist bot reviewed Oct 23, 2025

View reviewed changes

djmmoss changed the title ~~feat: enable deepgemm jit for fp8 block-scale~~ feat: enable deepgemm jit for fp8 block-scale on SM90 Oct 23, 2025

gemini comment

aa0421e

Signed-off-by: Duncan Moss <djm.moss@gmail.com>

coderabbitai bot reviewed Oct 23, 2025

View reviewed changes

csrc/nv_internal/tensorrt_llm/deep_gemm/compiler.cuh Show resolved Hide resolved

yzh119 approved these changes Oct 24, 2025

View reviewed changes

yzh119 merged commit bf03ad4 into flashinfer-ai:main Oct 26, 2025
4 checks passed

yzh119 mentioned this pull request Nov 14, 2025

refactor: pass hopper deepgemm include directory through python #2090

Merged

5 tasks

coderabbitai bot mentioned this pull request Dec 22, 2025

[DSR1] performance improvements by using cuteDSL MOE with fusion / RMSNorm_fp4 quant fusion #2259

Closed

coderabbitai bot mentioned this pull request Feb 10, 2026

fix: include fp8_blockscale_gemm_90 in AOT jit-cache #2533

Merged

4 tasks

Conversation

djmmoss commented Oct 23, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

gemini-code-assist bot commented Oct 23, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yzh119 Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

djmmoss Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

yzh119 Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

yzh119 Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

djmmoss Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

yzh119 Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

djmmoss commented Oct 23, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 23, 2025 •

edited

Loading