[Perf][Feature] Add SM103-specific schedulers for NVFP4 CUTLASS kernels by LopezCastroRoberto · Pull Request #2303 · flashinfer-ai/flashinfer

LopezCastroRoberto · 2026-01-07T17:41:01Z

Summary

This PR adds new template specializations for SM103 NVFP4 CUTLASS GEMM kernels using architecture-specific tile shapes, cluster shapes, and schedulers.

Motivation

SM103 specifications show a higher NVFP4-over-BF16 speedup ratio than B200 (6× vs. 4×), but current kernels remain far from this limit.
This PR introduces SM103-optimized templates to improve the achieved performance on this architecture.

The performance gains are more pronounced at larger batch sizes, while the previous SM100 configurations remain preferable in other cases.
For this reason, SM103-specific configurations were added alongside the existing ones rather than replacing them, and the optimal configuration is automatically selected as part of the autotuning process.

Performance results examples

Llama-3.1-70B, N=8192 K=28672, BF16 vs NVFP4 GEMMs TFLOP/s:

Batch Size	Torch BF16	NVFP4 Before	NVFP4 After
8	50.418336	110.598008	124.005817
16	99.350151	219.649654	260.502226
32	193.884850	445.840601	519.291059
64	385.790757	978.451544	1011.614080
128	692.915989	2072.797941	2076.017433
256	1211.413202	3817.738538	3868.924511
512	1464.015616	5141.532768	5503.664311
1024	1600.983748	5659.831320	6341.013002
2048	1625.639619	5991.840134	6630.757403
4096	1602.978834	6160.806595	6898.878407
8192	1691.174722	5939.220913	6653.915111
16384	1688.224044	5926.519222	6595.387600
24576	1706.774619	5905.301100	6617.486211
32768	1678.225402	5913.806010	6592.762922

Llama-3.1-70B, N=8192 K=8192, BF16 vs NVFP4 GEMMs TFLOP/s:

Batch Size	Torch BF16	NVFP4 Before	NVFP4 After
8	47.780647	124.774241	124.760324
16	95.671633	249.502165	249.131125
32	189.224266	497.991489	497.277802
64	373.320912	993.731451	989.446041
128	707.096994	1959.258553	1970.430179
256	1126.908748	4037.558967	4159.515720
512	1407.884777	5045.981883	4958.698763
1024	1491.747576	5654.694949	5614.133004
2048	1546.322959	5898.291400	6204.813491
4096	1610.656216	6312.498418	6605.534723
8192	1623.748353	6392.424296	6803.660138
16384	1627.947338	6438.789701	6947.466217
24576	1614.582791	6469.307368	6991.331576
32768	1617.601164	6515.312895	7010.746651

Summary by CodeRabbit

New Features
- Added support for NVIDIA SM103 GPU architecture in FP4 operations with specialized kernel configurations and optimized launcher implementations, extending hardware compatibility and enabling efficient computation on additional GPU variants.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: LopezCastroRoberto <robertol.c510@gmail.com>

coderabbitai · 2026-01-07T17:41:13Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

This PR adds FP4 GEMM support for SM103 GPUs by introducing SM103-specific kernel implementations, tile configurations, type adapters, and a JIT module generator. Changes span CUDA kernels, C++ templates, configuration headers, and Python integration for runtime dispatch.

Changes

Cohort / File(s)	Summary
SM103 CUDA Implementation `csrc/fp4_gemm_cutlass_sm103.cu`, `csrc/fp4_gemm_cutlass_sm103.jinja`, `csrc/fp4_gemm_cutlass.jinja`	New SM103 FP4 GEMM kernels with TVM FFI entry points (fp4_gemm, fp4_gemm_tactic_num). Includes workspace sizing, input validation, batch/matrix dimension inference, and tactic-based kernel dispatch. Adds kernel launcher instantiation with cta_m=4, cta_n=1, cta_k=1 parameters.
SM103 Header Templates `include/flashinfer/gemm/fp4_gemm_template_sm103.h`, `include/flashinfer/gemm/fp4_gemm_cutlass_gemm_configs.h`	New SM103-specific FP4 GEMM paths with type adapters (SMTypeAdapter_sm103 for _1SM_sm103, _2SM_sm103), kernel launchers, argument preparation, and workspace management. Introduces cluster-shape and CTA-shape dispatch with robust error handling.
Configuration Extensions `include/flashinfer/gemm/cutlass_gemm_configs.h`	Expands tile configurations for SM103: adds TileShape_128x128x768, TileShape_128x192x768, TileShape_128x256x768 to both CutlassTileConfigSM100 and TileShape enums. Adds ClusterShape_4x1x1 cluster shape variant. Updates get_tile_shape, get_tile_shape_name, get_cluster_shape_name, and get_cluster_shape to handle new shapes.
Python Integration `flashinfer/gemm/gemm_base.py`, `flashinfer/jit/gemm/core.py`, `flashinfer/jit/gemm/__init__.py`	New gen_gemm_sm103_module_cutlass_fp4() generator for JIT compilation of SM103 kernels with BF16/FP16 types and three tile configurations. Updates get_cutlass_fp4_gemm_module to accept (sm_major, sm_minor) and dispatch to SM103 path when sm_minor==3. Extends mm_fp4 to extract and propagate minor compute capability.

Sequence Diagrams

sequenceDiagram
    participant PythonAPI as Python API<br/>(mm_fp4)
    participant Dispatcher as Compute Capability<br/>Dispatcher
    participant SM103Gen as SM103 Module<br/>Generator
    participant JITCompiler as JIT Compiler
    participant CUDARuntime as CUDA Runtime

    PythonAPI->>Dispatcher: Extract (major, minor) capability
    alt sm_minor == 3
        Dispatcher->>SM103Gen: get_gemm_sm103_module_cutlass_fp4()
        SM103Gen->>SM103Gen: Generate fp4_gemm_cutlass_sm103.cu<br/>with tile configs<br/>(128x128x768, 128x192x768,<br/>128x256x768)
        SM103Gen->>JITCompiler: Render templates &<br/>compile sources
        JITCompiler->>JITCompiler: Build SM103 kernels<br/>with ENABLE_FP4,<br/>ENABLE_BF16 flags
        JITCompiler-->>SM103Gen: Compiled module
        SM103Gen-->>Dispatcher: cutlass_fp4_gemm_runner()
    else sm_minor != 3
        Dispatcher->>Dispatcher: Fallback to SM100 path
    end
    Dispatcher->>PythonAPI: Return kernel runner
    PythonAPI->>CUDARuntime: Dispatch FP4 GEMM<br/>via tactic selection

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

aleozlx
djmmoss
yongwww
cyx-6
nvmbreughe
bkryu
ttyio

Poem

🐰 Whiskers twitch with joy so bright,
SM103's FP4 kernels take flight!
Tile shapes hop from SM100 to new,
H100 hardware, we've got you too!
With BF16 and dispatch so neat,
GEMM ops make the code complete! 🚀

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 16.22% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The PR description provides comprehensive motivation, performance results, and technical details, but the provided description field in the PR is missing the required template sections like the checklist items.	Consider adding the pre-commit checks and tests sections from the template to document completion status of these validation steps.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: adding SM103-specific schedulers for NVFP4 CUTLASS kernels, directly matching the PR's core objective.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-01-07T17:41:27Z

Summary of Changes

Hello @LopezCastroRoberto, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance of NVFP4 General Matrix Multiplication (GEMM) operations on NVIDIA's SM103 architecture (B300 GPUs). By introducing specialized CUTLASS kernel schedulers tailored for SM103, the changes aim to unlock greater efficiency and speedup, especially for larger batch sizes in deep learning workloads. The integration ensures that the system intelligently selects the most performant kernel for the given hardware, without replacing existing configurations for other architectures.

Highlights

SM103-Specific Scheduler Integration: Introduced new template specializations for SM103 NVFP4 CUTLASS GEMM kernels, leveraging architecture-specific schedulers to optimize performance on B300 GPUs.
Performance Improvement: Achieved significant performance gains for NVFP4 GEMMs on SM103, particularly at larger batch sizes, moving closer to the theoretical 6x NVFP4-over-BF16 speedup ratio compared to the previous 4x on B200.
Dynamic Configuration Selection: The new SM103-specific configurations are added alongside existing ones, and the optimal configuration is automatically selected through an autotuning process, ensuring the best performance across various workloads.
New Kernel Definitions: Added new CUDA source and Jinja templates (fp4_gemm_cutlass_sm103.cu, fp4_gemm_cutlass_sm103.jinja, fp4_gemm_template_sm103.h) to define and instantiate the specialized SM103 kernels.
Python Integration: Updated the Python GEMM module loading logic to dynamically select the appropriate CUTLASS FP4 GEMM module (SM100 or SM103) based on the detected GPU compute capability (sm_major and sm_minor).

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces SM103-specific schedulers for NVFP4 CUTLASS kernels to enhance performance, particularly for larger batch sizes. The changes are well-structured, adding new kernel configurations and the necessary C++ and Python logic to dispatch to them based on the GPU architecture. The overall approach is sound. My review has identified a high-severity issue in the build scripts that could cause file conflicts, along with a couple of medium-severity issues related to misleading documentation and error messages. Addressing these points will improve the correctness and maintainability of the code.

flashinfer/jit/gemm/core.py

flashinfer/gemm/gemm_base.py

include/flashinfer/gemm/fp4_gemm_template_sm103.h

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

include/flashinfer/gemm/cutlass_gemm_configs.h (2)

284-301: Missing get_cluster_shape_name() cases for several cluster shapes.

The ClusterShape enum includes ClusterShape_1x4x1, ClusterShape_4x2x1, ClusterShape_2x4x1, and ClusterShape_4x4x1, but get_cluster_shape_name() does not handle these cases, returning "Unknown shape" for them. This may cause confusion during debugging or logging.

Consider adding the missing cases for completeness:

Proposed fix

 static auto get_cluster_shape_name(ClusterShape Shape_MNK) {
   if (Shape_MNK == ClusterShape::ClusterShape_1x1x1) {
     return "1x1x1";
   } else if (Shape_MNK == ClusterShape::ClusterShape_2x1x1) {
     return "2x1x1";
   } else if (Shape_MNK == ClusterShape::ClusterShape_1x2x1) {
     return "1x2x1";
   } else if (Shape_MNK == ClusterShape::ClusterShape_2x2x1) {
     return "2x2x1";
+  } else if (Shape_MNK == ClusterShape::ClusterShape_1x4x1) {
+    return "1x4x1";
+  } else if (Shape_MNK == ClusterShape::ClusterShape_4x2x1) {
+    return "4x2x1";
+  } else if (Shape_MNK == ClusterShape::ClusterShape_2x4x1) {
+    return "2x4x1";
+  } else if (Shape_MNK == ClusterShape::ClusterShape_4x4x1) {
+    return "4x4x1";
   } else if (Shape_MNK == ClusterShape::ClusterShape_1x8x1) {
     return "1x8x1";
   } else if (Shape_MNK == ClusterShape::ClusterShape_8x1x1) {
     return "8x1x1";
   } else if (Shape_MNK == ClusterShape::ClusterShape_4x1x1) {
     return "4x1x1";
   }
   return "Unknown shape";
 }

303-321: Missing get_cluster_shape() cases cause undefined behavior.

The template function get_cluster_shape() does not handle ClusterShape_1x4x1, ClusterShape_4x2x1, ClusterShape_2x4x1, and ClusterShape_4x4x1. For unmatched cases, the function has no return statement, resulting in undefined behavior.

Proposed fix

 template <ClusterShape Shape_MNK>
 constexpr auto get_cluster_shape() {
   using namespace cute;
   if constexpr (Shape_MNK == ClusterShape::ClusterShape_1x1x1) {
     return cute::Shape<_1, _1, _1>{};
   } else if constexpr (Shape_MNK == ClusterShape::ClusterShape_2x1x1) {
     return cute::Shape<_2, _1, _1>{};
   } else if constexpr (Shape_MNK == ClusterShape::ClusterShape_1x2x1) {
     return cute::Shape<_1, _2, _1>{};
   } else if constexpr (Shape_MNK == ClusterShape::ClusterShape_2x2x1) {
     return cute::Shape<_2, _2, _1>{};
+  } else if constexpr (Shape_MNK == ClusterShape::ClusterShape_1x4x1) {
+    return cute::Shape<_1, _4, _1>{};
+  } else if constexpr (Shape_MNK == ClusterShape::ClusterShape_4x2x1) {
+    return cute::Shape<_4, _2, _1>{};
+  } else if constexpr (Shape_MNK == ClusterShape::ClusterShape_2x4x1) {
+    return cute::Shape<_2, _4, _1>{};
+  } else if constexpr (Shape_MNK == ClusterShape::ClusterShape_4x4x1) {
+    return cute::Shape<_4, _4, _1>{};
   } else if constexpr (Shape_MNK == ClusterShape::ClusterShape_1x8x1) {
     return cute::Shape<_1, _8, _1>{};
   } else if constexpr (Shape_MNK == ClusterShape::ClusterShape_8x1x1) {
     return cute::Shape<_8, _1, _1>{};
   } else if constexpr (Shape_MNK == ClusterShape::ClusterShape_4x1x1) {
     return cute::Shape<_4, _1, _1>{};
+  } else {
+    static_assert(sizeof(Shape_MNK) == 0, "Unsupported ClusterShape");
   }
 }

🤖 Fix all issues with AI agents

In @flashinfer/gemm/gemm_base.py:
- Around line 525-531: The docstring for get_gemm_sm103_module_cutlass_fp4() is
incorrect (it references SM100/103/110); update it to accurately describe this
function as returning the SM103 FP4 GEMM module (e.g., "Get the SM103 FP4 GEMM
module.") so it matches the function name and behavior in
gen_gemm_sm103_module_cutlass_fp4() and the _create_cutlass_fp4_gemm_module
call.

In @flashinfer/jit/gemm/cutlass/cutlass_library.py:
- Line 627: Remove the personal annotation "#RLC:" from the
KernelScheduleType.Nvf4TmaWarpSpecialized2SmSm103 mapping in the cutlass mapping
table (the entry that maps to the long cutlass::gemm class name) and add a
corresponding suffix entry to the KernelScheduleSuffixes dictionary for
KernelScheduleType.Nvf4TmaWarpSpecialized2SmSm103 with the value
"_o_vs16_2sm_sm103" so the suffix map includes this schedule type.

In @include/flashinfer/gemm/fp4_gemm_template_sm103.h:
- Around line 270-281: Error messages reference the wrong architecture string;
update the messages constructed after the gemm.initialize (initStatus) and
gemm.run (runStatus) checks to say "sm103" instead of "sm100". Locate the blocks
using gemm.initialize(args, workspace, stream) and gemm.run(args, workspace,
stream, nullptr, /*enablePDL=*/true) and change the human-readable text in the
std::string errMsg concatenations that include "Failed to initialize/run cutlass
FP4 gemm on sm100" to "Failed to initialize/run cutlass FP4 gemm on sm103" while
keeping the rest of the error handling (cutlassGetStatusString, throwing
std::runtime_error) unchanged.

🧹 Nitpick comments (6)

csrc/fp4_gemm_cutlass.jinja (1)
29-29: LGTM! New cluster configuration correctly instantiated.

The new (4,1,1) cluster configuration with _2SM scheduler is correctly instantiated and complements the existing configurations. This aligns with the PR objective to improve SM103 NVFP4 performance.
♻️ Optional: Consider reordering for better organization

For improved readability, you might place the (4,1,1) configuration before (4,2,1) to maintain a consistent ordering pattern (cluster_m=4, then cluster_n in ascending order: 1, 2, 4).
 INSTANTIATE_FP4_GEMM_KERNEL_LAUNCHER({{ type }}, {{ cta_m }}, {{ cta_n }}, {{ cta_k }}, 2, 4, 1, _2SM)
+INSTANTIATE_FP4_GEMM_KERNEL_LAUNCHER({{ type }}, {{ cta_m }}, {{ cta_n }}, {{ cta_k }}, 4, 1, 1, _2SM)
 INSTANTIATE_FP4_GEMM_KERNEL_LAUNCHER({{ type }}, {{ cta_m }}, {{ cta_n }}, {{ cta_k }}, 4, 2, 1, _2SM)
 INSTANTIATE_FP4_GEMM_KERNEL_LAUNCHER({{ type }}, {{ cta_m }}, {{ cta_n }}, {{ cta_k }}, 4, 4, 1, _2SM)
-INSTANTIATE_FP4_GEMM_KERNEL_LAUNCHER({{ type }}, {{ cta_m }}, {{ cta_n }}, {{ cta_k }}, 4, 1, 1, _2SM)
This is purely cosmetic and doesn't affect functionality.
include/flashinfer/gemm/fp4_gemm_cutlass_template_sm103.h (3)
17-18: Include guard name may conflict with other headers.

The include guard FLASHINFER_FP4_GEMM_CUTLASS_TEMPLATE_H_ is generic and doesn't include "SM103". If there's another fp4_gemm_cutlass_template.h (e.g., for SM100), this could cause include guard collisions.
Suggested fix
-#ifndef FLASHINFER_FP4_GEMM_CUTLASS_TEMPLATE_H_
-#define FLASHINFER_FP4_GEMM_CUTLASS_TEMPLATE_H_
+#ifndef FLASHINFER_FP4_GEMM_CUTLASS_TEMPLATE_SM103_H_
+#define FLASHINFER_FP4_GEMM_CUTLASS_TEMPLATE_SM103_H_
And at the end of the file:
-#endif  // FLASHINFER_FP4_GEMM_CUTLASS_TEMPLATE_H_
+#endif  // FLASHINFER_FP4_GEMM_CUTLASS_TEMPLATE_SM103_H_
357-364: Weak hash function with high collision probability.

The hash function XORs all four values directly without bit shifting, which leads to poor distribution. For example, (1,2,3,4) and (2,1,4,3) would produce the same hash.
Proposed fix using a better hash combination
   struct MNKHash {
     size_t operator()(const MNK& mnk) const {
       auto h1 = std::hash<int>{}(std::get<0>(mnk));
       auto h2 = std::hash<int>{}(std::get<1>(mnk));
       auto h3 = std::hash<int>{}(std::get<2>(mnk));
       auto h4 = std::hash<int>{}(std::get<3>(mnk));
-      return h1 ^ h2 ^ h3 ^ h4;
+      // Combine hashes with bit shifting to reduce collisions
+      size_t seed = h1;
+      seed ^= h2 + 0x9e3779b9 + (seed << 6) + (seed >> 2);
+      seed ^= h3 + 0x9e3779b9 + (seed << 6) + (seed >> 2);
+      seed ^= h4 + 0x9e3779b9 + (seed << 6) + (seed >> 2);
+      return seed;
     }
   };
287-329: Review the getConfigs() tactic ordering.

The best_tactics_index list {22, 20, 29, 4, 18} references specific indices in candidateConfigs. This assumes the configuration list order is stable. Any changes to tilesSm100 or clusterShapes vectors will invalidate these indices, leading to incorrect tactic prioritization.

Consider using a more robust approach, such as storing the actual configuration tuples rather than indices.
flashinfer/jit/gemm/cutlass/generate_kernels.py (1)
22-22: Unused import.

The logger is imported but does not appear to be used anywhere in this file.
Proposed fix
-from ...core import logger
csrc/fp4_gemm_cutlass_sm103.cu (1)

103-103: Consider removing or documenting the unused variable.

mat2_k_scale is set to 1 and used in dimension checks, but its purpose isn't clear. If it's a placeholder for future scaling functionality, a comment explaining this would help. If it's truly unused, consider removing it.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between df8015c and dd8061a.

📒 Files selected for processing (11)

csrc/fp4_gemm_cutlass.jinja
csrc/fp4_gemm_cutlass_sm103.cu
csrc/fp4_gemm_cutlass_sm103.jinja
flashinfer/gemm/gemm_base.py
flashinfer/jit/gemm/__init__.py
flashinfer/jit/gemm/core.py
flashinfer/jit/gemm/cutlass/cutlass_library.py
flashinfer/jit/gemm/cutlass/generate_kernels.py
include/flashinfer/gemm/cutlass_gemm_configs.h
include/flashinfer/gemm/fp4_gemm_cutlass_template_sm103.h
include/flashinfer/gemm/fp4_gemm_template_sm103.h

🧰 Additional context used

📓 Path-based instructions (4)

flashinfer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

flashinfer/**/*.py: Use @functools.cache decorator on Python API functions to implement module-level caching and avoid recompilation
Use @flashinfer_api decorator for debugging API calls, enable via FLASHINFER_LOGLEVEL environment variable (0=off, 1=basic, 3=detailed, 5=with stats)

Files:

flashinfer/jit/gemm/cutlass/generate_kernels.py
flashinfer/jit/gemm/core.py
flashinfer/jit/gemm/cutlass/cutlass_library.py
flashinfer/gemm/gemm_base.py
flashinfer/jit/gemm/__init__.py

flashinfer/jit/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

flashinfer/jit/**/*.py: JIT module generators in flashinfer/jit/ must follow the pattern: compute URI → create directory → (optional) render Jinja template → copy sources → return JitSpec
Use gen_jit_spec() function to return a properly configured JitSpec from module generators with appropriate sources and extra_cuda_cflags
Specify supported_major_versions in JitSpec to restrict kernel compilation to supported GPU architectures (e.g., SM versions 9, 10, 11, 12 for Hopper/newer)

Files:

flashinfer/jit/gemm/cutlass/generate_kernels.py
flashinfer/jit/gemm/core.py
flashinfer/jit/gemm/cutlass/cutlass_library.py
flashinfer/jit/gemm/__init__.py

csrc/**/*.jinja

📄 CodeRabbit inference engine (CLAUDE.md)

csrc/**/*.jinja: Use dispatch macros (e.g., DISPATCH_DTYPE, DISPATCH_BLOCK_SIZE) in .jinja template files to handle combinatorial parameter spaces in CUDA kernels
Use DISPATCH_DTYPE, DISPATCH_BLOCK_SIZE, and similar macros to reduce code duplication when handling multiple dtype and template parameter combinations

Files:

csrc/fp4_gemm_cutlass.jinja
csrc/fp4_gemm_cutlass_sm103.jinja

csrc/**/*.cu

📄 CodeRabbit inference engine (CLAUDE.md)

Framework bindings and PyTorch tensor handling should be implemented in csrc/ via TVM-FFI, not in include/ headers

Files:

csrc/fp4_gemm_cutlass_sm103.cu

🧠 Learnings (12)

📓 Common learnings

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/jit/**/*.py : Specify `supported_major_versions` in JitSpec to restrict kernel compilation to supported GPU architectures (e.g., SM versions 9, 10, 11, 12 for Hopper/newer)

📚 Learning: 2025-11-12T03:35:17.583Z

Learnt from: raayandhar
Repo: flashinfer-ai/flashinfer PR: 2070
File: include/flashinfer/gemm/bf16_gemm_cutlass_template.h:145-160
Timestamp: 2025-11-12T03:35:17.583Z
Learning: In flashinfer GEMM implementations (e.g., include/flashinfer/gemm/bf16_gemm_cutlass_template.h, fp8_gemm_cutlass_template.h), it is acceptable to catch and silently ignore std::runtime_error exceptions in getWorkspaceSizeImpl when probing multiple GEMM configurations, as some configurations may legitimately fail due to SMEM constraints. This pattern should include a comment like "// Swallow errors when SMEM exceeds maximum allowed" to document the rationale.

Applied to files:

csrc/fp4_gemm_cutlass.jinja
csrc/fp4_gemm_cutlass_sm103.cu
csrc/fp4_gemm_cutlass_sm103.jinja
include/flashinfer/gemm/fp4_gemm_template_sm103.h
include/flashinfer/gemm/fp4_gemm_cutlass_template_sm103.h

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Use `FLASHINFER_CUDA_ARCH_LIST` environment variable to specify target GPU architectures (e.g., '8.0 9.0a') and `FLASHINFER_NVCC_THREADS` to control parallel compilation threads

Applied to files:

csrc/fp4_gemm_cutlass.jinja
csrc/fp4_gemm_cutlass_sm103.jinja

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/jit/**/*.py : Use `gen_jit_spec()` function to return a properly configured JitSpec from module generators with appropriate `sources` and `extra_cuda_cflags`

Applied to files:

flashinfer/jit/gemm/core.py
flashinfer/gemm/gemm_base.py
flashinfer/jit/gemm/__init__.py

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/jit/**/*.py : JIT module generators in `flashinfer/jit/` must follow the pattern: compute URI → create directory → (optional) render Jinja template → copy sources → return JitSpec

Applied to files:

flashinfer/jit/gemm/core.py

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/jit/**/*.py : Specify `supported_major_versions` in JitSpec to restrict kernel compilation to supported GPU architectures (e.g., SM versions 9, 10, 11, 12 for Hopper/newer)

Applied to files:

flashinfer/jit/gemm/core.py

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/aot.py : Register new operations in `flashinfer/aot.py` by calling the `gen_*_module()` function for AOT (Ahead-Of-Time) pre-compilation support

Applied to files:

flashinfer/jit/gemm/core.py
flashinfer/jit/gemm/__init__.py

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to include/**/*.cuh : Kernel code in `include/flashinfer/` is automatically picked up by JIT compilation on changes - no pip reinstall needed

Applied to files:

flashinfer/jit/gemm/core.py
csrc/fp4_gemm_cutlass_sm103.jinja
flashinfer/jit/gemm/__init__.py

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to csrc/**/*_jit_binding.cu : Create TVM-FFI bindings in files matching the pattern `csrc/*_jit_binding.cu` using the `TVM_FFI_DLL_EXPORT_TYPED_FUNC(name, func)` macro to expose C++ functions

Applied to files:

csrc/fp4_gemm_cutlass_sm103.cu

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to csrc/**/*.cu : Framework bindings and PyTorch tensor handling should be implemented in `csrc/` via TVM-FFI, not in `include/` headers

Applied to files:

csrc/fp4_gemm_cutlass_sm103.cu

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to include/**/*.cuh : Consult the PTX ISA documentation (https://docs.nvidia.com/cuda/parallel-thread-execution/) for low-level instruction details and new GPU architecture features when writing inline PTX assembly

Applied to files:

csrc/fp4_gemm_cutlass_sm103.jinja
include/flashinfer/gemm/fp4_gemm_template_sm103.h

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/__init__.py : Export new operations in `flashinfer/__init__.py` to make them available as public API

Applied to files:

flashinfer/jit/gemm/__init__.py

🧬 Code graph analysis (4)

flashinfer/jit/gemm/core.py (2)

flashinfer/jit/core.py (2)

JitSpec (216-397)

gen_jit_spec (400-466)

flashinfer/compilation_context.py (1)

get_nvcc_flags_list (50-68)

flashinfer/jit/gemm/__init__.py (1)

flashinfer/jit/gemm/core.py (1)

gen_gemm_sm103_module_cutlass_fp4 (97-165)

include/flashinfer/gemm/fp4_gemm_template_sm103.h (2)

include/flashinfer/gemm/fp4_gemm_cutlass_template_sm103.h (1)

gemm (42-381)

include/flashinfer/gemm/cutlass_gemm_configs.h (1)

ClusterShape (270-412)

include/flashinfer/gemm/fp4_gemm_cutlass_template_sm103.h (2)

include/flashinfer/gemm/fp4_gemm_template_sm103.h (4)

gemm (38-288)

void (151-283)

_1SM_sm103 (55-60)

_2SM_sm103 (63-68)

include/flashinfer/gemm/fp4_gemm_cutlass.h (1)

FP4GemmType (59-88)

🪛 Ruff (0.14.10)

flashinfer/jit/gemm/core.py

157-161: Consider [*nvcc_flags, "-DENABLE_BF16", "-DENABLE_FP4"] instead of concatenation

Replace with [*nvcc_flags, "-DENABLE_BF16", "-DENABLE_FP4"]

(RUF005)

🔇 Additional comments (17)

flashinfer/jit/gemm/__init__.py (1)

20-20: LGTM!

The new gen_gemm_sm103_module_cutlass_fp4 symbol is correctly imported and exported, following the established pattern for other SM-specific module generators.

Also applies to: 36-36

include/flashinfer/gemm/cutlass_gemm_configs.h (1)

136-140: LGTM!

The SM103-specific tile configurations (128x128x768, 128x192x768, 128x256x768) are correctly added to the CutlassTileConfigSM100 enum, TileShape enum, and the corresponding get_tile_shape() and get_tile_shape_name() functions.

Also applies to: 196-200, 228-233, 260-265

include/flashinfer/gemm/fp4_gemm_cutlass_template_sm103.h (1)

45-112: Missing cluster shape cases in dispatch functions.

Both dispatchNVFP4xNVFP4GemmClusterShapeSm100 and dispatchNVFP4xNVFP4GemmClusterShapeSm103 handle most cluster shapes but miss ClusterShape::ClusterShape_1x8x1 and ClusterShape::ClusterShape_8x1x1. These are present in the ClusterShape enum and used in getConfigs(). If these shapes are selected during autotuning, the dispatch will throw a runtime error.

Please verify whether ClusterShape_1x8x1 and ClusterShape_8x1x1 should be supported for SM103 FP4 GEMM, or if they should be excluded from the config list at lines 300-306.

Also applies to: 114-181

csrc/fp4_gemm_cutlass_sm103.jinja (1)

1-32: LGTM!

The Jinja template correctly instantiates SM103 FP4 Ultra GEMM kernel launchers for the supported cluster shape configurations, with appropriate SM type suffixes (_1SM_sm103, _2SM_sm103).

flashinfer/jit/gemm/core.py (3)

97-99: Shared generation directory may cause confusion.

The gen_directory is set to "gen_gemm_sm100_cutlass_fp4", same as gen_gemm_sm100_module_cutlass_fp4(). While this may be intentional (the SM103 module includes SM100 configurations), it could lead to file collisions or confusion during incremental builds. Consider using a distinct directory like "gen_gemm_sm103_cutlass_fp4".

127-149: SM103 module includes SM100 kernel configurations.

The SM103 module generator also renders kernels using fp4_gemm_cutlass.jinja with SM100 tile configurations. This creates a superset module containing both SM100 and SM103 kernels.

Please confirm this is the intended design - the SM103 module should support both SM100 base configurations and SM103-specific optimized configurations for autotuning to select the best one.

151-165: LGTM - follows established JIT module pattern.

The function correctly:

Specifies supported_major_versions=[10, 11, 12] per coding guidelines

Uses gen_jit_spec() to return a properly configured JitSpec

Includes appropriate CUDA flags for BF16 and FP4 support

Based on learnings, the supported_major_versions specification aligns with JIT module conventions.

csrc/fp4_gemm_cutlass_sm103.cu (4)

1-43: LGTM - File structure and template instantiations are correct.

The file correctly implements TVM-FFI bindings for SM103 FP4 GEMM as per the coding guidelines. Template instantiations for both __nv_bfloat16 and half types are properly declared.

49-58: LGTM - Config retrieval with proper bounds checking.

The static config caching and bounds validation are correctly implemented.

176-193: LGTM - Public API functions and TVM FFI exports are correct.

The fp4_gemm wrapper and fp4_gemm_tactic_num functions are cleanly implemented. The TVM FFI exports follow the correct pattern.

78-84: Verify if ffi::Tensor reference counting prevents premature deallocation of temporary workspace.

The async GEMM kernel receives a pointer to new_workspace, which goes out of scope before the kernel completes. This is safe only if TVM's ffi::Tensor uses reference counting or environment-managed memory that extends the tensor's lifetime beyond the local scope. Verify against TVM's FFI documentation or implementation to confirm the memory lifetime guarantees, or add explicit stream synchronization as a safeguard.

flashinfer/gemm/gemm_base.py (2)

542-554: LGTM - SM103 routing logic is correct.

The routing correctly identifies SM103 (major=10, minor=3) and routes to the specialized module. Other SM10x/SM11x variants correctly fall back to the SM100 path.

2288-2295: LGTM - Compute capability extraction updated correctly.

The change properly extracts both major and minor versions and passes them to enable SM103-specific module selection.

include/flashinfer/gemm/fp4_gemm_template_sm103.h (4)

1-46: LGTM - Header structure and type definitions are correct.

The header guard, includes, namespace structure, and SafeBF16_sm103 definition follow established patterns. The conditional BF16 handling is appropriate.

47-68: LGTM - SM type adapters correctly specialized for SM103.

The 1SM and 2SM configurations properly define their respective scales, thread shapes, and SM103-specific schedule types.

148-163: LGTM - Architecture guard correctly enforces SM103 execution.

The Sm103Only wrapper provides a safety mechanism to prevent execution on incompatible architectures. The use of is_match_v<103> and __trap() follows established patterns for architecture enforcement.

254-257: LGTM - Workspace size query pattern is correct.

The null-pointer check pattern for querying workspace size without running the kernel is a standard CUTLASS convention.

flashinfer/gemm/gemm_base.py

flashinfer/jit/gemm/cutlass/cutlass_library.py

include/flashinfer/gemm/fp4_gemm_template_sm103.h

Signed-off-by: LopezCastroRoberto <robertol.c510@gmail.com>

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In @flashinfer/gemm/gemm_base.py:
- Line 519: Update the docstring for the function that currently reads "Get the
SM100/103/110 FP4 GEMM module." to reflect that SM103 is now separate; change
the text to "Get the SM100/110 FP4 GEMM module." so it matches the dedicated
get_gemm_sm103_module_cutlass_fp4() handler and avoids confusion when locating
get_gemm_sm100_110_module_cutlass_fp4().

🧹 Nitpick comments (1)

flashinfer/jit/gemm/core.py (1)

157-161: Consider using spread operator for list concatenation.

The static analysis tool suggests using the spread operator for cleaner list concatenation.

♻️ Suggested refactor

     return gen_jit_spec(
         "fp4_gemm_cutlass_sm103",
         source_paths,
-        extra_cuda_cflags=nvcc_flags
-        + [
-            "-DENABLE_BF16",
-            "-DENABLE_FP4",
-        ],
+        extra_cuda_cflags=[
+            *nvcc_flags,
+            "-DENABLE_BF16",
+            "-DENABLE_FP4",
+        ],
         extra_cflags=[
             "-DFAST_BUILD",
         ],
     )

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dd8061a and 8cd1d62.

📒 Files selected for processing (3)

flashinfer/gemm/gemm_base.py
flashinfer/jit/gemm/core.py
include/flashinfer/gemm/fp4_gemm_template_sm103.h

🚧 Files skipped from review as they are similar to previous changes (1)

include/flashinfer/gemm/fp4_gemm_template_sm103.h

🧰 Additional context used

📓 Path-based instructions (2)

flashinfer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

flashinfer/**/*.py: Use @functools.cache decorator on Python API functions to implement module-level caching and avoid recompilation
Use @flashinfer_api decorator for debugging API calls, enable via FLASHINFER_LOGLEVEL environment variable (0=off, 1=basic, 3=detailed, 5=with stats)

Files:

flashinfer/gemm/gemm_base.py
flashinfer/jit/gemm/core.py

flashinfer/jit/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

flashinfer/jit/**/*.py: JIT module generators in flashinfer/jit/ must follow the pattern: compute URI → create directory → (optional) render Jinja template → copy sources → return JitSpec
Use gen_jit_spec() function to return a properly configured JitSpec from module generators with appropriate sources and extra_cuda_cflags
Specify supported_major_versions in JitSpec to restrict kernel compilation to supported GPU architectures (e.g., SM versions 9, 10, 11, 12 for Hopper/newer)

Files:

flashinfer/jit/gemm/core.py

🧠 Learnings (6)

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/jit/**/*.py : Use `gen_jit_spec()` function to return a properly configured JitSpec from module generators with appropriate `sources` and `extra_cuda_cflags`

Applied to files:

flashinfer/gemm/gemm_base.py
flashinfer/jit/gemm/core.py

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/jit/**/*.py : JIT module generators in `flashinfer/jit/` must follow the pattern: compute URI → create directory → (optional) render Jinja template → copy sources → return JitSpec

Applied to files:

flashinfer/jit/gemm/core.py

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/jit/**/*.py : Specify `supported_major_versions` in JitSpec to restrict kernel compilation to supported GPU architectures (e.g., SM versions 9, 10, 11, 12 for Hopper/newer)

Applied to files:

flashinfer/jit/gemm/core.py

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/aot.py : Register new operations in `flashinfer/aot.py` by calling the `gen_*_module()` function for AOT (Ahead-Of-Time) pre-compilation support

Applied to files:

flashinfer/jit/gemm/core.py

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to include/**/*.cuh : Kernel code in `include/flashinfer/` is automatically picked up by JIT compilation on changes - no pip reinstall needed

Applied to files:

flashinfer/jit/gemm/core.py

📚 Learning: 2025-11-12T03:35:17.583Z

Learnt from: raayandhar
Repo: flashinfer-ai/flashinfer PR: 2070
File: include/flashinfer/gemm/bf16_gemm_cutlass_template.h:145-160
Timestamp: 2025-11-12T03:35:17.583Z
Learning: In flashinfer GEMM implementations (e.g., include/flashinfer/gemm/bf16_gemm_cutlass_template.h, fp8_gemm_cutlass_template.h), it is acceptable to catch and silently ignore std::runtime_error exceptions in getWorkspaceSizeImpl when probing multiple GEMM configurations, as some configurations may legitimately fail due to SMEM constraints. This pattern should include a comment like "// Swallow errors when SMEM exceeds maximum allowed" to document the rationale.

Applied to files:

flashinfer/jit/gemm/core.py

🪛 Ruff (0.14.10)

flashinfer/jit/gemm/core.py

157-161: Consider [*nvcc_flags, "-DENABLE_BF16", "-DENABLE_FP4"] instead of concatenation

Replace with [*nvcc_flags, "-DENABLE_BF16", "-DENABLE_FP4"]

(RUF005)

🔇 Additional comments (4)

flashinfer/jit/gemm/core.py (1)

97-166: LGTM! SM103 module generator correctly implements dual-configuration strategy.

The implementation properly generates SM103-specific optimizations alongside fallback configurations by using both fp4_gemm_cutlass_sm103.jinja (with larger K-dimension tiles: 768) and fp4_gemm_cutlass.jinja (standard tiles). This approach aligns with the PR objectives of providing SM103-specific schedulers while maintaining compatibility.

The separate directory gen_gemm_sm103_cutlass_fp4 correctly addresses the previous review concern about file collisions.

flashinfer/gemm/gemm_base.py (3)

525-531: LGTM! SM103 module accessor correctly implemented.

The function properly builds and loads the SM103-specific FP4 GEMM module with correct docstring and caching. Implementation follows the established pattern from SM100 and SM120 variants.

542-554: LGTM! SM103 routing logic correctly implemented.

The updated function properly routes to the SM103-specific module when sm_minor == 3 (compute capability 10.3), while maintaining backward compatibility for SM100/110. The conditional logic clearly separates the three variants (SM10x with/without SM103, SM12x).

2288-2295: LGTM! Compute capability extraction correctly updated.

The code now properly extracts both major and minor compute capability values and passes them to the module selector, enabling correct routing to SM103-specific kernels when minor == 3.

flashinfer/gemm/gemm_base.py

IwakuraRein

LTGM. Thanks for the contributions!

include/flashinfer/gemm/fp4_gemm_template_sm103.h

aleozlx · 2026-01-17T00:14:45Z

/bot run

flashinfer-bot · 2026-01-17T00:15:53Z

GitLab MR !247 has been created, and the CI pipeline #41923518 is currently running. I'll report back once the pipeline job completes.

aleozlx · 2026-01-17T00:32:28Z

Thanks! I'll also review but today might be hard

flashinfer-bot · 2026-01-18T00:19:23Z

[FAILED] Pipeline #41923518: 14/20 passed

include/flashinfer/gemm/cutlass_gemm_configs.h

aleozlx

LGTM as well. but wanna give some time for other comments to be resolved

LopezCastroRoberto · 2026-02-02T22:54:14Z

Any updates regarding this PR? I saw v0.6.2 was released last week, but the changes here were not merged. Thanks!

cc: @yzh119 @aleozlx

bkryu · 2026-02-02T23:11:06Z

/bot run

flashinfer-bot · 2026-02-02T23:11:45Z

GitLab MR !247 has been updated with latest changes, and the CI pipeline #43135280 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-02-03T04:48:40Z

[CANCELING] Pipeline #43135280: canceled

bkryu

LGTM. Unit tests are also coming back as passing.

…ls (flashinfer-ai#2303) ## Summary This PR adds new template specializations for SM103 NVFP4 CUTLASS GEMM kernels using architecture-specific tile shapes, cluster shapes, and schedulers. ## Motivation SM103 specifications show a higher NVFP4-over-BF16 speedup ratio than B200 (6× vs. 4×), but current kernels remain far from this limit. This PR introduces SM103-optimized templates to improve the achieved performance on this architecture. The performance gains are more pronounced at larger batch sizes, while the previous SM100 configurations remain preferable in other cases. For this reason, SM103-specific configurations were added alongside the existing ones rather than replacing them, and the optimal configuration is automatically selected as part of the autotuning process. ## Performance results examples Llama-3.1-70B, N=8192 K=28672, BF16 vs NVFP4 GEMMs TFLOP/s: | Batch Size | Torch BF16 | NVFP4 Before | NVFP4 After | |-----------:|-----------:|-------------:|------------:| | 8 | 50.418336 | 110.598008 | 124.005817 | | 16 | 99.350151 | 219.649654 | 260.502226 | | 32 | 193.884850 | 445.840601 | 519.291059 | | 64 | 385.790757 | 978.451544 | 1011.614080 | | 128 | 692.915989 | 2072.797941 | 2076.017433 | | 256 | 1211.413202| 3817.738538 | 3868.924511 | | 512 | 1464.015616| 5141.532768 | 5503.664311 | | 1024 | 1600.983748| 5659.831320 | 6341.013002 | | 2048 | 1625.639619| 5991.840134 | 6630.757403 | | 4096 | 1602.978834| 6160.806595 | 6898.878407 | | 8192 | 1691.174722| 5939.220913 | 6653.915111 | | 16384 | 1688.224044| 5926.519222 | 6595.387600 | | 24576 | 1706.774619| 5905.301100 | 6617.486211 | | 32768 | 1678.225402| 5913.806010 | 6592.762922 | --- Llama-3.1-70B, N=8192 K=8192, BF16 vs NVFP4 GEMMs TFLOP/s: | Batch Size | Torch BF16 | NVFP4 Before | NVFP4 After | |-----------:|-----------:|--------------------------:|--------------------------:| | 8 | 47.780647 | 124.774241 | 124.760324 | | 16 | 95.671633 | 249.502165 | 249.131125 | | 32 | 189.224266 | 497.991489 | 497.277802 | | 64 | 373.320912 | 993.731451 | 989.446041 | | 128 | 707.096994 | 1959.258553 | 1970.430179 | | 256 | 1126.908748| 4037.558967 | 4159.515720 | | 512 | 1407.884777| 5045.981883 | 4958.698763 | | 1024 | 1491.747576| 5654.694949 | 5614.133004 | | 2048 | 1546.322959| 5898.291400 | 6204.813491 | | 4096 | 1610.656216| 6312.498418 | 6605.534723 | | 8192 | 1623.748353| 6392.424296 | 6803.660138 | | 16384 | 1627.947338| 6438.789701 | 6947.466217 | | 24576 | 1614.582791| 6469.307368 | 6991.331576 | | 32768 | 1617.601164| 6515.312895 | 7010.746651 |  ## Summary by CodeRabbit * **New Features** * Added support for NVIDIA SM103 GPU architecture in FP4 operations with specialized kernel configurations and optimized launcher implementations, extending hardware compatibility and enabling efficient computation on additional GPU variants. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub>  --------- Signed-off-by: LopezCastroRoberto <robertol.c510@gmail.com> Co-authored-by: yzh119 <zihaoy@nvidia.com>

LopezCastroRoberto added 2 commits January 7, 2026 17:08

Add sm103 specific template

a9abb86

Signed-off-by: LopezCastroRoberto <robertol.c510@gmail.com>

reverting some changes

dd8061a

Signed-off-by: LopezCastroRoberto <robertol.c510@gmail.com>

LopezCastroRoberto requested review from aleozlx, bkryu, cyx-6, djmmoss, jiahanc, jimmyzho, nvmbreughe, ttyio, yongwww and yzh119 as code owners January 7, 2026 17:41

LopezCastroRoberto marked this pull request as draft January 7, 2026 17:41

gemini-code-assist bot reviewed Jan 7, 2026

View reviewed changes

flashinfer/jit/gemm/core.py Outdated Show resolved Hide resolved

flashinfer/gemm/gemm_base.py Outdated Show resolved Hide resolved

include/flashinfer/gemm/fp4_gemm_template_sm103.h Show resolved Hide resolved

coderabbitai bot reviewed Jan 7, 2026

View reviewed changes

flashinfer/gemm/gemm_base.py Show resolved Hide resolved

flashinfer/jit/gemm/cutlass/cutlass_library.py Outdated Show resolved Hide resolved

include/flashinfer/gemm/fp4_gemm_template_sm103.h Show resolved Hide resolved

LopezCastroRoberto changed the title ~~[Perf][Feature] Add SM103-specific schedulers for B300 NVFP4 CUTLASS kernels~~ [Perf][Feature] Add SM103-specific schedulers for NVFP4 CUTLASS kernels Jan 7, 2026

Fixing CodeRabbit suggestions

8cd1d62

Signed-off-by: LopezCastroRoberto <robertol.c510@gmail.com>

LopezCastroRoberto marked this pull request as ready for review January 7, 2026 19:29

coderabbitai bot reviewed Jan 7, 2026

View reviewed changes

flashinfer/gemm/gemm_base.py Show resolved Hide resolved

IwakuraRein approved these changes Jan 16, 2026

View reviewed changes

include/flashinfer/gemm/fp4_gemm_template_sm103.h Show resolved Hide resolved

lint

d996e2e

IwakuraRein reviewed Jan 20, 2026

View reviewed changes

include/flashinfer/gemm/cutlass_gemm_configs.h Show resolved Hide resolved

aleozlx approved these changes Jan 20, 2026

View reviewed changes

yzh119 approved these changes Jan 20, 2026

View reviewed changes

yzh119 added the v0.6.2 label Jan 20, 2026

b8zhong mentioned this pull request Jan 22, 2026

Suboptimal mm_fp4 backend selection #2375

Closed

bkryu approved these changes Feb 3, 2026

View reviewed changes

bkryu merged commit c7761ad into flashinfer-ai:main Feb 3, 2026
49 of 58 checks passed

coderabbitai bot mentioned this pull request Feb 3, 2026

feat: Add MXFP8 GEMM mm_mxfp8 (cutlass) #2464

Merged

5 tasks

coderabbitai bot mentioned this pull request Feb 24, 2026

fix: add SM121 support to SM120 version guards #2631

Merged

5 tasks

coderabbitai bot mentioned this pull request Mar 10, 2026

Support for MXFP4 and NVFP4 group GEMMs on GeForce and Spark #2738

Merged

5 tasks

Conversation

LopezCastroRoberto commented Jan 7, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Performance results examples

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagrams

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

gemini-code-assist bot commented Jan 7, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

IwakuraRein left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aleozlx commented Jan 17, 2026

Uh oh!

flashinfer-bot commented Jan 17, 2026

Uh oh!

aleozlx commented Jan 17, 2026

Uh oh!

flashinfer-bot commented Jan 18, 2026

Uh oh!

Uh oh!

aleozlx left a comment

Choose a reason for hiding this comment

Uh oh!

LopezCastroRoberto commented Feb 2, 2026

Uh oh!

bkryu commented Feb 2, 2026

Uh oh!

flashinfer-bot commented Feb 2, 2026

Uh oh!

flashinfer-bot commented Feb 3, 2026

Uh oh!

bkryu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

LopezCastroRoberto commented Jan 7, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 7, 2026 •

edited

Loading