Skip to content

[Perf][Feature] Add SM103-specific schedulers for NVFP4 CUTLASS kernels#2303

Merged
bkryu merged 4 commits intoflashinfer-ai:mainfrom
LopezCastroRoberto:feature/sm103_specific_schedulers
Feb 3, 2026
Merged

[Perf][Feature] Add SM103-specific schedulers for NVFP4 CUTLASS kernels#2303
bkryu merged 4 commits intoflashinfer-ai:mainfrom
LopezCastroRoberto:feature/sm103_specific_schedulers

Conversation

@LopezCastroRoberto
Copy link
Copy Markdown
Contributor

@LopezCastroRoberto LopezCastroRoberto commented Jan 7, 2026

Summary

This PR adds new template specializations for SM103 NVFP4 CUTLASS GEMM kernels using architecture-specific tile shapes, cluster shapes, and schedulers.

Motivation

SM103 specifications show a higher NVFP4-over-BF16 speedup ratio than B200 (6× vs. 4×), but current kernels remain far from this limit.
This PR introduces SM103-optimized templates to improve the achieved performance on this architecture.

The performance gains are more pronounced at larger batch sizes, while the previous SM100 configurations remain preferable in other cases.
For this reason, SM103-specific configurations were added alongside the existing ones rather than replacing them, and the optimal configuration is automatically selected as part of the autotuning process.

Performance results examples

Llama-3.1-70B, N=8192 K=28672, BF16 vs NVFP4 GEMMs TFLOP/s:

Batch Size Torch BF16 NVFP4 Before NVFP4 After
8 50.418336 110.598008 124.005817
16 99.350151 219.649654 260.502226
32 193.884850 445.840601 519.291059
64 385.790757 978.451544 1011.614080
128 692.915989 2072.797941 2076.017433
256 1211.413202 3817.738538 3868.924511
512 1464.015616 5141.532768 5503.664311
1024 1600.983748 5659.831320 6341.013002
2048 1625.639619 5991.840134 6630.757403
4096 1602.978834 6160.806595 6898.878407
8192 1691.174722 5939.220913 6653.915111
16384 1688.224044 5926.519222 6595.387600
24576 1706.774619 5905.301100 6617.486211
32768 1678.225402 5913.806010 6592.762922

Llama-3.1-70B, N=8192 K=8192, BF16 vs NVFP4 GEMMs TFLOP/s:

Batch Size Torch BF16 NVFP4 Before NVFP4 After
8 47.780647 124.774241 124.760324
16 95.671633 249.502165 249.131125
32 189.224266 497.991489 497.277802
64 373.320912 993.731451 989.446041
128 707.096994 1959.258553 1970.430179
256 1126.908748 4037.558967 4159.515720
512 1407.884777 5045.981883 4958.698763
1024 1491.747576 5654.694949 5614.133004
2048 1546.322959 5898.291400 6204.813491
4096 1610.656216 6312.498418 6605.534723
8192 1623.748353 6392.424296 6803.660138
16384 1627.947338 6438.789701 6947.466217
24576 1614.582791 6469.307368 6991.331576
32768 1617.601164 6515.312895 7010.746651

Summary by CodeRabbit

  • New Features
    • Added support for NVIDIA SM103 GPU architecture in FP4 operations with specialized kernel configurations and optimized launcher implementations, extending hardware compatibility and enabling efficient computation on additional GPU variants.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: LopezCastroRoberto <robertol.c510@gmail.com>
Signed-off-by: LopezCastroRoberto <robertol.c510@gmail.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 7, 2026

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

This PR adds FP4 GEMM support for SM103 GPUs by introducing SM103-specific kernel implementations, tile configurations, type adapters, and a JIT module generator. Changes span CUDA kernels, C++ templates, configuration headers, and Python integration for runtime dispatch.

Changes

Cohort / File(s) Summary
SM103 CUDA Implementation
csrc/fp4_gemm_cutlass_sm103.cu, csrc/fp4_gemm_cutlass_sm103.jinja, csrc/fp4_gemm_cutlass.jinja
New SM103 FP4 GEMM kernels with TVM FFI entry points (fp4_gemm, fp4_gemm_tactic_num). Includes workspace sizing, input validation, batch/matrix dimension inference, and tactic-based kernel dispatch. Adds kernel launcher instantiation with cta_m=4, cta_n=1, cta_k=1 parameters.
SM103 Header Templates
include/flashinfer/gemm/fp4_gemm_template_sm103.h, include/flashinfer/gemm/fp4_gemm_cutlass_gemm_configs.h
New SM103-specific FP4 GEMM paths with type adapters (SMTypeAdapter_sm103 for _1SM_sm103, _2SM_sm103), kernel launchers, argument preparation, and workspace management. Introduces cluster-shape and CTA-shape dispatch with robust error handling.
Configuration Extensions
include/flashinfer/gemm/cutlass_gemm_configs.h
Expands tile configurations for SM103: adds TileShape_128x128x768, TileShape_128x192x768, TileShape_128x256x768 to both CutlassTileConfigSM100 and TileShape enums. Adds ClusterShape_4x1x1 cluster shape variant. Updates get_tile_shape, get_tile_shape_name, get_cluster_shape_name, and get_cluster_shape to handle new shapes.
Python Integration
flashinfer/gemm/gemm_base.py, flashinfer/jit/gemm/core.py, flashinfer/jit/gemm/__init__.py
New gen_gemm_sm103_module_cutlass_fp4() generator for JIT compilation of SM103 kernels with BF16/FP16 types and three tile configurations. Updates get_cutlass_fp4_gemm_module to accept (sm_major, sm_minor) and dispatch to SM103 path when sm_minor==3. Extends mm_fp4 to extract and propagate minor compute capability.

Sequence Diagrams

sequenceDiagram
    participant PythonAPI as Python API<br/>(mm_fp4)
    participant Dispatcher as Compute Capability<br/>Dispatcher
    participant SM103Gen as SM103 Module<br/>Generator
    participant JITCompiler as JIT Compiler
    participant CUDARuntime as CUDA Runtime

    PythonAPI->>Dispatcher: Extract (major, minor) capability
    alt sm_minor == 3
        Dispatcher->>SM103Gen: get_gemm_sm103_module_cutlass_fp4()
        SM103Gen->>SM103Gen: Generate fp4_gemm_cutlass_sm103.cu<br/>with tile configs<br/>(128x128x768, 128x192x768,<br/>128x256x768)
        SM103Gen->>JITCompiler: Render templates &<br/>compile sources
        JITCompiler->>JITCompiler: Build SM103 kernels<br/>with ENABLE_FP4,<br/>ENABLE_BF16 flags
        JITCompiler-->>SM103Gen: Compiled module
        SM103Gen-->>Dispatcher: cutlass_fp4_gemm_runner()
    else sm_minor != 3
        Dispatcher->>Dispatcher: Fallback to SM100 path
    end
    Dispatcher->>PythonAPI: Return kernel runner
    PythonAPI->>CUDARuntime: Dispatch FP4 GEMM<br/>via tactic selection
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • aleozlx
  • djmmoss
  • yongwww
  • cyx-6
  • nvmbreughe
  • bkryu
  • ttyio

Poem

🐰 Whiskers twitch with joy so bright,
SM103's FP4 kernels take flight!
Tile shapes hop from SM100 to new,
H100 hardware, we've got you too!
With BF16 and dispatch so neat,
GEMM ops make the code complete! 🚀

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.22% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description provides comprehensive motivation, performance results, and technical details, but the provided description field in the PR is missing the required template sections like the checklist items. Consider adding the pre-commit checks and tests sections from the template to document completion status of these validation steps.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding SM103-specific schedulers for NVFP4 CUTLASS kernels, directly matching the PR's core objective.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@LopezCastroRoberto LopezCastroRoberto marked this pull request as draft January 7, 2026 17:41
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @LopezCastroRoberto, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance of NVFP4 General Matrix Multiplication (GEMM) operations on NVIDIA's SM103 architecture (B300 GPUs). By introducing specialized CUTLASS kernel schedulers tailored for SM103, the changes aim to unlock greater efficiency and speedup, especially for larger batch sizes in deep learning workloads. The integration ensures that the system intelligently selects the most performant kernel for the given hardware, without replacing existing configurations for other architectures.

Highlights

  • SM103-Specific Scheduler Integration: Introduced new template specializations for SM103 NVFP4 CUTLASS GEMM kernels, leveraging architecture-specific schedulers to optimize performance on B300 GPUs.
  • Performance Improvement: Achieved significant performance gains for NVFP4 GEMMs on SM103, particularly at larger batch sizes, moving closer to the theoretical 6x NVFP4-over-BF16 speedup ratio compared to the previous 4x on B200.
  • Dynamic Configuration Selection: The new SM103-specific configurations are added alongside existing ones, and the optimal configuration is automatically selected through an autotuning process, ensuring the best performance across various workloads.
  • New Kernel Definitions: Added new CUDA source and Jinja templates (fp4_gemm_cutlass_sm103.cu, fp4_gemm_cutlass_sm103.jinja, fp4_gemm_template_sm103.h) to define and instantiate the specialized SM103 kernels.
  • Python Integration: Updated the Python GEMM module loading logic to dynamically select the appropriate CUTLASS FP4 GEMM module (SM100 or SM103) based on the detected GPU compute capability (sm_major and sm_minor).

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces SM103-specific schedulers for NVFP4 CUTLASS kernels to enhance performance, particularly for larger batch sizes. The changes are well-structured, adding new kernel configurations and the necessary C++ and Python logic to dispatch to them based on the GPU architecture. The overall approach is sound. My review has identified a high-severity issue in the build scripts that could cause file conflicts, along with a couple of medium-severity issues related to misleading documentation and error messages. Addressing these points will improve the correctness and maintainability of the code.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
include/flashinfer/gemm/cutlass_gemm_configs.h (2)

284-301: Missing get_cluster_shape_name() cases for several cluster shapes.

The ClusterShape enum includes ClusterShape_1x4x1, ClusterShape_4x2x1, ClusterShape_2x4x1, and ClusterShape_4x4x1, but get_cluster_shape_name() does not handle these cases, returning "Unknown shape" for them. This may cause confusion during debugging or logging.

Consider adding the missing cases for completeness:

Proposed fix
 static auto get_cluster_shape_name(ClusterShape Shape_MNK) {
   if (Shape_MNK == ClusterShape::ClusterShape_1x1x1) {
     return "1x1x1";
   } else if (Shape_MNK == ClusterShape::ClusterShape_2x1x1) {
     return "2x1x1";
   } else if (Shape_MNK == ClusterShape::ClusterShape_1x2x1) {
     return "1x2x1";
   } else if (Shape_MNK == ClusterShape::ClusterShape_2x2x1) {
     return "2x2x1";
+  } else if (Shape_MNK == ClusterShape::ClusterShape_1x4x1) {
+    return "1x4x1";
+  } else if (Shape_MNK == ClusterShape::ClusterShape_4x2x1) {
+    return "4x2x1";
+  } else if (Shape_MNK == ClusterShape::ClusterShape_2x4x1) {
+    return "2x4x1";
+  } else if (Shape_MNK == ClusterShape::ClusterShape_4x4x1) {
+    return "4x4x1";
   } else if (Shape_MNK == ClusterShape::ClusterShape_1x8x1) {
     return "1x8x1";
   } else if (Shape_MNK == ClusterShape::ClusterShape_8x1x1) {
     return "8x1x1";
   } else if (Shape_MNK == ClusterShape::ClusterShape_4x1x1) {
     return "4x1x1";
   }
   return "Unknown shape";
 }

303-321: Missing get_cluster_shape() cases cause undefined behavior.

The template function get_cluster_shape() does not handle ClusterShape_1x4x1, ClusterShape_4x2x1, ClusterShape_2x4x1, and ClusterShape_4x4x1. For unmatched cases, the function has no return statement, resulting in undefined behavior.

Proposed fix
 template <ClusterShape Shape_MNK>
 constexpr auto get_cluster_shape() {
   using namespace cute;
   if constexpr (Shape_MNK == ClusterShape::ClusterShape_1x1x1) {
     return cute::Shape<_1, _1, _1>{};
   } else if constexpr (Shape_MNK == ClusterShape::ClusterShape_2x1x1) {
     return cute::Shape<_2, _1, _1>{};
   } else if constexpr (Shape_MNK == ClusterShape::ClusterShape_1x2x1) {
     return cute::Shape<_1, _2, _1>{};
   } else if constexpr (Shape_MNK == ClusterShape::ClusterShape_2x2x1) {
     return cute::Shape<_2, _2, _1>{};
+  } else if constexpr (Shape_MNK == ClusterShape::ClusterShape_1x4x1) {
+    return cute::Shape<_1, _4, _1>{};
+  } else if constexpr (Shape_MNK == ClusterShape::ClusterShape_4x2x1) {
+    return cute::Shape<_4, _2, _1>{};
+  } else if constexpr (Shape_MNK == ClusterShape::ClusterShape_2x4x1) {
+    return cute::Shape<_2, _4, _1>{};
+  } else if constexpr (Shape_MNK == ClusterShape::ClusterShape_4x4x1) {
+    return cute::Shape<_4, _4, _1>{};
   } else if constexpr (Shape_MNK == ClusterShape::ClusterShape_1x8x1) {
     return cute::Shape<_1, _8, _1>{};
   } else if constexpr (Shape_MNK == ClusterShape::ClusterShape_8x1x1) {
     return cute::Shape<_8, _1, _1>{};
   } else if constexpr (Shape_MNK == ClusterShape::ClusterShape_4x1x1) {
     return cute::Shape<_4, _1, _1>{};
+  } else {
+    static_assert(sizeof(Shape_MNK) == 0, "Unsupported ClusterShape");
   }
 }
🤖 Fix all issues with AI agents
In @flashinfer/gemm/gemm_base.py:
- Around line 525-531: The docstring for get_gemm_sm103_module_cutlass_fp4() is
incorrect (it references SM100/103/110); update it to accurately describe this
function as returning the SM103 FP4 GEMM module (e.g., "Get the SM103 FP4 GEMM
module.") so it matches the function name and behavior in
gen_gemm_sm103_module_cutlass_fp4() and the _create_cutlass_fp4_gemm_module
call.

In @flashinfer/jit/gemm/cutlass/cutlass_library.py:
- Line 627: Remove the personal annotation "#RLC:" from the
KernelScheduleType.Nvf4TmaWarpSpecialized2SmSm103 mapping in the cutlass mapping
table (the entry that maps to the long cutlass::gemm class name) and add a
corresponding suffix entry to the KernelScheduleSuffixes dictionary for
KernelScheduleType.Nvf4TmaWarpSpecialized2SmSm103 with the value
"_o_vs16_2sm_sm103" so the suffix map includes this schedule type.

In @include/flashinfer/gemm/fp4_gemm_template_sm103.h:
- Around line 270-281: Error messages reference the wrong architecture string;
update the messages constructed after the gemm.initialize (initStatus) and
gemm.run (runStatus) checks to say "sm103" instead of "sm100". Locate the blocks
using gemm.initialize(args, workspace, stream) and gemm.run(args, workspace,
stream, nullptr, /*enablePDL=*/true) and change the human-readable text in the
std::string errMsg concatenations that include "Failed to initialize/run cutlass
FP4 gemm on sm100" to "Failed to initialize/run cutlass FP4 gemm on sm103" while
keeping the rest of the error handling (cutlassGetStatusString, throwing
std::runtime_error) unchanged.
🧹 Nitpick comments (6)
csrc/fp4_gemm_cutlass.jinja (1)

29-29: LGTM! New cluster configuration correctly instantiated.

The new (4,1,1) cluster configuration with _2SM scheduler is correctly instantiated and complements the existing configurations. This aligns with the PR objective to improve SM103 NVFP4 performance.

♻️ Optional: Consider reordering for better organization

For improved readability, you might place the (4,1,1) configuration before (4,2,1) to maintain a consistent ordering pattern (cluster_m=4, then cluster_n in ascending order: 1, 2, 4).

 INSTANTIATE_FP4_GEMM_KERNEL_LAUNCHER({{ type }}, {{ cta_m }}, {{ cta_n }}, {{ cta_k }}, 2, 4, 1, _2SM)
+INSTANTIATE_FP4_GEMM_KERNEL_LAUNCHER({{ type }}, {{ cta_m }}, {{ cta_n }}, {{ cta_k }}, 4, 1, 1, _2SM)
 INSTANTIATE_FP4_GEMM_KERNEL_LAUNCHER({{ type }}, {{ cta_m }}, {{ cta_n }}, {{ cta_k }}, 4, 2, 1, _2SM)
 INSTANTIATE_FP4_GEMM_KERNEL_LAUNCHER({{ type }}, {{ cta_m }}, {{ cta_n }}, {{ cta_k }}, 4, 4, 1, _2SM)
-INSTANTIATE_FP4_GEMM_KERNEL_LAUNCHER({{ type }}, {{ cta_m }}, {{ cta_n }}, {{ cta_k }}, 4, 1, 1, _2SM)

This is purely cosmetic and doesn't affect functionality.

include/flashinfer/gemm/fp4_gemm_cutlass_template_sm103.h (3)

17-18: Include guard name may conflict with other headers.

The include guard FLASHINFER_FP4_GEMM_CUTLASS_TEMPLATE_H_ is generic and doesn't include "SM103". If there's another fp4_gemm_cutlass_template.h (e.g., for SM100), this could cause include guard collisions.

Suggested fix
-#ifndef FLASHINFER_FP4_GEMM_CUTLASS_TEMPLATE_H_
-#define FLASHINFER_FP4_GEMM_CUTLASS_TEMPLATE_H_
+#ifndef FLASHINFER_FP4_GEMM_CUTLASS_TEMPLATE_SM103_H_
+#define FLASHINFER_FP4_GEMM_CUTLASS_TEMPLATE_SM103_H_

And at the end of the file:

-#endif  // FLASHINFER_FP4_GEMM_CUTLASS_TEMPLATE_H_
+#endif  // FLASHINFER_FP4_GEMM_CUTLASS_TEMPLATE_SM103_H_

357-364: Weak hash function with high collision probability.

The hash function XORs all four values directly without bit shifting, which leads to poor distribution. For example, (1,2,3,4) and (2,1,4,3) would produce the same hash.

Proposed fix using a better hash combination
   struct MNKHash {
     size_t operator()(const MNK& mnk) const {
       auto h1 = std::hash<int>{}(std::get<0>(mnk));
       auto h2 = std::hash<int>{}(std::get<1>(mnk));
       auto h3 = std::hash<int>{}(std::get<2>(mnk));
       auto h4 = std::hash<int>{}(std::get<3>(mnk));
-      return h1 ^ h2 ^ h3 ^ h4;
+      // Combine hashes with bit shifting to reduce collisions
+      size_t seed = h1;
+      seed ^= h2 + 0x9e3779b9 + (seed << 6) + (seed >> 2);
+      seed ^= h3 + 0x9e3779b9 + (seed << 6) + (seed >> 2);
+      seed ^= h4 + 0x9e3779b9 + (seed << 6) + (seed >> 2);
+      return seed;
     }
   };

287-329: Review the getConfigs() tactic ordering.

The best_tactics_index list {22, 20, 29, 4, 18} references specific indices in candidateConfigs. This assumes the configuration list order is stable. Any changes to tilesSm100 or clusterShapes vectors will invalidate these indices, leading to incorrect tactic prioritization.

Consider using a more robust approach, such as storing the actual configuration tuples rather than indices.

flashinfer/jit/gemm/cutlass/generate_kernels.py (1)

22-22: Unused import.

The logger is imported but does not appear to be used anywhere in this file.

Proposed fix
-from ...core import logger
csrc/fp4_gemm_cutlass_sm103.cu (1)

103-103: Consider removing or documenting the unused variable.

mat2_k_scale is set to 1 and used in dimension checks, but its purpose isn't clear. If it's a placeholder for future scaling functionality, a comment explaining this would help. If it's truly unused, consider removing it.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between df8015c and dd8061a.

📒 Files selected for processing (11)
  • csrc/fp4_gemm_cutlass.jinja
  • csrc/fp4_gemm_cutlass_sm103.cu
  • csrc/fp4_gemm_cutlass_sm103.jinja
  • flashinfer/gemm/gemm_base.py
  • flashinfer/jit/gemm/__init__.py
  • flashinfer/jit/gemm/core.py
  • flashinfer/jit/gemm/cutlass/cutlass_library.py
  • flashinfer/jit/gemm/cutlass/generate_kernels.py
  • include/flashinfer/gemm/cutlass_gemm_configs.h
  • include/flashinfer/gemm/fp4_gemm_cutlass_template_sm103.h
  • include/flashinfer/gemm/fp4_gemm_template_sm103.h
🧰 Additional context used
📓 Path-based instructions (4)
flashinfer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

flashinfer/**/*.py: Use @functools.cache decorator on Python API functions to implement module-level caching and avoid recompilation
Use @flashinfer_api decorator for debugging API calls, enable via FLASHINFER_LOGLEVEL environment variable (0=off, 1=basic, 3=detailed, 5=with stats)

Files:

  • flashinfer/jit/gemm/cutlass/generate_kernels.py
  • flashinfer/jit/gemm/core.py
  • flashinfer/jit/gemm/cutlass/cutlass_library.py
  • flashinfer/gemm/gemm_base.py
  • flashinfer/jit/gemm/__init__.py
flashinfer/jit/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

flashinfer/jit/**/*.py: JIT module generators in flashinfer/jit/ must follow the pattern: compute URI → create directory → (optional) render Jinja template → copy sources → return JitSpec
Use gen_jit_spec() function to return a properly configured JitSpec from module generators with appropriate sources and extra_cuda_cflags
Specify supported_major_versions in JitSpec to restrict kernel compilation to supported GPU architectures (e.g., SM versions 9, 10, 11, 12 for Hopper/newer)

Files:

  • flashinfer/jit/gemm/cutlass/generate_kernels.py
  • flashinfer/jit/gemm/core.py
  • flashinfer/jit/gemm/cutlass/cutlass_library.py
  • flashinfer/jit/gemm/__init__.py
csrc/**/*.jinja

📄 CodeRabbit inference engine (CLAUDE.md)

csrc/**/*.jinja: Use dispatch macros (e.g., DISPATCH_DTYPE, DISPATCH_BLOCK_SIZE) in .jinja template files to handle combinatorial parameter spaces in CUDA kernels
Use DISPATCH_DTYPE, DISPATCH_BLOCK_SIZE, and similar macros to reduce code duplication when handling multiple dtype and template parameter combinations

Files:

  • csrc/fp4_gemm_cutlass.jinja
  • csrc/fp4_gemm_cutlass_sm103.jinja
csrc/**/*.cu

📄 CodeRabbit inference engine (CLAUDE.md)

Framework bindings and PyTorch tensor handling should be implemented in csrc/ via TVM-FFI, not in include/ headers

Files:

  • csrc/fp4_gemm_cutlass_sm103.cu
🧠 Learnings (12)
📓 Common learnings
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/jit/**/*.py : Specify `supported_major_versions` in JitSpec to restrict kernel compilation to supported GPU architectures (e.g., SM versions 9, 10, 11, 12 for Hopper/newer)
📚 Learning: 2025-11-12T03:35:17.583Z
Learnt from: raayandhar
Repo: flashinfer-ai/flashinfer PR: 2070
File: include/flashinfer/gemm/bf16_gemm_cutlass_template.h:145-160
Timestamp: 2025-11-12T03:35:17.583Z
Learning: In flashinfer GEMM implementations (e.g., include/flashinfer/gemm/bf16_gemm_cutlass_template.h, fp8_gemm_cutlass_template.h), it is acceptable to catch and silently ignore std::runtime_error exceptions in getWorkspaceSizeImpl when probing multiple GEMM configurations, as some configurations may legitimately fail due to SMEM constraints. This pattern should include a comment like "// Swallow errors when SMEM exceeds maximum allowed" to document the rationale.

Applied to files:

  • csrc/fp4_gemm_cutlass.jinja
  • csrc/fp4_gemm_cutlass_sm103.cu
  • csrc/fp4_gemm_cutlass_sm103.jinja
  • include/flashinfer/gemm/fp4_gemm_template_sm103.h
  • include/flashinfer/gemm/fp4_gemm_cutlass_template_sm103.h
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Use `FLASHINFER_CUDA_ARCH_LIST` environment variable to specify target GPU architectures (e.g., '8.0 9.0a') and `FLASHINFER_NVCC_THREADS` to control parallel compilation threads

Applied to files:

  • csrc/fp4_gemm_cutlass.jinja
  • csrc/fp4_gemm_cutlass_sm103.jinja
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/jit/**/*.py : Use `gen_jit_spec()` function to return a properly configured JitSpec from module generators with appropriate `sources` and `extra_cuda_cflags`

Applied to files:

  • flashinfer/jit/gemm/core.py
  • flashinfer/gemm/gemm_base.py
  • flashinfer/jit/gemm/__init__.py
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/jit/**/*.py : JIT module generators in `flashinfer/jit/` must follow the pattern: compute URI → create directory → (optional) render Jinja template → copy sources → return JitSpec

Applied to files:

  • flashinfer/jit/gemm/core.py
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/jit/**/*.py : Specify `supported_major_versions` in JitSpec to restrict kernel compilation to supported GPU architectures (e.g., SM versions 9, 10, 11, 12 for Hopper/newer)

Applied to files:

  • flashinfer/jit/gemm/core.py
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/aot.py : Register new operations in `flashinfer/aot.py` by calling the `gen_*_module()` function for AOT (Ahead-Of-Time) pre-compilation support

Applied to files:

  • flashinfer/jit/gemm/core.py
  • flashinfer/jit/gemm/__init__.py
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to include/**/*.cuh : Kernel code in `include/flashinfer/` is automatically picked up by JIT compilation on changes - no pip reinstall needed

Applied to files:

  • flashinfer/jit/gemm/core.py
  • csrc/fp4_gemm_cutlass_sm103.jinja
  • flashinfer/jit/gemm/__init__.py
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to csrc/**/*_jit_binding.cu : Create TVM-FFI bindings in files matching the pattern `csrc/*_jit_binding.cu` using the `TVM_FFI_DLL_EXPORT_TYPED_FUNC(name, func)` macro to expose C++ functions

Applied to files:

  • csrc/fp4_gemm_cutlass_sm103.cu
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to csrc/**/*.cu : Framework bindings and PyTorch tensor handling should be implemented in `csrc/` via TVM-FFI, not in `include/` headers

Applied to files:

  • csrc/fp4_gemm_cutlass_sm103.cu
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to include/**/*.cuh : Consult the PTX ISA documentation (https://docs.nvidia.com/cuda/parallel-thread-execution/) for low-level instruction details and new GPU architecture features when writing inline PTX assembly

Applied to files:

  • csrc/fp4_gemm_cutlass_sm103.jinja
  • include/flashinfer/gemm/fp4_gemm_template_sm103.h
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/__init__.py : Export new operations in `flashinfer/__init__.py` to make them available as public API

Applied to files:

  • flashinfer/jit/gemm/__init__.py
🧬 Code graph analysis (4)
flashinfer/jit/gemm/core.py (2)
flashinfer/jit/core.py (2)
  • JitSpec (216-397)
  • gen_jit_spec (400-466)
flashinfer/compilation_context.py (1)
  • get_nvcc_flags_list (50-68)
flashinfer/jit/gemm/__init__.py (1)
flashinfer/jit/gemm/core.py (1)
  • gen_gemm_sm103_module_cutlass_fp4 (97-165)
include/flashinfer/gemm/fp4_gemm_template_sm103.h (2)
include/flashinfer/gemm/fp4_gemm_cutlass_template_sm103.h (1)
  • gemm (42-381)
include/flashinfer/gemm/cutlass_gemm_configs.h (1)
  • ClusterShape (270-412)
include/flashinfer/gemm/fp4_gemm_cutlass_template_sm103.h (2)
include/flashinfer/gemm/fp4_gemm_template_sm103.h (4)
  • gemm (38-288)
  • void (151-283)
  • _1SM_sm103 (55-60)
  • _2SM_sm103 (63-68)
include/flashinfer/gemm/fp4_gemm_cutlass.h (1)
  • FP4GemmType (59-88)
🪛 Ruff (0.14.10)
flashinfer/jit/gemm/core.py

157-161: Consider [*nvcc_flags, "-DENABLE_BF16", "-DENABLE_FP4"] instead of concatenation

Replace with [*nvcc_flags, "-DENABLE_BF16", "-DENABLE_FP4"]

(RUF005)

🔇 Additional comments (17)
flashinfer/jit/gemm/__init__.py (1)

20-20: LGTM!

The new gen_gemm_sm103_module_cutlass_fp4 symbol is correctly imported and exported, following the established pattern for other SM-specific module generators.

Also applies to: 36-36

include/flashinfer/gemm/cutlass_gemm_configs.h (1)

136-140: LGTM!

The SM103-specific tile configurations (128x128x768, 128x192x768, 128x256x768) are correctly added to the CutlassTileConfigSM100 enum, TileShape enum, and the corresponding get_tile_shape() and get_tile_shape_name() functions.

Also applies to: 196-200, 228-233, 260-265

include/flashinfer/gemm/fp4_gemm_cutlass_template_sm103.h (1)

45-112: Missing cluster shape cases in dispatch functions.

Both dispatchNVFP4xNVFP4GemmClusterShapeSm100 and dispatchNVFP4xNVFP4GemmClusterShapeSm103 handle most cluster shapes but miss ClusterShape::ClusterShape_1x8x1 and ClusterShape::ClusterShape_8x1x1. These are present in the ClusterShape enum and used in getConfigs(). If these shapes are selected during autotuning, the dispatch will throw a runtime error.

Please verify whether ClusterShape_1x8x1 and ClusterShape_8x1x1 should be supported for SM103 FP4 GEMM, or if they should be excluded from the config list at lines 300-306.

Also applies to: 114-181

csrc/fp4_gemm_cutlass_sm103.jinja (1)

1-32: LGTM!

The Jinja template correctly instantiates SM103 FP4 Ultra GEMM kernel launchers for the supported cluster shape configurations, with appropriate SM type suffixes (_1SM_sm103, _2SM_sm103).

flashinfer/jit/gemm/core.py (3)

97-99: Shared generation directory may cause confusion.

The gen_directory is set to "gen_gemm_sm100_cutlass_fp4", same as gen_gemm_sm100_module_cutlass_fp4(). While this may be intentional (the SM103 module includes SM100 configurations), it could lead to file collisions or confusion during incremental builds. Consider using a distinct directory like "gen_gemm_sm103_cutlass_fp4".


127-149: SM103 module includes SM100 kernel configurations.

The SM103 module generator also renders kernels using fp4_gemm_cutlass.jinja with SM100 tile configurations. This creates a superset module containing both SM100 and SM103 kernels.

Please confirm this is the intended design - the SM103 module should support both SM100 base configurations and SM103-specific optimized configurations for autotuning to select the best one.


151-165: LGTM - follows established JIT module pattern.

The function correctly:

  • Specifies supported_major_versions=[10, 11, 12] per coding guidelines
  • Uses gen_jit_spec() to return a properly configured JitSpec
  • Includes appropriate CUDA flags for BF16 and FP4 support

Based on learnings, the supported_major_versions specification aligns with JIT module conventions.

csrc/fp4_gemm_cutlass_sm103.cu (4)

1-43: LGTM - File structure and template instantiations are correct.

The file correctly implements TVM-FFI bindings for SM103 FP4 GEMM as per the coding guidelines. Template instantiations for both __nv_bfloat16 and half types are properly declared.


49-58: LGTM - Config retrieval with proper bounds checking.

The static config caching and bounds validation are correctly implemented.


176-193: LGTM - Public API functions and TVM FFI exports are correct.

The fp4_gemm wrapper and fp4_gemm_tactic_num functions are cleanly implemented. The TVM FFI exports follow the correct pattern.


78-84: Verify if ffi::Tensor reference counting prevents premature deallocation of temporary workspace.

The async GEMM kernel receives a pointer to new_workspace, which goes out of scope before the kernel completes. This is safe only if TVM's ffi::Tensor uses reference counting or environment-managed memory that extends the tensor's lifetime beyond the local scope. Verify against TVM's FFI documentation or implementation to confirm the memory lifetime guarantees, or add explicit stream synchronization as a safeguard.

flashinfer/gemm/gemm_base.py (2)

542-554: LGTM - SM103 routing logic is correct.

The routing correctly identifies SM103 (major=10, minor=3) and routes to the specialized module. Other SM10x/SM11x variants correctly fall back to the SM100 path.


2288-2295: LGTM - Compute capability extraction updated correctly.

The change properly extracts both major and minor versions and passes them to enable SM103-specific module selection.

include/flashinfer/gemm/fp4_gemm_template_sm103.h (4)

1-46: LGTM - Header structure and type definitions are correct.

The header guard, includes, namespace structure, and SafeBF16_sm103 definition follow established patterns. The conditional BF16 handling is appropriate.


47-68: LGTM - SM type adapters correctly specialized for SM103.

The 1SM and 2SM configurations properly define their respective scales, thread shapes, and SM103-specific schedule types.


148-163: LGTM - Architecture guard correctly enforces SM103 execution.

The Sm103Only wrapper provides a safety mechanism to prevent execution on incompatible architectures. The use of is_match_v<103> and __trap() follows established patterns for architecture enforcement.


254-257: LGTM - Workspace size query pattern is correct.

The null-pointer check pattern for querying workspace size without running the kernel is a standard CUTLASS convention.

@LopezCastroRoberto LopezCastroRoberto changed the title [Perf][Feature] Add SM103-specific schedulers for B300 NVFP4 CUTLASS kernels [Perf][Feature] Add SM103-specific schedulers for NVFP4 CUTLASS kernels Jan 7, 2026
Signed-off-by: LopezCastroRoberto <robertol.c510@gmail.com>
@LopezCastroRoberto LopezCastroRoberto marked this pull request as ready for review January 7, 2026 19:29
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @flashinfer/gemm/gemm_base.py:
- Line 519: Update the docstring for the function that currently reads "Get the
SM100/103/110 FP4 GEMM module." to reflect that SM103 is now separate; change
the text to "Get the SM100/110 FP4 GEMM module." so it matches the dedicated
get_gemm_sm103_module_cutlass_fp4() handler and avoids confusion when locating
get_gemm_sm100_110_module_cutlass_fp4().
🧹 Nitpick comments (1)
flashinfer/jit/gemm/core.py (1)

157-161: Consider using spread operator for list concatenation.

The static analysis tool suggests using the spread operator for cleaner list concatenation.

♻️ Suggested refactor
     return gen_jit_spec(
         "fp4_gemm_cutlass_sm103",
         source_paths,
-        extra_cuda_cflags=nvcc_flags
-        + [
-            "-DENABLE_BF16",
-            "-DENABLE_FP4",
-        ],
+        extra_cuda_cflags=[
+            *nvcc_flags,
+            "-DENABLE_BF16",
+            "-DENABLE_FP4",
+        ],
         extra_cflags=[
             "-DFAST_BUILD",
         ],
     )
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dd8061a and 8cd1d62.

📒 Files selected for processing (3)
  • flashinfer/gemm/gemm_base.py
  • flashinfer/jit/gemm/core.py
  • include/flashinfer/gemm/fp4_gemm_template_sm103.h
🚧 Files skipped from review as they are similar to previous changes (1)
  • include/flashinfer/gemm/fp4_gemm_template_sm103.h
🧰 Additional context used
📓 Path-based instructions (2)
flashinfer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

flashinfer/**/*.py: Use @functools.cache decorator on Python API functions to implement module-level caching and avoid recompilation
Use @flashinfer_api decorator for debugging API calls, enable via FLASHINFER_LOGLEVEL environment variable (0=off, 1=basic, 3=detailed, 5=with stats)

Files:

  • flashinfer/gemm/gemm_base.py
  • flashinfer/jit/gemm/core.py
flashinfer/jit/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

flashinfer/jit/**/*.py: JIT module generators in flashinfer/jit/ must follow the pattern: compute URI → create directory → (optional) render Jinja template → copy sources → return JitSpec
Use gen_jit_spec() function to return a properly configured JitSpec from module generators with appropriate sources and extra_cuda_cflags
Specify supported_major_versions in JitSpec to restrict kernel compilation to supported GPU architectures (e.g., SM versions 9, 10, 11, 12 for Hopper/newer)

Files:

  • flashinfer/jit/gemm/core.py
🧠 Learnings (6)
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/jit/**/*.py : Use `gen_jit_spec()` function to return a properly configured JitSpec from module generators with appropriate `sources` and `extra_cuda_cflags`

Applied to files:

  • flashinfer/gemm/gemm_base.py
  • flashinfer/jit/gemm/core.py
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/jit/**/*.py : JIT module generators in `flashinfer/jit/` must follow the pattern: compute URI → create directory → (optional) render Jinja template → copy sources → return JitSpec

Applied to files:

  • flashinfer/jit/gemm/core.py
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/jit/**/*.py : Specify `supported_major_versions` in JitSpec to restrict kernel compilation to supported GPU architectures (e.g., SM versions 9, 10, 11, 12 for Hopper/newer)

Applied to files:

  • flashinfer/jit/gemm/core.py
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/aot.py : Register new operations in `flashinfer/aot.py` by calling the `gen_*_module()` function for AOT (Ahead-Of-Time) pre-compilation support

Applied to files:

  • flashinfer/jit/gemm/core.py
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to include/**/*.cuh : Kernel code in `include/flashinfer/` is automatically picked up by JIT compilation on changes - no pip reinstall needed

Applied to files:

  • flashinfer/jit/gemm/core.py
📚 Learning: 2025-11-12T03:35:17.583Z
Learnt from: raayandhar
Repo: flashinfer-ai/flashinfer PR: 2070
File: include/flashinfer/gemm/bf16_gemm_cutlass_template.h:145-160
Timestamp: 2025-11-12T03:35:17.583Z
Learning: In flashinfer GEMM implementations (e.g., include/flashinfer/gemm/bf16_gemm_cutlass_template.h, fp8_gemm_cutlass_template.h), it is acceptable to catch and silently ignore std::runtime_error exceptions in getWorkspaceSizeImpl when probing multiple GEMM configurations, as some configurations may legitimately fail due to SMEM constraints. This pattern should include a comment like "// Swallow errors when SMEM exceeds maximum allowed" to document the rationale.

Applied to files:

  • flashinfer/jit/gemm/core.py
🪛 Ruff (0.14.10)
flashinfer/jit/gemm/core.py

157-161: Consider [*nvcc_flags, "-DENABLE_BF16", "-DENABLE_FP4"] instead of concatenation

Replace with [*nvcc_flags, "-DENABLE_BF16", "-DENABLE_FP4"]

(RUF005)

🔇 Additional comments (4)
flashinfer/jit/gemm/core.py (1)

97-166: LGTM! SM103 module generator correctly implements dual-configuration strategy.

The implementation properly generates SM103-specific optimizations alongside fallback configurations by using both fp4_gemm_cutlass_sm103.jinja (with larger K-dimension tiles: 768) and fp4_gemm_cutlass.jinja (standard tiles). This approach aligns with the PR objectives of providing SM103-specific schedulers while maintaining compatibility.

The separate directory gen_gemm_sm103_cutlass_fp4 correctly addresses the previous review concern about file collisions.

flashinfer/gemm/gemm_base.py (3)

525-531: LGTM! SM103 module accessor correctly implemented.

The function properly builds and loads the SM103-specific FP4 GEMM module with correct docstring and caching. Implementation follows the established pattern from SM100 and SM120 variants.


542-554: LGTM! SM103 routing logic correctly implemented.

The updated function properly routes to the SM103-specific module when sm_minor == 3 (compute capability 10.3), while maintaining backward compatibility for SM100/110. The conditional logic clearly separates the three variants (SM10x with/without SM103, SM12x).


2288-2295: LGTM! Compute capability extraction correctly updated.

The code now properly extracts both major and minor compute capability values and passes them to the module selector, enabling correct routing to SM103-specific kernels when minor == 3.

Copy link
Copy Markdown
Collaborator

@IwakuraRein IwakuraRein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LTGM. Thanks for the contributions!

@aleozlx
Copy link
Copy Markdown
Collaborator

aleozlx commented Jan 17, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !247 has been created, and the CI pipeline #41923518 is currently running. I'll report back once the pipeline job completes.

@aleozlx
Copy link
Copy Markdown
Collaborator

aleozlx commented Jan 17, 2026

Thanks! I'll also review but today might be hard

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[FAILED] Pipeline #41923518: 14/20 passed

Copy link
Copy Markdown
Collaborator

@aleozlx aleozlx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well. but wanna give some time for other comments to be resolved

@LopezCastroRoberto
Copy link
Copy Markdown
Contributor Author

Any updates regarding this PR? I saw v0.6.2 was released last week, but the changes here were not merged. Thanks!

cc: @yzh119 @aleozlx

@bkryu
Copy link
Copy Markdown
Collaborator

bkryu commented Feb 2, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !247 has been updated with latest changes, and the CI pipeline #43135280 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[CANCELING] Pipeline #43135280: canceled

Copy link
Copy Markdown
Collaborator

@bkryu bkryu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Unit tests are also coming back as passing.

@bkryu bkryu merged commit c7761ad into flashinfer-ai:main Feb 3, 2026
49 of 58 checks passed
raayandhar pushed a commit to raayandhar/flashinfer that referenced this pull request Feb 5, 2026
…ls (flashinfer-ai#2303)

## Summary

This PR adds new template specializations for SM103 NVFP4 CUTLASS GEMM
kernels using architecture-specific tile shapes, cluster shapes, and
schedulers.

## Motivation

SM103 specifications show a higher NVFP4-over-BF16 speedup ratio than
B200 (6× vs. 4×), but current kernels remain far from this limit.
This PR introduces SM103-optimized templates to improve the achieved
performance on this architecture.

The performance gains are more pronounced at larger batch sizes, while
the previous SM100 configurations remain preferable in other cases.
For this reason, SM103-specific configurations were added alongside the
existing ones rather than replacing them, and the optimal configuration
is automatically selected as part of the autotuning process.

## Performance results examples

Llama-3.1-70B, N=8192 K=28672, BF16 vs NVFP4 GEMMs TFLOP/s:

| Batch Size | Torch BF16 | NVFP4 Before | NVFP4 After |
|-----------:|-----------:|-------------:|------------:|
| 8          | 50.418336  | 110.598008   | 124.005817  |
| 16         | 99.350151  | 219.649654   | 260.502226  |
| 32         | 193.884850 | 445.840601   | 519.291059  |
| 64         | 385.790757 | 978.451544   | 1011.614080 |
| 128        | 692.915989 | 2072.797941  | 2076.017433 |
| 256        | 1211.413202| 3817.738538  | 3868.924511 |
| 512        | 1464.015616| 5141.532768  | 5503.664311 |
| 1024       | 1600.983748| 5659.831320  | 6341.013002 |
| 2048       | 1625.639619| 5991.840134  | 6630.757403 |
| 4096       | 1602.978834| 6160.806595  | 6898.878407 |
| 8192       | 1691.174722| 5939.220913  | 6653.915111 |
| 16384      | 1688.224044| 5926.519222  | 6595.387600 |
| 24576      | 1706.774619| 5905.301100  | 6617.486211 |
| 32768      | 1678.225402| 5913.806010  | 6592.762922 |

---

Llama-3.1-70B, N=8192 K=8192, BF16 vs NVFP4 GEMMs TFLOP/s:

| Batch Size | Torch BF16 | NVFP4 Before | NVFP4 After |

|-----------:|-----------:|--------------------------:|--------------------------:|
| 8 | 47.780647 | 124.774241 | 124.760324 |
| 16 | 95.671633 | 249.502165 | 249.131125 |
| 32 | 189.224266 | 497.991489 | 497.277802 |
| 64 | 373.320912 | 993.731451 | 989.446041 |
| 128 | 707.096994 | 1959.258553 | 1970.430179 |
| 256 | 1126.908748| 4037.558967 | 4159.515720 |
| 512 | 1407.884777| 5045.981883 | 4958.698763 |
| 1024 | 1491.747576| 5654.694949 | 5614.133004 |
| 2048 | 1546.322959| 5898.291400 | 6204.813491 |
| 4096 | 1610.656216| 6312.498418 | 6605.534723 |
| 8192 | 1623.748353| 6392.424296 | 6803.660138 |
| 16384 | 1627.947338| 6438.789701 | 6947.466217 |
| 24576 | 1614.582791| 6469.307368 | 6991.331576 |
| 32768 | 1617.601164| 6515.312895 | 7010.746651 |

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added support for NVIDIA SM103 GPU architecture in FP4 operations with
specialized kernel configurations and optimized launcher
implementations, extending hardware compatibility and enabling efficient
computation on additional GPU variants.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: LopezCastroRoberto <robertol.c510@gmail.com>
Co-authored-by: yzh119 <zihaoy@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants