perf: mm_fp4 heuristic prioritizes CUTLASS over cuDNN on SM103 by bkryu · Pull Request #2404 · flashinfer-ai/flashinfer

bkryu · 2026-01-22T21:58:56Z

On SM103 (B300), CUTLASS outperforms cuDNN for FP4 GEMM operations, while on SM100 (B200), cuDNN is faster. This PR updates the backend selection heuristic to check compute capability and prefer CUTLASS on SM103 even with CUDA 13 and cuDNN 9.15+.

Fixes #2375

Generated with Claude Code

Summary by CodeRabbit

Performance
- Enhanced backend selection for FP4 matrix operations on compatible GPU architectures when using CUDA 13+ and cuDNN 9.15+ or later, improving performance on supported configurations.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

On SM103 (B300), CUTLASS outperforms cuDNN for FP4 GEMM operations, while on SM100 (B200), cuDNN is faster. This change updates the backend selection heuristic to check compute capability and prefer CUTLASS on SM103 even with CUDA 13 and cuDNN 9.15+. Benchmark results show: - SM103: CUTLASS ~10-15% faster than cuDNN - SM100: cuDNN ~10-20% faster than CUTLASS Fixes #2375 Co-authored-by: Brian K. Ryu <bkryu@users.noreply.github.com>

gemini-code-assist · 2026-01-22T21:59:12Z

Summary of Changes

Hello @bkryu, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refines the FP4 GEMM backend selection logic to ensure optimal performance across different NVIDIA GPU architectures. It introduces a compute capability check to intelligently choose between CUTLASS and cuDNN, addressing performance discrepancies observed on SM103 (B300) and SM100 (B200) devices, particularly with newer CUDA and cuDNN versions. This change aims to maximize efficiency by leveraging the best-performing library for each specific hardware configuration.

Highlights

Backend Selection Heuristic Update: The _heuristic_func_mm_fp4 function has been updated to dynamically select the optimal GEMM backend (CUTLASS or cuDNN) for FP4 operations based on the GPU's compute capability.
Performance Optimization for SM103 (B300): On SM103 (B300) GPUs, CUTLASS is now prioritized over cuDNN for FP4 GEMM, as benchmarks indicate superior performance for this architecture, even with CUDA 13 and cuDNN 9.15+.
Performance Optimization for SM100 (B200): For SM100 (B200) GPUs, cuDNN remains the preferred backend for FP4 GEMM, ensuring continued optimal performance for that specific architecture.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-01-22T21:59:22Z

📝 Walkthrough

Walkthrough

The _heuristic_func_mm_fp4 backend selection heuristic in flashinfer/gemm/gemm_base.py has been modified to distinguish between SM103 and SM100 GPUs when CUDA 13+ and cuDNN 9.15+ are detected, prioritizing cutlass on SM103 and cuDNN on SM100, returning only supported backends instead of a fixed order.

Changes

Cohort / File(s)	Summary
Backend selection heuristic optimization `flashinfer/gemm/gemm_base.py`	Modified `_heuristic_func_mm_fp4` to distinguish compute capabilities (SM103 vs SM100) and adjust backend preference ordering based on GPU type when CUDA 13+ and cuDNN 9.15+ are detected; filters returned candidates to supported backends only.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Possibly related PRs

fix: Enable SM121 for mm_fp4 #2012: Modifies compute-capability-based backend selection for mm_fp4 by adding SM121 support to backend lists.
[perf] Improve gemm_fp8_nt_groupwise (cutlass backend) by 10-40% for batch sizes <= 32 #2327: Changes CUDA/cuDNN backend preference to favor cuDNN on SM100 but cutlass on SM103, directly affecting which backend optimizations are utilized.
feat: Add backend='auto' to mm_fp4 and enable autotune for backend='cudnn' #1979: Modifies mm_fp4 auto-backend selection heuristic logic and filtering of cudnn vs cutlass backends.

Suggested reviewers

yongwww
nvmbreughe
jimmyzho

Poem

🐰 A speedy fix, compute-aware,
SM103 and SM100, we now declare,
Cutlass here, cuDNN there,
Backend choices, perfectly paired! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: updating mm_fp4 heuristic to prioritize CUTLASS over cuDNN specifically on SM103.
Description check	✅ Passed	The description includes the problem context, affected hardware, the solution approach, and a link to the related issue, though it lacks explicit coverage of testing details.
Linked Issues check	✅ Passed	The PR addresses the core objective from issue `#2375` by implementing compute-capability-aware backend selection that prioritizes CUTLASS on SM103 for mm_fp4 operations.
Out of Scope Changes check	✅ Passed	All changes are directly related to fixing the mm_fp4 backend selection heuristic for SM103, with no extraneous modifications beyond the stated objectives.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request updates the backend selection heuristic for FP4 GEMM operations to prioritize CUTLASS on SM103 GPUs and cuDNN on other modern GPUs (like SM100) when using recent CUDA and cuDNN versions. The changes correctly implement the desired performance optimization. I've included a suggestion to refactor the conditional logic for improved readability and conciseness.

gemini-code-assist · 2026-01-22T22:01:01Z

flashinfer/gemm/gemm_base.py

    if CUDNN_AVAILABLE and cuda_major >= 13 and cudnn.backend_version() >= 91500:
-        candidate_backends = ("cudnn", "cutlass")
+        if is_sm103:
+            candidate_backends = ("cutlass", "cudnn")
+        else:
+            candidate_backends = ("cudnn", "cutlass")
    # Otherwise, prioritize cutlass
    else:
        candidate_backends = ("cutlass", "cudnn")


The conditional logic for selecting the candidate backends can be simplified. The current implementation has a nested if/else and an outer else where two branches produce the same result (("cutlass", "cudnn")). This can be refactored into a single if/else statement, making the condition for prioritizing cudnn more explicit and the code more concise.

if CUDNN_AVAILABLE and cuda_major >= 13 and cudnn.backend_version() >= 91500 and not is_sm103: candidate_backends = ("cudnn", "cutlass") else: candidate_backends = ("cutlass", "cudnn")

claude · 2026-01-22T22:01:08Z

Code Review for PR #2404

Summary

This PR updates the backend selection heuristic for mm_fp4 to prefer CUTLASS over cuDNN on SM103 (B300) GPUs, based on benchmark results showing CUTLASS is 10-15% faster on SM103 while cuDNN remains faster on SM100 (B200).

✅ Positive Aspects

Well-motivated change: The PR addresses a real performance issue (Suboptimal mm_fp4 backend selection #2375) with benchmark data to support the decision
Clear documentation: The updated docstring clearly explains the logic and rationale
Minimal scope: The change is focused and doesn't introduce unnecessary complexity
Consistent pattern: Uses get_compute_capability() similar to other parts of the codebase
Backward compatible: Doesn't change behavior for existing SM100 or pre-CUDA 13 environments

🔍 Observations and Suggestions

1. Consider other Blackwell variants (SM110, SM120, SM121)

The change only checks for SM103, but there are other Blackwell-family architectures in the codebase:

SM110 (mentioned in flashinfer/gemm/gemm_base.py:1039)
SM120/SM121 (extensively used throughout the file)

Question: Have these architectures been benchmarked for mm_fp4? If not benchmarked yet, the current code will treat them as SM100 (preferring cuDNN), which may or may not be optimal.

Suggestion: Consider adding a comment about untested architectures, e.g.:

# Get compute capability to distinguish between SM100 (10.0) and SM103 (10.3)
# Note: SM110/SM120/SM121 behavior not yet benchmarked, currently treated as SM100
major, minor = get_compute_capability(a.device)
is_sm103 = major == 10 and minor == 3

2. Alternative: Use `_match_sm_version()` helper

For consistency with other heuristics like _heuristic_func_bmm_fp8 (line 2841), consider:

is_sm103 = _match_sm_version(a.device, ["103"])

This is slightly more concise and matches the pattern used elsewhere in the file.

3. Test coverage

There don't appear to be any tests for the heuristic function itself. While this is consistent with other heuristic functions in the codebase, consider:

Adding a unit test for _heuristic_func_mm_fp4 that verifies correct backend ordering for different SM versions
This would prevent regressions and document expected behavior

Example test structure:

def test_mm_fp4_heuristic_sm103():
    if not is_sm103_available():
        pytest.skip("Requires SM103")
    # Test that cutlass comes before cudnn on SM103
    ...

4. Minor: Consider extracting compute capability check

The compute capability is retrieved but only used to check SM103. If you anticipate adding more SM-specific logic in the future, consider extracting it:

major, minor = get_compute_capability(a.device)
sm_version = major * 10 + minor  # 103 for SM103, 100 for SM100

if sm_version == 103:
    candidate_backends = ("cutlass", "cudnn")

This mirrors the pattern in _match_sm_version() and makes future additions easier.

🔒 Security & Correctness

No security concerns: This is a performance optimization that doesn't affect correctness
Safe fallback: The heuristic returns a list that gets filtered by suitable_backends, so if CUTLASS isn't available, it will still fall back to cuDNN
No breaking changes: Existing behavior preserved for all other configurations

🎯 Performance Considerations

The change adds a minimal runtime check (compute capability lookup + comparison)
get_compute_capability() should be fast (device property query)
The performance impact is negligible compared to the GEMM operation itself

📝 Documentation

The docstring update is clear and includes the rationale. Consider also:

Adding the benchmark command/script to the commit message or issue for reproducibility
Documenting in CLAUDE.md if this represents a pattern for future architecture-specific optimizations

✅ Verdict

This is a solid PR that addresses a real performance issue with a targeted fix. The code is correct, well-documented, and follows existing patterns.

Recommendation: Approve with minor suggestions for future improvement (SM110/SM120/SM121 consideration and test coverage).

Generated with Claude Code

bkryu · 2026-01-22T22:24:53Z

/bot run

flashinfer-bot · 2026-01-22T22:25:18Z

GitLab MR !258 has been created, and the CI pipeline #42305431 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-01-24T01:04:05Z

[FAILED] Pipeline #42305431: 9/20 passed

bkryu requested review from jimmyzho, nvmbreughe and yongwww as code owners January 22, 2026 21:58

gemini-code-assist bot reviewed Jan 22, 2026

View reviewed changes

bkryu mentioned this pull request Jan 22, 2026

Suboptimal mm_fp4 backend selection #2375

Closed

bkryu self-assigned this Jan 22, 2026

yzh119 approved these changes Jan 25, 2026

View reviewed changes

yzh119 merged commit bd0b27b into main Jan 26, 2026
35 of 39 checks passed

yzh119 deleted the claude/issue-2375-20260122-2147 branch January 26, 2026 07:34

coderabbitai bot mentioned this pull request Feb 3, 2026

feat: Add MXFP8 GEMM mm_mxfp8 (cutlass) #2464

Merged

5 tasks

coderabbitai bot mentioned this pull request Mar 2, 2026

perf: prefer cuDNN first for mm_fp4 on CUDA>=13 and cuDNN>=9.15 (SM100/SM103) #2664

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: mm_fp4 heuristic prioritizes CUTLASS over cuDNN on SM103#2404

perf: mm_fp4 heuristic prioritizes CUTLASS over cuDNN on SM103#2404
yzh119 merged 1 commit intomainfrom
claude/issue-2375-20260122-2147

bkryu commented Jan 22, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

gemini-code-assist bot commented Jan 22, 2026

Uh oh!

coderabbitai bot commented Jan 22, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 22, 2026

Uh oh!

claude bot commented Jan 22, 2026

Uh oh!

bkryu commented Jan 22, 2026

Uh oh!

flashinfer-bot commented Jan 22, 2026

Uh oh!

flashinfer-bot commented Jan 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bkryu commented Jan 22, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Jan 22, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Jan 22, 2026

Code Review for PR #2404

Summary

✅ Positive Aspects

🔍 Observations and Suggestions

1. Consider other Blackwell variants (SM110, SM120, SM121)

2. Alternative: Use _match_sm_version() helper

3. Test coverage

4. Minor: Consider extracting compute capability check

🔒 Security & Correctness

🎯 Performance Considerations

📝 Documentation

✅ Verdict

Uh oh!

bkryu commented Jan 22, 2026

Uh oh!

flashinfer-bot commented Jan 22, 2026

Uh oh!

flashinfer-bot commented Jan 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bkryu commented Jan 22, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 22, 2026 •

edited

Loading

2. Alternative: Use `_match_sm_version()` helper