Skip to content

perf: mm_fp4 heuristic prioritizes CUTLASS over cuDNN on SM103#2404

Merged
yzh119 merged 1 commit intomainfrom
claude/issue-2375-20260122-2147
Jan 26, 2026
Merged

perf: mm_fp4 heuristic prioritizes CUTLASS over cuDNN on SM103#2404
yzh119 merged 1 commit intomainfrom
claude/issue-2375-20260122-2147

Conversation

@bkryu
Copy link
Copy Markdown
Collaborator

@bkryu bkryu commented Jan 22, 2026

On SM103 (B300), CUTLASS outperforms cuDNN for FP4 GEMM operations, while on SM100 (B200), cuDNN is faster. This PR updates the backend selection heuristic to check compute capability and prefer CUTLASS on SM103 even with CUDA 13 and cuDNN 9.15+.

Fixes #2375

Generated with Claude Code

Summary by CodeRabbit

  • Performance
    • Enhanced backend selection for FP4 matrix operations on compatible GPU architectures when using CUDA 13+ and cuDNN 9.15+ or later, improving performance on supported configurations.

✏️ Tip: You can customize this high-level summary in your review settings.

On SM103 (B300), CUTLASS outperforms cuDNN for FP4 GEMM operations,
while on SM100 (B200), cuDNN is faster. This change updates the
backend selection heuristic to check compute capability and prefer
CUTLASS on SM103 even with CUDA 13 and cuDNN 9.15+.

Benchmark results show:
- SM103: CUTLASS ~10-15% faster than cuDNN
- SM100: cuDNN ~10-20% faster than CUTLASS

Fixes #2375

Co-authored-by: Brian K. Ryu <bkryu@users.noreply.github.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @bkryu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refines the FP4 GEMM backend selection logic to ensure optimal performance across different NVIDIA GPU architectures. It introduces a compute capability check to intelligently choose between CUTLASS and cuDNN, addressing performance discrepancies observed on SM103 (B300) and SM100 (B200) devices, particularly with newer CUDA and cuDNN versions. This change aims to maximize efficiency by leveraging the best-performing library for each specific hardware configuration.

Highlights

  • Backend Selection Heuristic Update: The _heuristic_func_mm_fp4 function has been updated to dynamically select the optimal GEMM backend (CUTLASS or cuDNN) for FP4 operations based on the GPU's compute capability.
  • Performance Optimization for SM103 (B300): On SM103 (B300) GPUs, CUTLASS is now prioritized over cuDNN for FP4 GEMM, as benchmarks indicate superior performance for this architecture, even with CUDA 13 and cuDNN 9.15+.
  • Performance Optimization for SM100 (B200): For SM100 (B200) GPUs, cuDNN remains the preferred backend for FP4 GEMM, ensuring continued optimal performance for that specific architecture.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 22, 2026

📝 Walkthrough

Walkthrough

The _heuristic_func_mm_fp4 backend selection heuristic in flashinfer/gemm/gemm_base.py has been modified to distinguish between SM103 and SM100 GPUs when CUDA 13+ and cuDNN 9.15+ are detected, prioritizing cutlass on SM103 and cuDNN on SM100, returning only supported backends instead of a fixed order.

Changes

Cohort / File(s) Summary
Backend selection heuristic optimization
flashinfer/gemm/gemm_base.py
Modified _heuristic_func_mm_fp4 to distinguish compute capabilities (SM103 vs SM100) and adjust backend preference ordering based on GPU type when CUDA 13+ and cuDNN 9.15+ are detected; filters returned candidates to supported backends only.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Possibly related PRs

Suggested reviewers

  • yongwww
  • nvmbreughe
  • jimmyzho

Poem

🐰 A speedy fix, compute-aware,
SM103 and SM100, we now declare,
Cutlass here, cuDNN there,
Backend choices, perfectly paired! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: updating mm_fp4 heuristic to prioritize CUTLASS over cuDNN specifically on SM103.
Description check ✅ Passed The description includes the problem context, affected hardware, the solution approach, and a link to the related issue, though it lacks explicit coverage of testing details.
Linked Issues check ✅ Passed The PR addresses the core objective from issue #2375 by implementing compute-capability-aware backend selection that prioritizes CUTLASS on SM103 for mm_fp4 operations.
Out of Scope Changes check ✅ Passed All changes are directly related to fixing the mm_fp4 backend selection heuristic for SM103, with no extraneous modifications beyond the stated objectives.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the backend selection heuristic for FP4 GEMM operations to prioritize CUTLASS on SM103 GPUs and cuDNN on other modern GPUs (like SM100) when using recent CUDA and cuDNN versions. The changes correctly implement the desired performance optimization. I've included a suggestion to refactor the conditional logic for improved readability and conciseness.

Comment on lines 2559 to 2566
if CUDNN_AVAILABLE and cuda_major >= 13 and cudnn.backend_version() >= 91500:
candidate_backends = ("cudnn", "cutlass")
if is_sm103:
candidate_backends = ("cutlass", "cudnn")
else:
candidate_backends = ("cudnn", "cutlass")
# Otherwise, prioritize cutlass
else:
candidate_backends = ("cutlass", "cudnn")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The conditional logic for selecting the candidate backends can be simplified. The current implementation has a nested if/else and an outer else where two branches produce the same result (("cutlass", "cudnn")). This can be refactored into a single if/else statement, making the condition for prioritizing cudnn more explicit and the code more concise.

    if CUDNN_AVAILABLE and cuda_major >= 13 and cudnn.backend_version() >= 91500 and not is_sm103:
        candidate_backends = ("cudnn", "cutlass")
    else:
        candidate_backends = ("cutlass", "cudnn")

@claude
Copy link
Copy Markdown

claude bot commented Jan 22, 2026

Code Review for PR #2404

Summary

This PR updates the backend selection heuristic for mm_fp4 to prefer CUTLASS over cuDNN on SM103 (B300) GPUs, based on benchmark results showing CUTLASS is 10-15% faster on SM103 while cuDNN remains faster on SM100 (B200).

✅ Positive Aspects

  1. Well-motivated change: The PR addresses a real performance issue (Suboptimal mm_fp4 backend selection #2375) with benchmark data to support the decision
  2. Clear documentation: The updated docstring clearly explains the logic and rationale
  3. Minimal scope: The change is focused and doesn't introduce unnecessary complexity
  4. Consistent pattern: Uses get_compute_capability() similar to other parts of the codebase
  5. Backward compatible: Doesn't change behavior for existing SM100 or pre-CUDA 13 environments

🔍 Observations and Suggestions

1. Consider other Blackwell variants (SM110, SM120, SM121)

The change only checks for SM103, but there are other Blackwell-family architectures in the codebase:

  • SM110 (mentioned in flashinfer/gemm/gemm_base.py:1039)
  • SM120/SM121 (extensively used throughout the file)

Question: Have these architectures been benchmarked for mm_fp4? If not benchmarked yet, the current code will treat them as SM100 (preferring cuDNN), which may or may not be optimal.

Suggestion: Consider adding a comment about untested architectures, e.g.:

# Get compute capability to distinguish between SM100 (10.0) and SM103 (10.3)
# Note: SM110/SM120/SM121 behavior not yet benchmarked, currently treated as SM100
major, minor = get_compute_capability(a.device)
is_sm103 = major == 10 and minor == 3

2. Alternative: Use _match_sm_version() helper

For consistency with other heuristics like _heuristic_func_bmm_fp8 (line 2841), consider:

is_sm103 = _match_sm_version(a.device, ["103"])

This is slightly more concise and matches the pattern used elsewhere in the file.

3. Test coverage

There don't appear to be any tests for the heuristic function itself. While this is consistent with other heuristic functions in the codebase, consider:

  • Adding a unit test for _heuristic_func_mm_fp4 that verifies correct backend ordering for different SM versions
  • This would prevent regressions and document expected behavior

Example test structure:

def test_mm_fp4_heuristic_sm103():
    if not is_sm103_available():
        pytest.skip("Requires SM103")
    # Test that cutlass comes before cudnn on SM103
    ...

4. Minor: Consider extracting compute capability check

The compute capability is retrieved but only used to check SM103. If you anticipate adding more SM-specific logic in the future, consider extracting it:

major, minor = get_compute_capability(a.device)
sm_version = major * 10 + minor  # 103 for SM103, 100 for SM100

if sm_version == 103:
    candidate_backends = ("cutlass", "cudnn")

This mirrors the pattern in _match_sm_version() and makes future additions easier.

🔒 Security & Correctness

  • No security concerns: This is a performance optimization that doesn't affect correctness
  • Safe fallback: The heuristic returns a list that gets filtered by suitable_backends, so if CUTLASS isn't available, it will still fall back to cuDNN
  • No breaking changes: Existing behavior preserved for all other configurations

🎯 Performance Considerations

  • The change adds a minimal runtime check (compute capability lookup + comparison)
  • get_compute_capability() should be fast (device property query)
  • The performance impact is negligible compared to the GEMM operation itself

📝 Documentation

The docstring update is clear and includes the rationale. Consider also:

  • Adding the benchmark command/script to the commit message or issue for reproducibility
  • Documenting in CLAUDE.md if this represents a pattern for future architecture-specific optimizations

✅ Verdict

This is a solid PR that addresses a real performance issue with a targeted fix. The code is correct, well-documented, and follows existing patterns.

Recommendation: Approve with minor suggestions for future improvement (SM110/SM120/SM121 consideration and test coverage).


Generated with Claude Code

@bkryu
Copy link
Copy Markdown
Collaborator Author

bkryu commented Jan 22, 2026

/bot run

@bkryu bkryu self-assigned this Jan 22, 2026
@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !258 has been created, and the CI pipeline #42305431 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[FAILED] Pipeline #42305431: 9/20 passed

@yzh119 yzh119 merged commit bd0b27b into main Jan 26, 2026
35 of 39 checks passed
@yzh119 yzh119 deleted the claude/issue-2375-20260122-2147 branch January 26, 2026 07:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Suboptimal mm_fp4 backend selection

3 participants