perf: mm_fp4 heuristic prioritizes CUTLASS over cuDNN on SM103#2404
perf: mm_fp4 heuristic prioritizes CUTLASS over cuDNN on SM103#2404
Conversation
On SM103 (B300), CUTLASS outperforms cuDNN for FP4 GEMM operations, while on SM100 (B200), cuDNN is faster. This change updates the backend selection heuristic to check compute capability and prefer CUTLASS on SM103 even with CUDA 13 and cuDNN 9.15+. Benchmark results show: - SM103: CUTLASS ~10-15% faster than cuDNN - SM100: cuDNN ~10-20% faster than CUTLASS Fixes #2375 Co-authored-by: Brian K. Ryu <bkryu@users.noreply.github.com>
Summary of ChangesHello @bkryu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refines the FP4 GEMM backend selection logic to ensure optimal performance across different NVIDIA GPU architectures. It introduces a compute capability check to intelligently choose between CUTLASS and cuDNN, addressing performance discrepancies observed on SM103 (B300) and SM100 (B200) devices, particularly with newer CUDA and cuDNN versions. This change aims to maximize efficiency by leveraging the best-performing library for each specific hardware configuration. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
📝 WalkthroughWalkthroughThe Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~15 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request updates the backend selection heuristic for FP4 GEMM operations to prioritize CUTLASS on SM103 GPUs and cuDNN on other modern GPUs (like SM100) when using recent CUDA and cuDNN versions. The changes correctly implement the desired performance optimization. I've included a suggestion to refactor the conditional logic for improved readability and conciseness.
| if CUDNN_AVAILABLE and cuda_major >= 13 and cudnn.backend_version() >= 91500: | ||
| candidate_backends = ("cudnn", "cutlass") | ||
| if is_sm103: | ||
| candidate_backends = ("cutlass", "cudnn") | ||
| else: | ||
| candidate_backends = ("cudnn", "cutlass") | ||
| # Otherwise, prioritize cutlass | ||
| else: | ||
| candidate_backends = ("cutlass", "cudnn") |
There was a problem hiding this comment.
The conditional logic for selecting the candidate backends can be simplified. The current implementation has a nested if/else and an outer else where two branches produce the same result (("cutlass", "cudnn")). This can be refactored into a single if/else statement, making the condition for prioritizing cudnn more explicit and the code more concise.
if CUDNN_AVAILABLE and cuda_major >= 13 and cudnn.backend_version() >= 91500 and not is_sm103:
candidate_backends = ("cudnn", "cutlass")
else:
candidate_backends = ("cutlass", "cudnn")
Code Review for PR #2404SummaryThis PR updates the backend selection heuristic for ✅ Positive Aspects
🔍 Observations and Suggestions1. Consider other Blackwell variants (SM110, SM120, SM121)The change only checks for SM103, but there are other Blackwell-family architectures in the codebase:
Question: Have these architectures been benchmarked for Suggestion: Consider adding a comment about untested architectures, e.g.: # Get compute capability to distinguish between SM100 (10.0) and SM103 (10.3)
# Note: SM110/SM120/SM121 behavior not yet benchmarked, currently treated as SM100
major, minor = get_compute_capability(a.device)
is_sm103 = major == 10 and minor == 32. Alternative: Use
|
|
/bot run |
|
[FAILED] Pipeline #42305431: 9/20 passed |
On SM103 (B300), CUTLASS outperforms cuDNN for FP4 GEMM operations, while on SM100 (B200), cuDNN is faster. This PR updates the backend selection heuristic to check compute capability and prefer CUTLASS on SM103 even with CUDA 13 and cuDNN 9.15+.
Fixes #2375
Generated with Claude Code
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.