fix: compile flags for trtllm fmha_v2 #2175
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughSuppressed two runtime prints in the FMHA code generator, tightened NVCC supported major versions to only 12 and added an NVCC flag, added an early-device-capability ValueError guard to prefill, and added a test runtime skip when SM120a is unsupported. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20–30 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @jimmyzho, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the compatibility of the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds support for Ampere and Hopper architectures for the trtllm_fmha_v2 module by updating the supported major versions for nvcc and adding a flag to suppress warnings about deprecated GPU targets. The changes look correct and align with the goal of extending hardware support. I've also noticed that some print statements in the kernel generation utility have been commented out. While this reduces verbosity, I've suggested using Python's logging module as a more maintainable and flexible approach for controlling debug output.
| # print('Running command "{}" to build "bin/print_traits.exe":'.format(" ".join(cmd))) | ||
| process = subprocess.Popen(cmd, stdin=subprocess.PIPE, stdout=subprocess.PIPE) | ||
| output, error = process.communicate() | ||
| print('Running "bin/print_traits.exe":') | ||
| # print('Running "bin/print_traits.exe":') |
There was a problem hiding this comment.
Instead of commenting out these print statements, consider using the logging module. This allows for more flexible control over verbosity (e.g., via log levels like INFO or DEBUG) and is a better practice for maintainability. The information about the commands being run is valuable for debugging the kernel generation process.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
flashinfer/jit/attention/fmha_v2/generator_utils.py (1)
3714-3718: Silencing debug prints ingenerate_filesis reasonableCommenting out these two debug
flashinfer/jit/attention/modules.py (1)
1726-1728: Verify NVCC compatibility after widening supported majors and adding-Wno-deprecated-gpu-targetsBoth
gen_fmha_cutlass_sm100a_moduleandgen_trtllm_fmha_v2_modulenow accept NVCC major versions 8–12, and the TRT-LLM FMHA path additionally always appends-Wno-deprecated-gpu-targets. That’s aligned with the goal of broader support and silencing deprecation noise on newer toolchains, but it’s worth double-checking that:
current_compilation_context.get_nvcc_flags_list(supported_major_versions=[8, 9, 10, 11, 12])will never select an NVCC that (a) doesn’t understand-Wno-deprecated-gpu-targetsor (b) can’t compile the targeted SMs (Ampere/Hopper/Blackwell) without failing the build, and- If very old NVCCs (8/9) are still in play for some users, you either gate adding
-Wno-deprecated-gpu-targetsbased on the detected NVCC major or constrainsupported_major_versionshere to the range you’ve actually validated for these kernels.If
current_compilation_contextalready enforces a minimum NVCC version for these architectures, then this change looks fine as-is; otherwise, a small guard around the warning-suppression flag would make this more robust.Also applies to: 1903-1908
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
flashinfer/jit/attention/fmha_v2/generator_utils.py(1 hunks)flashinfer/jit/attention/modules.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.7)
flashinfer/jit/attention/fmha_v2/generator_utils.py
3715-3715: subprocess call: check for execution of untrusted input
(S603)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Deploy Docs
|
/bot run |
| if "CUDA_PATH" in os.environ: | ||
| cmd[0] = os.environ["CUDA_PATH"] + "/bin/" + cmd[0] | ||
| print('Running command "{}" to build "bin/print_traits.exe":'.format(" ".join(cmd))) | ||
| # print('Running command "{}" to build "bin/print_traits.exe":'.format(" ".join(cmd))) |
There was a problem hiding this comment.
Are these changes relevant to the PR?
There was a problem hiding this comment.
not really, just commenting these out from the original trtllm script to clean up the stdout
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
tests/attention/test_fmha_v2_prefill_deepseek.py (1)
60-61: Good defensive test skip for unsupported hardware.The test correctly skips on devices that don't support SM120a. The skip message could optionally be more specific about the requirements (SM 12.0 + CUDA >= 12.8), but the current message is acceptable.
If you want to be more specific, consider:
- pytest.skip("fmha_v2_prefill_deepseek is only supported on SM120 GPUs.") + pytest.skip("fmha_v2_prefill_deepseek requires SM 12.0 GPU with CUDA >= 12.8")flashinfer/prefill.py (1)
3606-3607: Good early validation for device capability.The device check correctly prevents execution on unsupported hardware. The error message could optionally be more specific about the full requirements.
Consider making the error message more informative about both the GPU architecture and CUDA version requirements:
- raise ValueError("fmha_v2_prefill_deepseek is only supported on SM120 GPUs.") + raise ValueError( + "fmha_v2_prefill_deepseek requires SM 12.0 GPU with CUDA >= 12.8. " + f"Current device: {query.device}" + )Note: The static analysis hint about using custom exception classes (Ruff TRY003) is a style preference and not critical for this use case.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
flashinfer/jit/attention/modules.py(1 hunks)flashinfer/prefill.py(1 hunks)tests/attention/test_fmha_v2_prefill_deepseek.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- flashinfer/jit/attention/modules.py
🧰 Additional context used
🧬 Code graph analysis (2)
flashinfer/prefill.py (3)
flashinfer/utils.py (1)
is_sm120a_supported(546-548)include/flashinfer/trtllm/common.h (1)
device(83-90)flashinfer/logits_processor/types.py (1)
device(119-123)
tests/attention/test_fmha_v2_prefill_deepseek.py (1)
flashinfer/utils.py (1)
is_sm120a_supported(546-548)
🪛 Ruff (0.14.7)
flashinfer/prefill.py
3607-3607: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Deploy Docs
🔇 Additional comments (2)
tests/attention/test_fmha_v2_prefill_deepseek.py (1)
8-8: LGTM!The import is correctly added and used in the test guard below.
flashinfer/prefill.py (1)
60-60: LGTM!The import is correctly added and used in the device capability check.
|
/bot run |
<!-- .github/pull_request_template.md --> ## 📌 Description <!-- What does this PR do? Briefly describe the changes and why they’re needed. --> ## 🔍 Related Issues <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [ ] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [ ] I have installed the hooks with `pre-commit install`. - [ ] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Removed noisy runtime console prints during build/generation. * Updated CUDA compiler requirements to target CUDA 12 and added a new compiler flag for compatibility. * **Bug Fixes** * Added an early check that raises a clear error on unsupported GPU devices (SM120a), preventing misruns. * **Tests** * Test now skips automatically when the required SM120a GPU support is not present. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
📌 Description
🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
Chores
Bug Fixes
Tests
✏️ Tip: You can customize this high-level summary in your review settings.