[CUDA] Introduce simulated load/store 256bits access for CUDA compatibility #1656
[CUDA] Introduce simulated load/store 256bits access for CUDA compatibility #1656LeiWang1999 merged 2 commits intotile-ai:mainfrom
Conversation
…ty (tile-ai#1652) Refactor the `ld_global_256` and `st_global_256` functions to support both CUDA versions above 12.9 and earlier versions. This change ensures that 256-bit loads and stores are handled correctly across different CUDA versions, improving performance and compatibility. The implementation now uses two 128-bit loads/stores for older versions, enhancing the robustness of the codebase.
Clarified comments in `ld_global_256` and `st_global_256` functions to indicate that the fallback for CUDA versions below 12.9 may have performance regressions. This change enhances code readability and provides better context for developers working with different CUDA versions.
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
📝 WalkthroughWalkthroughThis change adds CUDA version-conditional compilation to 256-bit load/store operations in the SM100 copy template. For CUDA 12.9+, optimized 4×64-bit vector operations are used; for earlier versions, the code falls back to 2×128-bit operations, maintaining backward compatibility while enabling performance improvements on newer toolchains. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/tl_templates/cuda/copy_sm100.h (1)
11-12: Consider extracting the version check into a reusable macro.The same version check pattern is repeated 8 times throughout the file. While this works correctly, a single macro definition would reduce maintenance burden and potential for inconsistency if the version threshold changes.
♻️ Suggested refactor
Add at the top of the file (after includes):
// CUDA 12.9+ supports native 256-bit vector load/store instructions #define TL_CUDA_HAS_256BIT_LDST \ ((__CUDACC_VER_MAJOR__ > 12) || \ (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ >= 9))Then replace each occurrence with:
-#if (__CUDACC_VER_MAJOR__ > 12) || \ - (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ >= 9) +#if TL_CUDA_HAS_256BIT_LDSTAlso applies to: 32-33, 54-55, 74-75, 94-95, 116-117, 136-137, 157-158
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/tl_templates/cuda/copy_sm100.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
- GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
🔇 Additional comments (1)
src/tl_templates/cuda/copy_sm100.h (1)
9-27: LGTM!The version check and fallback implementation are correct. The use of two 128-bit vector loads at appropriate byte offsets (0 and 16) correctly emulates the 256-bit load for older CUDA versions.
This pull request adds compatibility for CUDA versions earlier than 12.9 to the 256-bit load and store functions in
copy_sm100.h. The code now uses two 128-bit operations as a fallback for older CUDA versions, ensuring broader compatibility at the potential cost of some performance.CUDA Version Compatibility:
ld_global_256andst_global_256functions to detect CUDA version and choose the appropriate assembly instructions. For CUDA 12.9 and above, use native 256-bit operations; for earlier versions, fall back to two 128-bit operations.Summary by CodeRabbit
Release Notes
✏️ Tip: You can customize this high-level summary in your review settings.