Skip to content

[CUDA] Introduce simulated load/store 256bits access for CUDA compatibility #1656

Merged
LeiWang1999 merged 2 commits intotile-ai:mainfrom
LeiWang1999:fallback_0112
Jan 12, 2026
Merged

[CUDA] Introduce simulated load/store 256bits access for CUDA compatibility #1656
LeiWang1999 merged 2 commits intotile-ai:mainfrom
LeiWang1999:fallback_0112

Conversation

@LeiWang1999
Copy link
Member

@LeiWang1999 LeiWang1999 commented Jan 12, 2026

This pull request adds compatibility for CUDA versions earlier than 12.9 to the 256-bit load and store functions in copy_sm100.h. The code now uses two 128-bit operations as a fallback for older CUDA versions, ensuring broader compatibility at the potential cost of some performance.

CUDA Version Compatibility:

  • Added preprocessor checks to all ld_global_256 and st_global_256 functions to detect CUDA version and choose the appropriate assembly instructions. For CUDA 12.9 and above, use native 256-bit operations; for earlier versions, fall back to two 128-bit operations.
  • Provided explicit comments and code paths for the fallback, clarifying that this may cause a performance regression on older CUDA versions.

Summary by CodeRabbit

Release Notes

  • Improvements
    • Added optimized global memory operations support for CUDA 12.9+ with backward compatibility for earlier CUDA toolkit versions.

✏️ Tip: You can customize this high-level summary in your review settings.

…ty (tile-ai#1652)

Refactor the `ld_global_256` and `st_global_256` functions to support both CUDA versions above 12.9 and earlier versions. This change ensures that 256-bit loads and stores are handled correctly across different CUDA versions, improving performance and compatibility. The implementation now uses two 128-bit loads/stores for older versions, enhancing the robustness of the codebase.
Clarified comments in `ld_global_256` and `st_global_256` functions to indicate that the fallback for CUDA versions below 12.9 may have performance regressions. This change enhances code readability and provides better context for developers working with different CUDA versions.
@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 12, 2026

📝 Walkthrough

Walkthrough

This change adds CUDA version-conditional compilation to 256-bit load/store operations in the SM100 copy template. For CUDA 12.9+, optimized 4×64-bit vector operations are used; for earlier versions, the code falls back to 2×128-bit operations, maintaining backward compatibility while enabling performance improvements on newer toolchains.

Changes

Cohort / File(s) Summary
CUDA 12.9+ version gating for 256-bit operations
src/tl_templates/cuda/copy_sm100.h
Added conditional compilation branches selecting between optimized 4×64-bit vector load/store (CUDA ≥12.9) and fallback 2×128-bit operations (earlier versions). Pattern applied to ld_global_256 and st_global_256 for longlong4, ulonglong4, and generic 256-bit specializations. Includes explanatory comments for fallback behavior.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A hop through CUDA lanes so wide,
Twelve-nine or bust—we pick our stride!
Four sixty-fours dance in the sun,
Two one-twenty-eights when times are one.
Version gates keep all at play,
Backwards, forwards, come what may!

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and accurately describes the main change: introducing fallback 256-bit load/store operations for CUDA compatibility with older versions.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/tl_templates/cuda/copy_sm100.h (1)

11-12: Consider extracting the version check into a reusable macro.

The same version check pattern is repeated 8 times throughout the file. While this works correctly, a single macro definition would reduce maintenance burden and potential for inconsistency if the version threshold changes.

♻️ Suggested refactor

Add at the top of the file (after includes):

// CUDA 12.9+ supports native 256-bit vector load/store instructions
#define TL_CUDA_HAS_256BIT_LDST                                                \
  ((__CUDACC_VER_MAJOR__ > 12) ||                                              \
   (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ >= 9))

Then replace each occurrence with:

-#if (__CUDACC_VER_MAJOR__ > 12) ||                                             \
-    (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ >= 9)
+#if TL_CUDA_HAS_256BIT_LDST

Also applies to: 32-33, 54-55, 74-75, 94-95, 116-117, 136-137, 157-158

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5e347e3 and 1e83b10.

📒 Files selected for processing (1)
  • src/tl_templates/cuda/copy_sm100.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
🔇 Additional comments (1)
src/tl_templates/cuda/copy_sm100.h (1)

9-27: LGTM!

The version check and fallback implementation are correct. The use of two 128-bit vector loads at appropriate byte offsets (0 and 16) correctly emulates the 256-bit load for older CUDA versions.

@LeiWang1999 LeiWang1999 merged commit fd260e3 into tile-ai:main Jan 12, 2026
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant