[CUDA] Introduce simulated load/store 256bits access for CUDA compatibility by LeiWang1999 · Pull Request #1656 · tile-ai/tilelang

LeiWang1999 · 2026-01-12T06:21:52Z

This pull request adds compatibility for CUDA versions earlier than 12.9 to the 256-bit load and store functions in copy_sm100.h. The code now uses two 128-bit operations as a fallback for older CUDA versions, ensuring broader compatibility at the potential cost of some performance.

CUDA Version Compatibility:

Added preprocessor checks to all ld_global_256 and st_global_256 functions to detect CUDA version and choose the appropriate assembly instructions. For CUDA 12.9 and above, use native 256-bit operations; for earlier versions, fall back to two 128-bit operations.
Provided explicit comments and code paths for the fallback, clarifying that this may cause a performance regression on older CUDA versions.

Summary by CodeRabbit

Release Notes

Improvements
- Added optimized global memory operations support for CUDA 12.9+ with backward compatibility for earlier CUDA toolkit versions.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…ty (tile-ai#1652) Refactor the `ld_global_256` and `st_global_256` functions to support both CUDA versions above 12.9 and earlier versions. This change ensures that 256-bit loads and stores are handled correctly across different CUDA versions, improving performance and compatibility. The implementation now uses two 128-bit loads/stores for older versions, enhancing the robustness of the codebase.

Clarified comments in `ld_global_256` and `st_global_256` functions to indicate that the fallback for CUDA versions below 12.9 may have performance regressions. This change enhances code readability and provides better context for developers working with different CUDA versions.

github-actions · 2026-01-12T06:22:01Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai · 2026-01-12T06:22:03Z

📝 Walkthrough

Walkthrough

This change adds CUDA version-conditional compilation to 256-bit load/store operations in the SM100 copy template. For CUDA 12.9+, optimized 4×64-bit vector operations are used; for earlier versions, the code falls back to 2×128-bit operations, maintaining backward compatibility while enabling performance improvements on newer toolchains.

Changes

Cohort / File(s)	Summary
CUDA 12.9+ version gating for 256-bit operations `src/tl_templates/cuda/copy_sm100.h`	Added conditional compilation branches selecting between optimized 4×64-bit vector load/store (CUDA ≥12.9) and fallback 2×128-bit operations (earlier versions). Pattern applied to `ld_global_256` and `st_global_256` for `longlong4`, `ulonglong4`, and generic 256-bit specializations. Includes explanatory comments for fallback behavior.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A hop through CUDA lanes so wide,
Twelve-nine or bust—we pick our stride!
Four sixty-fours dance in the sun,
Two one-twenty-eights when times are one.
Version gates keep all at play,
Backwards, forwards, come what may! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and accurately describes the main change: introducing fallback 256-bit load/store operations for CUDA compatibility with older versions.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/tl_templates/cuda/copy_sm100.h (1)
11-12: Consider extracting the version check into a reusable macro.

The same version check pattern is repeated 8 times throughout the file. While this works correctly, a single macro definition would reduce maintenance burden and potential for inconsistency if the version threshold changes.
♻️ Suggested refactor

Add at the top of the file (after includes):
// CUDA 12.9+ supports native 256-bit vector load/store instructions
#define TL_CUDA_HAS_256BIT_LDST                                                \
  ((__CUDACC_VER_MAJOR__ > 12) ||                                              \
   (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ >= 9))
Then replace each occurrence with:
-#if (__CUDACC_VER_MAJOR__ > 12) ||                                             \
-    (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ >= 9)
+#if TL_CUDA_HAS_256BIT_LDST
Also applies to: 32-33, 54-55, 74-75, 94-95, 116-117, 136-137, 157-158

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5e347e3 and 1e83b10.

📒 Files selected for processing (1)

src/tl_templates/cuda/copy_sm100.h

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)

🔇 Additional comments (1)

src/tl_templates/cuda/copy_sm100.h (1)

9-27: LGTM!

The version check and fallback implementation are correct. The use of two 128-bit vector loads at appropriate byte offsets (0 and 16) correctly emulates the 256-bit load for older CUDA versions.

LeiWang1999 added 2 commits January 12, 2026 14:17

coderabbitai bot reviewed Jan 12, 2026

View reviewed changes

LeiWang1999 merged commit fd260e3 into tile-ai:main Jan 12, 2026
6 of 7 checks passed

coderabbitai bot mentioned this pull request Jan 22, 2026

[Enhancement] Add explicit global memory load/store intrinsics (ldg/stg 32/64/128) #1717

Merged

8 tasks

LeiWang1999 mentioned this pull request Feb 3, 2026

Sync branch sync_0203: Features, Enhancements, and Bug Fixes #1780

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] Introduce simulated load/store 256bits access for CUDA compatibility #1656

[CUDA] Introduce simulated load/store 256bits access for CUDA compatibility #1656
LeiWang1999 merged 2 commits intotile-ai:mainfrom
LeiWang1999:fallback_0112

LeiWang1999 commented Jan 12, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Jan 12, 2026

Uh oh!

coderabbitai bot commented Jan 12, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LeiWang1999 commented Jan 12, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

github-actions bot commented Jan 12, 2026

Uh oh!

coderabbitai bot commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LeiWang1999 commented Jan 12, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 12, 2026 •

edited

Loading