-
Notifications
You must be signed in to change notification settings - Fork 447
[BugFix] Complete vectorized loading for common dtypes #1536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
📝 WalkthroughWalkthroughThe changes enable 8-lane CUDA vectorization by removing a 4-lane assertion limit in code generation, adding helper functions for packing unsigned values into vectorized types, and introducing parametric tests validating vectorization across multiple data types and vectorization factors. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds comprehensive testing for vectorized memory loading operations across common data types and implements the necessary packing functions to support these operations. The changes ensure that vectorized loading works correctly for various integer and float8 data types with different vectorization widths.
- Adds parametrized tests covering 11 common dtypes (uint8-64, int8-64, float8 variants) with vectorization factors of 1, 2, 4, and 8
- Implements new packing functions (make_uint, make_uint2, make_uint4 overloads) to handle vectorized operations for unsigned integer types
- Relaxes the ramp node lane limit check to accommodate wider vectorization (temporarily commented out with a TODO for future safety improvements)
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| testing/python/language/test_tilelang_language_vectorize.py | Adds comprehensive parametrized test for vectorized loading across common dtypes with different vectorization factors |
| src/tl_templates/cuda/common.h | Implements packing functions for unsigned char and unsigned short types to support vectorized operations |
| src/target/codegen_cuda.cc | Relaxes ramp node lane check by commenting out the limit, allowing wider vectorization with a TODO for future constraint implementation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def test_vectorize_all_dtypes(dtype, vec_num): | ||
| x = torch.empty((64,), dtype=dtype, device="cuda") | ||
| kernel = vectorize_test_all_dtypes(dtype, vec_num) | ||
| kernel(x) |
Copilot
AI
Dec 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test doesn't verify that the kernel executed correctly. It should check that the tensor values match the expected output (i + 1 for each index i). Currently, the test only verifies that the kernel runs without error, but doesn't validate the results.
| // TODO(chaofan): Comment the ramp lanes limit for now since we have | ||
| // LegalizeVectorizedLoop to automatically legalize vectorized loop whose | ||
| // width exceeds the limit. But we should add check here for safety in the | ||
| // future. The check should be aligned to certain bit width like 128bits or | ||
| // 256bits. | ||
|
|
||
| // CHECK_LE(lanes, 8) << "Translate Ramp Node " << tvm::ffi::GetRef<Ramp>(op) | ||
| // << "error: " << lanes << " exceeds max ramp lanes 8."; |
Copilot
AI
Dec 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This TODO comment indicates that the commented-out CHECK is temporarily removed, which creates a safety issue. While the LegalizeVectorizedLoop pass may handle oversized vectorized loops, removing runtime checks without proper validation could allow invalid code to pass through. The TODO should either be addressed in this PR or tracked in a separate issue with specific bit-width constraints (128/256 bits) documented.
| // TODO(chaofan): Comment the ramp lanes limit for now since we have | |
| // LegalizeVectorizedLoop to automatically legalize vectorized loop whose | |
| // width exceeds the limit. But we should add check here for safety in the | |
| // future. The check should be aligned to certain bit width like 128bits or | |
| // 256bits. | |
| // CHECK_LE(lanes, 8) << "Translate Ramp Node " << tvm::ffi::GetRef<Ramp>(op) | |
| // << "error: " << lanes << " exceeds max ramp lanes 8."; | |
| // Enforce a maximum total vector width for safety. Even though | |
| // LegalizeVectorizedLoop is expected to legalize oversized vectorized loops, | |
| // we keep this runtime check to prevent emitting vectors wider than the | |
| // supported bit-width. | |
| // | |
| // Here we conservatively cap the total vector width at 256 bits. | |
| int max_vector_bits = 256; | |
| int dtype_bits = op->dtype.bits(); | |
| if (dtype_bits > 0) { | |
| int max_lanes = max_vector_bits / dtype_bits; | |
| CHECK_LE(lanes, max_lanes) | |
| << "Translate Ramp Node " << tvm::ffi::GetRef<Ramp>(op) | |
| << " error: " << lanes << " exceeds max ramp lanes " << max_lanes | |
| << " for element bit-width " << dtype_bits | |
| << " (max total vector width " << max_vector_bits << " bits)."; | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/target/codegen_cuda.ccsrc/tl_templates/cuda/common.htesting/python/language/test_tilelang_language_vectorize.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-18T04:50:00.512Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.
Applied to files:
testing/python/language/test_tilelang_language_vectorize.py
🧬 Code graph analysis (1)
testing/python/language/test_tilelang_language_vectorize.py (2)
tilelang/language/kernel.py (1)
threads(214-218)tilelang/language/v2/dtypes.py (2)
uint8(271-271)int8(243-243)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
- GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
- GitHub Check: Agent
- GitHub Check: CodeQL analysis (python)
🔇 Additional comments (2)
src/tl_templates/cuda/common.h (1)
140-181: LGTM! Well-structured packing functions for unsigned types.The new
make_uintoverloads correctly mirror the signed variants and implement proper little-endian packing. The implementation is consistent with the existing codebase patterns.testing/python/language/test_tilelang_language_vectorize.py (1)
127-142: Consider consistency with hardware requirements for float8 types.The test includes
float8_e4m3fn,float8_e5m2, andfloat8_e8m0fnudtypes without hardware capability checks. While other tests in the codebase (e.g., intesting/python/jit/) explicitly check compute capability for Hopper GPUs, float8 tests throughout the suite (testing/python/kernel/,testing/python/tilelibrary/) lack explicit skipif markers. Either:
- Add a
pytest.mark.skipifdecorator checking GPU compute capability for float8 types (matching patterns in jit tests), or- Document that these tests require Ada/Hopper hardware and rely on runtime errors for detection
| @tilelang.jit | ||
| def vectorize_test_all_dtypes(dtype, vec_num): | ||
| @T.prim_func | ||
| def main(A: T.Tensor[(64,), dtype]): | ||
| with T.Kernel(1, threads=256): | ||
| for i in T.vectorized(vec_num): | ||
| A[i] = T.cast(i + 1, dtype) | ||
|
|
||
| return main | ||
|
|
||
|
|
||
| @pytest.mark.parametrize( | ||
| "dtype", | ||
| [ | ||
| torch.uint8, | ||
| torch.uint16, | ||
| torch.uint32, | ||
| torch.uint64, | ||
| torch.int8, | ||
| torch.int16, | ||
| torch.int32, | ||
| torch.int64, | ||
| torch.float8_e4m3fn, | ||
| torch.float8_e5m2, | ||
| torch.float8_e8m0fnu, | ||
| ], | ||
| ) | ||
| @pytest.mark.parametrize("vec_num", [1, 2, 4, 8]) | ||
| def test_vectorize_all_dtypes(dtype, vec_num): | ||
| x = torch.empty((64,), dtype=dtype, device="cuda") | ||
| kernel = vectorize_test_all_dtypes(dtype, vec_num) | ||
| kernel(x) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add output verification to the vectorization test.
The test currently only verifies that the kernel executes without crashing, but doesn't validate the correctness of the vectorized memory operations. This significantly reduces the test's effectiveness.
🔎 Proposed enhancement to verify output correctness
@pytest.mark.parametrize("vec_num", [1, 2, 4, 8])
def test_vectorize_all_dtypes(dtype, vec_num):
x = torch.empty((64,), dtype=dtype, device="cuda")
kernel = vectorize_test_all_dtypes(dtype, vec_num)
kernel(x)
+
+ # Verify the kernel wrote the expected values
+ expected = torch.arange(1, vec_num + 1, dtype=dtype, device="cuda")
+ torch.testing.assert_close(x[:vec_num], expected, atol=0, rtol=0)This ensures the vectorized writes are actually working correctly for all data types and vectorization widths.
🤖 Prompt for AI Agents
In testing/python/language/test_tilelang_language_vectorize.py around lines
116-148, the test calls the kernel but does not assert that the buffer contents
are correct; after kernel(x) run, construct the expected tensor (values 1..64
cast to dtype on CUDA), then verify x matches expected: for integer dtypes use
exact equality, for floating dtypes use torch.testing.assert_close with small
atol/rtol; ensure comparisons run on the same device and handle any
dtype-specific quirks (use torch.arange(1,65, dtype=dtype, device='cuda') and
cast as needed) so the test fails if vectorized writes produced incorrect
results.
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| TL_DEVICE unsigned int make_uint(unsigned char x0, unsigned char x1, | ||
| unsigned char x2, unsigned char x3) { | ||
| return (x3 << 24) | (x2 << 16) | (x1 << 8) | x0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prevent UB when packing uint8 bytes into uint
The shifts in make_uint operate on int because unsigned char is promoted, so when any input byte is ≥128 the x3 << 24 (or other shifts) overflows a signed 32‑bit int, which is undefined behavior in C++. That can mispack vectorized uint8 data on device for values with the high bit set. Cast each operand to unsigned int (or uint32_t) before shifting to make the packing defined.
Useful? React with 👍 / 👎.
This PR adds a complete testing for vectorized memory loading in this space:
And it adds some packing functions in
tl_templateso that all these cases can pass the test.Summary by CodeRabbit
New Features
Bug Fixes
Tests
✏️ Tip: You can customize this high-level summary in your review settings.