Skip to content

Conversation

@SiriusNEO
Copy link
Collaborator

@SiriusNEO SiriusNEO commented Dec 25, 2025

This PR adds a complete testing for vectorized memory loading in this space:

common_dtypes x vec_bits
where
common_dtypes = [uint8, uint16, uint32, uint64, int8, int16, int32, int64, float8_e4m3fn, float8_e5m2, float8_e8m0fnu]
vec_bits=[1, 2, 4, 8]

And it adds some packing functions in tl_template so that all these cases can pass the test.

Summary by CodeRabbit

  • New Features

    • Added utility functions for packing values into vectorized types, enabling more flexible data structure construction on CUDA.
  • Bug Fixes

    • Removed excessive lane limitations on CUDA operations, allowing higher vectorization factors and improved performance potential.
  • Tests

    • Added comprehensive parametric tests for vectorization across multiple data types and vectorization factors.

✏️ Tip: You can customize this high-level summary in your review settings.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 25, 2025

📝 Walkthrough

Walkthrough

The changes enable 8-lane CUDA vectorization by removing a 4-lane assertion limit in code generation, adding helper functions for packing unsigned values into vectorized types, and introducing parametric tests validating vectorization across multiple data types and vectorization factors.

Changes

Cohort / File(s) Summary
CUDA Code Generation
src/target/codegen_cuda.cc
Removed active runtime assertion enforcing ≤4 lanes for Ramp nodes; commented-out alternative allows up to 8 lanes with deferred safety validation. Functional output generation unchanged.
CUDA Vector Helpers
src/tl_templates/cuda/common.h
Added four overloaded make_uint/make_uint2/make_uint4 functions to pack 8-bit and 16-bit unsigned values into composite vectorized types via existing helpers and pointer casts.
Vectorization Tests
testing/python/language/test_tilelang_language_vectorize.py
Added parametric test kernel and test function covering integer and float8 dtypes with vectorization factors [1, 2, 4, 8]; introduced pytest import for parameterization.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Suggested reviewers

  • LeiWang1999
  • kurisu6912

Poem

🐰 Eight lanes now flow where four once stood,
Pack those bytes as vectorized types should,
Helper functions weave with tested grace,
CUDA kernels zoom at quickened pace!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.36% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title references 'vectorized loading' but the main changes involve removing a runtime assertion, adding packing helper functions, and introducing comprehensive tests. The PR is broader than just completing vectorized loading. Clarify whether the title should be more specific (e.g., mentioning both the assertion removal and new packing functions) or if it accurately represents the primary objective from the author's perspective.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive testing for vectorized memory loading operations across common data types and implements the necessary packing functions to support these operations. The changes ensure that vectorized loading works correctly for various integer and float8 data types with different vectorization widths.

  • Adds parametrized tests covering 11 common dtypes (uint8-64, int8-64, float8 variants) with vectorization factors of 1, 2, 4, and 8
  • Implements new packing functions (make_uint, make_uint2, make_uint4 overloads) to handle vectorized operations for unsigned integer types
  • Relaxes the ramp node lane limit check to accommodate wider vectorization (temporarily commented out with a TODO for future safety improvements)

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
testing/python/language/test_tilelang_language_vectorize.py Adds comprehensive parametrized test for vectorized loading across common dtypes with different vectorization factors
src/tl_templates/cuda/common.h Implements packing functions for unsigned char and unsigned short types to support vectorized operations
src/target/codegen_cuda.cc Relaxes ramp node lane check by commenting out the limit, allowing wider vectorization with a TODO for future constraint implementation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +144 to +147
def test_vectorize_all_dtypes(dtype, vec_num):
x = torch.empty((64,), dtype=dtype, device="cuda")
kernel = vectorize_test_all_dtypes(dtype, vec_num)
kernel(x)
Copy link

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test doesn't verify that the kernel executed correctly. It should check that the tensor values match the expected output (i + 1 for each index i). Currently, the test only verifies that the kernel runs without error, but doesn't validate the results.

Copilot uses AI. Check for mistakes.
Comment on lines +2833 to +2840
// TODO(chaofan): Comment the ramp lanes limit for now since we have
// LegalizeVectorizedLoop to automatically legalize vectorized loop whose
// width exceeds the limit. But we should add check here for safety in the
// future. The check should be aligned to certain bit width like 128bits or
// 256bits.

// CHECK_LE(lanes, 8) << "Translate Ramp Node " << tvm::ffi::GetRef<Ramp>(op)
// << "error: " << lanes << " exceeds max ramp lanes 8.";
Copy link

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TODO comment indicates that the commented-out CHECK is temporarily removed, which creates a safety issue. While the LegalizeVectorizedLoop pass may handle oversized vectorized loops, removing runtime checks without proper validation could allow invalid code to pass through. The TODO should either be addressed in this PR or tracked in a separate issue with specific bit-width constraints (128/256 bits) documented.

Suggested change
// TODO(chaofan): Comment the ramp lanes limit for now since we have
// LegalizeVectorizedLoop to automatically legalize vectorized loop whose
// width exceeds the limit. But we should add check here for safety in the
// future. The check should be aligned to certain bit width like 128bits or
// 256bits.
// CHECK_LE(lanes, 8) << "Translate Ramp Node " << tvm::ffi::GetRef<Ramp>(op)
// << "error: " << lanes << " exceeds max ramp lanes 8.";
// Enforce a maximum total vector width for safety. Even though
// LegalizeVectorizedLoop is expected to legalize oversized vectorized loops,
// we keep this runtime check to prevent emitting vectors wider than the
// supported bit-width.
//
// Here we conservatively cap the total vector width at 256 bits.
int max_vector_bits = 256;
int dtype_bits = op->dtype.bits();
if (dtype_bits > 0) {
int max_lanes = max_vector_bits / dtype_bits;
CHECK_LE(lanes, max_lanes)
<< "Translate Ramp Node " << tvm::ffi::GetRef<Ramp>(op)
<< " error: " << lanes << " exceeds max ramp lanes " << max_lanes
<< " for element bit-width " << dtype_bits
<< " (max total vector width " << max_vector_bits << " bits).";
}

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3ce8ac9 and b8ec2dd.

📒 Files selected for processing (3)
  • src/target/codegen_cuda.cc
  • src/tl_templates/cuda/common.h
  • testing/python/language/test_tilelang_language_vectorize.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-18T04:50:00.512Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.

Applied to files:

  • testing/python/language/test_tilelang_language_vectorize.py
🧬 Code graph analysis (1)
testing/python/language/test_tilelang_language_vectorize.py (2)
tilelang/language/kernel.py (1)
  • threads (214-218)
tilelang/language/v2/dtypes.py (2)
  • uint8 (271-271)
  • int8 (243-243)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Agent
  • GitHub Check: CodeQL analysis (python)
🔇 Additional comments (2)
src/tl_templates/cuda/common.h (1)

140-181: LGTM! Well-structured packing functions for unsigned types.

The new make_uint overloads correctly mirror the signed variants and implement proper little-endian packing. The implementation is consistent with the existing codebase patterns.

testing/python/language/test_tilelang_language_vectorize.py (1)

127-142: Consider consistency with hardware requirements for float8 types.

The test includes float8_e4m3fn, float8_e5m2, and float8_e8m0fnu dtypes without hardware capability checks. While other tests in the codebase (e.g., in testing/python/jit/) explicitly check compute capability for Hopper GPUs, float8 tests throughout the suite (testing/python/kernel/, testing/python/tilelibrary/) lack explicit skipif markers. Either:

  1. Add a pytest.mark.skipif decorator checking GPU compute capability for float8 types (matching patterns in jit tests), or
  2. Document that these tests require Ada/Hopper hardware and rely on runtime errors for detection

Comment on lines +116 to +148
@tilelang.jit
def vectorize_test_all_dtypes(dtype, vec_num):
@T.prim_func
def main(A: T.Tensor[(64,), dtype]):
with T.Kernel(1, threads=256):
for i in T.vectorized(vec_num):
A[i] = T.cast(i + 1, dtype)

return main


@pytest.mark.parametrize(
"dtype",
[
torch.uint8,
torch.uint16,
torch.uint32,
torch.uint64,
torch.int8,
torch.int16,
torch.int32,
torch.int64,
torch.float8_e4m3fn,
torch.float8_e5m2,
torch.float8_e8m0fnu,
],
)
@pytest.mark.parametrize("vec_num", [1, 2, 4, 8])
def test_vectorize_all_dtypes(dtype, vec_num):
x = torch.empty((64,), dtype=dtype, device="cuda")
kernel = vectorize_test_all_dtypes(dtype, vec_num)
kernel(x)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add output verification to the vectorization test.

The test currently only verifies that the kernel executes without crashing, but doesn't validate the correctness of the vectorized memory operations. This significantly reduces the test's effectiveness.

🔎 Proposed enhancement to verify output correctness
 @pytest.mark.parametrize("vec_num", [1, 2, 4, 8])
 def test_vectorize_all_dtypes(dtype, vec_num):
     x = torch.empty((64,), dtype=dtype, device="cuda")
     kernel = vectorize_test_all_dtypes(dtype, vec_num)
     kernel(x)
+    
+    # Verify the kernel wrote the expected values
+    expected = torch.arange(1, vec_num + 1, dtype=dtype, device="cuda")
+    torch.testing.assert_close(x[:vec_num], expected, atol=0, rtol=0)

This ensures the vectorized writes are actually working correctly for all data types and vectorization widths.

🤖 Prompt for AI Agents
In testing/python/language/test_tilelang_language_vectorize.py around lines
116-148, the test calls the kernel but does not assert that the buffer contents
are correct; after kernel(x) run, construct the expected tensor (values 1..64
cast to dtype on CUDA), then verify x matches expected: for integer dtypes use
exact equality, for floating dtypes use torch.testing.assert_close with small
atol/rtol; ensure comparisons run on the same device and handle any
dtype-specific quirks (use torch.arange(1,65, dtype=dtype, device='cuda') and
cast as needed) so the test fails if vectorized writes produced incorrect
results.

@LeiWang1999
Copy link
Member

@codex review

@LeiWang1999 LeiWang1999 merged commit dff10e5 into tile-ai:main Dec 25, 2025
13 checks passed
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +141 to +143
TL_DEVICE unsigned int make_uint(unsigned char x0, unsigned char x1,
unsigned char x2, unsigned char x3) {
return (x3 << 24) | (x2 << 16) | (x1 << 8) | x0;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Prevent UB when packing uint8 bytes into uint

The shifts in make_uint operate on int because unsigned char is promoted, so when any input byte is ≥128 the x3 << 24 (or other shifts) overflows a signed 32‑bit int, which is undefined behavior in C++. That can mispack vectorized uint8 data on device for values with the high bit set. Cast each operand to unsigned int (or uint32_t) before shifting to make the packing defined.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants