[BugFix] Complete vectorized loading for common dtypes #1536

SiriusNEO · 2025-12-25T12:30:03Z

This PR adds a complete testing for vectorized memory loading in this space:

common_dtypes x vec_bits
where
common_dtypes = [uint8, uint16, uint32, uint64, int8, int16, int32, int64, float8_e4m3fn, float8_e5m2, float8_e8m0fnu]
vec_bits=[1, 2, 4, 8]

And it adds some packing functions in tl_template so that all these cases can pass the test.

Summary by CodeRabbit

New Features
- Added utility functions for packing values into vectorized types, enabling more flexible data structure construction on CUDA.
Bug Fixes
- Removed excessive lane limitations on CUDA operations, allowing higher vectorization factors and improved performance potential.
Tests
- Added comprehensive parametric tests for vectorization across multiple data types and vectorization factors.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

github-actions · 2025-12-25T12:30:13Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai · 2025-12-25T12:30:16Z

📝 Walkthrough

Walkthrough

The changes enable 8-lane CUDA vectorization by removing a 4-lane assertion limit in code generation, adding helper functions for packing unsigned values into vectorized types, and introducing parametric tests validating vectorization across multiple data types and vectorization factors.

Changes

Cohort / File(s)	Summary
CUDA Code Generation `src/target/codegen_cuda.cc`	Removed active runtime assertion enforcing ≤4 lanes for Ramp nodes; commented-out alternative allows up to 8 lanes with deferred safety validation. Functional output generation unchanged.
CUDA Vector Helpers `src/tl_templates/cuda/common.h`	Added four overloaded `make_uint/make_uint2/make_uint4` functions to pack 8-bit and 16-bit unsigned values into composite vectorized types via existing helpers and pointer casts.
Vectorization Tests `testing/python/language/test_tilelang_language_vectorize.py`	Added parametric test kernel and test function covering integer and float8 dtypes with vectorization factors [1, 2, 4, 8]; introduced pytest import for parameterization.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Fix ramp_lanes calculation in CUDA codegen #1518 — Also modifies ramp handling in src/target/codegen_cuda.cc; main PR removes lane limit assertion while this PR changes ramp_lanes computation.
[Enhancement] Enhance Cast operations Vectorization #1156 — Adds explicit 8-lane cast codegen for CUDA vectorization; complements main PR's removal of 4-lane Ramp constraint.
Add int2 and longlong4 pack functions #1129 — Adds similar pack-style helper functions (make_int2/make_longlong4) in CUDA common templates; follows same pattern as new make_uint* overloads.

Suggested reviewers

LeiWang1999
kurisu6912

Poem

🐰 Eight lanes now flow where four once stood,
Pack those bytes as vectorized types should,
Helper functions weave with tested grace,
CUDA kernels zoom at quickened pace! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 36.36% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title check	❓ Inconclusive	The title references 'vectorized loading' but the main changes involve removing a runtime assertion, adding packing helper functions, and introducing comprehensive tests. The PR is broader than just completing vectorized loading.	Clarify whether the title should be more specific (e.g., mentioning both the assertion removal and new packing functions) or if it accurately represents the primary objective from the author's perspective.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR adds comprehensive testing for vectorized memory loading operations across common data types and implements the necessary packing functions to support these operations. The changes ensure that vectorized loading works correctly for various integer and float8 data types with different vectorization widths.

Adds parametrized tests covering 11 common dtypes (uint8-64, int8-64, float8 variants) with vectorization factors of 1, 2, 4, and 8
Implements new packing functions (make_uint, make_uint2, make_uint4 overloads) to handle vectorized operations for unsigned integer types
Relaxes the ramp node lane limit check to accommodate wider vectorization (temporarily commented out with a TODO for future safety improvements)

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
testing/python/language/test_tilelang_language_vectorize.py	Adds comprehensive parametrized test for vectorized loading across common dtypes with different vectorization factors
src/tl_templates/cuda/common.h	Implements packing functions for unsigned char and unsigned short types to support vectorized operations
src/target/codegen_cuda.cc	Relaxes ramp node lane check by commenting out the limit, allowing wider vectorization with a TODO for future constraint implementation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-25T12:33:41Z

testing/python/language/test_tilelang_language_vectorize.py

+def test_vectorize_all_dtypes(dtype, vec_num):
+    x = torch.empty((64,), dtype=dtype, device="cuda")
+    kernel = vectorize_test_all_dtypes(dtype, vec_num)
+    kernel(x)


The test doesn't verify that the kernel executed correctly. It should check that the tensor values match the expected output (i + 1 for each index i). Currently, the test only verifies that the kernel runs without error, but doesn't validate the results.

Copilot · 2025-12-25T12:33:42Z

src/target/codegen_cuda.cc

+  // TODO(chaofan): Comment the ramp lanes limit for now since we have
+  // LegalizeVectorizedLoop to automatically legalize vectorized loop whose
+  // width exceeds the limit. But we should add check here for safety in the
+  // future. The check should be aligned to certain bit width like 128bits or
+  // 256bits.
+
+  // CHECK_LE(lanes, 8) << "Translate Ramp Node " << tvm::ffi::GetRef<Ramp>(op)
+  //                    << "error: " << lanes << " exceeds max ramp lanes 8.";


This TODO comment indicates that the commented-out CHECK is temporarily removed, which creates a safety issue. While the LegalizeVectorizedLoop pass may handle oversized vectorized loops, removing runtime checks without proper validation could allow invalid code to pass through. The TODO should either be addressed in this PR or tracked in a separate issue with specific bit-width constraints (128/256 bits) documented.

Suggested change

// TODO(chaofan): Comment the ramp lanes limit for now since we have

// LegalizeVectorizedLoop to automatically legalize vectorized loop whose

// width exceeds the limit. But we should add check here for safety in the

// future. The check should be aligned to certain bit width like 128bits or

// 256bits.

// CHECK_LE(lanes, 8) << "Translate Ramp Node " << tvm::ffi::GetRef<Ramp>(op)

// << "error: " << lanes << " exceeds max ramp lanes 8.";

// Enforce a maximum total vector width for safety. Even though

// LegalizeVectorizedLoop is expected to legalize oversized vectorized loops,

// we keep this runtime check to prevent emitting vectors wider than the

// supported bit-width.

//

// Here we conservatively cap the total vector width at 256 bits.

int max_vector_bits = 256;

int dtype_bits = op->dtype.bits();

if (dtype_bits > 0) {

int max_lanes = max_vector_bits / dtype_bits;

CHECK_LE(lanes, max_lanes)

<< "Translate Ramp Node " << tvm::ffi::GetRef<Ramp>(op)

<< " error: " << lanes << " exceeds max ramp lanes " << max_lanes

<< " for element bit-width " << dtype_bits

<< " (max total vector width " << max_vector_bits << " bits).";

}

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3ce8ac9 and b8ec2dd.

📒 Files selected for processing (3)

src/target/codegen_cuda.cc
src/tl_templates/cuda/common.h
testing/python/language/test_tilelang_language_vectorize.py

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-12-18T04:50:00.512Z

Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.

Applied to files:

testing/python/language/test_tilelang_language_vectorize.py

🧬 Code graph analysis (1)

testing/python/language/test_tilelang_language_vectorize.py (2)

tilelang/language/kernel.py (1)

threads (214-218)

tilelang/language/v2/dtypes.py (2)

uint8 (271-271)

int8 (243-243)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
GitHub Check: Agent
GitHub Check: CodeQL analysis (python)

🔇 Additional comments (2)

src/tl_templates/cuda/common.h (1)

140-181: LGTM! Well-structured packing functions for unsigned types.

The new make_uint overloads correctly mirror the signed variants and implement proper little-endian packing. The implementation is consistent with the existing codebase patterns.

testing/python/language/test_tilelang_language_vectorize.py (1)

127-142: Consider consistency with hardware requirements for float8 types.

The test includes float8_e4m3fn, float8_e5m2, and float8_e8m0fnu dtypes without hardware capability checks. While other tests in the codebase (e.g., in testing/python/jit/) explicitly check compute capability for Hopper GPUs, float8 tests throughout the suite (testing/python/kernel/, testing/python/tilelibrary/) lack explicit skipif markers. Either:

Add a pytest.mark.skipif decorator checking GPU compute capability for float8 types (matching patterns in jit tests), or

Document that these tests require Ada/Hopper hardware and rely on runtime errors for detection

src/target/codegen_cuda.cc

coderabbitai · 2025-12-25T12:37:48Z

testing/python/language/test_tilelang_language_vectorize.py

+@tilelang.jit
+def vectorize_test_all_dtypes(dtype, vec_num):
+    @T.prim_func
+    def main(A: T.Tensor[(64,), dtype]):
+        with T.Kernel(1, threads=256):
+            for i in T.vectorized(vec_num):
+                A[i] = T.cast(i + 1, dtype)
+
+    return main
+
+
+@pytest.mark.parametrize(
+    "dtype",
+    [
+        torch.uint8,
+        torch.uint16,
+        torch.uint32,
+        torch.uint64,
+        torch.int8,
+        torch.int16,
+        torch.int32,
+        torch.int64,
+        torch.float8_e4m3fn,
+        torch.float8_e5m2,
+        torch.float8_e8m0fnu,
+    ],
+)
+@pytest.mark.parametrize("vec_num", [1, 2, 4, 8])
+def test_vectorize_all_dtypes(dtype, vec_num):
+    x = torch.empty((64,), dtype=dtype, device="cuda")
+    kernel = vectorize_test_all_dtypes(dtype, vec_num)
+    kernel(x)
+


🛠️ Refactor suggestion | 🟠 Major

Add output verification to the vectorization test.

The test currently only verifies that the kernel executes without crashing, but doesn't validate the correctness of the vectorized memory operations. This significantly reduces the test's effectiveness.

🔎 Proposed enhancement to verify output correctness

@pytest.mark.parametrize("vec_num", [1, 2, 4, 8]) def test_vectorize_all_dtypes(dtype, vec_num): x = torch.empty((64,), dtype=dtype, device="cuda") kernel = vectorize_test_all_dtypes(dtype, vec_num) kernel(x) + + # Verify the kernel wrote the expected values + expected = torch.arange(1, vec_num + 1, dtype=dtype, device="cuda") + torch.testing.assert_close(x[:vec_num], expected, atol=0, rtol=0)

This ensures the vectorized writes are actually working correctly for all data types and vectorization widths.

🤖 Prompt for AI Agents

In testing/python/language/test_tilelang_language_vectorize.py around lines 116-148, the test calls the kernel but does not assert that the buffer contents are correct; after kernel(x) run, construct the expected tensor (values 1..64 cast to dtype on CUDA), then verify x matches expected: for integer dtypes use exact equality, for floating dtypes use torch.testing.assert_close with small atol/rtol; ensure comparisons run on the same device and handle any dtype-specific quirks (use torch.arange(1,65, dtype=dtype, device='cuda') and cast as needed) so the test fails if vectorized writes produced incorrect results.

LeiWang1999 · 2025-12-25T14:28:22Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-12-25T14:31:00Z

src/tl_templates/cuda/common.h

+TL_DEVICE unsigned int make_uint(unsigned char x0, unsigned char x1,
+                                 unsigned char x2, unsigned char x3) {
+  return (x3 << 24) | (x2 << 16) | (x1 << 8) | x0;


Prevent UB when packing uint8 bytes into uint

The shifts in make_uint operate on int because unsigned char is promoted, so when any input byte is ≥128 the x3 << 24 (or other shifts) overflows a signed 32‑bit int, which is undefined behavior in C++. That can mispack vectorized uint8 data on device for values with the high bit set. Cast each operand to unsigned int (or uint32_t) before shifting to make the packing defined.

Useful? React with 👍 / 👎.

[BugFix] Complete vectorized loading for common dtypes

b8ec2dd

SiriusNEO requested review from LeiWang1999 and Copilot December 25, 2025 12:30

Copilot started reviewing on behalf of SiriusNEO December 25, 2025 12:30 View session

Copilot AI reviewed Dec 25, 2025

View reviewed changes

coderabbitai bot reviewed Dec 25, 2025

View reviewed changes

LeiWang1999 approved these changes Dec 25, 2025

View reviewed changes

LeiWang1999 merged commit dff10e5 into tile-ai:main Dec 25, 2025
13 checks passed

chatgpt-codex-connector bot reviewed Dec 25, 2025

View reviewed changes

kurisu6912 mentioned this pull request Feb 11, 2026

[LoopVectorize] Loop Independent Var Optimization in IfThenElse Expr kurisu6912/tilelang#2

Closed

-  // TODO(chaofan): Comment the ramp lanes limit for now since we have
-  // LegalizeVectorizedLoop to automatically legalize vectorized loop whose
-  // width exceeds the limit. But we should add check here for safety in the
-  // future. The check should be aligned to certain bit width like 128bits or
-  // 256bits.
-  // CHECK_LE(lanes, 8) << "Translate Ramp Node " << tvm::ffi::GetRef<Ramp>(op)
-  //                    << "error: " << lanes << " exceeds max ramp lanes 8.";
+  // Enforce a maximum total vector width for safety. Even though
+  // LegalizeVectorizedLoop is expected to legalize oversized vectorized loops,
+  // we keep this runtime check to prevent emitting vectors wider than the
+  // supported bit-width.
+  //
+  // Here we conservatively cap the total vector width at 256 bits.
+  int max_vector_bits = 256;
+  int dtype_bits = op->dtype.bits();
+  if (dtype_bits > 0) {
+    int max_lanes = max_vector_bits / dtype_bits;
+    CHECK_LE(lanes, max_lanes)
+        << "Translate Ramp Node " << tvm::ffi::GetRef<Ramp>(op)
+        << " error: " << lanes << " exceeds max ramp lanes " << max_lanes
+        << " for element bit-width " << dtype_bits
+        << " (max total vector width " << max_vector_bits << " bits).";
+  }

[BugFix] Complete vectorized loading for common dtypes #1536

[BugFix] Complete vectorized loading for common dtypes #1536

Conversation

SiriusNEO commented Dec 25, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Dec 25, 2025

Uh oh!

coderabbitai bot commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

LeiWang1999 commented Dec 25, 2025

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SiriusNEO commented Dec 25, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 25, 2025 •

edited

Loading