-
Notifications
You must be signed in to change notification settings - Fork 450
Enhance loop vectorization logic for CallNode handling #1696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Refactored vectorization logic to utilize configuration settings for enabling/disabling vectorization. - Introduced new checks for buffer access invariance in call nodes to improve vectorization accuracy. - Updated GCD calculations for vector sizes to include call node constraints, enhancing overall performance. - Added verbose logging for better insights during vectorization planning. - Introduced tests for vectorization of infinity and bitwise logical operations to validate new functionality.
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
📝 WalkthroughWalkthroughAdds PassContext-backed TileLang config accessors and extends the loop vectorization planner with call/cast-aware per-run state, index transformation helpers, expanded buffer classification, revised buffer vector-size computation, and two new JIT tests that validate call/cast vectorization output. Changes
Sequence Diagram(s)(Skipped) Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/transform/loop_vectorize.cc (1)
523-533: Store transformed indices for re-validation.
ComputeBufferVectorSizenow uses transformed indices, butBufferVectorInfostill records the original indices. Later re-validation (Line 285-292) usesinfo.indices, which can be inconsistent when a layout map is present and may allow unsafe vectorization. Store the transformed indices (or re-transform before re-validation).🛠️ Suggested fix (store transformed indices)
void UpdateVectorSize(const Array<PrimExpr> &indices, const Buffer &buffer, bool is_store) { - int buffer_vec_size = ComputeBufferVectorSize(indices, buffer, is_store); - buffer_vector_infos_.push_back( - {buffer, buffer_vec_size, is_store, indices}); + auto transformed_indices = TransformIndices(indices, buffer); + int buffer_vec_size = ComputeBufferVectorSize(indices, buffer, is_store); + buffer_vector_infos_.push_back( + {buffer, buffer_vec_size, is_store, transformed_indices}); }
|
@regression-perf |
Performance Regression Test ReportTriggered by: @LeiWang1999 Results
Artifacts
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@examples/gemm_fp8/example_tilelang_gemm_fp8_intrinsic.py`:
- Line 230: Remove the unconditional debug print of the kernel source causing
noisy CI logs; replace the direct call to print(kernel_e4m3.get_kernel_source())
with either removal or a conditional/log-level gated output (e.g., only print
when a VERBOSE env var or a debug/verbose flag is set, or emit via a logger at
DEBUG level). Locate the call referencing kernel_e4m3.get_kernel_source() in the
regression performance function and ensure the kernel source is only output when
explicitly requested.
| out_dtype, accum_dtype = "float32", "float32" | ||
| in_dtype = T.float8_e4m3fn | ||
| kernel_e4m3 = tl_matmul(M, N, K, in_dtype, out_dtype, accum_dtype) | ||
| print(kernel_e4m3.get_kernel_source()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug print statement in regression performance function.
The print(kernel_e4m3.get_kernel_source()) will output the full kernel source on every regression run, which can clutter CI logs. If this was added for debugging during development, consider removing it or making it conditional (e.g., via an environment variable or verbose flag).
Suggested fix
def run_regression_perf():
M, N, K = 4096, 4096, 4096
out_dtype, accum_dtype = "float32", "float32"
in_dtype = T.float8_e4m3fn
kernel_e4m3 = tl_matmul(M, N, K, in_dtype, out_dtype, accum_dtype)
- print(kernel_e4m3.get_kernel_source())
profiler_e4m3 = kernel_e4m3.get_profiler(tilelang.TensorSupplyType.Integer)
latency_e4m3 = profiler_e4m3.do_bench(backend="cupti")
return latency_e4m3📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| print(kernel_e4m3.get_kernel_source()) | |
| def run_regression_perf(): | |
| M, N, K = 4096, 4096, 4096 | |
| out_dtype, accum_dtype = "float32", "float32" | |
| in_dtype = T.float8_e4m3fn | |
| kernel_e4m3 = tl_matmul(M, N, K, in_dtype, out_dtype, accum_dtype) | |
| profiler_e4m3 = kernel_e4m3.get_profiler(tilelang.TensorSupplyType.Integer) | |
| latency_e4m3 = profiler_e4m3.do_bench(backend="cupti") | |
| return latency_e4m3 |
🤖 Prompt for AI Agents
In `@examples/gemm_fp8/example_tilelang_gemm_fp8_intrinsic.py` at line 230, Remove
the unconditional debug print of the kernel source causing noisy CI logs;
replace the direct call to print(kernel_e4m3.get_kernel_source()) with either
removal or a conditional/log-level gated output (e.g., only print when a VERBOSE
env var or a debug/verbose flag is set, or emit via a logger at DEBUG level).
Locate the call referencing kernel_e4m3.get_kernel_source() in the regression
performance function and ensure the kernel source is only output when explicitly
requested.
|
@regression-perf |
Performance Regression Test ReportTriggered by: @LeiWang1999 Results
Artifacts
|
Summary
Introduce
src/config.h: Centralize PassContext configuration utilities withVectorizePlannerVerboseEnabled()andVectorize256Disabled()helper functions for cleaner code reuse.Improve CallNode vectorization in
loop_vectorize.cc: Previously, most CallNodes were conservatively assigned vector_size=1, limiting vectorization opportunities. This PR introduces buffer access invariant analysis for CallNodes, finding the maximum vector size where all buffer accesses remain invariant within vector boundaries.Refactor buffer categorization: Separate buffers into three categories (local/fragment, memory, call/cast) with dedicated minimum tracking for more precise vectorization decisions.
Extract
TransformIndiceshelper: Factor out index transformation logic for reuse across buffer load/store analysis and the new CallNode invariant checking.Test plan
test_vectorize_call_infinity: Verifies thatT.infinity()calls can be vectorized (expectsfloat4in output)test_vectorize_call_bitwise_logical: Verifies vectorization works correctly with swizzle layouts and parallel loops🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Tests
Examples
✏️ Tip: You can customize this high-level summary in your review settings.