Skip to content

Conversation

@LeiWang1999
Copy link
Member

@LeiWang1999 LeiWang1999 commented Jan 19, 2026

Summary

  • Introduce src/config.h: Centralize PassContext configuration utilities with VectorizePlannerVerboseEnabled() and Vectorize256Disabled() helper functions for cleaner code reuse.

  • Improve CallNode vectorization in loop_vectorize.cc: Previously, most CallNodes were conservatively assigned vector_size=1, limiting vectorization opportunities. This PR introduces buffer access invariant analysis for CallNodes, finding the maximum vector size where all buffer accesses remain invariant within vector boundaries.

  • Refactor buffer categorization: Separate buffers into three categories (local/fragment, memory, call/cast) with dedicated minimum tracking for more precise vectorization decisions.

  • Extract TransformIndices helper: Factor out index transformation logic for reuse across buffer load/store analysis and the new CallNode invariant checking.

Test plan

  • Added test_vectorize_call_infinity: Verifies that T.infinity() calls can be vectorized (expects float4 in output)
  • Added test_vectorize_call_bitwise_logical: Verifies vectorization works correctly with swizzle layouts and parallel loops

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added config toggles to control vectorization diagnostics and 256-wide vectorization.
    • Improved loop vectorization to better handle call/cast paths and more buffer access patterns with richer logging and per-run vector-size decisions.
  • Tests

    • Added tests validating vectorized codegen for call sites and bitwise/logical patterns (checks for packed float outputs).
  • Examples

    • Simplified gemm FP8 example to a single path and added kernel source debug printing.

✏️ Tip: You can customize this high-level summary in your review settings.

- Refactored vectorization logic to utilize configuration settings for enabling/disabling vectorization.
- Introduced new checks for buffer access invariance in call nodes to improve vectorization accuracy.
- Updated GCD calculations for vector sizes to include call node constraints, enhancing overall performance.
- Added verbose logging for better insights during vectorization planning.
- Introduced tests for vectorization of infinity and bitwise logical operations to validate new functionality.
@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 19, 2026

📝 Walkthrough

Walkthrough

Adds PassContext-backed TileLang config accessors and extends the loop vectorization planner with call/cast-aware per-run state, index transformation helpers, expanded buffer classification, revised buffer vector-size computation, and two new JIT tests that validate call/cast vectorization output.

Changes

Cohort / File(s) Summary
Configuration utilities
src/config.h
Added tvm::tl::tl_config::VectorizePlannerVerboseEnabled() and tvm::tl::tl_config::Vectorize256Disabled() to read PassContext boolean configs (default false).
Vectorization planning logic
src/transform/loop_vectorize.cc
Replaced direct PassContext queries with tl_config accessors; introduced per-run call_node_min; added TransformIndices helper; expanded buffer classification to include a call/cast category; revised ComputeBufferVectorSize to use transformed indices, dynamic-stride offsets, invariant checks, loop-halving search for call-node vector sizes, and additional logging.
Vectorization tests
testing/python/language/test_tilelang_language_vectorize.py
Added vectorize_test_call_infinity and vectorize_test_call_bitwise_logical JIT tests; one enables the verbose planner config and uses a swizzle layout to validate generated vectorized code contains float4.
Example update
examples/gemm_fp8/...
Removed secondary e5m2 latency path and debugging: now prints generated kernel source for e4m3 and returns only e4m3 latency.

Sequence Diagram(s)

(Skipped)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested reviewers

  • kurisu6912
  • tzj-fxz

Poem

🐰✨ I hop through loops with nimble paws,
Transforming indices, learning new laws,
Calls and casts I count with care,
Four-wide floats now dance in air,
A little rabbit, planner-bright and fair.

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the primary change: enhancing loop vectorization logic to improve CallNode handling, which is the main focus across modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/transform/loop_vectorize.cc (1)

523-533: Store transformed indices for re-validation.
ComputeBufferVectorSize now uses transformed indices, but BufferVectorInfo still records the original indices. Later re-validation (Line 285-292) uses info.indices, which can be inconsistent when a layout map is present and may allow unsafe vectorization. Store the transformed indices (or re-transform before re-validation).

🛠️ Suggested fix (store transformed indices)
void UpdateVectorSize(const Array<PrimExpr> &indices, const Buffer &buffer,
                      bool is_store) {
-  int buffer_vec_size = ComputeBufferVectorSize(indices, buffer, is_store);
-  buffer_vector_infos_.push_back(
-      {buffer, buffer_vec_size, is_store, indices});
+  auto transformed_indices = TransformIndices(indices, buffer);
+  int buffer_vec_size = ComputeBufferVectorSize(indices, buffer, is_store);
+  buffer_vector_infos_.push_back(
+      {buffer, buffer_vec_size, is_store, transformed_indices});
}

@LeiWang1999
Copy link
Member Author

@regression-perf

@github-actions
Copy link

Performance Regression Test Report

Triggered by: @LeiWang1999
Workflow run: https://github.com/tile-ai/tilelang/actions/runs/21141562085

Results

File Original Latency Current Latency Speedup
example_tilelang_gemm_fp8_intrinsic 0.468471 0.933556 0.501814
example_warp_specialize_gemm_barrierpipe_stage2 0.038337 0.039425 0.972403
sparse_mla_fwd_pipelined 0.0945838 0.0959446 0.985817
example_tilelang_gemm_fp8_2xAcc 0.188226 0.190899 0.985993
example_mha_inference 0.07981 0.080706 0.988898
example_mha_sink_fwd_bhsd 0.015623 0.0157185 0.993923
sparse_mla_fwd 0.131223 0.131973 0.99432
example_dequant_gemm_bf16_fp4_hopper 0.577452 0.580493 0.994761
sparse_mla_bwd 0.38412 0.386068 0.994953
example_mha_bwd_bshd_wgmma_pipelined 0.026264 0.0263831 0.995486
example_mha_fwd_varlen 0.0455202 0.0456686 0.996751
example_mha_bwd_bshd 0.0412876 0.0414136 0.996958
example_topk 0.010945 0.010977 0.997085
example_mha_bwd_bhsd 0.0406608 0.0407727 0.997257
topk_selector 0.0539579 0.0540991 0.997391
example_convolution_autotune 0.993616 0.995667 0.99794
example_gqa_bwd_wgmma_pipelined 0.0698241 0.0699523 0.998168
example_mha_sink_fwd_bhsd_wgmma_pipelined_sliding_window 0.0155034 0.0155279 0.998421
example_dequant_gemv_fp16xint4 0.0284105 0.0284432 0.99885
example_mla_decode 0.461034 0.46145 0.999098
example_gemm_schedule 0.0325873 0.0326157 0.99913
example_gqa_sink_fwd_bhsd_wgmma_pipelined_sliding_window 0.0146073 0.0146176 0.999293
example_tilelang_block_sparse_attn 0.0101784 0.0101855 0.999309
fp8_lighting_indexer 0.0359653 0.035987 0.999398
example_elementwise_add 0.297518 0.297672 0.999485
example_mha_sink_fwd_bhsd_sliding_window 0.0156848 0.015692 0.999542
example_blocksparse_gemm 0.0226864 0.0226962 0.999566
example_gqa_bwd 0.0498306 0.0498485 0.999641
block_sparse_attn_tilelang 0.0103002 0.0103033 0.999697
example_gqa_sink_fwd_bhsd_wgmma_pipelined 0.0144703 0.0144734 0.999787
example_gqa_sink_bwd_bhsd 0.0417102 0.041716 0.99986
example_tilelang_sparse_gqa_decode_varlen_indice 0.0172 0.0172024 0.999861
example_mha_sink_bwd_bhsd 0.0624402 0.0624446 0.999929
example_gemv 0.288795 0.288812 0.999943
example_linear_attn_fwd 0.037008 0.0370097 0.999953
example_vertical_slash_sparse_attn 0.23754 0.23754 1
example_warp_specialize_gemm_copy_1_gemm_0 0.038209 0.038209 1
example_gemm_intrinsics 0.035105 0.035104 1.00003
tilelang_example_sparse_tensorcore 0.0150932 0.0150902 1.0002
example_gqa_sink_bwd_bhsd_sliding_window 0.0256295 0.0256226 1.00027
example_linear_attn_bwd 0.153621 0.15357 1.00033
example_group_per_split_token_cast_to_fp8 0.0103962 0.0103921 1.00039
example_mha_sink_fwd_bhsd_wgmma_pipelined 0.0154692 0.0154631 1.0004
example_tilelang_gemm_splitk 1.42228 1.42161 1.00047
example_tilelang_gemm_splitk_vectorize_atomicadd 1.42342 1.42269 1.00051
example_gqa_bwd_tma_reduce_varlen 0.0523587 0.0523207 1.00073
example_dequant_gemm_w4a8 5.39851 5.39402 1.00083
example_tilelang_sparse_gqa_decode_varlen_mask 0.0240634 0.0240403 1.00096
example_dequant_gemm_bf16_mxfp4_hopper 0.509097 0.508558 1.00106
example_per_token_cast_to_fp8 0.00737664 0.00736681 1.00133
example_dynamic 0.657772 0.656876 1.00136
example_gemm_autotune 0.022369 0.022336 1.00148
example_tilelang_nsa_fwd 0.00701738 0.00700464 1.00182
example_tilelang_nsa_decode 0.00738676 0.007373 1.00187
example_convolution 1.33545 1.33152 1.00295
example_mha_sink_bwd_bhsd_sliding_window 0.044948 0.0447641 1.00411
example_dequant_groupedgemm_bf16_mxfp4_hopper 3.48179 3.46514 1.00481
example_warp_specialize_gemm_softpipe_stage2 0.039585 0.039393 1.00487
example_tilelang_gemm_fp8 0.322648 0.320487 1.00674
example_warp_specialize_gemm_copy_0_gemm_1 0.038976 0.038592 1.00995
example_gqa_decode 0.049185 0.048609 1.01185
example_gemm 0.022849 0.022432 1.01859
example_dequant_gemm_fp4_hopper 1.07467 1.0463 1.02712

Artifacts

  • regression_result.png (speedup plot) is attached as a workflow artifact. Download it from the workflow run page above.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@examples/gemm_fp8/example_tilelang_gemm_fp8_intrinsic.py`:
- Line 230: Remove the unconditional debug print of the kernel source causing
noisy CI logs; replace the direct call to print(kernel_e4m3.get_kernel_source())
with either removal or a conditional/log-level gated output (e.g., only print
when a VERBOSE env var or a debug/verbose flag is set, or emit via a logger at
DEBUG level). Locate the call referencing kernel_e4m3.get_kernel_source() in the
regression performance function and ensure the kernel source is only output when
explicitly requested.

out_dtype, accum_dtype = "float32", "float32"
in_dtype = T.float8_e4m3fn
kernel_e4m3 = tl_matmul(M, N, K, in_dtype, out_dtype, accum_dtype)
print(kernel_e4m3.get_kernel_source())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Debug print statement in regression performance function.

The print(kernel_e4m3.get_kernel_source()) will output the full kernel source on every regression run, which can clutter CI logs. If this was added for debugging during development, consider removing it or making it conditional (e.g., via an environment variable or verbose flag).

Suggested fix
 def run_regression_perf():
     M, N, K = 4096, 4096, 4096
     out_dtype, accum_dtype = "float32", "float32"
     in_dtype = T.float8_e4m3fn
     kernel_e4m3 = tl_matmul(M, N, K, in_dtype, out_dtype, accum_dtype)
-    print(kernel_e4m3.get_kernel_source())
     profiler_e4m3 = kernel_e4m3.get_profiler(tilelang.TensorSupplyType.Integer)
     latency_e4m3 = profiler_e4m3.do_bench(backend="cupti")
     return latency_e4m3
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
print(kernel_e4m3.get_kernel_source())
def run_regression_perf():
M, N, K = 4096, 4096, 4096
out_dtype, accum_dtype = "float32", "float32"
in_dtype = T.float8_e4m3fn
kernel_e4m3 = tl_matmul(M, N, K, in_dtype, out_dtype, accum_dtype)
profiler_e4m3 = kernel_e4m3.get_profiler(tilelang.TensorSupplyType.Integer)
latency_e4m3 = profiler_e4m3.do_bench(backend="cupti")
return latency_e4m3
🤖 Prompt for AI Agents
In `@examples/gemm_fp8/example_tilelang_gemm_fp8_intrinsic.py` at line 230, Remove
the unconditional debug print of the kernel source causing noisy CI logs;
replace the direct call to print(kernel_e4m3.get_kernel_source()) with either
removal or a conditional/log-level gated output (e.g., only print when a VERBOSE
env var or a debug/verbose flag is set, or emit via a logger at DEBUG level).
Locate the call referencing kernel_e4m3.get_kernel_source() in the regression
performance function and ensure the kernel source is only output when explicitly
requested.

@LeiWang1999
Copy link
Member Author

@regression-perf

@github-actions
Copy link

Performance Regression Test Report

Triggered by: @LeiWang1999
Workflow run: https://github.com/tile-ai/tilelang/actions/runs/21147861495

Results

File Original Latency Current Latency Speedup
example_mha_inference 0.078722 0.080866 0.973487
example_warp_specialize_gemm_barrierpipe_stage2 0.038209 0.039072 0.977913
example_gqa_fwd_bshd_wgmma_pipelined 0.055938 0.056481 0.990386
example_gemm 0.022625 0.022721 0.995775
example_warp_specialize_gemm_softpipe_stage2 0.037728 0.037857 0.996592
example_gemm_autotune 0.022336 0.0224 0.997143
example_dynamic 0.656078 0.657675 0.997572
example_gqa_bwd_wgmma_pipelined 0.0699414 0.0700803 0.998019
example_gemm_intrinsics 0.035073 0.035136 0.998207
sparse_mla_fwd_pipelined 0.0948755 0.0950365 0.998306
example_gqa_sink_fwd_bhsd_wgmma_pipelined 0.014465 0.014489 0.998345
example_tilelang_gemm_fp8_intrinsic 0.933623 0.934882 0.998654
example_per_token_cast_to_fp8 0.00736767 0.0073748 0.999034
example_mha_sink_fwd_bhsd 0.0156779 0.0156926 0.999062
example_group_per_split_token_cast_to_fp8 0.0103884 0.0103981 0.999071
example_gemv 0.288769 0.288999 0.999203
example_convolution 1.33401 1.33503 0.999234
example_gqa_sink_bwd_bhsd_sliding_window 0.0256211 0.0256374 0.999365
example_tilelang_sparse_gqa_decode_varlen_indice 0.0172052 0.0172145 0.999457
example_tilelang_sparse_gqa_decode_varlen_mask 0.024047 0.0240595 0.99948
example_tilelang_nsa_fwd 0.00700945 0.00701241 0.999578
example_mha_sink_bwd_bhsd 0.0625175 0.0625428 0.999596
example_mha_bwd_bshd_wgmma_pipelined 0.0263605 0.0263703 0.999627
example_vertical_slash_sparse_attn 0.237528 0.237608 0.999664
fp8_lighting_indexer 0.0359855 0.0359964 0.999697
sparse_mla_fwd 0.131963 0.132001 0.999712
example_tilelang_block_sparse_attn 0.010186 0.0101888 0.999726
example_tilelang_gemm_splitk_vectorize_atomicadd 1.42315 1.42335 0.999857
example_gqa_sink_bwd_bhsd 0.0417009 0.0417063 0.999872
example_linear_attn_bwd 0.15359 0.153607 0.999889
topk_selector 0.0540955 0.0541011 0.999896
example_dequant_gemv_fp16xint4 0.0284228 0.0284241 0.999953
example_dequant_gemm_w4a8 5.39676 5.39661 1.00003
example_elementwise_add 0.29763 0.29762 1.00003
example_tilelang_nsa_decode 0.00738098 0.00738056 1.00006
example_dequant_gemm_bf16_mxfp4_hopper 0.50865 0.508619 1.00006
example_linear_attn_fwd 0.0370211 0.0370172 1.0001
tilelang_example_sparse_tensorcore 0.0150836 0.015082 1.00011
example_gqa_sink_fwd_bhsd_wgmma_pipelined_sliding_window 0.0146253 0.0146234 1.00013
example_mha_bwd_bshd 0.0414286 0.0414204 1.0002
sparse_mla_bwd 0.386345 0.386231 1.0003
example_gqa_bwd 0.0498339 0.0498182 1.00031
example_gqa_bwd_tma_reduce_varlen 0.0523476 0.0523282 1.00037
example_blocksparse_gemm 0.0227387 0.0227267 1.00053
block_sparse_attn_tilelang 0.010298 0.010292 1.00058
example_tilelang_gemm_splitk 1.42247 1.42159 1.00062
example_gemm_schedule 0.0326079 0.0325864 1.00066
example_mha_bwd_bhsd 0.0407741 0.0407461 1.00069
example_gqa_fwd_bshd 0.071617 0.071554 1.00088
example_mha_fwd_varlen 0.045736 0.0456889 1.00103
example_mla_decode 0.460999 0.460426 1.00124
example_topk 0.010945 0.010912 1.00302
example_warp_specialize_gemm_copy_1_gemm_0 0.038721 0.038593 1.00332
example_mha_sink_bwd_bhsd_sliding_window 0.0449739 0.0447978 1.00393
example_dequant_gemm_fp4_hopper 1.07794 1.0735 1.00414
example_tilelang_gemm_fp8 0.322744 0.321367 1.00428
example_mha_sink_fwd_bhsd_wgmma_pipelined_sliding_window 0.0155659 0.0154901 1.00489
example_mha_sink_fwd_bhsd_sliding_window 0.015761 0.0156827 1.00499
example_convolution_autotune 1.00041 0.995063 1.00537
example_mha_sink_fwd_bhsd_wgmma_pipelined 0.0155152 0.0154308 1.00547
example_tilelang_gemm_fp8_2xAcc 0.187684 0.186467 1.00652
example_gqa_decode 0.04848 0.047969 1.01065
example_dequant_gemm_bf16_fp4_hopper 0.579181 0.568103 1.0195
example_warp_specialize_gemm_copy_0_gemm_1 0.040321 0.039521 1.02024
example_dequant_groupedgemm_bf16_mxfp4_hopper 3.59564 3.49689 1.02824

Artifacts

  • regression_result.png (speedup plot) is attached as a workflow artifact. Download it from the workflow run page above.

@LeiWang1999 LeiWang1999 merged commit 9a9255f into tile-ai:main Jan 19, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant