Skip to content

Conversation

@LeiWang1999
Copy link
Member

@LeiWang1999 LeiWang1999 commented Feb 4, 2026

This pull request introduces an enhancement to the vectorization planning logic in loop_vectorize.cc, specifically improving how buffer indices are validated for vectorization at different vector sizes. The changes make the vectorization planner more robust by allowing it to recognize additional cases where indices can be safely vectorized, not just when they are invariant but also when they meet certain vectorization criteria.

Improvements to vectorization boundary checks:

  • The loop that checks buffer indices for invariance at the new vector size now also considers whether indices are vectorizable using the IndiceCanVectorize method, expanding the set of cases where vectorization is allowed.
  • Added a clarifying comment to indicate that the validation checks for both invariance and vectorizability at the new vector size boundary.

Code cleanup:

  • Removed unnecessary blank lines in the buffer stride and offset calculation logic for better readability.

Summary by CodeRabbit

  • Refactor
    • Improved loop-vectorization revalidation so buffer vector-size feasibility is assessed with a broader feasibility check, enabling more loops to be safely vectorized.
  • Bug Fixes
    • Fixed edge cases where vector-size adjustments were overly strict; the planner now safely reduces vector sizes when needed and logs adjustments to preserve correctness.
  • Chores
    • Renamed an internal helper used by the vectorization planner (internal-only).

@github-actions
Copy link

github-actions bot commented Feb 4, 2026

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 4, 2026

📝 Walkthrough

Walkthrough

Renamed IndiceCanVectorizeIndicesCanVectorize and replaced strict invariance checks with IndicesCanVectorize(...) calls in memory-buffer revalidation paths; when the check fails, code recomputes vector_size_ via GCD with the buffer’s vector size and logs the adjustment.

Changes

Cohort / File(s) Summary
Loop vectorize implementation
src/transform/loop_vectorize.cc
Replaced strict invariance prerequisites with IndicesCanVectorize(...) in multiple memory-buffer revalidation branches and the buffer refinement loop; when infeasible, recompute vector_size_ using GCD with buffer vector size and log the change.
Public header
src/transform/loop_vectorize.h
Renamed exported function IndiceCanVectorize(...)IndicesCanVectorize(...) without signature changes.
Copy node usage
src/op/copy.cc
Updated two call sites in CopyNode::LowerLDSMCopy to call IndicesCanVectorize(...) instead of IndiceCanVectorize(...) (no other logic changes).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • kurisu6912

Poem

🐰 I nibble code in moonlit rows,
I hop through loops where vector grows,
I check the indices, twist and scan,
If bounds don’t match, I GCD my plan,
Now buffers stride in tidy rows!

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '[Enhancement] Enhance register vectorize inference' is directly related to the main change: improving vectorization planning logic by enhancing buffer index validation across vector sizes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/transform/loop_vectorize.cc`:
- Around line 293-298: The check uses loop_extent_vector_size_ (a gcd) when
calling IndiceCanVectorize, which can under-report the loop extent and yield
false positives; change the call to pass the full loop extent used by
ComputeBufferVectorSize instead of loop_extent_vector_size_ so
IndiceCanVectorize validates all vector blocks; update the conditional around
IsExprInvariantInVectorBoundary and IndiceCanVectorize for elem_offset,
inner_for_->loop_var, vector_size_ to use the real extent value consistent with
ComputeBufferVectorSize.

@SiriusNEO
Copy link
Collaborator

@regression-perf

@github-actions
Copy link

github-actions bot commented Feb 4, 2026

Performance Regression Test Report

Triggered by: @SiriusNEO
Workflow run: https://github.com/tile-ai/tilelang/actions/runs/21666714087

Results

File Original Latency Current Latency Speedup
example_dequant_groupedgemm_bf16_mxfp4_hopper 3.4452 3.51106 0.981241
example_gqa_decode 0.047969 0.048513 0.988787
example_warp_specialize_gemm_softpipe_stage2 0.038817 0.039105 0.992635
example_topk 0.010784 0.010816 0.997041
example_mha_sink_fwd_bhsd_wgmma_pipelined 0.0154061 0.0154489 0.997228
sparse_mla_fwd_pipelined 0.0955275 0.095627 0.998959
example_gqa_bwd 0.0497783 0.0498294 0.998973
example_dequant_gemm_w4a8 5.39271 5.39823 0.998978
example_gqa_sink_fwd_bhsd_wgmma_pipelined 0.0144383 0.0144505 0.999157
example_gqa_sink_fwd_bhsd_wgmma_pipelined_sliding_window 0.0145684 0.0145792 0.999263
example_dynamic 0.656651 0.657037 0.999413
example_tilelang_sparse_gqa_decode_varlen_indice 0.0170519 0.0170618 0.999424
example_dequant_gemm_bf16_mxfp4_hopper 0.508197 0.508462 0.999479
example_gqa_sink_bwd_bhsd 0.041637 0.0416563 0.999537
example_tilelang_gemm_splitk 1.42298 1.4236 0.999562
example_tilelang_sparse_gqa_decode_varlen_mask 0.0234171 0.0234273 0.999566
example_mha_sink_fwd_bhsd_wgmma_pipelined_sliding_window 0.015483 0.015489 0.999611
example_tilelang_nsa_decode 0.00734897 0.00735157 0.999646
example_tilelang_gemm_splitk_vectorize_atomicadd 1.4223 1.42262 0.999771
example_linear_attn_fwd 0.0368961 0.0369039 0.999791
example_mha_bwd_bhsd 0.0406662 0.0406743 0.999802
example_elementwise_add 0.295876 0.29593 0.999819
block_sparse_attn_tilelang 0.0102512 0.0102529 0.999832
example_vertical_slash_sparse_attn 0.237236 0.237273 0.999845
example_mla_decode 0.461257 0.461287 0.999935
example_gemm_schedule 0.0325868 0.0325886 0.999943
example_gemm_intrinsics 0.035072 0.035073 0.999971
example_mha_bwd_bshd 0.0412493 0.0412501 0.99998
example_tilelang_block_sparse_attn 0.0101599 0.0101589 1.0001
example_gqa_bwd_tma_reduce_varlen 0.0521638 0.0521577 1.00012
example_gemv 0.284859 0.284821 1.00013
example_linear_attn_bwd 0.153149 0.153115 1.00023
topk_selector 0.0535497 0.0535362 1.00025
example_group_per_split_token_cast_to_fp8 0.010356 0.0103531 1.00029
example_gqa_sink_bwd_bhsd_sliding_window 0.0255729 0.0255654 1.00029
fp8_lighting_indexer 0.0357631 0.0357491 1.00039
sparse_mla_bwd 0.383107 0.382953 1.0004
example_tilelang_gemm_fp8_intrinsic 0.933258 0.932883 1.0004
tilelang_example_sparse_tensorcore 0.0150401 0.0150339 1.00041
example_mha_sink_fwd_bhsd 0.0158734 0.0158658 1.00048
example_gqa_bwd_wgmma_pipelined 0.0696336 0.0695949 1.00056
example_mha_fwd_varlen 0.0454677 0.0454417 1.00057
example_blocksparse_gemm 0.0226856 0.0226717 1.00061
example_mha_sink_bwd_bhsd 0.0624088 0.0623702 1.00062
example_dequant_gemv_fp16xint4 0.0284425 0.0284243 1.00064
example_tilelang_nsa_fwd 0.00686032 0.00685593 1.00064
example_mha_sink_fwd_bhsd_sliding_window 0.015721 0.0157105 1.00067
example_per_token_cast_to_fp8 0.00740982 0.00740455 1.00071
example_mha_bwd_bshd_wgmma_pipelined 0.0256794 0.0256606 1.00073
example_convolution 1.33427 1.33299 1.00096
sparse_mla_fwd 0.130791 0.130647 1.0011
example_convolution_autotune 0.995691 0.993111 1.0026
example_tilelang_gemm_fp8 0.322486 0.32162 1.00269
example_mha_sink_bwd_bhsd_sliding_window 0.0448627 0.0446871 1.00393
example_gemm_autotune 0.022432 0.022336 1.0043
example_dequant_gemm_fp4_hopper 1.06159 1.05646 1.00486
example_warp_specialize_gemm_copy_1_gemm_0 0.038081 0.037856 1.00594
example_mha_inference 0.079328 0.078817 1.00648
example_dequant_gemm_bf16_fp4_hopper 0.577068 0.573161 1.00682
example_tilelang_gemm_fp8_2xAcc 0.189843 0.188131 1.0091
example_warp_specialize_gemm_barrierpipe_stage2 0.039009 0.038593 1.01078
example_gemm 0.023009 0.022561 1.01986
example_warp_specialize_gemm_copy_0_gemm_1 0.039424 0.038049 1.03614

Artifacts

  • regression_result.png (speedup plot) is attached as a workflow artifact. Download it from the workflow run page above.

@SiriusNEO SiriusNEO merged commit df87c56 into tile-ai:main Feb 4, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants