Skip to content

[Refactor] Treat local.var as local buffers when deciding vectorization for stable actions#1835

Merged
LeiWang1999 merged 1 commit intotile-ai:mainfrom
LeiWang1999:refactor_0211
Feb 11, 2026
Merged

[Refactor] Treat local.var as local buffers when deciding vectorization for stable actions#1835
LeiWang1999 merged 1 commit intotile-ai:mainfrom
LeiWang1999:refactor_0211

Conversation

@LeiWang1999
Copy link
Member

@LeiWang1999 LeiWang1999 commented Feb 11, 2026

Summary by CodeRabbit

  • Refactor

    • Improved buffer detection logic to better handle local memory optimization.
  • Tests

    • Added test coverage for vectorization behavior with local variable handling.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 11, 2026

📝 Walkthrough

Walkthrough

The PR modifies buffer locality detection in the tile operation lowering pass to treat var-mapped buffers as local, and adds a test case verifying vectorization behavior with var buffers during parallel execution.

Changes

Cohort / File(s) Summary
Compiler: Buffer Locality Check
src/transform/lower_tile_op.cc
Tightens non-local buffer detection by changing IsLocalBuffer call to include allow_var=true, treating var-mapped locals as local while preserving fragment buffer exclusion.
Test: Parallel Vectorization
testing/python/language/test_tilelang_language_parallel.py
Introduces _parallel_vectorize_local_and_var kernel with fragment and var buffers, and test_parallel_vectorize_var test that verifies vectorization is not applied (asserts absence of "float2" in kernel source).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

Poem

🐰 Twitches whiskers with delight
A buffer here, a var there,
Locals nested with such care,
Vectorize when none are near,
This change makes logic crystalline and bright! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: treating local.var buffers as local when deciding vectorization for stable actions. This directly relates to the primary modifications in both files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@LeiWang1999
Copy link
Member Author

@regression-perf

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@testing/python/language/test_tilelang_language_parallel.py`:
- Around line 69-77: The test uses the low-level T.parallel() in
_parallel_vectorize_local_and_var while other tests use the TileLang high-level
T.Parallel(); update the for loop in _parallel_vectorize_local_and_var to use
T.Parallel() instead of T.parallel() so the file remains consistent with the
TileLang API (search for the function _parallel_vectorize_local_and_var and
replace the T.parallel(...) usage with T.Parallel(...)).
🧹 Nitpick comments (1)
testing/python/language/test_tilelang_language_parallel.py (1)

79-82: Consider adding a CUDA-availability guard.

Other tests in this file skip gracefully when CUDA is unavailable (_require_cuda_tensor). If get_kernel_source() requires a CUDA-capable environment for compilation, this test will hard-fail in CPU-only CI. Consider wrapping it with a skip check similar to the other tests.

Comment on lines +69 to +77
@tilelang.jit
def _parallel_vectorize_local_and_var():
with T.Kernel(1) as _:
x = T.alloc_fragment([256], T.float32)
y = T.alloc_fragment([256], T.float32)
z = T.alloc_var(T.float32)
for i in T.parallel(256):
y[i] = x[i] * z

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, find and read the test file to verify the actual casing
fd test_tilelang_language_parallel.py --type f

Repository: tile-ai/tilelang

Length of output: 118


🏁 Script executed:

# Read the specific lines mentioned in the review
if [ -f "testing/python/language/test_tilelang_language_parallel.py" ]; then
  echo "=== Lines 69-77 (the disputed code) ===" && \
  sed -n '69,77p' testing/python/language/test_tilelang_language_parallel.py && \
  echo -e "\n=== Lines 18, 33, 36 (for comparison) ===" && \
  sed -n '18p;33p;36p' testing/python/language/test_tilelang_language_parallel.py
fi

Repository: tile-ai/tilelang

Length of output: 538


🏁 Script executed:

# Search for T.parallel and T.Parallel definitions in the codebase
rg "def parallel|def Parallel" --type py -C 2

Repository: tile-ai/tilelang

Length of output: 10557


🏁 Script executed:

# Check how T is imported and what it refers to
rg "^import T|^from.*import.*T|^T = " testing/python/language/test_tilelang_language_parallel.py -C 1

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Check the imports at the top of the test file
head -30 testing/python/language/test_tilelang_language_parallel.py

Repository: tile-ai/tilelang

Length of output: 783


🏁 Script executed:

# Search for T usage and imports more broadly
rg "^import|^from" testing/python/language/test_tilelang_language_parallel.py | head -20

Repository: tile-ai/tilelang

Length of output: 156


🏁 Script executed:

# Check what T module contains and where Parallel is accessible from
cat -n tilelang/language/loop.py | head -50

Repository: tile-ai/tilelang

Length of output: 2267


🏁 Script executed:

# Check if parallel is aliased to Parallel or available in the same module
rg "parallel|Parallel" tilelang/language/__init__.py -C 3

Repository: tile-ai/tilelang

Length of output: 375


🏁 Script executed:

# Check what else is exported from __init__.py, especially any lowercase parallel
rg "parallel" tilelang/language/__init__.py

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Check if there's an alias or if parallel is defined elsewhere
rg "^parallel" tilelang/language/ -A 2

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Let's also check the specific loop.py file more thoroughly for any lowercase parallel
rg -i "parallel" tilelang/language/loop.py | grep -v "^\s*#"

Repository: tile-ai/tilelang

Length of output: 999


🏁 Script executed:

# Check if parallel exists in tir/ir.py and ast/ir.py that was mentioned in the earlier grep
cat -n tilelang/language/tir/ir.py | grep -A 10 "^.*def parallel"

Repository: tile-ai/tilelang

Length of output: 436


🏁 Script executed:

# Check if tir/ir.py or parallel (lowercase) is imported in __init__.py
cat tilelang/language/__init__.py

Repository: tile-ai/tilelang

Length of output: 4252


🏁 Script executed:

# Verify the full tir/ir.py file to understand its structure
wc -l tilelang/language/tir/ir.py
head -100 tilelang/language/tir/ir.py

Repository: tile-ai/tilelang

Length of output: 2726


🏁 Script executed:

# Check if parallel is explicitly exported from tir module
rg "from.*tir" tilelang/language/__init__.py -C 3

Repository: tile-ai/tilelang

Length of output: 569


🏁 Script executed:

# Verify the function signatures one more time to be absolutely certain
grep -A 5 "def parallel(" tilelang/language/tir/ir.py | head -10
grep -A 5 "def Parallel(" tilelang/language/loop.py | head -10

Repository: tile-ai/tilelang

Length of output: 454


🏁 Script executed:

# Check if there are other uses of T.parallel vs T.Parallel in the test file to see the pattern
rg "T\.parallel|T\.Parallel" testing/python/language/test_tilelang_language_parallel.py -n

Repository: tile-ai/tilelang

Length of output: 227


Use T.Parallel() for consistency with the rest of the file.

Line 75 uses T.parallel() (the low-level TVM IR API), while lines 18, 33, and 36 use T.Parallel() (the TileLang high-level API). Both are valid, but the file should use T.Parallel() consistently for readability and to maintain the TileLang abstraction level.

🤖 Prompt for AI Agents
In `@testing/python/language/test_tilelang_language_parallel.py` around lines 69 -
77, The test uses the low-level T.parallel() in
_parallel_vectorize_local_and_var while other tests use the TileLang high-level
T.Parallel(); update the for loop in _parallel_vectorize_local_and_var to use
T.Parallel() instead of T.parallel() so the file remains consistent with the
TileLang API (search for the function _parallel_vectorize_local_and_var and
replace the T.parallel(...) usage with T.Parallel(...)).

@github-actions
Copy link

Performance Regression Test Report

Triggered by: @LeiWang1999
Workflow run: https://github.com/tile-ai/tilelang/actions/runs/21894068859

Results

File Original Latency Current Latency Speedup
example_mha_fwd_bhsd_wgmma_pipelined 0.014214 0.0142283 0.998996
example_mha_fwd_bshd_wgmma_pipelined 0.0145731 0.0145682 1.00034
example_per_token_cast_to_fp8 0.00743505 0.00743009 1.00067
example_dequant_gemv_fp16xint4 0.0284735 0.0284125 1.00215
example_convolution_autotune 0.997592 0.994989 1.00262
sparse_mla_fwd 0.132359 0.131918 1.00334
example_mha_fwd_bhsd 0.0111702 0.0111233 1.00422
example_mha_sink_fwd_bhsd_sliding_window 0.0158059 0.015736 1.00445
example_group_per_split_token_cast_to_fp8 0.010437 0.0103889 1.00462
example_mha_sink_fwd_bhsd_wgmma_pipelined_sliding_window 0.0155948 0.0155218 1.00471
example_gemm 0.0228896 0.0227737 1.00509
example_gqa_sink_fwd_bhsd_wgmma_pipelined_sliding_window 0.0147006 0.0146233 1.00529
sparse_mla_fwd_pipelined 0.0970263 0.0964933 1.00552
example_mha_sink_fwd_bhsd 0.0160206 0.0159322 1.00555
example_tilelang_nsa_decode 0.00742407 0.00738251 1.00563
sparse_mla_bwd 0.426907 0.424478 1.00572
example_mha_sink_bwd_bhsd_sliding_window 0.0452462 0.0449876 1.00575
topk_selector 0.0544493 0.0541261 1.00597
example_tilelang_nsa_fwd 0.00704928 0.00700675 1.00607
example_tilelang_block_sparse_attn 0.0102338 0.0101715 1.00612
example_mha_sink_fwd_bhsd_wgmma_pipelined 0.0155616 0.0154569 1.00677
fp8_lighting_indexer 0.0360557 0.0358128 1.00678
example_gqa_sink_fwd_bhsd_wgmma_pipelined 0.0145782 0.0144773 1.00697
example_warp_specialize_gemm_barrierpipe_stage2 0.0398239 0.0395465 1.00701
example_tilelang_sparse_gqa_decode_varlen_indice 0.0172615 0.0171324 1.00754
example_mha_fwd_varlen 0.0454873 0.0451415 1.00766
example_mha_bwd_bshd_wgmma_pipelined 0.025725 0.0255245 1.00786
example_mha_fwd_bshd 0.0261507 0.025938 1.0082
example_mha_sink_bwd_bhsd 0.0629183 0.062401 1.00829
example_blocksparse_gemm 0.0228686 0.0226796 1.00833
example_tilelang_sparse_gqa_decode_varlen_mask 0.0237069 0.023504 1.00863
example_dynamic 0.65774 0.651626 1.00938
example_gemm_schedule 0.032584 0.032269 1.00976
example_gqa_sink_bwd_bhsd_sliding_window 0.0258904 0.0256192 1.01059
example_dequant_gemm_bf16_mxfp4_hopper 0.516452 0.51099 1.01069
example_gemm_autotune 0.0226019 0.0223607 1.01079
example_convolution 1.34841 1.33393 1.01085
example_dequant_gemm_bf16_fp4_hopper 0.582536 0.576019 1.01131
example_gqa_decode 0.0482775 0.0477303 1.01147
example_warp_specialize_gemm_softpipe_stage2 0.0387425 0.0383031 1.01147
example_tilelang_gemm_fp8 0.322473 0.318774 1.0116
example_gqa_sink_bwd_bhsd 0.0421709 0.041687 1.01161
example_gqa_fwd_bshd_wgmma_pipelined 0.0559108 0.055266 1.01167
tilelang_example_sparse_tensorcore 0.0150815 0.0149051 1.01183
example_warp_specialize_gemm_copy_1_gemm_0 0.0387523 0.0382955 1.01193
example_elementwise_add 0.297675 0.29411 1.01212
example_mha_bwd_bshd 0.0412775 0.0407748 1.01233
example_warp_specialize_gemm_copy_0_gemm_1 0.0392552 0.0387737 1.01242
block_sparse_attn_tilelang 0.0102939 0.0101658 1.01261
example_linear_attn_fwd 0.0370073 0.036543 1.01271
example_gqa_bwd_wgmma_pipelined 0.0698217 0.0689052 1.0133
example_mla_decode 0.467796 0.461514 1.01361
example_linear_attn_bwd 0.153644 0.151418 1.0147
example_mha_bwd_bhsd 0.0406717 0.0400818 1.01472
example_gqa_fwd_bshd 0.0720494 0.0709887 1.01494
example_tilelang_gemm_splitk 1.42296 1.40165 1.0152
example_tilelang_gemm_splitk_vectorize_atomicadd 1.42245 1.40098 1.01533
example_fusedmoe_tilelang 0.13349 0.131435 1.01564
example_mha_inference 0.0802538 0.078978 1.01615
example_dequant_groupedgemm_bf16_mxfp4_hopper 3.41103 3.3523 1.01752
example_dequant_gemm_w4a8 5.39797 5.30447 1.01763
example_gqa_bwd 0.0498308 0.0489481 1.01803
example_gemm_intrinsics 0.0354379 0.0348004 1.01832
example_topk 0.0111042 0.0108973 1.01899
example_gqa_bwd_tma_reduce_varlen 0.0523578 0.0513782 1.01907
example_vertical_slash_sparse_attn 0.236832 0.231708 1.02212
example_gemv 0.288407 0.281803 1.02343
example_tilelang_gemm_fp8_2xAcc 0.188315 0.18397 1.02362
example_tilelang_gemm_fp8_intrinsic 0.934314 0.911959 1.02451
example_dequant_gemm_fp4_hopper 1.06555 1.03482 1.0297

Artifacts

  • regression_result.png (speedup plot) is attached as a workflow artifact. Download it from the workflow run page above.

@LeiWang1999 LeiWang1999 merged commit e666d2d into tile-ai:main Feb 11, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant