[Refactor] Treat `local.var` as `local` buffers when deciding vectorization for stable actions by LeiWang1999 · Pull Request #1835 · tile-ai/tilelang

LeiWang1999 · 2026-02-11T05:34:22Z

Summary by CodeRabbit

Refactor
- Improved buffer detection logic to better handle local memory optimization.
Tests
- Added test coverage for vectorization behavior with local variable handling.

github-actions · 2026-02-11T05:34:31Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai · 2026-02-11T05:34:42Z

📝 Walkthrough

Walkthrough

The PR modifies buffer locality detection in the tile operation lowering pass to treat var-mapped buffers as local, and adds a test case verifying vectorization behavior with var buffers during parallel execution.

Changes

Cohort / File(s)	Summary
Compiler: Buffer Locality Check `src/transform/lower_tile_op.cc`	Tightens non-local buffer detection by changing `IsLocalBuffer` call to include `allow_var=true`, treating var-mapped locals as local while preserving fragment buffer exclusion.
Test: Parallel Vectorization `testing/python/language/test_tilelang_language_parallel.py`	Introduces `_parallel_vectorize_local_and_var` kernel with fragment and var buffers, and `test_parallel_vectorize_var` test that verifies vectorization is not applied (asserts absence of "float2" in kernel source).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

[Bugfix] Do not consider local.var as local buffer during LowerTileOP #1628: Makes opposite change to IsLocalBuffer behavior by removing allow_var treatment, affecting what buffers are considered local during non-local analysis.
[BugFix] Fix some bugs in lowering ParallelOp and VectorizeLoop #1607: Directly applies the same IsLocalBuffer(..., allow_var=true) pattern to the same non-local buffer checks in lower_tile_op.cc.
[Refactor] Relocate layout transformation of ptx_stmatrix #1689: Refactors buffer-map and layout remapping in the same LowerTileOpPass, treating local tensors as local earlier in the analysis pipeline.

Poem

🐰 Twitches whiskers with delight
A buffer here, a var there,
Locals nested with such care,
Vectorize when none are near,
This change makes logic crystalline and bright! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: treating local.var buffers as local when deciding vectorization for stable actions. This directly relates to the primary modifications in both files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

LeiWang1999 · 2026-02-11T05:38:23Z

@regression-perf

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@testing/python/language/test_tilelang_language_parallel.py`:
- Around line 69-77: The test uses the low-level T.parallel() in
_parallel_vectorize_local_and_var while other tests use the TileLang high-level
T.Parallel(); update the for loop in _parallel_vectorize_local_and_var to use
T.Parallel() instead of T.parallel() so the file remains consistent with the
TileLang API (search for the function _parallel_vectorize_local_and_var and
replace the T.parallel(...) usage with T.Parallel(...)).

🧹 Nitpick comments (1)

testing/python/language/test_tilelang_language_parallel.py (1)

79-82: Consider adding a CUDA-availability guard.

Other tests in this file skip gracefully when CUDA is unavailable (_require_cuda_tensor). If get_kernel_source() requires a CUDA-capable environment for compilation, this test will hard-fail in CPU-only CI. Consider wrapping it with a skip check similar to the other tests.

coderabbitai · 2026-02-11T05:38:57Z

testing/python/language/test_tilelang_language_parallel.py

+@tilelang.jit
+def _parallel_vectorize_local_and_var():
+    with T.Kernel(1) as _:
+        x = T.alloc_fragment([256], T.float32)
+        y = T.alloc_fragment([256], T.float32)
+        z = T.alloc_var(T.float32)
+        for i in T.parallel(256):
+            y[i] = x[i] * z
+


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, find and read the test file to verify the actual casing fd test_tilelang_language_parallel.py --type f

Repository: tile-ai/tilelang

Length of output: 118

🏁 Script executed:

# Read the specific lines mentioned in the review if [ -f "testing/python/language/test_tilelang_language_parallel.py" ]; then echo "=== Lines 69-77 (the disputed code) ===" && \ sed -n '69,77p' testing/python/language/test_tilelang_language_parallel.py && \ echo -e "\n=== Lines 18, 33, 36 (for comparison) ===" && \ sed -n '18p;33p;36p' testing/python/language/test_tilelang_language_parallel.py fi

Repository: tile-ai/tilelang

Length of output: 538

🏁 Script executed:

# Search for T.parallel and T.Parallel definitions in the codebase rg "def parallel|def Parallel" --type py -C 2

Repository: tile-ai/tilelang

Length of output: 10557

🏁 Script executed:

# Check how T is imported and what it refers to rg "^import T|^from.*import.*T|^T = " testing/python/language/test_tilelang_language_parallel.py -C 1

Repository: tile-ai/tilelang

Length of output: 42

🏁 Script executed:

# Check the imports at the top of the test file head -30 testing/python/language/test_tilelang_language_parallel.py

Repository: tile-ai/tilelang

Length of output: 783

🏁 Script executed:

# Search for T usage and imports more broadly rg "^import|^from" testing/python/language/test_tilelang_language_parallel.py | head -20

Repository: tile-ai/tilelang

Length of output: 156

🏁 Script executed:

# Check what T module contains and where Parallel is accessible from cat -n tilelang/language/loop.py | head -50

Repository: tile-ai/tilelang

Length of output: 2267

🏁 Script executed:

# Check if parallel is aliased to Parallel or available in the same module rg "parallel|Parallel" tilelang/language/__init__.py -C 3

Repository: tile-ai/tilelang

Length of output: 375

🏁 Script executed:

# Check what else is exported from __init__.py, especially any lowercase parallel rg "parallel" tilelang/language/__init__.py

Repository: tile-ai/tilelang

Length of output: 42

🏁 Script executed:

# Check if there's an alias or if parallel is defined elsewhere rg "^parallel" tilelang/language/ -A 2

Repository: tile-ai/tilelang

Length of output: 42

🏁 Script executed:

# Let's also check the specific loop.py file more thoroughly for any lowercase parallel rg -i "parallel" tilelang/language/loop.py | grep -v "^\s*#"

Repository: tile-ai/tilelang

Length of output: 999

🏁 Script executed:

# Check if parallel exists in tir/ir.py and ast/ir.py that was mentioned in the earlier grep cat -n tilelang/language/tir/ir.py | grep -A 10 "^.*def parallel"

Repository: tile-ai/tilelang

Length of output: 436

🏁 Script executed:

# Check if tir/ir.py or parallel (lowercase) is imported in __init__.py cat tilelang/language/__init__.py

Repository: tile-ai/tilelang

Length of output: 4252

🏁 Script executed:

# Verify the full tir/ir.py file to understand its structure wc -l tilelang/language/tir/ir.py head -100 tilelang/language/tir/ir.py

Repository: tile-ai/tilelang

Length of output: 2726

🏁 Script executed:

# Check if parallel is explicitly exported from tir module rg "from.*tir" tilelang/language/__init__.py -C 3

Repository: tile-ai/tilelang

Length of output: 569

🏁 Script executed:

# Verify the function signatures one more time to be absolutely certain grep -A 5 "def parallel(" tilelang/language/tir/ir.py | head -10 grep -A 5 "def Parallel(" tilelang/language/loop.py | head -10

Repository: tile-ai/tilelang

Length of output: 454

🏁 Script executed:

# Check if there are other uses of T.parallel vs T.Parallel in the test file to see the pattern rg "T\.parallel|T\.Parallel" testing/python/language/test_tilelang_language_parallel.py -n

Repository: tile-ai/tilelang

Length of output: 227

Use T.Parallel() for consistency with the rest of the file.

Line 75 uses T.parallel() (the low-level TVM IR API), while lines 18, 33, and 36 use T.Parallel() (the TileLang high-level API). Both are valid, but the file should use T.Parallel() consistently for readability and to maintain the TileLang abstraction level.

🤖 Prompt for AI Agents

In `@testing/python/language/test_tilelang_language_parallel.py` around lines 69 - 77, The test uses the low-level T.parallel() in _parallel_vectorize_local_and_var while other tests use the TileLang high-level T.Parallel(); update the for loop in _parallel_vectorize_local_and_var to use T.Parallel() instead of T.parallel() so the file remains consistent with the TileLang API (search for the function _parallel_vectorize_local_and_var and replace the T.parallel(...) usage with T.Parallel(...)).

github-actions · 2026-02-11T07:13:59Z

Performance Regression Test Report

Triggered by: @LeiWang1999
Workflow run: https://github.com/tile-ai/tilelang/actions/runs/21894068859

Results

File	Original Latency	Current Latency	Speedup
example_mha_fwd_bhsd_wgmma_pipelined	0.014214	0.0142283	0.998996
example_mha_fwd_bshd_wgmma_pipelined	0.0145731	0.0145682	1.00034
example_per_token_cast_to_fp8	0.00743505	0.00743009	1.00067
example_dequant_gemv_fp16xint4	0.0284735	0.0284125	1.00215
example_convolution_autotune	0.997592	0.994989	1.00262
sparse_mla_fwd	0.132359	0.131918	1.00334
example_mha_fwd_bhsd	0.0111702	0.0111233	1.00422
example_mha_sink_fwd_bhsd_sliding_window	0.0158059	0.015736	1.00445
example_group_per_split_token_cast_to_fp8	0.010437	0.0103889	1.00462
example_mha_sink_fwd_bhsd_wgmma_pipelined_sliding_window	0.0155948	0.0155218	1.00471
example_gemm	0.0228896	0.0227737	1.00509
example_gqa_sink_fwd_bhsd_wgmma_pipelined_sliding_window	0.0147006	0.0146233	1.00529
sparse_mla_fwd_pipelined	0.0970263	0.0964933	1.00552
example_mha_sink_fwd_bhsd	0.0160206	0.0159322	1.00555
example_tilelang_nsa_decode	0.00742407	0.00738251	1.00563
sparse_mla_bwd	0.426907	0.424478	1.00572
example_mha_sink_bwd_bhsd_sliding_window	0.0452462	0.0449876	1.00575
topk_selector	0.0544493	0.0541261	1.00597
example_tilelang_nsa_fwd	0.00704928	0.00700675	1.00607
example_tilelang_block_sparse_attn	0.0102338	0.0101715	1.00612
example_mha_sink_fwd_bhsd_wgmma_pipelined	0.0155616	0.0154569	1.00677
fp8_lighting_indexer	0.0360557	0.0358128	1.00678
example_gqa_sink_fwd_bhsd_wgmma_pipelined	0.0145782	0.0144773	1.00697
example_warp_specialize_gemm_barrierpipe_stage2	0.0398239	0.0395465	1.00701
example_tilelang_sparse_gqa_decode_varlen_indice	0.0172615	0.0171324	1.00754
example_mha_fwd_varlen	0.0454873	0.0451415	1.00766
example_mha_bwd_bshd_wgmma_pipelined	0.025725	0.0255245	1.00786
example_mha_fwd_bshd	0.0261507	0.025938	1.0082
example_mha_sink_bwd_bhsd	0.0629183	0.062401	1.00829
example_blocksparse_gemm	0.0228686	0.0226796	1.00833
example_tilelang_sparse_gqa_decode_varlen_mask	0.0237069	0.023504	1.00863
example_dynamic	0.65774	0.651626	1.00938
example_gemm_schedule	0.032584	0.032269	1.00976
example_gqa_sink_bwd_bhsd_sliding_window	0.0258904	0.0256192	1.01059
example_dequant_gemm_bf16_mxfp4_hopper	0.516452	0.51099	1.01069
example_gemm_autotune	0.0226019	0.0223607	1.01079
example_convolution	1.34841	1.33393	1.01085
example_dequant_gemm_bf16_fp4_hopper	0.582536	0.576019	1.01131
example_gqa_decode	0.0482775	0.0477303	1.01147
example_warp_specialize_gemm_softpipe_stage2	0.0387425	0.0383031	1.01147
example_tilelang_gemm_fp8	0.322473	0.318774	1.0116
example_gqa_sink_bwd_bhsd	0.0421709	0.041687	1.01161
example_gqa_fwd_bshd_wgmma_pipelined	0.0559108	0.055266	1.01167
tilelang_example_sparse_tensorcore	0.0150815	0.0149051	1.01183
example_warp_specialize_gemm_copy_1_gemm_0	0.0387523	0.0382955	1.01193
example_elementwise_add	0.297675	0.29411	1.01212
example_mha_bwd_bshd	0.0412775	0.0407748	1.01233
example_warp_specialize_gemm_copy_0_gemm_1	0.0392552	0.0387737	1.01242
block_sparse_attn_tilelang	0.0102939	0.0101658	1.01261
example_linear_attn_fwd	0.0370073	0.036543	1.01271
example_gqa_bwd_wgmma_pipelined	0.0698217	0.0689052	1.0133
example_mla_decode	0.467796	0.461514	1.01361
example_linear_attn_bwd	0.153644	0.151418	1.0147
example_mha_bwd_bhsd	0.0406717	0.0400818	1.01472
example_gqa_fwd_bshd	0.0720494	0.0709887	1.01494
example_tilelang_gemm_splitk	1.42296	1.40165	1.0152
example_tilelang_gemm_splitk_vectorize_atomicadd	1.42245	1.40098	1.01533
example_fusedmoe_tilelang	0.13349	0.131435	1.01564
example_mha_inference	0.0802538	0.078978	1.01615
example_dequant_groupedgemm_bf16_mxfp4_hopper	3.41103	3.3523	1.01752
example_dequant_gemm_w4a8	5.39797	5.30447	1.01763
example_gqa_bwd	0.0498308	0.0489481	1.01803
example_gemm_intrinsics	0.0354379	0.0348004	1.01832
example_topk	0.0111042	0.0108973	1.01899
example_gqa_bwd_tma_reduce_varlen	0.0523578	0.0513782	1.01907
example_vertical_slash_sparse_attn	0.236832	0.231708	1.02212
example_gemv	0.288407	0.281803	1.02343
example_tilelang_gemm_fp8_2xAcc	0.188315	0.18397	1.02362
example_tilelang_gemm_fp8_intrinsic	0.934314	0.911959	1.02451
example_dequant_gemm_fp4_hopper	1.06555	1.03482	1.0297

Artifacts

regression_result.png (speedup plot) is attached as a workflow artifact. Download it from the workflow run page above.

lint fix

1817937

coderabbitai bot reviewed Feb 11, 2026

View reviewed changes

LeiWang1999 merged commit e666d2d into tile-ai:main Feb 11, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Refactor] Treat `local.var` as `local` buffers when deciding vectorization for stable actions#1835

[Refactor] Treat `local.var` as `local` buffers when deciding vectorization for stable actions#1835
LeiWang1999 merged 1 commit intotile-ai:mainfrom
LeiWang1999:refactor_0211

LeiWang1999 commented Feb 11, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

coderabbitai bot commented Feb 11, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

LeiWang1999 commented Feb 11, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 11, 2026

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LeiWang1999 commented Feb 11, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

coderabbitai bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

LeiWang1999 commented Feb 11, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 11, 2026

Performance Regression Test Report

Results

Artifacts

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LeiWang1999 commented Feb 11, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 11, 2026 •

edited

Loading