[Bugfix] Fix variable scoping issue in InjectSoftwarePipeline for transitive LetStmt dependencies#1657
[Bugfix] Fix variable scoping issue in InjectSoftwarePipeline for transitive LetStmt dependencies#1657LeiWang1999 merged 9 commits intomainfrom
Conversation
…ty (#1652) Refactor the `ld_global_256` and `st_global_256` functions to support both CUDA versions above 12.9 and earlier versions. This change ensures that 256-bit loads and stores are handled correctly across different CUDA versions, improving performance and compatibility. The implementation now uses two 128-bit loads/stores for older versions, enhancing the robustness of the codebase.
Clarified comments in `ld_global_256` and `st_global_256` functions to indicate that the fallback for CUDA versions below 12.9 may have performance regressions. This change enhances code readability and provides better context for developers working with different CUDA versions.
- Updated the TVM submodule to the latest commit. - Improved the handling of LetStmt in the inject_pipeline.cc file to account for transitive dependencies on loop variables, ensuring correct variable substitution in rewritten blocks. - Adjusted test_tilelang_issue_1263.py to remove unnecessary jit decorator and updated the kernel compilation process with specific pass configurations.
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
📝 WalkthroughWalkthroughThis PR updates a TVM submodule reference, removes autotuning infrastructure from an elementwise example, and refines pipeline injection logic to correctly handle transitive LetStmt dependencies by iterating wrappers in reverse order for proper scoping. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR fixes a variable scoping bug in the InjectSoftwarePipeline transformation pass that occurred when LetStmt variables had transitive dependencies on loop variables through other LetStmt definitions.
Changes:
- Track transitive dependencies by maintaining a set of all variables dependent on the loop variable, not just direct dependencies
- Reverse the iteration order when wrapping LetStmts to ensure proper variable scoping (earlier definitions scope over later ones)
- Add comprehensive test case with multiple compilation configurations to verify the fix
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
testing/python/issue/test_tilelang_issue_1263.py |
Updated test to use tilelang.compile() with multiple configurations instead of @tilelang.jit() decorator |
src/transform/inject_pipeline.cc |
Fixed variable scoping by tracking transitive LetStmt dependencies and reversing LetStmt wrapping order |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| tilelang.compile(test_kernel(1024, 1024)) | ||
| tilelang.compile( | ||
| test_kernel(1024, 1024), |
There was a problem hiding this comment.
The test case validates compilation succeeds but doesn't verify the correctness of the variable scoping fix. Consider adding assertions to check that the IR transformation produces the expected LetStmt nesting structure, or add a runtime test that would fail with incorrect scoping.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @3rdparty/tvm:
- Line 1: The PR references the wrong commit for the InjectSoftwarePipeline fix;
find the actual commit that modifies inject_pipeline.cc and addresses transitive
LetStmt dependencies (the change should touch
InjectSoftwarePipeline/inject_pipeline.cc), then update the referenced commit
hash and message wherever it is documented or cited in the PR/changelog to point
to that correct commit and ensure the description matches the actual diff (i.e.,
confirm the commit modifies inject_pipeline.cc and fixes transitive LetStmt
dependency handling in InjectSoftwarePipeline).
🧹 Nitpick comments (1)
examples/elementwise/example_elementwise_add.py (1)
31-31: Remove unuseduse_autotuneparameter.The
use_autotuneparameter is no longer used in the function body after removing the autotuning functionality. This dead parameter should be removed for clarity.♻️ Suggested fix
-def main(M=1024, N=1024, use_autotune=False): +def main(M=1024, N=1024):
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
3rdparty/tvmexamples/elementwise/example_elementwise_add.pyexamples/elementwise/test_example_elementwise.pysrc/transform/inject_pipeline.cctesting/python/issue/test_tilelang_issue_1263.py
💤 Files with no reviewable changes (1)
- examples/elementwise/test_example_elementwise.py
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.
Applied to files:
testing/python/issue/test_tilelang_issue_1263.py
📚 Learning: 2026-01-06T05:20:51.649Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1606
File: testing/python/transform/test_tilelang_transform_hoist_broadcast_values.py:30-30
Timestamp: 2026-01-06T05:20:51.649Z
Learning: In `testing/python/transform/test_tilelang_transform_hoist_broadcast_values.py`, the test validates that the `hoist_broadcast_values` transformation pass correctly identifies and hoists broadcast operations by checking for patterns in the generated kernel source code. The specific literal values used (e.g., 430) are not important for the test's purpose, as it does not validate numerical precision or actual stored tensor values.
Applied to files:
testing/python/issue/test_tilelang_issue_1263.py
📚 Learning: 2025-12-18T04:50:00.512Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.
Applied to files:
testing/python/issue/test_tilelang_issue_1263.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
- GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
🔇 Additional comments (4)
testing/python/issue/test_tilelang_issue_1263.py (1)
27-34: LGTM! Good test coverage for the transitive dependency fix.The test correctly exercises the transitive LetStmt dependency pattern (
id = ids[i]→id2 = ids2[id]) that this PR fixes. Testing with both default settings and with warp specialization/TMA lowering disabled provides good coverage of different code paths.src/transform/inject_pipeline.cc (2)
841-861: LGTM! Correct reverse iteration for proper LetStmt scoping.The reverse iteration ensures that earlier LetStmt definitions (e.g.,
id = ids[i]) correctly scope over later ones that depend on them (e.g.,id2 = ids2[id]). The explanatory comments clearly document the expected nesting behavior.
1089-1118: LGTM! Proper transitive dependency tracking for LetStmts.The implementation correctly handles transitive dependencies by building a
dependent_varsset that includes both the loop variable and all previously recorded loop-var-dependent variables. This ensures chains likeid = ids[i]→id2 = ids2[id]are both captured and later substituted correctly.examples/elementwise/example_elementwise_add.py (1)
35-38: LGTM!The simplified kernel invocation with fixed configuration values is appropriate for an example file.
…nsitive LetStmt dependencies (tile-ai#1657) * [Enhancement] Update global load/store functions for CUDA compatibility (tile-ai#1652) Refactor the `ld_global_256` and `st_global_256` functions to support both CUDA versions above 12.9 and earlier versions. This change ensures that 256-bit loads and stores are handled correctly across different CUDA versions, improving performance and compatibility. The implementation now uses two 128-bit loads/stores for older versions, enhancing the robustness of the codebase. * Update comments in global load/store functions for CUDA compatibility Clarified comments in `ld_global_256` and `st_global_256` functions to indicate that the fallback for CUDA versions below 12.9 may have performance regressions. This change enhances code readability and provides better context for developers working with different CUDA versions. * Update submodule and enhance LetStmt handling in inject_pipeline.cc - Updated the TVM submodule to the latest commit. - Improved the handling of LetStmt in the inject_pipeline.cc file to account for transitive dependencies on loop variables, ensuring correct variable substitution in rewritten blocks. - Adjusted test_tilelang_issue_1263.py to remove unnecessary jit decorator and updated the kernel compilation process with specific pass configurations. * lint fix * revert tvm * remove unused test * test fix
* finish KDA algorithm in tilelang * fix pre-commit.ci * fix pre-commit.ci * fix pre-commit local * [Style] Fix some code styles * [Refactor] Remove redundant swizzle for they can be automatically done * [Refactor] remove chunk_bwd_intra.py and rename chunk_bwd_intra_op.py and do some fix form coderabbitai * update ruff * update pre-commit * [Enhancement] Improve unroll loop functionality for dynamic extent and corresponding test case (#1654) * Add unroll loop functionality and corresponding test case - Introduced a new `UnrollLoop` function in the transform module to unroll loops based on various configuration options. - Added a test case in `test_tilelang_language_unroll.py` to validate the behavior of `T.unroll` with only the extent parameter, ensuring correct kernel generation with unroll pragmas. * Refactor unroll kernel implementation and update test case - Changed the kernel function in `test_tilelang_language_unroll.py` to use a new `unroll_kernel` function that compiles and returns the output tensor, improving clarity and structure. - Updated the `OptimizeForTarget` function in `phase.py` to ensure the `UnrollLoop` transformation is applied correctly, maintaining consistency in optimization phases. * lint fix * lint fix * [Bugfix] Fix missing annotations for default CallNode Visitor (#1659) tvm fix * [Clean] Remove unnecessary debug print (#1661) remove unnecessary debug print * [Bugfix] Fix variable scoping issue in InjectSoftwarePipeline for transitive LetStmt dependencies (#1657) * [Enhancement] Update global load/store functions for CUDA compatibility (#1652) Refactor the `ld_global_256` and `st_global_256` functions to support both CUDA versions above 12.9 and earlier versions. This change ensures that 256-bit loads and stores are handled correctly across different CUDA versions, improving performance and compatibility. The implementation now uses two 128-bit loads/stores for older versions, enhancing the robustness of the codebase. * Update comments in global load/store functions for CUDA compatibility Clarified comments in `ld_global_256` and `st_global_256` functions to indicate that the fallback for CUDA versions below 12.9 may have performance regressions. This change enhances code readability and provides better context for developers working with different CUDA versions. * Update submodule and enhance LetStmt handling in inject_pipeline.cc - Updated the TVM submodule to the latest commit. - Improved the handling of LetStmt in the inject_pipeline.cc file to account for transitive dependencies on loop variables, ensuring correct variable substitution in rewritten blocks. - Adjusted test_tilelang_issue_1263.py to remove unnecessary jit decorator and updated the kernel compilation process with specific pass configurations. * lint fix * revert tvm * remove unused test * test fix * [Refactor] Improve CallNode handling to include annotations in various operations (#1663) * [Enhancement] Update CallNode handling to include annotations in various operations - Modified CallNode invocations in multiple files to ensure that annotations are passed correctly, enhancing the consistency and functionality of the codebase. - Removed the "use_tma" annotation from AtomicAddNode and adjusted related calls to maintain expected behavior. - Updated CUDA intrinsic dispatch functions to include annotations, improving compatibility and correctness in CUDA operations. * lint fix * [EagerJIT] Add Support for Parameter Only Kernel Compilation (#1664) * [Fix] Refactor type hint extraction logic in DSLMutator for better clarity and handling of annotations * [Refactor] Remove redundant tensor creation in loop layout tests and update kernel compilation parameters * [AutoDD] Add Tilelang AutoDD to Reduce Buggy Program (#1639) * [Feat] Add tilelang autodd for delta debugging * fix typos * fix lint error * fix typos * fix lint error * fix bugs * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix codeview comments * [Refactor] Move AutoDD detection to env module and update import logic * Refactor: Relocate the _is_running_autodd function to the env module for better organization and encapsulation. * Update initialization logic to skip logger and heavy imports based on a new light import mode, enhancing flexibility in module usage. * Ensure consistent handling of environment variables across the package, improving overall code clarity and maintainability. * [Documentation] Add AutoDD section to debug_tools_for_tilelang.md * Introduced a comprehensive guide on AutoDD (Automatic Delta Debugging) for isolating bugs in TileLang programs. * Explained Delta Debugging methodology, usage, parameters, and provided examples for clarity. * Highlighted the benefits of using AutoDD for large codebases and hard-to-locate errors, emphasizing time-saving aspects. * Included tips for effective usage and a reference to a complete example in the documentation. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: kurisu6912 <227995639+kurisu6912@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * rebase origin * [Feature] Support `cp.reduce.async.bulk.tensor` (#1667) * support cp.reduce.async.bulk.tensor and add test * Refactor flash attention example by removing unnecessary layout annotations * support swizzle layout for tma reduce * auto swizzle for non-1d tma atomic add * upd example and test * lint * typo * add constraint for test * Refactor CUDA data type mapping by moving the to_CUtensorMapDataType function to utils.cc and utils.h, while removing redundant definitions from atomic_add.cc and copy.cc. * lint * rename basename according to CI * Update submodule TVM and remove deprecated KDA example files - Updated the TVM submodule to commit 354eef9a. - Removed several outdated KDA example files and utility scripts that are no longer in use, including chunk_bwd_dqkwg.py, chunk_bwd_dv.py, chunk_bwd_gla_dA.py, chunk_bwd_intra.py, chunk_delta_bwd.py, chunk_delta_h_fwd.py, chunk_inter_solve_fused.py, chunk_intra_token_parallel.py, chunk_o.py, README.md, test_utils_kda.py, wy_fast_bwd.py, wy_fast.py, and various FLA_KDA implementations. * lint fix --------- Co-authored-by: wufang <wufang@MBP-MK6VR66Y2M-2329.local> Co-authored-by: tzj-fxz <tzjfxz@gmail.com> Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by: Kuris <227995639+kurisu6912@users.noreply.github.com> Co-authored-by: Kexing Zhou <KEKE_046@pku.edu.cn> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> Co-authored-by: Zhengju Tang <97930865+tzj-fxz@users.noreply.github.com> Co-authored-by: Tong WU <109033598+Rachmanino@users.noreply.github.com>
Summary
InjectSoftwarePipelinepass where variables with transitive dependencies on loop variables were incorrectly scopedid = ids[i],id2 = ids2[id]) correctlyProblem
When using pipelined loops with chained variable definitions like:
After the
InjectSoftwarePipelinetransformation, the variableidwould become "floating" (undefined in scope) because:id2's definition was placed at an outer scopeid's definition (viaLetStmt) was inside an inner scopeid2to reference an undefined variableidRoot Cause
Two issues in
src/transform/inject_pipeline.cc:Only direct loop variable dependencies were tracked: The code only checked if a LetStmt's value directly used the loop variable, ignoring transitive dependencies through other LetStmts.
Wrong LetStmt wrapping order: When building the LetStmt chain, forward iteration produced incorrect nesting where inner variables were referenced by outer definitions.
Solution
Track transitive dependencies: Now collects all variables from previously recorded loop-var-dependent LetStmts and checks if current LetStmt depends on any of them.
Reverse iteration for LetStmt wrapping: Iterate in reverse order so that earlier definitions scope over later ones, producing correct nesting:
Test plan
test_tilelang_issue_1263.pythat reproduces the issue🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
Bug Fixes
Refactor
Chores
✏️ Tip: You can customize this high-level summary in your review settings.