Skip to content

[Bugfix] Fix variable scoping issue in InjectSoftwarePipeline for transitive LetStmt dependencies#1657

Merged
LeiWang1999 merged 9 commits intomainfrom
fallback_0112
Jan 13, 2026
Merged

[Bugfix] Fix variable scoping issue in InjectSoftwarePipeline for transitive LetStmt dependencies#1657
LeiWang1999 merged 9 commits intomainfrom
fallback_0112

Conversation

@LeiWang1999
Copy link
Member

@LeiWang1999 LeiWang1999 commented Jan 12, 2026

Summary

  • Fix variable scoping bug in InjectSoftwarePipeline pass where variables with transitive dependencies on loop variables were incorrectly scoped
  • Handle chained LetStmt dependencies (e.g., id = ids[i], id2 = ids2[id]) correctly

Problem

When using pipelined loops with chained variable definitions like:

for i in T.Pipelined(4, num_stages=1):
    id = ids[i]        # depends on loop var i
    id2 = ids2[id]     # depends on id (transitively on loop var)
    T.copy(KV[id2, :], A)

After the InjectSoftwarePipeline transformation, the variable id would become "floating" (undefined in scope) because:

  1. id2's definition was placed at an outer scope
  2. id's definition (via LetStmt) was inside an inner scope
  3. This caused id2 to reference an undefined variable id

Root Cause

Two issues in src/transform/inject_pipeline.cc:

  1. Only direct loop variable dependencies were tracked: The code only checked if a LetStmt's value directly used the loop variable, ignoring transitive dependencies through other LetStmts.

  2. Wrong LetStmt wrapping order: When building the LetStmt chain, forward iteration produced incorrect nesting where inner variables were referenced by outer definitions.

Solution

  1. Track transitive dependencies: Now collects all variables from previously recorded loop-var-dependent LetStmts and checks if current LetStmt depends on any of them.

  2. Reverse iteration for LetStmt wrapping: Iterate in reverse order so that earlier definitions scope over later ones, producing correct nesting:

    LetStmt(id, ids[...],
      LetStmt(id2, ids2[id],
        body))
    

Test plan

  • Added test case test_tilelang_issue_1263.py that reproduces the issue
  • Verified the IR transformation produces correct variable scoping

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Fixed variable dependency handling in pipeline injection for improved correctness.
  • Refactor

    • Simplified elementwise example by removing autotuning capability.
  • Chores

    • Updated submodule dependencies.

✏️ Tip: You can customize this high-level summary in your review settings.

…ty (#1652)

Refactor the `ld_global_256` and `st_global_256` functions to support both CUDA versions above 12.9 and earlier versions. This change ensures that 256-bit loads and stores are handled correctly across different CUDA versions, improving performance and compatibility. The implementation now uses two 128-bit loads/stores for older versions, enhancing the robustness of the codebase.
Clarified comments in `ld_global_256` and `st_global_256` functions to indicate that the fallback for CUDA versions below 12.9 may have performance regressions. This change enhances code readability and provides better context for developers working with different CUDA versions.
- Updated the TVM submodule to the latest commit.
- Improved the handling of LetStmt in the inject_pipeline.cc file to account for transitive dependencies on loop variables, ensuring correct variable substitution in rewritten blocks.
- Adjusted test_tilelang_issue_1263.py to remove unnecessary jit decorator and updated the kernel compilation process with specific pass configurations.
@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 12, 2026

📝 Walkthrough

Walkthrough

This PR updates a TVM submodule reference, removes autotuning infrastructure from an elementwise example, and refines pipeline injection logic to correctly handle transitive LetStmt dependencies by iterating wrappers in reverse order for proper scoping.

Changes

Cohort / File(s) Summary
Submodule Update
3rdparty/tvm
Updated TVM submodule pointer to new commit da7f19b6908045a1f9bf94cb7e044beaa32421b6.
Autotuning Removal
examples/elementwise/example_elementwise_add.py, examples/elementwise/test_example_elementwise.py
Removed @tilelang.autotune decorator, eliminated get_configs() function, dropped use_autotune parameter from main(), and removed test_example_elementwise_add_autotune() test.
Pipeline Injection Logic
src/transform/inject_pipeline.cc
Reversed iteration order when wrapping rewritten blocks with LetStmt substitutions; enhanced LetStmt dependency detection to account for transitive variable dependencies beyond direct loop variable usage.
Test Configuration Update
testing/python/issue/test_tilelang_issue_1263.py
Replaced JIT decorator with explicit tilelang.compile() calls using and without specific pass configurations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • tile-ai/tilelang#1399: Modifies pipeline injection logic in src/transform/inject_pipeline.cc with LetStmt wrapping and dependency handling similar to this PR.
  • tile-ai/tilelang#1525: Refactors PipelineInjector/PipelineBodyRewriter behavior in src/transform/inject_pipeline.cc alongside buffer allocation changes.
  • tile-ai/tilelang#1445: Directly modifies examples/elementwise/example_elementwise_add.py autotuning infrastructure and decorator handling.

Suggested reviewers

  • kurisu6912
  • SiriusNEO

Poem

🐰 In pipelines deep where Let statements flow,
We chase dependencies, watching them grow,
Reverse the order, let scoping be right,
Autotuning fades—pure simplicity's light! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the main change: fixing a variable scoping bug in InjectSoftwarePipeline for transitive LetStmt dependencies, which aligns with the primary modification in src/transform/inject_pipeline.cc and the new test case.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@LeiWang1999 LeiWang1999 requested a review from Copilot January 12, 2026 08:52
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a variable scoping bug in the InjectSoftwarePipeline transformation pass that occurred when LetStmt variables had transitive dependencies on loop variables through other LetStmt definitions.

Changes:

  • Track transitive dependencies by maintaining a set of all variables dependent on the loop variable, not just direct dependencies
  • Reverse the iteration order when wrapping LetStmts to ensure proper variable scoping (earlier definitions scope over later ones)
  • Add comprehensive test case with multiple compilation configurations to verify the fix

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
testing/python/issue/test_tilelang_issue_1263.py Updated test to use tilelang.compile() with multiple configurations instead of @tilelang.jit() decorator
src/transform/inject_pipeline.cc Fixed variable scoping by tracking transitive LetStmt dependencies and reversing LetStmt wrapping order

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +27 to +29
tilelang.compile(test_kernel(1024, 1024))
tilelang.compile(
test_kernel(1024, 1024),
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test case validates compilation succeeds but doesn't verify the correctness of the variable scoping fix. Consider adding assertions to check that the IR transformation produces the expected LetStmt nesting structure, or add a runtime test that would fail with incorrect scoping.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @3rdparty/tvm:
- Line 1: The PR references the wrong commit for the InjectSoftwarePipeline fix;
find the actual commit that modifies inject_pipeline.cc and addresses transitive
LetStmt dependencies (the change should touch
InjectSoftwarePipeline/inject_pipeline.cc), then update the referenced commit
hash and message wherever it is documented or cited in the PR/changelog to point
to that correct commit and ensure the description matches the actual diff (i.e.,
confirm the commit modifies inject_pipeline.cc and fixes transitive LetStmt
dependency handling in InjectSoftwarePipeline).
🧹 Nitpick comments (1)
examples/elementwise/example_elementwise_add.py (1)

31-31: Remove unused use_autotune parameter.

The use_autotune parameter is no longer used in the function body after removing the autotuning functionality. This dead parameter should be removed for clarity.

♻️ Suggested fix
-def main(M=1024, N=1024, use_autotune=False):
+def main(M=1024, N=1024):
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fd260e3 and 08cc5bf.

📒 Files selected for processing (5)
  • 3rdparty/tvm
  • examples/elementwise/example_elementwise_add.py
  • examples/elementwise/test_example_elementwise.py
  • src/transform/inject_pipeline.cc
  • testing/python/issue/test_tilelang_issue_1263.py
💤 Files with no reviewable changes (1)
  • examples/elementwise/test_example_elementwise.py
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

  • testing/python/issue/test_tilelang_issue_1263.py
📚 Learning: 2026-01-06T05:20:51.649Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1606
File: testing/python/transform/test_tilelang_transform_hoist_broadcast_values.py:30-30
Timestamp: 2026-01-06T05:20:51.649Z
Learning: In `testing/python/transform/test_tilelang_transform_hoist_broadcast_values.py`, the test validates that the `hoist_broadcast_values` transformation pass correctly identifies and hoists broadcast operations by checking for patterns in the generated kernel source code. The specific literal values used (e.g., 430) are not important for the test's purpose, as it does not validate numerical precision or actual stored tensor values.

Applied to files:

  • testing/python/issue/test_tilelang_issue_1263.py
📚 Learning: 2025-12-18T04:50:00.512Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.

Applied to files:

  • testing/python/issue/test_tilelang_issue_1263.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
🔇 Additional comments (4)
testing/python/issue/test_tilelang_issue_1263.py (1)

27-34: LGTM! Good test coverage for the transitive dependency fix.

The test correctly exercises the transitive LetStmt dependency pattern (id = ids[i]id2 = ids2[id]) that this PR fixes. Testing with both default settings and with warp specialization/TMA lowering disabled provides good coverage of different code paths.

src/transform/inject_pipeline.cc (2)

841-861: LGTM! Correct reverse iteration for proper LetStmt scoping.

The reverse iteration ensures that earlier LetStmt definitions (e.g., id = ids[i]) correctly scope over later ones that depend on them (e.g., id2 = ids2[id]). The explanatory comments clearly document the expected nesting behavior.


1089-1118: LGTM! Proper transitive dependency tracking for LetStmts.

The implementation correctly handles transitive dependencies by building a dependent_vars set that includes both the loop variable and all previously recorded loop-var-dependent variables. This ensures chains like id = ids[i]id2 = ids2[id] are both captured and later substituted correctly.

examples/elementwise/example_elementwise_add.py (1)

35-38: LGTM!

The simplified kernel invocation with fixed configuration values is appropriate for an example file.

@LeiWang1999 LeiWang1999 merged commit 4fb96a0 into main Jan 13, 2026
5 of 6 checks passed
tzj-fxz pushed a commit to wfloveiu/tilelang that referenced this pull request Jan 14, 2026
…nsitive LetStmt dependencies (tile-ai#1657)

* [Enhancement] Update global load/store functions for CUDA compatibility (tile-ai#1652)

Refactor the `ld_global_256` and `st_global_256` functions to support both CUDA versions above 12.9 and earlier versions. This change ensures that 256-bit loads and stores are handled correctly across different CUDA versions, improving performance and compatibility. The implementation now uses two 128-bit loads/stores for older versions, enhancing the robustness of the codebase.

* Update comments in global load/store functions for CUDA compatibility

Clarified comments in `ld_global_256` and `st_global_256` functions to indicate that the fallback for CUDA versions below 12.9 may have performance regressions. This change enhances code readability and provides better context for developers working with different CUDA versions.

* Update submodule and enhance LetStmt handling in inject_pipeline.cc

- Updated the TVM submodule to the latest commit.
- Improved the handling of LetStmt in the inject_pipeline.cc file to account for transitive dependencies on loop variables, ensuring correct variable substitution in rewritten blocks.
- Adjusted test_tilelang_issue_1263.py to remove unnecessary jit decorator and updated the kernel compilation process with specific pass configurations.

* lint fix

* revert tvm

* remove unused test

* test fix
LeiWang1999 added a commit that referenced this pull request Jan 26, 2026
* finish KDA algorithm in tilelang

* fix pre-commit.ci

* fix pre-commit.ci

* fix pre-commit local

* [Style] Fix some code styles

* [Refactor] Remove redundant swizzle for they can be automatically done

* [Refactor] remove chunk_bwd_intra.py and rename chunk_bwd_intra_op.py and do some fix form coderabbitai

* update ruff

* update pre-commit

* [Enhancement] Improve unroll loop functionality for dynamic extent and corresponding test case (#1654)

* Add unroll loop functionality and corresponding test case

- Introduced a new `UnrollLoop` function in the transform module to unroll loops based on various configuration options.
- Added a test case in `test_tilelang_language_unroll.py` to validate the behavior of `T.unroll` with only the extent parameter, ensuring correct kernel generation with unroll pragmas.

* Refactor unroll kernel implementation and update test case

- Changed the kernel function in `test_tilelang_language_unroll.py` to use a new `unroll_kernel` function that compiles and returns the output tensor, improving clarity and structure.
- Updated the `OptimizeForTarget` function in `phase.py` to ensure the `UnrollLoop` transformation is applied correctly, maintaining consistency in optimization phases.

* lint fix

* lint fix

* [Bugfix] Fix missing annotations for default CallNode Visitor (#1659)

tvm fix

* [Clean] Remove unnecessary debug print (#1661)

remove unnecessary debug print

* [Bugfix] Fix variable scoping issue in InjectSoftwarePipeline for transitive LetStmt dependencies (#1657)

* [Enhancement] Update global load/store functions for CUDA compatibility (#1652)

Refactor the `ld_global_256` and `st_global_256` functions to support both CUDA versions above 12.9 and earlier versions. This change ensures that 256-bit loads and stores are handled correctly across different CUDA versions, improving performance and compatibility. The implementation now uses two 128-bit loads/stores for older versions, enhancing the robustness of the codebase.

* Update comments in global load/store functions for CUDA compatibility

Clarified comments in `ld_global_256` and `st_global_256` functions to indicate that the fallback for CUDA versions below 12.9 may have performance regressions. This change enhances code readability and provides better context for developers working with different CUDA versions.

* Update submodule and enhance LetStmt handling in inject_pipeline.cc

- Updated the TVM submodule to the latest commit.
- Improved the handling of LetStmt in the inject_pipeline.cc file to account for transitive dependencies on loop variables, ensuring correct variable substitution in rewritten blocks.
- Adjusted test_tilelang_issue_1263.py to remove unnecessary jit decorator and updated the kernel compilation process with specific pass configurations.

* lint fix

* revert tvm

* remove unused test

* test fix

* [Refactor] Improve CallNode handling to include annotations in various operations (#1663)

* [Enhancement] Update CallNode handling to include annotations in various operations

- Modified CallNode invocations in multiple files to ensure that annotations are passed correctly, enhancing the consistency and functionality of the codebase.
- Removed the "use_tma" annotation from AtomicAddNode and adjusted related calls to maintain expected behavior.
- Updated CUDA intrinsic dispatch functions to include annotations, improving compatibility and correctness in CUDA operations.

* lint fix

* [EagerJIT] Add Support for Parameter Only Kernel Compilation (#1664)

* [Fix] Refactor type hint extraction logic in DSLMutator for better clarity and handling of annotations

* [Refactor] Remove redundant tensor creation in loop layout tests and update kernel compilation parameters

* [AutoDD] Add Tilelang AutoDD to Reduce Buggy Program (#1639)

* [Feat] Add tilelang autodd for delta debugging

* fix typos

* fix lint error

* fix typos

* fix lint error

* fix bugs

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix codeview comments

* [Refactor] Move AutoDD detection to env module and update import logic

* Refactor: Relocate the _is_running_autodd function to the env module for better organization and encapsulation.
* Update initialization logic to skip logger and heavy imports based on a new light import mode, enhancing flexibility in module usage.
* Ensure consistent handling of environment variables across the package, improving overall code clarity and maintainability.

* [Documentation] Add AutoDD section to debug_tools_for_tilelang.md

* Introduced a comprehensive guide on AutoDD (Automatic Delta Debugging) for isolating bugs in TileLang programs.
* Explained Delta Debugging methodology, usage, parameters, and provided examples for clarity.
* Highlighted the benefits of using AutoDD for large codebases and hard-to-locate errors, emphasizing time-saving aspects.
* Included tips for effective usage and a reference to a complete example in the documentation.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: kurisu6912 <227995639+kurisu6912@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* rebase origin

* [Feature] Support `cp.reduce.async.bulk.tensor` (#1667)

* support cp.reduce.async.bulk.tensor and add test

* Refactor flash attention example by removing unnecessary layout annotations

* support swizzle layout for tma reduce

* auto swizzle for non-1d tma atomic add

* upd example and test

* lint

* typo

* add constraint for test

* Refactor CUDA data type mapping by moving the to_CUtensorMapDataType function to utils.cc and utils.h, while removing redundant definitions from atomic_add.cc and copy.cc.

* lint

* rename basename according to CI

* Update submodule TVM and remove deprecated KDA example files

- Updated the TVM submodule to commit 354eef9a.
- Removed several outdated KDA example files and utility scripts that are no longer in use, including chunk_bwd_dqkwg.py, chunk_bwd_dv.py, chunk_bwd_gla_dA.py, chunk_bwd_intra.py, chunk_delta_bwd.py, chunk_delta_h_fwd.py, chunk_inter_solve_fused.py, chunk_intra_token_parallel.py, chunk_o.py, README.md, test_utils_kda.py, wy_fast_bwd.py, wy_fast.py, and various FLA_KDA implementations.

* lint fix

---------

Co-authored-by: wufang <wufang@MBP-MK6VR66Y2M-2329.local>
Co-authored-by: tzj-fxz <tzjfxz@gmail.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: Kuris <227995639+kurisu6912@users.noreply.github.com>
Co-authored-by: Kexing Zhou <KEKE_046@pku.edu.cn>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
Co-authored-by: Zhengju Tang <97930865+tzj-fxz@users.noreply.github.com>
Co-authored-by: Tong WU <109033598+Rachmanino@users.noreply.github.com>
@LeiWang1999 LeiWang1999 deleted the fallback_0112 branch February 3, 2026 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant