Skip to content

[Refactor] Improve CallNode handling to include annotations in various operations#1663

Merged
LeiWang1999 merged 2 commits intotile-ai:mainfrom
LeiWang1999:enhance_0113
Jan 13, 2026
Merged

[Refactor] Improve CallNode handling to include annotations in various operations#1663
LeiWang1999 merged 2 commits intotile-ai:mainfrom
LeiWang1999:enhance_0113

Conversation

@LeiWang1999
Copy link
Member

@LeiWang1999 LeiWang1999 commented Jan 13, 2026

as title.

Summary by CodeRabbit

  • Bug Fixes

    • Fixed annotation handling in atomic operations to prevent unwanted metadata propagation.
    • Improved annotation preservation through CUDA dispatch and code transformation passes.
    • Added null-check for call validation in CUDA shuffle operations.
  • Chores

    • Updated third-party dependencies.

✏️ Tip: You can customize this high-level summary in your review settings.

…ous operations

- Modified CallNode invocations in multiple files to ensure that annotations are passed correctly, enhancing the consistency and functionality of the codebase.
- Removed the "use_tma" annotation from AtomicAddNode and adjusted related calls to maintain expected behavior.
- Updated CUDA intrinsic dispatch functions to include annotations, improving compatibility and correctness in CUDA operations.
@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 13, 2026

📝 Walkthrough

Walkthrough

Updates annotation handling across TVM IR transformations to ensure Call node annotations are properly propagated through CUDA intrinsic dispatchers and barrier injection passes. Filters "use_tma" annotation from atomic operations while preserving it in node metadata. Includes TVM submodule reference update.

Changes

Cohort / File(s) Summary
TVM Submodule Update
3rdparty/tvm
Updated submodule commit reference from da7f19b6 to 0794c13a; no observable changes to code logic.
Call Annotation Propagation
src/target/intrin_rule_cuda.cc
DispatchCUDAWarpActiveMask and DispatchCUDAShuffle now propagate source call annotations to reconstructed Call nodes. Added null-check for CallNode in DispatchCUDAShuffle.
TMA Barrier Annotation Propagation
src/transform/inject_tma_barrier.cc
All Call node reconstructions now include op->annotations as fourth parameter across TmaExpectTxRewriter, BarrierCreationRewriter, and TmaBarrierRewriter. Preserves metadata through rewrites.
Atomic Operation Annotation Filtering
src/op/atomic_add.cc
AtomicAddNode::MakeSIMTLoop and AtomicAddNode::Lower now copy annotations, remove "use_tma" entry, and pass filtered set to atomic/tma_store calls, preventing "use_tma" propagation to atomic operations.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

Poem

🐰 Annotations hop through IR trees,
From Call to Call, with practiced ease,
Filtering out what shouldn't be,
While metadata runs wild and free,
The CUDA code now knows its place!

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main refactoring objective: improving CallNode handling to include annotations across various operations.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @src/target/intrin_rule_cuda.cc:
- Around line 119-123: DispatchCUDAWarpActiveMask dereferences call without
checking e->as<CallNode()]; add the same null-check used in DispatchCUDAShuffle
by inserting ICHECK(call != nullptr) after obtaining const CallNode *call =
e.as<CallNode>(); so you bail early with a clear invariant if e is not a
CallNode before using call->dtype, call->args, or call->annotations; keep the
rest of the function constructing the Call(...) intact.
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5e90edc and d7e935c.

📒 Files selected for processing (4)
  • 3rdparty/tvm
  • src/op/atomic_add.cc
  • src/target/intrin_rule_cuda.cc
  • src/transform/inject_tma_barrier.cc
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-12T09:47:46.474Z
Learnt from: kurisu6912
Repo: tile-ai/tilelang PR: 794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.

Applied to files:

  • src/transform/inject_tma_barrier.cc
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
  • GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
🔇 Additional comments (7)
3rdparty/tvm (1)

1-1: Verify TVM API compatibility—internal code review passed, external verification required.

Internal review confirms consistent annotation handling across all modified files (src/op/atomic_add.cc, src/target/intrin_rule_cuda.cc, src/transform/inject_tma_barrier.cc). All Call constructors properly use the 4-argument form with annotations parameter. No security advisories found for the TVM commit.

However, external verification of TVM API changes between the old and new commits is still required to ensure the CallNode construction patterns remain compatible with the updated TVM version.

src/op/atomic_add.cc (2)

283-287: LGTM! Annotation filtering is correctly implemented.

The pattern of copying the annotations map, erasing "use_tma", and passing the filtered annotations to the atomicadd_elem_op call is correct. This ensures the use_tma annotation remains on the AtomicAddNode for decision-making (e.g., in GetUseTMA()) while not being forwarded to the underlying atomic operation.


396-403: LGTM! Consistent annotation handling in the TMA store path.

The same filtering pattern is correctly applied here: copying annotations, erasing "use_tma", and passing the filtered map to the tma_store call. This maintains consistency with MakeSIMTLoop and ensures proper annotation propagation throughout the lowering pipeline.

src/target/intrin_rule_cuda.cc (1)

125-133: LGTM! Proper null-check and annotation propagation.

The added ICHECK(call != nullptr) provides a clear diagnostic if the input is malformed, and the annotations are correctly forwarded to the new Call node. This aligns with the PR's goal of consistent annotation handling across CUDA intrinsic dispatchers.

src/transform/inject_tma_barrier.cc (3)

165-179: LGTM! Annotations preserved in TMA load rewrite.

The op->annotations are correctly propagated when reconstructing the Call node with the updated barrier argument. This ensures any metadata attached to the original TMA load call survives the rewrite.


350-389: LGTM! Annotations preserved in barrier creation rewrite.

The op->annotations are correctly forwarded when constructing the new Call node with the modified barrier argument list. This maintains any metadata attached to the original create_list_of_mbarrier call.


507-568: LGTM! Consistent annotation propagation across all TMA barrier rewrite paths.

All Call node reconstructions in TmaBarrierRewriter::VisitExpr_ correctly include op->annotations, ensuring metadata is preserved through:

  • 1D TMA load promotion (line 524)
  • TMA load barrier ID assignment (line 540)
  • mbarrier_expect_txptx_arrive_barrier_expect_tx conversion (line 555)
  • mbarrier_expect_tx argument update (line 557)
  • ptx_arrive_barrier passthrough (line 565)

@LeiWang1999 LeiWang1999 merged commit 29ece98 into tile-ai:main Jan 13, 2026
5 of 7 checks passed
tzj-fxz pushed a commit to wfloveiu/tilelang that referenced this pull request Jan 14, 2026
…s operations (tile-ai#1663)

* [Enhancement] Update CallNode handling to include annotations in various operations

- Modified CallNode invocations in multiple files to ensure that annotations are passed correctly, enhancing the consistency and functionality of the codebase.
- Removed the "use_tma" annotation from AtomicAddNode and adjusted related calls to maintain expected behavior.
- Updated CUDA intrinsic dispatch functions to include annotations, improving compatibility and correctness in CUDA operations.

* lint fix
LeiWang1999 added a commit that referenced this pull request Jan 26, 2026
* finish KDA algorithm in tilelang

* fix pre-commit.ci

* fix pre-commit.ci

* fix pre-commit local

* [Style] Fix some code styles

* [Refactor] Remove redundant swizzle for they can be automatically done

* [Refactor] remove chunk_bwd_intra.py and rename chunk_bwd_intra_op.py and do some fix form coderabbitai

* update ruff

* update pre-commit

* [Enhancement] Improve unroll loop functionality for dynamic extent and corresponding test case (#1654)

* Add unroll loop functionality and corresponding test case

- Introduced a new `UnrollLoop` function in the transform module to unroll loops based on various configuration options.
- Added a test case in `test_tilelang_language_unroll.py` to validate the behavior of `T.unroll` with only the extent parameter, ensuring correct kernel generation with unroll pragmas.

* Refactor unroll kernel implementation and update test case

- Changed the kernel function in `test_tilelang_language_unroll.py` to use a new `unroll_kernel` function that compiles and returns the output tensor, improving clarity and structure.
- Updated the `OptimizeForTarget` function in `phase.py` to ensure the `UnrollLoop` transformation is applied correctly, maintaining consistency in optimization phases.

* lint fix

* lint fix

* [Bugfix] Fix missing annotations for default CallNode Visitor (#1659)

tvm fix

* [Clean] Remove unnecessary debug print (#1661)

remove unnecessary debug print

* [Bugfix] Fix variable scoping issue in InjectSoftwarePipeline for transitive LetStmt dependencies (#1657)

* [Enhancement] Update global load/store functions for CUDA compatibility (#1652)

Refactor the `ld_global_256` and `st_global_256` functions to support both CUDA versions above 12.9 and earlier versions. This change ensures that 256-bit loads and stores are handled correctly across different CUDA versions, improving performance and compatibility. The implementation now uses two 128-bit loads/stores for older versions, enhancing the robustness of the codebase.

* Update comments in global load/store functions for CUDA compatibility

Clarified comments in `ld_global_256` and `st_global_256` functions to indicate that the fallback for CUDA versions below 12.9 may have performance regressions. This change enhances code readability and provides better context for developers working with different CUDA versions.

* Update submodule and enhance LetStmt handling in inject_pipeline.cc

- Updated the TVM submodule to the latest commit.
- Improved the handling of LetStmt in the inject_pipeline.cc file to account for transitive dependencies on loop variables, ensuring correct variable substitution in rewritten blocks.
- Adjusted test_tilelang_issue_1263.py to remove unnecessary jit decorator and updated the kernel compilation process with specific pass configurations.

* lint fix

* revert tvm

* remove unused test

* test fix

* [Refactor] Improve CallNode handling to include annotations in various operations (#1663)

* [Enhancement] Update CallNode handling to include annotations in various operations

- Modified CallNode invocations in multiple files to ensure that annotations are passed correctly, enhancing the consistency and functionality of the codebase.
- Removed the "use_tma" annotation from AtomicAddNode and adjusted related calls to maintain expected behavior.
- Updated CUDA intrinsic dispatch functions to include annotations, improving compatibility and correctness in CUDA operations.

* lint fix

* [EagerJIT] Add Support for Parameter Only Kernel Compilation (#1664)

* [Fix] Refactor type hint extraction logic in DSLMutator for better clarity and handling of annotations

* [Refactor] Remove redundant tensor creation in loop layout tests and update kernel compilation parameters

* [AutoDD] Add Tilelang AutoDD to Reduce Buggy Program (#1639)

* [Feat] Add tilelang autodd for delta debugging

* fix typos

* fix lint error

* fix typos

* fix lint error

* fix bugs

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix codeview comments

* [Refactor] Move AutoDD detection to env module and update import logic

* Refactor: Relocate the _is_running_autodd function to the env module for better organization and encapsulation.
* Update initialization logic to skip logger and heavy imports based on a new light import mode, enhancing flexibility in module usage.
* Ensure consistent handling of environment variables across the package, improving overall code clarity and maintainability.

* [Documentation] Add AutoDD section to debug_tools_for_tilelang.md

* Introduced a comprehensive guide on AutoDD (Automatic Delta Debugging) for isolating bugs in TileLang programs.
* Explained Delta Debugging methodology, usage, parameters, and provided examples for clarity.
* Highlighted the benefits of using AutoDD for large codebases and hard-to-locate errors, emphasizing time-saving aspects.
* Included tips for effective usage and a reference to a complete example in the documentation.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: kurisu6912 <227995639+kurisu6912@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* rebase origin

* [Feature] Support `cp.reduce.async.bulk.tensor` (#1667)

* support cp.reduce.async.bulk.tensor and add test

* Refactor flash attention example by removing unnecessary layout annotations

* support swizzle layout for tma reduce

* auto swizzle for non-1d tma atomic add

* upd example and test

* lint

* typo

* add constraint for test

* Refactor CUDA data type mapping by moving the to_CUtensorMapDataType function to utils.cc and utils.h, while removing redundant definitions from atomic_add.cc and copy.cc.

* lint

* rename basename according to CI

* Update submodule TVM and remove deprecated KDA example files

- Updated the TVM submodule to commit 354eef9a.
- Removed several outdated KDA example files and utility scripts that are no longer in use, including chunk_bwd_dqkwg.py, chunk_bwd_dv.py, chunk_bwd_gla_dA.py, chunk_bwd_intra.py, chunk_delta_bwd.py, chunk_delta_h_fwd.py, chunk_inter_solve_fused.py, chunk_intra_token_parallel.py, chunk_o.py, README.md, test_utils_kda.py, wy_fast_bwd.py, wy_fast.py, and various FLA_KDA implementations.

* lint fix

---------

Co-authored-by: wufang <wufang@MBP-MK6VR66Y2M-2329.local>
Co-authored-by: tzj-fxz <tzjfxz@gmail.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: Kuris <227995639+kurisu6912@users.noreply.github.com>
Co-authored-by: Kexing Zhou <KEKE_046@pku.edu.cn>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
Co-authored-by: Zhengju Tang <97930865+tzj-fxz@users.noreply.github.com>
Co-authored-by: Tong WU <109033598+Rachmanino@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant