[Refactor] Improve CallNode handling to include annotations in various operations by LeiWang1999 · Pull Request #1663 · tile-ai/tilelang

LeiWang1999 · 2026-01-13T06:14:50Z

as title.

Summary by CodeRabbit

Bug Fixes
- Fixed annotation handling in atomic operations to prevent unwanted metadata propagation.
- Improved annotation preservation through CUDA dispatch and code transformation passes.
- Added null-check for call validation in CUDA shuffle operations.
Chores
- Updated third-party dependencies.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…ous operations - Modified CallNode invocations in multiple files to ensure that annotations are passed correctly, enhancing the consistency and functionality of the codebase. - Removed the "use_tma" annotation from AtomicAddNode and adjusted related calls to maintain expected behavior. - Updated CUDA intrinsic dispatch functions to include annotations, improving compatibility and correctness in CUDA operations.

github-actions · 2026-01-13T06:14:59Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai · 2026-01-13T06:15:22Z

📝 Walkthrough

Walkthrough

Updates annotation handling across TVM IR transformations to ensure Call node annotations are properly propagated through CUDA intrinsic dispatchers and barrier injection passes. Filters "use_tma" annotation from atomic operations while preserving it in node metadata. Includes TVM submodule reference update.

Changes

Cohort / File(s)	Summary
TVM Submodule Update `3rdparty/tvm`	Updated submodule commit reference from da7f19b6 to 0794c13a; no observable changes to code logic.
Call Annotation Propagation `src/target/intrin_rule_cuda.cc`	DispatchCUDAWarpActiveMask and DispatchCUDAShuffle now propagate source call annotations to reconstructed Call nodes. Added null-check for CallNode in DispatchCUDAShuffle.
TMA Barrier Annotation Propagation `src/transform/inject_tma_barrier.cc`	All Call node reconstructions now include `op->annotations` as fourth parameter across TmaExpectTxRewriter, BarrierCreationRewriter, and TmaBarrierRewriter. Preserves metadata through rewrites.
Atomic Operation Annotation Filtering `src/op/atomic_add.cc`	AtomicAddNode::MakeSIMTLoop and AtomicAddNode::Lower now copy annotations, remove "use_tma" entry, and pass filtered set to atomic/tma_store calls, preventing "use_tma" propagation to atomic operations.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

[Bugfix] Fix missing annotations for default CallNode Visitor #1659: Addresses TVM fix for default CallNode visitor behavior; both PRs handle Call-node annotation propagation and coordinate with TVM submodule updates.
[Feature]: Add test for atomicadd auto vectorize and remove useless code #1019: Modifies AtomicAddNode implementation internals; shares changes to how atomic-add extern calls are constructed.
[Language] Support atomic add with ret #870: Modifies src/op/atomic_add.cc; overlaps in atomic add call argument and annotation handling.

Poem

🐰 Annotations hop through IR trees,
From Call to Call, with practiced ease,
Filtering out what shouldn't be,
While metadata runs wild and free,
The CUDA code now knows its place! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main refactoring objective: improving CallNode handling to include annotations across various operations.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In @src/target/intrin_rule_cuda.cc:
- Around line 119-123: DispatchCUDAWarpActiveMask dereferences call without
checking e->as<CallNode()]; add the same null-check used in DispatchCUDAShuffle
by inserting ICHECK(call != nullptr) after obtaining const CallNode *call =
e.as<CallNode>(); so you bail early with a clear invariant if e is not a
CallNode before using call->dtype, call->args, or call->annotations; keep the
rest of the function constructing the Call(...) intact.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5e90edc and d7e935c.

📒 Files selected for processing (4)

3rdparty/tvm
src/op/atomic_add.cc
src/target/intrin_rule_cuda.cc
src/transform/inject_tma_barrier.cc

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-09-12T09:47:46.474Z

Learnt from: kurisu6912
Repo: tile-ai/tilelang PR: 794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.

Applied to files:

src/transform/inject_tma_barrier.cc

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
GitHub Check: Test for Python 3.12 with Metal (on macos-latest)

🔇 Additional comments (7)

3rdparty/tvm (1)

1-1: Verify TVM API compatibility—internal code review passed, external verification required.

Internal review confirms consistent annotation handling across all modified files (src/op/atomic_add.cc, src/target/intrin_rule_cuda.cc, src/transform/inject_tma_barrier.cc). All Call constructors properly use the 4-argument form with annotations parameter. No security advisories found for the TVM commit.

However, external verification of TVM API changes between the old and new commits is still required to ensure the CallNode construction patterns remain compatible with the updated TVM version.

src/op/atomic_add.cc (2)

283-287: LGTM! Annotation filtering is correctly implemented.

The pattern of copying the annotations map, erasing "use_tma", and passing the filtered annotations to the atomicadd_elem_op call is correct. This ensures the use_tma annotation remains on the AtomicAddNode for decision-making (e.g., in GetUseTMA()) while not being forwarded to the underlying atomic operation.

396-403: LGTM! Consistent annotation handling in the TMA store path.

The same filtering pattern is correctly applied here: copying annotations, erasing "use_tma", and passing the filtered map to the tma_store call. This maintains consistency with MakeSIMTLoop and ensures proper annotation propagation throughout the lowering pipeline.

src/target/intrin_rule_cuda.cc (1)

125-133: LGTM! Proper null-check and annotation propagation.

The added ICHECK(call != nullptr) provides a clear diagnostic if the input is malformed, and the annotations are correctly forwarded to the new Call node. This aligns with the PR's goal of consistent annotation handling across CUDA intrinsic dispatchers.

src/transform/inject_tma_barrier.cc (3)

165-179: LGTM! Annotations preserved in TMA load rewrite.

The op->annotations are correctly propagated when reconstructing the Call node with the updated barrier argument. This ensures any metadata attached to the original TMA load call survives the rewrite.

350-389: LGTM! Annotations preserved in barrier creation rewrite.

The op->annotations are correctly forwarded when constructing the new Call node with the modified barrier argument list. This maintains any metadata attached to the original create_list_of_mbarrier call.

507-568: LGTM! Consistent annotation propagation across all TMA barrier rewrite paths.

All Call node reconstructions in TmaBarrierRewriter::VisitExpr_ correctly include op->annotations, ensuring metadata is preserved through:

1D TMA load promotion (line 524)

TMA load barrier ID assignment (line 540)

mbarrier_expect_tx → ptx_arrive_barrier_expect_tx conversion (line 555)

mbarrier_expect_tx argument update (line 557)

ptx_arrive_barrier passthrough (line 565)

src/target/intrin_rule_cuda.cc

…s operations (tile-ai#1663) * [Enhancement] Update CallNode handling to include annotations in various operations - Modified CallNode invocations in multiple files to ensure that annotations are passed correctly, enhancing the consistency and functionality of the codebase. - Removed the "use_tma" annotation from AtomicAddNode and adjusted related calls to maintain expected behavior. - Updated CUDA intrinsic dispatch functions to include annotations, improving compatibility and correctness in CUDA operations. * lint fix

* finish KDA algorithm in tilelang * fix pre-commit.ci * fix pre-commit.ci * fix pre-commit local * [Style] Fix some code styles * [Refactor] Remove redundant swizzle for they can be automatically done * [Refactor] remove chunk_bwd_intra.py and rename chunk_bwd_intra_op.py and do some fix form coderabbitai * update ruff * update pre-commit * [Enhancement] Improve unroll loop functionality for dynamic extent and corresponding test case (#1654) * Add unroll loop functionality and corresponding test case - Introduced a new `UnrollLoop` function in the transform module to unroll loops based on various configuration options. - Added a test case in `test_tilelang_language_unroll.py` to validate the behavior of `T.unroll` with only the extent parameter, ensuring correct kernel generation with unroll pragmas. * Refactor unroll kernel implementation and update test case - Changed the kernel function in `test_tilelang_language_unroll.py` to use a new `unroll_kernel` function that compiles and returns the output tensor, improving clarity and structure. - Updated the `OptimizeForTarget` function in `phase.py` to ensure the `UnrollLoop` transformation is applied correctly, maintaining consistency in optimization phases. * lint fix * lint fix * [Bugfix] Fix missing annotations for default CallNode Visitor (#1659) tvm fix * [Clean] Remove unnecessary debug print (#1661) remove unnecessary debug print * [Bugfix] Fix variable scoping issue in InjectSoftwarePipeline for transitive LetStmt dependencies (#1657) * [Enhancement] Update global load/store functions for CUDA compatibility (#1652) Refactor the `ld_global_256` and `st_global_256` functions to support both CUDA versions above 12.9 and earlier versions. This change ensures that 256-bit loads and stores are handled correctly across different CUDA versions, improving performance and compatibility. The implementation now uses two 128-bit loads/stores for older versions, enhancing the robustness of the codebase. * Update comments in global load/store functions for CUDA compatibility Clarified comments in `ld_global_256` and `st_global_256` functions to indicate that the fallback for CUDA versions below 12.9 may have performance regressions. This change enhances code readability and provides better context for developers working with different CUDA versions. * Update submodule and enhance LetStmt handling in inject_pipeline.cc - Updated the TVM submodule to the latest commit. - Improved the handling of LetStmt in the inject_pipeline.cc file to account for transitive dependencies on loop variables, ensuring correct variable substitution in rewritten blocks. - Adjusted test_tilelang_issue_1263.py to remove unnecessary jit decorator and updated the kernel compilation process with specific pass configurations. * lint fix * revert tvm * remove unused test * test fix * [Refactor] Improve CallNode handling to include annotations in various operations (#1663) * [Enhancement] Update CallNode handling to include annotations in various operations - Modified CallNode invocations in multiple files to ensure that annotations are passed correctly, enhancing the consistency and functionality of the codebase. - Removed the "use_tma" annotation from AtomicAddNode and adjusted related calls to maintain expected behavior. - Updated CUDA intrinsic dispatch functions to include annotations, improving compatibility and correctness in CUDA operations. * lint fix * [EagerJIT] Add Support for Parameter Only Kernel Compilation (#1664) * [Fix] Refactor type hint extraction logic in DSLMutator for better clarity and handling of annotations * [Refactor] Remove redundant tensor creation in loop layout tests and update kernel compilation parameters * [AutoDD] Add Tilelang AutoDD to Reduce Buggy Program (#1639) * [Feat] Add tilelang autodd for delta debugging * fix typos * fix lint error * fix typos * fix lint error * fix bugs * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix codeview comments * [Refactor] Move AutoDD detection to env module and update import logic * Refactor: Relocate the _is_running_autodd function to the env module for better organization and encapsulation. * Update initialization logic to skip logger and heavy imports based on a new light import mode, enhancing flexibility in module usage. * Ensure consistent handling of environment variables across the package, improving overall code clarity and maintainability. * [Documentation] Add AutoDD section to debug_tools_for_tilelang.md * Introduced a comprehensive guide on AutoDD (Automatic Delta Debugging) for isolating bugs in TileLang programs. * Explained Delta Debugging methodology, usage, parameters, and provided examples for clarity. * Highlighted the benefits of using AutoDD for large codebases and hard-to-locate errors, emphasizing time-saving aspects. * Included tips for effective usage and a reference to a complete example in the documentation. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: kurisu6912 <227995639+kurisu6912@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * rebase origin * [Feature] Support `cp.reduce.async.bulk.tensor` (#1667) * support cp.reduce.async.bulk.tensor and add test * Refactor flash attention example by removing unnecessary layout annotations * support swizzle layout for tma reduce * auto swizzle for non-1d tma atomic add * upd example and test * lint * typo * add constraint for test * Refactor CUDA data type mapping by moving the to_CUtensorMapDataType function to utils.cc and utils.h, while removing redundant definitions from atomic_add.cc and copy.cc. * lint * rename basename according to CI * Update submodule TVM and remove deprecated KDA example files - Updated the TVM submodule to commit 354eef9a. - Removed several outdated KDA example files and utility scripts that are no longer in use, including chunk_bwd_dqkwg.py, chunk_bwd_dv.py, chunk_bwd_gla_dA.py, chunk_bwd_intra.py, chunk_delta_bwd.py, chunk_delta_h_fwd.py, chunk_inter_solve_fused.py, chunk_intra_token_parallel.py, chunk_o.py, README.md, test_utils_kda.py, wy_fast_bwd.py, wy_fast.py, and various FLA_KDA implementations. * lint fix --------- Co-authored-by: wufang <wufang@MBP-MK6VR66Y2M-2329.local> Co-authored-by: tzj-fxz <tzjfxz@gmail.com> Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by: Kuris <227995639+kurisu6912@users.noreply.github.com> Co-authored-by: Kexing Zhou <KEKE_046@pku.edu.cn> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> Co-authored-by: Zhengju Tang <97930865+tzj-fxz@users.noreply.github.com> Co-authored-by: Tong WU <109033598+Rachmanino@users.noreply.github.com>

LeiWang1999 added 2 commits January 13, 2026 14:12

lint fix

d7e935c

coderabbitai bot reviewed Jan 13, 2026

View reviewed changes

src/target/intrin_rule_cuda.cc Show resolved Hide resolved

LeiWang1999 merged commit 29ece98 into tile-ai:main Jan 13, 2026
5 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Refactor] Improve CallNode handling to include annotations in various operations#1663

[Refactor] Improve CallNode handling to include annotations in various operations#1663
LeiWang1999 merged 2 commits intotile-ai:mainfrom
LeiWang1999:enhance_0113

LeiWang1999 commented Jan 13, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Jan 13, 2026

Uh oh!

coderabbitai bot commented Jan 13, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LeiWang1999 commented Jan 13, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Jan 13, 2026

Uh oh!

coderabbitai bot commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LeiWang1999 commented Jan 13, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 13, 2026 •

edited

Loading