[Refactor] Improve CallNode handling to include annotations in various operations#1663
[Refactor] Improve CallNode handling to include annotations in various operations#1663LeiWang1999 merged 2 commits intotile-ai:mainfrom
Conversation
…ous operations - Modified CallNode invocations in multiple files to ensure that annotations are passed correctly, enhancing the consistency and functionality of the codebase. - Removed the "use_tma" annotation from AtomicAddNode and adjusted related calls to maintain expected behavior. - Updated CUDA intrinsic dispatch functions to include annotations, improving compatibility and correctness in CUDA operations.
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
📝 WalkthroughWalkthroughUpdates annotation handling across TVM IR transformations to ensure Call node annotations are properly propagated through CUDA intrinsic dispatchers and barrier injection passes. Filters "use_tma" annotation from atomic operations while preserving it in node metadata. Includes TVM submodule reference update. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @src/target/intrin_rule_cuda.cc:
- Around line 119-123: DispatchCUDAWarpActiveMask dereferences call without
checking e->as<CallNode()]; add the same null-check used in DispatchCUDAShuffle
by inserting ICHECK(call != nullptr) after obtaining const CallNode *call =
e.as<CallNode>(); so you bail early with a clear invariant if e is not a
CallNode before using call->dtype, call->args, or call->annotations; keep the
rest of the function constructing the Call(...) intact.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
3rdparty/tvmsrc/op/atomic_add.ccsrc/target/intrin_rule_cuda.ccsrc/transform/inject_tma_barrier.cc
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-12T09:47:46.474Z
Learnt from: kurisu6912
Repo: tile-ai/tilelang PR: 794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.
Applied to files:
src/transform/inject_tma_barrier.cc
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
- GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
- GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
🔇 Additional comments (7)
3rdparty/tvm (1)
1-1: Verify TVM API compatibility—internal code review passed, external verification required.Internal review confirms consistent annotation handling across all modified files (src/op/atomic_add.cc, src/target/intrin_rule_cuda.cc, src/transform/inject_tma_barrier.cc). All Call constructors properly use the 4-argument form with annotations parameter. No security advisories found for the TVM commit.
However, external verification of TVM API changes between the old and new commits is still required to ensure the CallNode construction patterns remain compatible with the updated TVM version.
src/op/atomic_add.cc (2)
283-287: LGTM! Annotation filtering is correctly implemented.The pattern of copying the annotations map, erasing "use_tma", and passing the filtered annotations to the
atomicadd_elem_opcall is correct. This ensures theuse_tmaannotation remains on theAtomicAddNodefor decision-making (e.g., inGetUseTMA()) while not being forwarded to the underlying atomic operation.
396-403: LGTM! Consistent annotation handling in the TMA store path.The same filtering pattern is correctly applied here: copying annotations, erasing "use_tma", and passing the filtered map to the
tma_storecall. This maintains consistency withMakeSIMTLoopand ensures proper annotation propagation throughout the lowering pipeline.src/target/intrin_rule_cuda.cc (1)
125-133: LGTM! Proper null-check and annotation propagation.The added
ICHECK(call != nullptr)provides a clear diagnostic if the input is malformed, and the annotations are correctly forwarded to the newCallnode. This aligns with the PR's goal of consistent annotation handling across CUDA intrinsic dispatchers.src/transform/inject_tma_barrier.cc (3)
165-179: LGTM! Annotations preserved in TMA load rewrite.The
op->annotationsare correctly propagated when reconstructing theCallnode with the updated barrier argument. This ensures any metadata attached to the original TMA load call survives the rewrite.
350-389: LGTM! Annotations preserved in barrier creation rewrite.The
op->annotationsare correctly forwarded when constructing the newCallnode with the modified barrier argument list. This maintains any metadata attached to the originalcreate_list_of_mbarriercall.
507-568: LGTM! Consistent annotation propagation across all TMA barrier rewrite paths.All
Callnode reconstructions inTmaBarrierRewriter::VisitExpr_correctly includeop->annotations, ensuring metadata is preserved through:
- 1D TMA load promotion (line 524)
- TMA load barrier ID assignment (line 540)
mbarrier_expect_tx→ptx_arrive_barrier_expect_txconversion (line 555)mbarrier_expect_txargument update (line 557)ptx_arrive_barrierpassthrough (line 565)
…s operations (tile-ai#1663) * [Enhancement] Update CallNode handling to include annotations in various operations - Modified CallNode invocations in multiple files to ensure that annotations are passed correctly, enhancing the consistency and functionality of the codebase. - Removed the "use_tma" annotation from AtomicAddNode and adjusted related calls to maintain expected behavior. - Updated CUDA intrinsic dispatch functions to include annotations, improving compatibility and correctness in CUDA operations. * lint fix
* finish KDA algorithm in tilelang * fix pre-commit.ci * fix pre-commit.ci * fix pre-commit local * [Style] Fix some code styles * [Refactor] Remove redundant swizzle for they can be automatically done * [Refactor] remove chunk_bwd_intra.py and rename chunk_bwd_intra_op.py and do some fix form coderabbitai * update ruff * update pre-commit * [Enhancement] Improve unroll loop functionality for dynamic extent and corresponding test case (#1654) * Add unroll loop functionality and corresponding test case - Introduced a new `UnrollLoop` function in the transform module to unroll loops based on various configuration options. - Added a test case in `test_tilelang_language_unroll.py` to validate the behavior of `T.unroll` with only the extent parameter, ensuring correct kernel generation with unroll pragmas. * Refactor unroll kernel implementation and update test case - Changed the kernel function in `test_tilelang_language_unroll.py` to use a new `unroll_kernel` function that compiles and returns the output tensor, improving clarity and structure. - Updated the `OptimizeForTarget` function in `phase.py` to ensure the `UnrollLoop` transformation is applied correctly, maintaining consistency in optimization phases. * lint fix * lint fix * [Bugfix] Fix missing annotations for default CallNode Visitor (#1659) tvm fix * [Clean] Remove unnecessary debug print (#1661) remove unnecessary debug print * [Bugfix] Fix variable scoping issue in InjectSoftwarePipeline for transitive LetStmt dependencies (#1657) * [Enhancement] Update global load/store functions for CUDA compatibility (#1652) Refactor the `ld_global_256` and `st_global_256` functions to support both CUDA versions above 12.9 and earlier versions. This change ensures that 256-bit loads and stores are handled correctly across different CUDA versions, improving performance and compatibility. The implementation now uses two 128-bit loads/stores for older versions, enhancing the robustness of the codebase. * Update comments in global load/store functions for CUDA compatibility Clarified comments in `ld_global_256` and `st_global_256` functions to indicate that the fallback for CUDA versions below 12.9 may have performance regressions. This change enhances code readability and provides better context for developers working with different CUDA versions. * Update submodule and enhance LetStmt handling in inject_pipeline.cc - Updated the TVM submodule to the latest commit. - Improved the handling of LetStmt in the inject_pipeline.cc file to account for transitive dependencies on loop variables, ensuring correct variable substitution in rewritten blocks. - Adjusted test_tilelang_issue_1263.py to remove unnecessary jit decorator and updated the kernel compilation process with specific pass configurations. * lint fix * revert tvm * remove unused test * test fix * [Refactor] Improve CallNode handling to include annotations in various operations (#1663) * [Enhancement] Update CallNode handling to include annotations in various operations - Modified CallNode invocations in multiple files to ensure that annotations are passed correctly, enhancing the consistency and functionality of the codebase. - Removed the "use_tma" annotation from AtomicAddNode and adjusted related calls to maintain expected behavior. - Updated CUDA intrinsic dispatch functions to include annotations, improving compatibility and correctness in CUDA operations. * lint fix * [EagerJIT] Add Support for Parameter Only Kernel Compilation (#1664) * [Fix] Refactor type hint extraction logic in DSLMutator for better clarity and handling of annotations * [Refactor] Remove redundant tensor creation in loop layout tests and update kernel compilation parameters * [AutoDD] Add Tilelang AutoDD to Reduce Buggy Program (#1639) * [Feat] Add tilelang autodd for delta debugging * fix typos * fix lint error * fix typos * fix lint error * fix bugs * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix codeview comments * [Refactor] Move AutoDD detection to env module and update import logic * Refactor: Relocate the _is_running_autodd function to the env module for better organization and encapsulation. * Update initialization logic to skip logger and heavy imports based on a new light import mode, enhancing flexibility in module usage. * Ensure consistent handling of environment variables across the package, improving overall code clarity and maintainability. * [Documentation] Add AutoDD section to debug_tools_for_tilelang.md * Introduced a comprehensive guide on AutoDD (Automatic Delta Debugging) for isolating bugs in TileLang programs. * Explained Delta Debugging methodology, usage, parameters, and provided examples for clarity. * Highlighted the benefits of using AutoDD for large codebases and hard-to-locate errors, emphasizing time-saving aspects. * Included tips for effective usage and a reference to a complete example in the documentation. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: kurisu6912 <227995639+kurisu6912@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * rebase origin * [Feature] Support `cp.reduce.async.bulk.tensor` (#1667) * support cp.reduce.async.bulk.tensor and add test * Refactor flash attention example by removing unnecessary layout annotations * support swizzle layout for tma reduce * auto swizzle for non-1d tma atomic add * upd example and test * lint * typo * add constraint for test * Refactor CUDA data type mapping by moving the to_CUtensorMapDataType function to utils.cc and utils.h, while removing redundant definitions from atomic_add.cc and copy.cc. * lint * rename basename according to CI * Update submodule TVM and remove deprecated KDA example files - Updated the TVM submodule to commit 354eef9a. - Removed several outdated KDA example files and utility scripts that are no longer in use, including chunk_bwd_dqkwg.py, chunk_bwd_dv.py, chunk_bwd_gla_dA.py, chunk_bwd_intra.py, chunk_delta_bwd.py, chunk_delta_h_fwd.py, chunk_inter_solve_fused.py, chunk_intra_token_parallel.py, chunk_o.py, README.md, test_utils_kda.py, wy_fast_bwd.py, wy_fast.py, and various FLA_KDA implementations. * lint fix --------- Co-authored-by: wufang <wufang@MBP-MK6VR66Y2M-2329.local> Co-authored-by: tzj-fxz <tzjfxz@gmail.com> Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by: Kuris <227995639+kurisu6912@users.noreply.github.com> Co-authored-by: Kexing Zhou <KEKE_046@pku.edu.cn> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> Co-authored-by: Zhengju Tang <97930865+tzj-fxz@users.noreply.github.com> Co-authored-by: Tong WU <109033598+Rachmanino@users.noreply.github.com>
as title.
Summary by CodeRabbit
Bug Fixes
Chores
✏️ Tip: You can customize this high-level summary in your review settings.