[Enhancement] Improve unroll loop functionality for dynamic extent and corresponding test case#1654
Conversation
- Introduced a new `UnrollLoop` function in the transform module to unroll loops based on various configuration options. - Added a test case in `test_tilelang_language_unroll.py` to validate the behavior of `T.unroll` with only the extent parameter, ensuring correct kernel generation with unroll pragmas.
📝 WalkthroughWalkthroughAdds a new C++ UnrollLoop transform and its config, exposes it via Changes
Sequence Diagram(s)sequenceDiagram
participant Python as Python caller (tilelang.transform.UnrollLoop)
participant FFI as FFI dispatch (_ffi_api)
participant Cpp as C++ transform (UnrollLoop)
participant IR as PrimFunc/IR body
Python->>FFI: call UnrollLoop(config?)
FFI->>Cpp: invoke UnrollLoop pass with config
Cpp->>IR: analyze loops (extents, steps, pragmas, local access)
Cpp->>IR: transform eligible For loops -> unrolled bodies / mark kUnrolled
Cpp-->>FFI: return transformed IR
FFI-->>Python: return transformed PrimFunc/stmt
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
src/transform/unroll_loop.cc (3)
111-117: Remove or reduce debug logging verbosity.This
LOG(INFO)statement will print to logs every time thepragma_unroll_explicitattribute is encountered, which could be noisy in production. Consider usingDLOG(INFO)for debug builds only, or remove the log statement entirely.♻️ Suggested fix
} else if (op->attr_key == "pragma_unroll_explicit") { bool explicit_unroll = Downcast<Integer>(op->value)->value; - LOG(INFO) << "explicit_unroll: " << explicit_unroll; std::swap(explicit_unroll, explicit_unroll_); Stmt ret = this->VisitStmt(op->body); std::swap(explicit_unroll, explicit_unroll_);
123-123: Addfinalkeyword for consistency.This method overrides a virtual function from
StmtExprMutatorbut is missing theoverrideorfinalkeyword, unlike other overridden methods in this class (lines 104, 171, 185, 200, 205).♻️ Suggested fix
- Stmt VisitStmt_(const ForNode* op) { + Stmt VisitStmt_(const ForNode* op) final {
59-60: Minor typo in docstring.There's an extra backtick character in the description string.
📝 Suggested fix
.def_ro("auto_max_extent", &UnrollLoopConfigNode::auto_max_extent, - "The maximum extent` of loop that will be unrolled.", refl::DefaultValue(0)) + "The maximum extent of loop that will be unrolled.", refl::DefaultValue(0))
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/transform/unroll_loop.cctesting/python/language/test_tilelang_language_unroll.pytilelang/transform/__init__.py
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2026-01-06T05:20:51.649Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1606
File: testing/python/transform/test_tilelang_transform_hoist_broadcast_values.py:30-30
Timestamp: 2026-01-06T05:20:51.649Z
Learning: In `testing/python/transform/test_tilelang_transform_hoist_broadcast_values.py`, the test validates that the `hoist_broadcast_values` transformation pass correctly identifies and hoists broadcast operations by checking for patterns in the generated kernel source code. The specific literal values used (e.g., 430) are not important for the test's purpose, as it does not validate numerical precision or actual stored tensor values.
Applied to files:
testing/python/language/test_tilelang_language_unroll.py
📚 Learning: 2025-12-18T04:50:00.512Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.
Applied to files:
testing/python/language/test_tilelang_language_unroll.py
🧬 Code graph analysis (3)
tilelang/transform/__init__.py (1)
src/transform/unroll_loop.cc (4)
UnrollLoop(274-282)UnrollLoop(274-274)UnrollLoop(288-299)UnrollLoop(288-288)
src/transform/unroll_loop.cc (2)
tilelang/language/loop.py (1)
Unroll(254-265)tilelang/transform/__init__.py (1)
UnrollLoop(515-530)
testing/python/language/test_tilelang_language_unroll.py (3)
tilelang/language/kernel.py (1)
threads(215-219)tilelang/intrinsics/mfma_macro_generator.py (1)
get_thread_binding(234-240)tilelang/language/loop.py (1)
unroll(174-236)
🪛 Cppcheck (2.19.0)
src/transform/unroll_loop.cc
[error] 76-76: syntax error
(syntaxError)
🔇 Additional comments (3)
tilelang/transform/__init__.py (1)
515-530: LGTM!The
UnrollLoopfunction follows the established pattern of other pass accessors in this module, with a comprehensive docstring documenting the configuration options and proper FFI delegation.testing/python/language/test_tilelang_language_unroll.py (1)
37-53: LGTM!The test effectively validates the
T.unrollbehavior with extent-only parameter using a dynamic extent (tid % 32). The test correctly verifies that the#pragma unrolldirective is present in the generated kernel source without needing to execute the kernel.src/transform/unroll_loop.cc (1)
274-298: Implementation looks solid.The
UnrollLoopfunction correctly applies the transformation with SSA conversion when changes occur. The pass registration follows standard TVM patterns with proper config handling and default values fallback.Note: The static analysis hint about a syntax error on line 76 is a false positive—
TVM_FFI_STATIC_INIT_BLOCK()is a valid TVM macro.
- Changed the kernel function in `test_tilelang_language_unroll.py` to use a new `unroll_kernel` function that compiles and returns the output tensor, improving clarity and structure. - Updated the `OptimizeForTarget` function in `phase.py` to ensure the `UnrollLoop` transformation is applied correctly, maintaining consistency in optimization phases.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In @src/transform/unroll_loop.cc:
- Around line 62-64: Fix the docstring for the
UnrollLoopConfigNode::auto_max_extent binding by removing the stray backtick:
update the string passed to .def_ro("auto_max_extent",
&UnrollLoopConfigNode::auto_max_extent, ...) so it reads "The maximum extent of
loop that will be unrolled." (replace `"The maximum extent\` of loop..."` with
the corrected text).
- Around line 119-125: The code leaves a debug logging call LOG(INFO) inside the
pragma_unroll_explicit handling; remove this production log (or replace with
VLOG if optional debug is desired) so normal passes don't emit info-level logs:
locate the block handling op->attr_key == "pragma_unroll_explicit" (variables
explicit_unroll and member explicit_unroll_, and the VisitStmt(op->body) call)
and delete the LOG(INFO) << "explicit_unroll: " << explicit_unroll; line (or
change it to VLOG(<appropriate_verbosity>) << ...).
🧹 Nitpick comments (1)
src/transform/unroll_loop.cc (1)
131-131: Addfinaloverride specifier for consistency.Other
VisitStmt_overrides in this class usefinal, but this one is missing it. Addingfinalensures consistency and prevents accidental further overriding in subclasses.✏️ Suggested fix
- Stmt VisitStmt_(const ForNode *op) { + Stmt VisitStmt_(const ForNode *op) final {
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/transform/unroll_loop.cc
🧰 Additional context used
🧬 Code graph analysis (1)
src/transform/unroll_loop.cc (2)
tilelang/language/loop.py (1)
Unroll(254-265)tilelang/transform/__init__.py (1)
UnrollLoop(515-530)
🪛 Cppcheck (2.19.0)
src/transform/unroll_loop.cc
[error] 83-83: syntax error
(syntaxError)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Quick Lint
🔇 Additional comments (4)
src/transform/unroll_loop.cc (4)
77-85: Configuration registration looks correct.The
UnrollLoopConfigwrapper and registration pattern follows TVM conventions. The Cppcheck syntax error at line 83 is a false positive —TVM_FFI_STATIC_INIT_BLOCKis a valid TVM macro for static initialization that the static analyzer doesn't recognize.
87-98: LGTM!Simple and correct visitor implementation for marking variables accessed in local memory contexts.
289-298: LGTM!The function correctly applies the unroller and conditionally converts to SSA form only when changes occur.
304-315: LGTM!The pass factory correctly retrieves configuration from the pass context with a fallback to defaults, and the registration follows TVM conventions.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In @src/transform/unroll_loop.cc:
- Around line 275-276: Fix the typo in the comment above the
unroll_local_access_ field: change "Wether to unroll loops to local access." to
"Whether to unroll loops to local access." so the comment correctly reads
"Whether to unroll loops to local access." and remains immediately above the
bool unroll_local_access_{false}; declaration.
- Around line 62-64: The docstring for the UnrollLoopConfigNode::auto_max_extent
binding contains an extraneous backtick; update the .def_ro call that binds
UnrollLoopConfigNode::auto_max_extent to remove the backtick so the description
reads "The maximum extent of loop that will be unrolled." (i.e., edit the string
argument in the .def_ro for auto_max_extent to delete the stray ` character).
- Around line 119-125: Remove the leftover debug log inside the
pragma_unroll_explicit handling: delete or replace the LOG(INFO) <<
"explicit_unroll: " << explicit_unroll; statement found in the branch where
op->attr_key == "pragma_unroll_explicit" (around the code that reads bool
explicit_unroll = Downcast<Integer>(op->value)->value and uses explicit_unroll_
and VisitStmt(op->body)); if you still want conditional debug output, change it
to DLOG(INFO) instead of LOG(INFO).
🧹 Nitpick comments (1)
src/transform/unroll_loop.cc (1)
131-131: Missingoverridespecifier on virtual function.The
VisitStmt_method overrides a virtual function fromStmtExprMutatorbut lacks theoverride(orfinal) specifier, which helps catch signature mismatches at compile time.♻️ Suggested fix
- Stmt VisitStmt_(const ForNode *op) { + Stmt VisitStmt_(const ForNode *op) final {
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/transform/unroll_loop.cc
🧰 Additional context used
🧬 Code graph analysis (1)
src/transform/unroll_loop.cc (2)
tilelang/language/loop.py (1)
Unroll(254-265)tilelang/transform/__init__.py (1)
UnrollLoop(515-530)
🪛 Cppcheck (2.19.0)
src/transform/unroll_loop.cc
[error] 83-83: syntax error
(syntaxError)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
- GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
- GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
🔇 Additional comments (4)
src/transform/unroll_loop.cc (4)
87-98: LGTM!The
VarLocalAccessMarkervisitor is a clean, focused implementation for collecting variables from expressions.
218-233: LGTM!The
SeqStmtNodevisitor correctly isolates counter state per sequence element and aggregates results appropriately—accumulating step counts and taking the maximum of depths.
235-266: LGTM!The
Unrollmethod correctly substitutes the loop variable and flattens into a sequence. TheGetExtenthelper properly handles non-constant extents and guards against integer overflow when casting toint.
289-321: LGTM!The
UnrollLoopfunction and pass registration follow standard TVM patterns. The config handling with fallback to default values and the SSA conversion for modified statements are correctly implemented.
| .def_ro("auto_max_extent", &UnrollLoopConfigNode::auto_max_extent, | ||
| "The maximum extent` of loop that will be unrolled.", | ||
| refl::DefaultValue(0)) |
There was a problem hiding this comment.
Minor typo in documentation string.
There's an extra backtick character in the description string: "The maximum extent\ of loop..."should be"The maximum extent of loop..."`.
📝 Suggested fix
.def_ro("auto_max_extent", &UnrollLoopConfigNode::auto_max_extent,
- "The maximum extent` of loop that will be unrolled.",
+ "The maximum extent of loop that will be unrolled.",
refl::DefaultValue(0))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| .def_ro("auto_max_extent", &UnrollLoopConfigNode::auto_max_extent, | |
| "The maximum extent` of loop that will be unrolled.", | |
| refl::DefaultValue(0)) | |
| .def_ro("auto_max_extent", &UnrollLoopConfigNode::auto_max_extent, | |
| "The maximum extent of loop that will be unrolled.", | |
| refl::DefaultValue(0)) |
🤖 Prompt for AI Agents
In @src/transform/unroll_loop.cc around lines 62 - 64, The docstring for the
UnrollLoopConfigNode::auto_max_extent binding contains an extraneous backtick;
update the .def_ro call that binds UnrollLoopConfigNode::auto_max_extent to
remove the backtick so the description reads "The maximum extent of loop that
will be unrolled." (i.e., edit the string argument in the .def_ro for
auto_max_extent to delete the stray ` character).
| } else if (op->attr_key == "pragma_unroll_explicit") { | ||
| bool explicit_unroll = Downcast<Integer>(op->value)->value; | ||
| LOG(INFO) << "explicit_unroll: " << explicit_unroll; | ||
| std::swap(explicit_unroll, explicit_unroll_); | ||
| Stmt ret = this->VisitStmt(op->body); | ||
| std::swap(explicit_unroll, explicit_unroll_); | ||
| return ret; |
There was a problem hiding this comment.
Remove debug logging statement.
LOG(INFO) << "explicit_unroll: " << explicit_unroll; appears to be leftover debug logging that will generate noise in production builds. Consider removing it or using DLOG if debugging output is needed only in debug builds.
🧹 Suggested fix
} else if (op->attr_key == "pragma_unroll_explicit") {
bool explicit_unroll = Downcast<Integer>(op->value)->value;
- LOG(INFO) << "explicit_unroll: " << explicit_unroll;
std::swap(explicit_unroll, explicit_unroll_);
Stmt ret = this->VisitStmt(op->body);
std::swap(explicit_unroll, explicit_unroll_);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| } else if (op->attr_key == "pragma_unroll_explicit") { | |
| bool explicit_unroll = Downcast<Integer>(op->value)->value; | |
| LOG(INFO) << "explicit_unroll: " << explicit_unroll; | |
| std::swap(explicit_unroll, explicit_unroll_); | |
| Stmt ret = this->VisitStmt(op->body); | |
| std::swap(explicit_unroll, explicit_unroll_); | |
| return ret; | |
| } else if (op->attr_key == "pragma_unroll_explicit") { | |
| bool explicit_unroll = Downcast<Integer>(op->value)->value; | |
| std::swap(explicit_unroll, explicit_unroll_); | |
| Stmt ret = this->VisitStmt(op->body); | |
| std::swap(explicit_unroll, explicit_unroll_); | |
| return ret; |
🤖 Prompt for AI Agents
In @src/transform/unroll_loop.cc around lines 119 - 125, Remove the leftover
debug log inside the pragma_unroll_explicit handling: delete or replace the
LOG(INFO) << "explicit_unroll: " << explicit_unroll; statement found in the
branch where op->attr_key == "pragma_unroll_explicit" (around the code that
reads bool explicit_unroll = Downcast<Integer>(op->value)->value and uses
explicit_unroll_ and VisitStmt(op->body)); if you still want conditional debug
output, change it to DLOG(INFO) instead of LOG(INFO).
| ICHECK_GE(value, 0) | ||
| << "Cannot unroll non-constant loop " << explicit_unroll_; | ||
| } |
There was a problem hiding this comment.
Confusing error message.
The error message appends explicit_unroll_ (a boolean flag) which doesn't provide useful context. Consider including the loop variable or extent instead.
📝 Suggested fix
if (explicit_unroll_) {
ICHECK_GE(value, 0)
- << "Cannot unroll non-constant loop " << explicit_unroll_;
+ << "Cannot unroll non-constant loop with variable: " << op->loop_var;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ICHECK_GE(value, 0) | |
| << "Cannot unroll non-constant loop " << explicit_unroll_; | |
| } | |
| ICHECK_GE(value, 0) | |
| << "Cannot unroll non-constant loop with variable: " << op->loop_var; | |
| } |
| // Wether to unroll loops to local access. | ||
| bool unroll_local_access_{false}; |
There was a problem hiding this comment.
Minor typo in comment.
"Wether" should be "Whether".
📝 Suggested fix
- // Wether to unroll loops to local access.
+ // Whether to unroll loops to local access.
bool unroll_local_access_{false};📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Wether to unroll loops to local access. | |
| bool unroll_local_access_{false}; | |
| // Whether to unroll loops to local access. | |
| bool unroll_local_access_{false}; |
🤖 Prompt for AI Agents
In @src/transform/unroll_loop.cc around lines 275 - 276, Fix the typo in the
comment above the unroll_local_access_ field: change "Wether to unroll loops to
local access." to "Whether to unroll loops to local access." so the comment
correctly reads "Whether to unroll loops to local access." and remains
immediately above the bool unroll_local_access_{false}; declaration.
…d corresponding test case (tile-ai#1654) * Add unroll loop functionality and corresponding test case - Introduced a new `UnrollLoop` function in the transform module to unroll loops based on various configuration options. - Added a test case in `test_tilelang_language_unroll.py` to validate the behavior of `T.unroll` with only the extent parameter, ensuring correct kernel generation with unroll pragmas. * Refactor unroll kernel implementation and update test case - Changed the kernel function in `test_tilelang_language_unroll.py` to use a new `unroll_kernel` function that compiles and returns the output tensor, improving clarity and structure. - Updated the `OptimizeForTarget` function in `phase.py` to ensure the `UnrollLoop` transformation is applied correctly, maintaining consistency in optimization phases. * lint fix * lint fix
* finish KDA algorithm in tilelang * fix pre-commit.ci * fix pre-commit.ci * fix pre-commit local * [Style] Fix some code styles * [Refactor] Remove redundant swizzle for they can be automatically done * [Refactor] remove chunk_bwd_intra.py and rename chunk_bwd_intra_op.py and do some fix form coderabbitai * update ruff * update pre-commit * [Enhancement] Improve unroll loop functionality for dynamic extent and corresponding test case (#1654) * Add unroll loop functionality and corresponding test case - Introduced a new `UnrollLoop` function in the transform module to unroll loops based on various configuration options. - Added a test case in `test_tilelang_language_unroll.py` to validate the behavior of `T.unroll` with only the extent parameter, ensuring correct kernel generation with unroll pragmas. * Refactor unroll kernel implementation and update test case - Changed the kernel function in `test_tilelang_language_unroll.py` to use a new `unroll_kernel` function that compiles and returns the output tensor, improving clarity and structure. - Updated the `OptimizeForTarget` function in `phase.py` to ensure the `UnrollLoop` transformation is applied correctly, maintaining consistency in optimization phases. * lint fix * lint fix * [Bugfix] Fix missing annotations for default CallNode Visitor (#1659) tvm fix * [Clean] Remove unnecessary debug print (#1661) remove unnecessary debug print * [Bugfix] Fix variable scoping issue in InjectSoftwarePipeline for transitive LetStmt dependencies (#1657) * [Enhancement] Update global load/store functions for CUDA compatibility (#1652) Refactor the `ld_global_256` and `st_global_256` functions to support both CUDA versions above 12.9 and earlier versions. This change ensures that 256-bit loads and stores are handled correctly across different CUDA versions, improving performance and compatibility. The implementation now uses two 128-bit loads/stores for older versions, enhancing the robustness of the codebase. * Update comments in global load/store functions for CUDA compatibility Clarified comments in `ld_global_256` and `st_global_256` functions to indicate that the fallback for CUDA versions below 12.9 may have performance regressions. This change enhances code readability and provides better context for developers working with different CUDA versions. * Update submodule and enhance LetStmt handling in inject_pipeline.cc - Updated the TVM submodule to the latest commit. - Improved the handling of LetStmt in the inject_pipeline.cc file to account for transitive dependencies on loop variables, ensuring correct variable substitution in rewritten blocks. - Adjusted test_tilelang_issue_1263.py to remove unnecessary jit decorator and updated the kernel compilation process with specific pass configurations. * lint fix * revert tvm * remove unused test * test fix * [Refactor] Improve CallNode handling to include annotations in various operations (#1663) * [Enhancement] Update CallNode handling to include annotations in various operations - Modified CallNode invocations in multiple files to ensure that annotations are passed correctly, enhancing the consistency and functionality of the codebase. - Removed the "use_tma" annotation from AtomicAddNode and adjusted related calls to maintain expected behavior. - Updated CUDA intrinsic dispatch functions to include annotations, improving compatibility and correctness in CUDA operations. * lint fix * [EagerJIT] Add Support for Parameter Only Kernel Compilation (#1664) * [Fix] Refactor type hint extraction logic in DSLMutator for better clarity and handling of annotations * [Refactor] Remove redundant tensor creation in loop layout tests and update kernel compilation parameters * [AutoDD] Add Tilelang AutoDD to Reduce Buggy Program (#1639) * [Feat] Add tilelang autodd for delta debugging * fix typos * fix lint error * fix typos * fix lint error * fix bugs * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix codeview comments * [Refactor] Move AutoDD detection to env module and update import logic * Refactor: Relocate the _is_running_autodd function to the env module for better organization and encapsulation. * Update initialization logic to skip logger and heavy imports based on a new light import mode, enhancing flexibility in module usage. * Ensure consistent handling of environment variables across the package, improving overall code clarity and maintainability. * [Documentation] Add AutoDD section to debug_tools_for_tilelang.md * Introduced a comprehensive guide on AutoDD (Automatic Delta Debugging) for isolating bugs in TileLang programs. * Explained Delta Debugging methodology, usage, parameters, and provided examples for clarity. * Highlighted the benefits of using AutoDD for large codebases and hard-to-locate errors, emphasizing time-saving aspects. * Included tips for effective usage and a reference to a complete example in the documentation. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: kurisu6912 <227995639+kurisu6912@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * rebase origin * [Feature] Support `cp.reduce.async.bulk.tensor` (#1667) * support cp.reduce.async.bulk.tensor and add test * Refactor flash attention example by removing unnecessary layout annotations * support swizzle layout for tma reduce * auto swizzle for non-1d tma atomic add * upd example and test * lint * typo * add constraint for test * Refactor CUDA data type mapping by moving the to_CUtensorMapDataType function to utils.cc and utils.h, while removing redundant definitions from atomic_add.cc and copy.cc. * lint * rename basename according to CI * Update submodule TVM and remove deprecated KDA example files - Updated the TVM submodule to commit 354eef9a. - Removed several outdated KDA example files and utility scripts that are no longer in use, including chunk_bwd_dqkwg.py, chunk_bwd_dv.py, chunk_bwd_gla_dA.py, chunk_bwd_intra.py, chunk_delta_bwd.py, chunk_delta_h_fwd.py, chunk_inter_solve_fused.py, chunk_intra_token_parallel.py, chunk_o.py, README.md, test_utils_kda.py, wy_fast_bwd.py, wy_fast.py, and various FLA_KDA implementations. * lint fix --------- Co-authored-by: wufang <wufang@MBP-MK6VR66Y2M-2329.local> Co-authored-by: tzj-fxz <tzjfxz@gmail.com> Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by: Kuris <227995639+kurisu6912@users.noreply.github.com> Co-authored-by: Kexing Zhou <KEKE_046@pku.edu.cn> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> Co-authored-by: Zhengju Tang <97930865+tzj-fxz@users.noreply.github.com> Co-authored-by: Tong WU <109033598+Rachmanino@users.noreply.github.com>
UnrollLoopfunction in the transform module to unroll loops based on various configuration options.test_tilelang_language_unroll.pyto validate the behavior ofT.unrollwith only the extent parameter, ensuring correct kernel generation with unroll pragmas.Summary by CodeRabbit
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.