Skip to content

[Enhancement] Improve unroll loop functionality for dynamic extent and corresponding test case#1654

Merged
LeiWang1999 merged 4 commits intotile-ai:mainfrom
LeiWang1999:unroll_0112
Jan 12, 2026
Merged

[Enhancement] Improve unroll loop functionality for dynamic extent and corresponding test case#1654
LeiWang1999 merged 4 commits intotile-ai:mainfrom
LeiWang1999:unroll_0112

Conversation

@LeiWang1999
Copy link
Member

@LeiWang1999 LeiWang1999 commented Jan 12, 2026

  • Introduced a new UnrollLoop function in the transform module to unroll loops based on various configuration options.
  • Added a test case in test_tilelang_language_unroll.py to validate the behavior of T.unroll with only the extent parameter, ensuring correct kernel generation with unroll pragmas.

Summary by CodeRabbit

  • New Features

    • Added an automatic loop-unrolling transform with configurable policies (extent, nesting, explicit hints, and local-memory awareness), exposed to Python and integrated into the optimization pipeline.
  • Tests

    • Added tests validating unroll behavior (including extent-only usage) and updated test imports.

✏️ Tip: You can customize this high-level summary in your review settings.

- Introduced a new `UnrollLoop` function in the transform module to unroll loops based on various configuration options.
- Added a test case in `test_tilelang_language_unroll.py` to validate the behavior of `T.unroll` with only the extent parameter, ensuring correct kernel generation with unroll pragmas.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 12, 2026

📝 Walkthrough

Walkthrough

Adds a new C++ UnrollLoop transform and its config, exposes it via tilelang.transform.UnrollLoop, switches the optimization pipeline to use it, and adds a unit test validating extent-only unroll emission.

Changes

Cohort / File(s) Summary
Core Implementation
src/transform/unroll_loop.cc
New C++ transform UnrollLoop and UnrollLoopConfig with LoopUnroller and VarLocalAccessMarker; handles pragma attributes, auto-unroll heuristics (extent/step/depth), local-memory enforcement, index substitution, optional SSA conversion, and registers tl.transform.UnrollLoop.
Python Bindings
tilelang/transform/__init__.py
Added UnrollLoop() accessor that forwards to the FFI backend (_ffi_api.UnrollLoop()) with docstring.
Engine / Pipeline
tilelang/engine/phase.py
Replaced use of tir.transform.UnrollLoop with tilelang.transform.UnrollLoop in OptimizeForTarget pipeline.
Tests
testing/python/language/test_tilelang_language_unroll.py
Added test_unroll_with_extent_only and initial import tilelang; asserts generated kernel contains #pragma unroll for extent-only unroll.

Sequence Diagram(s)

sequenceDiagram
  participant Python as Python caller (tilelang.transform.UnrollLoop)
  participant FFI as FFI dispatch (_ffi_api)
  participant Cpp as C++ transform (UnrollLoop)
  participant IR as PrimFunc/IR body

  Python->>FFI: call UnrollLoop(config?)
  FFI->>Cpp: invoke UnrollLoop pass with config
  Cpp->>IR: analyze loops (extents, steps, pragmas, local access)
  Cpp->>IR: transform eligible For loops -> unrolled bodies / mark kUnrolled
  Cpp-->>FFI: return transformed IR
  FFI-->>Python: return transformed PrimFunc/stmt
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • kurisu6912

Poem

🐰 I hop through loops and peel each tiny thread,

I heed the pragmas and count each step and depth,
I mark the locals, scatter indices wide,
and lay bare for-loops till no bounds hide.
A carrot cheer for passes now sped! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.18% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: introducing enhanced unroll loop functionality for dynamic extents and adding corresponding test cases.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
src/transform/unroll_loop.cc (3)

111-117: Remove or reduce debug logging verbosity.

This LOG(INFO) statement will print to logs every time the pragma_unroll_explicit attribute is encountered, which could be noisy in production. Consider using DLOG(INFO) for debug builds only, or remove the log statement entirely.

♻️ Suggested fix
     } else if (op->attr_key == "pragma_unroll_explicit") {
       bool explicit_unroll = Downcast<Integer>(op->value)->value;
-      LOG(INFO) << "explicit_unroll: " << explicit_unroll;
       std::swap(explicit_unroll, explicit_unroll_);
       Stmt ret = this->VisitStmt(op->body);
       std::swap(explicit_unroll, explicit_unroll_);

123-123: Add final keyword for consistency.

This method overrides a virtual function from StmtExprMutator but is missing the override or final keyword, unlike other overridden methods in this class (lines 104, 171, 185, 200, 205).

♻️ Suggested fix
-  Stmt VisitStmt_(const ForNode* op) {
+  Stmt VisitStmt_(const ForNode* op) final {

59-60: Minor typo in docstring.

There's an extra backtick character in the description string.

📝 Suggested fix
        .def_ro("auto_max_extent", &UnrollLoopConfigNode::auto_max_extent,
-               "The maximum extent` of loop that will be unrolled.", refl::DefaultValue(0))
+               "The maximum extent of loop that will be unrolled.", refl::DefaultValue(0))
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5e347e3 and e6e819a.

📒 Files selected for processing (3)
  • src/transform/unroll_loop.cc
  • testing/python/language/test_tilelang_language_unroll.py
  • tilelang/transform/__init__.py
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2026-01-06T05:20:51.649Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1606
File: testing/python/transform/test_tilelang_transform_hoist_broadcast_values.py:30-30
Timestamp: 2026-01-06T05:20:51.649Z
Learning: In `testing/python/transform/test_tilelang_transform_hoist_broadcast_values.py`, the test validates that the `hoist_broadcast_values` transformation pass correctly identifies and hoists broadcast operations by checking for patterns in the generated kernel source code. The specific literal values used (e.g., 430) are not important for the test's purpose, as it does not validate numerical precision or actual stored tensor values.

Applied to files:

  • testing/python/language/test_tilelang_language_unroll.py
📚 Learning: 2025-12-18T04:50:00.512Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.

Applied to files:

  • testing/python/language/test_tilelang_language_unroll.py
🧬 Code graph analysis (3)
tilelang/transform/__init__.py (1)
src/transform/unroll_loop.cc (4)
  • UnrollLoop (274-282)
  • UnrollLoop (274-274)
  • UnrollLoop (288-299)
  • UnrollLoop (288-288)
src/transform/unroll_loop.cc (2)
tilelang/language/loop.py (1)
  • Unroll (254-265)
tilelang/transform/__init__.py (1)
  • UnrollLoop (515-530)
testing/python/language/test_tilelang_language_unroll.py (3)
tilelang/language/kernel.py (1)
  • threads (215-219)
tilelang/intrinsics/mfma_macro_generator.py (1)
  • get_thread_binding (234-240)
tilelang/language/loop.py (1)
  • unroll (174-236)
🪛 Cppcheck (2.19.0)
src/transform/unroll_loop.cc

[error] 76-76: syntax error

(syntaxError)

🔇 Additional comments (3)
tilelang/transform/__init__.py (1)

515-530: LGTM!

The UnrollLoop function follows the established pattern of other pass accessors in this module, with a comprehensive docstring documenting the configuration options and proper FFI delegation.

testing/python/language/test_tilelang_language_unroll.py (1)

37-53: LGTM!

The test effectively validates the T.unroll behavior with extent-only parameter using a dynamic extent (tid % 32). The test correctly verifies that the #pragma unroll directive is present in the generated kernel source without needing to execute the kernel.

src/transform/unroll_loop.cc (1)

274-298: Implementation looks solid.

The UnrollLoop function correctly applies the transformation with SSA conversion when changes occur. The pass registration follows standard TVM patterns with proper config handling and default values fallback.

Note: The static analysis hint about a syntax error on line 76 is a false positive—TVM_FFI_STATIC_INIT_BLOCK() is a valid TVM macro.

- Changed the kernel function in `test_tilelang_language_unroll.py` to use a new `unroll_kernel` function that compiles and returns the output tensor, improving clarity and structure.
- Updated the `OptimizeForTarget` function in `phase.py` to ensure the `UnrollLoop` transformation is applied correctly, maintaining consistency in optimization phases.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @src/transform/unroll_loop.cc:
- Around line 62-64: Fix the docstring for the
UnrollLoopConfigNode::auto_max_extent binding by removing the stray backtick:
update the string passed to .def_ro("auto_max_extent",
&UnrollLoopConfigNode::auto_max_extent, ...) so it reads "The maximum extent of
loop that will be unrolled." (replace `"The maximum extent\` of loop..."` with
the corrected text).
- Around line 119-125: The code leaves a debug logging call LOG(INFO) inside the
pragma_unroll_explicit handling; remove this production log (or replace with
VLOG if optional debug is desired) so normal passes don't emit info-level logs:
locate the block handling op->attr_key == "pragma_unroll_explicit" (variables
explicit_unroll and member explicit_unroll_, and the VisitStmt(op->body) call)
and delete the LOG(INFO) << "explicit_unroll: " << explicit_unroll; line (or
change it to VLOG(<appropriate_verbosity>) << ...).
🧹 Nitpick comments (1)
src/transform/unroll_loop.cc (1)

131-131: Add final override specifier for consistency.

Other VisitStmt_ overrides in this class use final, but this one is missing it. Adding final ensures consistency and prevents accidental further overriding in subclasses.

✏️ Suggested fix
-  Stmt VisitStmt_(const ForNode *op) {
+  Stmt VisitStmt_(const ForNode *op) final {
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 63aa83b and 50afb54.

📒 Files selected for processing (1)
  • src/transform/unroll_loop.cc
🧰 Additional context used
🧬 Code graph analysis (1)
src/transform/unroll_loop.cc (2)
tilelang/language/loop.py (1)
  • Unroll (254-265)
tilelang/transform/__init__.py (1)
  • UnrollLoop (515-530)
🪛 Cppcheck (2.19.0)
src/transform/unroll_loop.cc

[error] 83-83: syntax error

(syntaxError)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Quick Lint
🔇 Additional comments (4)
src/transform/unroll_loop.cc (4)

77-85: Configuration registration looks correct.

The UnrollLoopConfig wrapper and registration pattern follows TVM conventions. The Cppcheck syntax error at line 83 is a false positive — TVM_FFI_STATIC_INIT_BLOCK is a valid TVM macro for static initialization that the static analyzer doesn't recognize.


87-98: LGTM!

Simple and correct visitor implementation for marking variables accessed in local memory contexts.


289-298: LGTM!

The function correctly applies the unroller and conditionally converts to SSA form only when changes occur.


304-315: LGTM!

The pass factory correctly retrieves configuration from the pass context with a fallback to defaults, and the registration follows TVM conventions.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In @src/transform/unroll_loop.cc:
- Around line 275-276: Fix the typo in the comment above the
unroll_local_access_ field: change "Wether to unroll loops to local access." to
"Whether to unroll loops to local access." so the comment correctly reads
"Whether to unroll loops to local access." and remains immediately above the
bool unroll_local_access_{false}; declaration.
- Around line 62-64: The docstring for the UnrollLoopConfigNode::auto_max_extent
binding contains an extraneous backtick; update the .def_ro call that binds
UnrollLoopConfigNode::auto_max_extent to remove the backtick so the description
reads "The maximum extent of loop that will be unrolled." (i.e., edit the string
argument in the .def_ro for auto_max_extent to delete the stray ` character).
- Around line 119-125: Remove the leftover debug log inside the
pragma_unroll_explicit handling: delete or replace the LOG(INFO) <<
"explicit_unroll: " << explicit_unroll; statement found in the branch where
op->attr_key == "pragma_unroll_explicit" (around the code that reads bool
explicit_unroll = Downcast<Integer>(op->value)->value and uses explicit_unroll_
and VisitStmt(op->body)); if you still want conditional debug output, change it
to DLOG(INFO) instead of LOG(INFO).
🧹 Nitpick comments (1)
src/transform/unroll_loop.cc (1)

131-131: Missing override specifier on virtual function.

The VisitStmt_ method overrides a virtual function from StmtExprMutator but lacks the override (or final) specifier, which helps catch signature mismatches at compile time.

♻️ Suggested fix
-  Stmt VisitStmt_(const ForNode *op) {
+  Stmt VisitStmt_(const ForNode *op) final {
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 50afb54 and 3f6a8ec.

📒 Files selected for processing (1)
  • src/transform/unroll_loop.cc
🧰 Additional context used
🧬 Code graph analysis (1)
src/transform/unroll_loop.cc (2)
tilelang/language/loop.py (1)
  • Unroll (254-265)
tilelang/transform/__init__.py (1)
  • UnrollLoop (515-530)
🪛 Cppcheck (2.19.0)
src/transform/unroll_loop.cc

[error] 83-83: syntax error

(syntaxError)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
  • GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
🔇 Additional comments (4)
src/transform/unroll_loop.cc (4)

87-98: LGTM!

The VarLocalAccessMarker visitor is a clean, focused implementation for collecting variables from expressions.


218-233: LGTM!

The SeqStmtNode visitor correctly isolates counter state per sequence element and aggregates results appropriately—accumulating step counts and taking the maximum of depths.


235-266: LGTM!

The Unroll method correctly substitutes the loop variable and flattens into a sequence. The GetExtent helper properly handles non-constant extents and guards against integer overflow when casting to int.


289-321: LGTM!

The UnrollLoop function and pass registration follow standard TVM patterns. The config handling with fallback to default values and the SSA conversion for modified statements are correctly implemented.

Comment on lines +62 to +64
.def_ro("auto_max_extent", &UnrollLoopConfigNode::auto_max_extent,
"The maximum extent` of loop that will be unrolled.",
refl::DefaultValue(0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Minor typo in documentation string.

There's an extra backtick character in the description string: "The maximum extent\ of loop..."should be"The maximum extent of loop..."`.

📝 Suggested fix
         .def_ro("auto_max_extent", &UnrollLoopConfigNode::auto_max_extent,
-                "The maximum extent` of loop that will be unrolled.",
+                "The maximum extent of loop that will be unrolled.",
                 refl::DefaultValue(0))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
.def_ro("auto_max_extent", &UnrollLoopConfigNode::auto_max_extent,
"The maximum extent` of loop that will be unrolled.",
refl::DefaultValue(0))
.def_ro("auto_max_extent", &UnrollLoopConfigNode::auto_max_extent,
"The maximum extent of loop that will be unrolled.",
refl::DefaultValue(0))
🤖 Prompt for AI Agents
In @src/transform/unroll_loop.cc around lines 62 - 64, The docstring for the
UnrollLoopConfigNode::auto_max_extent binding contains an extraneous backtick;
update the .def_ro call that binds UnrollLoopConfigNode::auto_max_extent to
remove the backtick so the description reads "The maximum extent of loop that
will be unrolled." (i.e., edit the string argument in the .def_ro for
auto_max_extent to delete the stray ` character).

Comment on lines +119 to +125
} else if (op->attr_key == "pragma_unroll_explicit") {
bool explicit_unroll = Downcast<Integer>(op->value)->value;
LOG(INFO) << "explicit_unroll: " << explicit_unroll;
std::swap(explicit_unroll, explicit_unroll_);
Stmt ret = this->VisitStmt(op->body);
std::swap(explicit_unroll, explicit_unroll_);
return ret;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove debug logging statement.

LOG(INFO) << "explicit_unroll: " << explicit_unroll; appears to be leftover debug logging that will generate noise in production builds. Consider removing it or using DLOG if debugging output is needed only in debug builds.

🧹 Suggested fix
     } else if (op->attr_key == "pragma_unroll_explicit") {
       bool explicit_unroll = Downcast<Integer>(op->value)->value;
-      LOG(INFO) << "explicit_unroll: " << explicit_unroll;
       std::swap(explicit_unroll, explicit_unroll_);
       Stmt ret = this->VisitStmt(op->body);
       std::swap(explicit_unroll, explicit_unroll_);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
} else if (op->attr_key == "pragma_unroll_explicit") {
bool explicit_unroll = Downcast<Integer>(op->value)->value;
LOG(INFO) << "explicit_unroll: " << explicit_unroll;
std::swap(explicit_unroll, explicit_unroll_);
Stmt ret = this->VisitStmt(op->body);
std::swap(explicit_unroll, explicit_unroll_);
return ret;
} else if (op->attr_key == "pragma_unroll_explicit") {
bool explicit_unroll = Downcast<Integer>(op->value)->value;
std::swap(explicit_unroll, explicit_unroll_);
Stmt ret = this->VisitStmt(op->body);
std::swap(explicit_unroll, explicit_unroll_);
return ret;
🤖 Prompt for AI Agents
In @src/transform/unroll_loop.cc around lines 119 - 125, Remove the leftover
debug log inside the pragma_unroll_explicit handling: delete or replace the
LOG(INFO) << "explicit_unroll: " << explicit_unroll; statement found in the
branch where op->attr_key == "pragma_unroll_explicit" (around the code that
reads bool explicit_unroll = Downcast<Integer>(op->value)->value and uses
explicit_unroll_ and VisitStmt(op->body)); if you still want conditional debug
output, change it to DLOG(INFO) instead of LOG(INFO).

Comment on lines +146 to +148
ICHECK_GE(value, 0)
<< "Cannot unroll non-constant loop " << explicit_unroll_;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Confusing error message.

The error message appends explicit_unroll_ (a boolean flag) which doesn't provide useful context. Consider including the loop variable or extent instead.

📝 Suggested fix
       if (explicit_unroll_) {
         ICHECK_GE(value, 0)
-            << "Cannot unroll non-constant loop " << explicit_unroll_;
+            << "Cannot unroll non-constant loop with variable: " << op->loop_var;
       }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ICHECK_GE(value, 0)
<< "Cannot unroll non-constant loop " << explicit_unroll_;
}
ICHECK_GE(value, 0)
<< "Cannot unroll non-constant loop with variable: " << op->loop_var;
}

Comment on lines +275 to +276
// Wether to unroll loops to local access.
bool unroll_local_access_{false};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Minor typo in comment.

"Wether" should be "Whether".

📝 Suggested fix
-  // Wether to unroll loops to local access.
+  // Whether to unroll loops to local access.
   bool unroll_local_access_{false};
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Wether to unroll loops to local access.
bool unroll_local_access_{false};
// Whether to unroll loops to local access.
bool unroll_local_access_{false};
🤖 Prompt for AI Agents
In @src/transform/unroll_loop.cc around lines 275 - 276, Fix the typo in the
comment above the unroll_local_access_ field: change "Wether to unroll loops to
local access." to "Whether to unroll loops to local access." so the comment
correctly reads "Whether to unroll loops to local access." and remains
immediately above the bool unroll_local_access_{false}; declaration.

@LeiWang1999 LeiWang1999 merged commit 9936636 into tile-ai:main Jan 12, 2026
4 of 6 checks passed
tzj-fxz pushed a commit to wfloveiu/tilelang that referenced this pull request Jan 14, 2026
…d corresponding test case (tile-ai#1654)

* Add unroll loop functionality and corresponding test case

- Introduced a new `UnrollLoop` function in the transform module to unroll loops based on various configuration options.
- Added a test case in `test_tilelang_language_unroll.py` to validate the behavior of `T.unroll` with only the extent parameter, ensuring correct kernel generation with unroll pragmas.

* Refactor unroll kernel implementation and update test case

- Changed the kernel function in `test_tilelang_language_unroll.py` to use a new `unroll_kernel` function that compiles and returns the output tensor, improving clarity and structure.
- Updated the `OptimizeForTarget` function in `phase.py` to ensure the `UnrollLoop` transformation is applied correctly, maintaining consistency in optimization phases.

* lint fix

* lint fix
LeiWang1999 added a commit that referenced this pull request Jan 26, 2026
* finish KDA algorithm in tilelang

* fix pre-commit.ci

* fix pre-commit.ci

* fix pre-commit local

* [Style] Fix some code styles

* [Refactor] Remove redundant swizzle for they can be automatically done

* [Refactor] remove chunk_bwd_intra.py and rename chunk_bwd_intra_op.py and do some fix form coderabbitai

* update ruff

* update pre-commit

* [Enhancement] Improve unroll loop functionality for dynamic extent and corresponding test case (#1654)

* Add unroll loop functionality and corresponding test case

- Introduced a new `UnrollLoop` function in the transform module to unroll loops based on various configuration options.
- Added a test case in `test_tilelang_language_unroll.py` to validate the behavior of `T.unroll` with only the extent parameter, ensuring correct kernel generation with unroll pragmas.

* Refactor unroll kernel implementation and update test case

- Changed the kernel function in `test_tilelang_language_unroll.py` to use a new `unroll_kernel` function that compiles and returns the output tensor, improving clarity and structure.
- Updated the `OptimizeForTarget` function in `phase.py` to ensure the `UnrollLoop` transformation is applied correctly, maintaining consistency in optimization phases.

* lint fix

* lint fix

* [Bugfix] Fix missing annotations for default CallNode Visitor (#1659)

tvm fix

* [Clean] Remove unnecessary debug print (#1661)

remove unnecessary debug print

* [Bugfix] Fix variable scoping issue in InjectSoftwarePipeline for transitive LetStmt dependencies (#1657)

* [Enhancement] Update global load/store functions for CUDA compatibility (#1652)

Refactor the `ld_global_256` and `st_global_256` functions to support both CUDA versions above 12.9 and earlier versions. This change ensures that 256-bit loads and stores are handled correctly across different CUDA versions, improving performance and compatibility. The implementation now uses two 128-bit loads/stores for older versions, enhancing the robustness of the codebase.

* Update comments in global load/store functions for CUDA compatibility

Clarified comments in `ld_global_256` and `st_global_256` functions to indicate that the fallback for CUDA versions below 12.9 may have performance regressions. This change enhances code readability and provides better context for developers working with different CUDA versions.

* Update submodule and enhance LetStmt handling in inject_pipeline.cc

- Updated the TVM submodule to the latest commit.
- Improved the handling of LetStmt in the inject_pipeline.cc file to account for transitive dependencies on loop variables, ensuring correct variable substitution in rewritten blocks.
- Adjusted test_tilelang_issue_1263.py to remove unnecessary jit decorator and updated the kernel compilation process with specific pass configurations.

* lint fix

* revert tvm

* remove unused test

* test fix

* [Refactor] Improve CallNode handling to include annotations in various operations (#1663)

* [Enhancement] Update CallNode handling to include annotations in various operations

- Modified CallNode invocations in multiple files to ensure that annotations are passed correctly, enhancing the consistency and functionality of the codebase.
- Removed the "use_tma" annotation from AtomicAddNode and adjusted related calls to maintain expected behavior.
- Updated CUDA intrinsic dispatch functions to include annotations, improving compatibility and correctness in CUDA operations.

* lint fix

* [EagerJIT] Add Support for Parameter Only Kernel Compilation (#1664)

* [Fix] Refactor type hint extraction logic in DSLMutator for better clarity and handling of annotations

* [Refactor] Remove redundant tensor creation in loop layout tests and update kernel compilation parameters

* [AutoDD] Add Tilelang AutoDD to Reduce Buggy Program (#1639)

* [Feat] Add tilelang autodd for delta debugging

* fix typos

* fix lint error

* fix typos

* fix lint error

* fix bugs

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix codeview comments

* [Refactor] Move AutoDD detection to env module and update import logic

* Refactor: Relocate the _is_running_autodd function to the env module for better organization and encapsulation.
* Update initialization logic to skip logger and heavy imports based on a new light import mode, enhancing flexibility in module usage.
* Ensure consistent handling of environment variables across the package, improving overall code clarity and maintainability.

* [Documentation] Add AutoDD section to debug_tools_for_tilelang.md

* Introduced a comprehensive guide on AutoDD (Automatic Delta Debugging) for isolating bugs in TileLang programs.
* Explained Delta Debugging methodology, usage, parameters, and provided examples for clarity.
* Highlighted the benefits of using AutoDD for large codebases and hard-to-locate errors, emphasizing time-saving aspects.
* Included tips for effective usage and a reference to a complete example in the documentation.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: kurisu6912 <227995639+kurisu6912@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* rebase origin

* [Feature] Support `cp.reduce.async.bulk.tensor` (#1667)

* support cp.reduce.async.bulk.tensor and add test

* Refactor flash attention example by removing unnecessary layout annotations

* support swizzle layout for tma reduce

* auto swizzle for non-1d tma atomic add

* upd example and test

* lint

* typo

* add constraint for test

* Refactor CUDA data type mapping by moving the to_CUtensorMapDataType function to utils.cc and utils.h, while removing redundant definitions from atomic_add.cc and copy.cc.

* lint

* rename basename according to CI

* Update submodule TVM and remove deprecated KDA example files

- Updated the TVM submodule to commit 354eef9a.
- Removed several outdated KDA example files and utility scripts that are no longer in use, including chunk_bwd_dqkwg.py, chunk_bwd_dv.py, chunk_bwd_gla_dA.py, chunk_bwd_intra.py, chunk_delta_bwd.py, chunk_delta_h_fwd.py, chunk_inter_solve_fused.py, chunk_intra_token_parallel.py, chunk_o.py, README.md, test_utils_kda.py, wy_fast_bwd.py, wy_fast.py, and various FLA_KDA implementations.

* lint fix

---------

Co-authored-by: wufang <wufang@MBP-MK6VR66Y2M-2329.local>
Co-authored-by: tzj-fxz <tzjfxz@gmail.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: Kuris <227995639+kurisu6912@users.noreply.github.com>
Co-authored-by: Kexing Zhou <KEKE_046@pku.edu.cn>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
Co-authored-by: Zhengju Tang <97930865+tzj-fxz@users.noreply.github.com>
Co-authored-by: Tong WU <109033598+Rachmanino@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant