[Enhancement] Improve iterator handling in layout utilities and parallel operations by LeiWang1999 · Pull Request #1221 · tile-ai/tilelang

LeiWang1999 · 2025-11-10T13:52:04Z

Summary by CodeRabbit

New Features
- Added optimized handling for fused iterator usage in layout computations.
- Introduced fast-path optimization for parallel buffer fragment operations with bijective mappings.
Tests
- Added new test module for fused layout kernel functionality on CUDA.

…lel operations * Added a new function, DivideUnusedIterators, to detect per-iterator gaps in fused index expressions, enhancing the accuracy of unused iterator detection. * Updated CompleteBufferFragment to prefer direct inversion for bijective index mappings and introduced a fallback mechanism for non-bijective cases, improving layout inversion robustness. * Added a new test for layout inference in fused kernels to ensure correct compilation and execution without layout inversion failures.

github-actions · 2025-11-10T13:52:15Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai · 2025-11-10T13:52:18Z

Walkthrough

This PR refactors iterator split computation and optimizes buffer fragment completion for bijective mappings. It introduces a new DivideUnusedIterators function, adds a fast path for bijective 2D index maps in parallel operations, and includes a fused layout kernel test.

Changes

Cohort / File(s)	Summary
Layout utilities `src/layout/utils.cc`	Adds `DivideUnusedIterators()` function that normalizes input expressions to IterSumExpr, collects and merges iterator splits per source Var across IterMarks, and uses a unified IterMark to compute missing pieces for consistent fused usage handling.
Parallel operations `src/op/parallel.cc`	Introduces a bijective fast path in `CompleteBufferFragment()` that directly inverts 2D index maps when they form bijective mappings with loop variables. Adds cardinality checks and a fallback path for non-bijective cases, interleaving new logic with existing replication semantics.
Testing `testing/python/layout/test_tilelang_layout_fused_replicate.py`	New test module introducing `fused_index_kernel()` for testing fused layout kernels, `_require_cuda_tensor()` helper for CUDA availability checks, and `test_layout_infer_compiles_and_runs()` that validates kernel compilation and execution with shape/dtype assertions.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant CBF as CompleteBufferFragment
    participant Check as BijectiveCheck
    participant FwdPath as FastBijectivePath
    participant RepBSetup as RepB Setup
    participant CardCheck as CardinalityCheck
    participant FallBack as Fallback Path
    participant Output as Fragment

    Caller->>CBF: CompleteBufferFragment(buffer)
    
    rect rgb(230, 240, 255)
    Note over Check,FwdPath: Fast Bijective Path (new)
    CBF->>Check: Check 2D bijective mapping
    alt 2D indices form bijection
        Check->>FwdPath: Invert 2D mapping directly
        FwdPath->>Output: Return condensed Fragment
    end
    end
    
    rect rgb(240, 230, 255)
    Note over RepBSetup,CardCheck: Extended Path with RepB (new)
    CBF->>RepBSetup: Create rep_b from unused iterators
    RepBSetup->>RepBSetup: Flatten and extend indice_map
    RepBSetup->>CardCheck: Check cardinality (in_prod vs out_prod)
    
    alt Bijectivity holds after RepB
        CardCheck->>CBF: Compute ind_inv, use ForwardThread
        CBF->>Output: Return replication-aware Fragment
    else Non-bijective after RepB
        CardCheck->>FallBack: Compute non-replicated inverse
        FallBack->>Output: Return CondenseReplicateVar Fragment
    end
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

src/op/parallel.cc requires careful attention to trace through multiple interleaved decision paths (bijective fast path, cardinality guards, fallback logic) and verify correctness of both the 2D inversion and the extended rep_b handling.
src/layout/utils.cc introduces new logic for normalizing, collecting, and merging iterator splits that interacts with existing utilities; the unified IterMark computation and fallback behavior need verification.
testing/python/layout/test_tilelang_layout_fused_replicate.py is straightforward but should be reviewed for adequate test coverage of the fused kernel scenarios.

Possibly related PRs

[Layout] Strict annotate completed replicated layout for fragment with constant index #929 — Modifies src/op/parallel.cc with replication-aware fragment/layout logic and relates to both PR's focus on replication handling in fragments and layout inference.

Poem

🐰 Splits and merges, oh what a sight!
Iterators fused, now the paths align just right.
Bijective maps dart quick through the night,
Unused becomes used—a computational delight! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main changes across all modified files: it addresses iterator handling improvements in layout utilities (DivideUnusedIterators) and parallel operations (CompleteBufferFragment optimization), reflecting the core enhancements in this changeset.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/layout/utils.cc (1)

133-151: Don't throw on fused IterMark sources

NormalizeToIterSum legitimately produces IterMark nodes whose source is an IterSumExpr (e.g., fused iterators). The new blanket check now throws a NormalizeIterException for those marks, so any fused layout that previously worked will immediately fail. Instead of rejecting them, skip non-Var sources when merging splits and keep the existing behavior for Var-backed marks.

Apply this diff to keep fused marks legal:

-  for (const IterMark &mark : collector.visited_) {
-    if (!mark->source.as<Var>()) {
-      std::ostringstream oss;
-      oss << "Not a normalized iterator: " << mark;
-      throw NormalizeIterException(oss.str());
-    }
-  }
-
   for (const IterVar &iter : input_iters) {
     // Merge splits from all IterMark that share the same source Var as `iter`.
     std::vector<IterSplitExpr> merged_splits;
     for (const IterMark &mark : collector.visited_) {
-      auto vexpr = mark->source.as<Var>();
-      if (vexpr && vexpr.value().same_as(iter->var)) {
+      auto vexpr = mark->source.as<Var>();
+      if (!vexpr)
+        continue;
+      if (vexpr.value().same_as(iter->var)) {
         auto it = collector.mark2splits_.find(mark);
         if (it != collector.mark2splits_.end()) {
           const auto &vec = it->second;
           merged_splits.insert(merged_splits.end(), vec.begin(), vec.end());
         }

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7e5b1cd and 75336f2.

📒 Files selected for processing (3)

src/layout/utils.cc (2 hunks)
src/op/parallel.cc (1 hunks)
testing/python/layout/test_tilelang_layout_fused_replicate.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

src/op/parallel.cc (2)

src/layout/utils.cc (6)

ToVMap (268-274)

ToVMap (268-268)

MakeFlattenedExpression (170-180)

MakeFlattenedExpression (170-170)

DivideUnusedIterators (122-168)

DivideUnusedIterators (122-124)

src/layout/layout.cc (6)

Layout (57-69)

Layout (71-74)

InputPlaceholder (30-32)

InputPlaceholder (30-30)

Fragment (318-340)

Fragment (342-352)

testing/python/layout/test_tilelang_layout_fused_replicate.py (4)

tilelang/testing/__init__.py (1)

set_random_seed (30-35)

tilelang/language/allocate.py (1)

alloc_fragment (59-70)

tilelang/language/loop.py (1)

Parallel (12-32)

tilelang/language/v2/dtypes.py (2)

bfloat16 (297-297)

float32 (200-200)

🪛 Ruff (0.14.3)

testing/python/layout/test_tilelang_layout_fused_replicate.py

18-18: Unused function argument: a

(ARG001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Quick Lint

coderabbitai · 2025-11-10T13:57:41Z

testing/python/layout/test_tilelang_layout_fused_replicate.py

+            for i, j in T.Parallel(BLOCK_MN, BLOCK_K):
+                idx = i * BLOCK_K + j
+                a_out[pid_b, offs_m + i, offs_n + j] = a_fp32_local[idx // VEC_SIZE, idx % VEC_SIZE]
+


⚠️ Potential issue | 🔴 Critical

Initialize the fragment before reading from it

a_out is filled from a_fp32_local, but that fragment is never written—a is ignored entirely. This leaves the store sourcing undefined data from uninitialized memory. Please load (or otherwise initialize) the fragment before using it and consume the input tensor.

Apply this diff to populate the fragment from a:

for i, j in T.Parallel(BLOCK_MN, BLOCK_K): idx = i * BLOCK_K + j - a_out[pid_b, offs_m + i, offs_n + j] = a_fp32_local[idx // VEC_SIZE, idx % VEC_SIZE] + a_fp32_local[idx // VEC_SIZE, idx % VEC_SIZE] = T.cast( + a[pid_b, offs_m + i, offs_n + j], "float32" + ) + a_out[pid_b, offs_m + i, offs_n + j] = a_fp32_local[idx // VEC_SIZE, idx % VEC_SIZE]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for i, j in T.Parallel(BLOCK_MN, BLOCK_K):

idx = i * BLOCK_K + j

a_out[pid_b, offs_m + i, offs_n + j] = a_fp32_local[idx // VEC_SIZE, idx % VEC_SIZE]

for i, j in T.Parallel(BLOCK_MN, BLOCK_K):

idx = i * BLOCK_K + j

a_fp32_local[idx // VEC_SIZE, idx % VEC_SIZE] = T.cast(

a[pid_b, offs_m + i, offs_n + j], "float32"

)

a_out[pid_b, offs_m + i, offs_n + j] = a_fp32_local[idx // VEC_SIZE, idx % VEC_SIZE]

🤖 Prompt for AI Agents

testing/python/layout/test_tilelang_layout_fused_replicate.py around lines 31 to 34: the code stores from a_fp32_local into a_out but never initializes a_fp32_local from the input tensor a, so stores read uninitialized memory; fix by loading/initializing the fragment before the Parallel store — perform the corresponding read from the input tensor a into a_fp32_local (e.g., a.load or an explicit loop that reads a into the fragment using the same indexing/vec layout) so the fragment is consumed and then use that initialized fragment when writing to a_out.

…llel operations (tile-ai#1221) * [Enhancement] Improve iterator handling in layout utilities and parallel operations * Added a new function, DivideUnusedIterators, to detect per-iterator gaps in fused index expressions, enhancing the accuracy of unused iterator detection. * Updated CompleteBufferFragment to prefer direct inversion for bijective index mappings and introduced a fallback mechanism for non-bijective cases, improving layout inversion robustness. * Added a new test for layout inference in fused kernels to ensure correct compilation and execution without layout inversion failures. * lint fix

LeiWang1999 added 2 commits November 10, 2025 21:50

lint fix

75336f2

coderabbitai bot reviewed Nov 10, 2025

View reviewed changes

LeiWang1999 merged commit 2957afc into tile-ai:main Nov 10, 2025
7 checks passed

coderabbitai bot mentioned this pull request Nov 11, 2025

[Refactor] Simplify logic in the CompleteBufferFragment #1226

Merged

This was referenced Nov 12, 2025

[Fix] Fix a type that make wrong T.macro backtrace #1234

Merged

[Language] Add type stubs for tir op #1239

Merged

This was referenced Nov 21, 2025

[Feat] Add missing support for uint32x2, add unsigned implicit cast in bitwise op, add T.Ref as macro annotation #1302

Closed

[Fix] Remove unused let_bindings_ in CodeGenC to fix #1300 #1305

Merged

[Fix] Fix frame scope error in T.macro #1308

Merged

coderabbitai bot mentioned this pull request Dec 4, 2025

[Layout] Enhance Free Layout Inference #1375

Merged

kurisu6912 mentioned this pull request Dec 17, 2025

[Bug] Fix tvm build script when patchelf is not found #1459

Merged

coderabbitai bot mentioned this pull request Dec 25, 2025

[Fix] Add support for non-var complement arithmetic computation (#1374) #1533

Merged

coderabbitai bot mentioned this pull request Jan 23, 2026

[Layout] Fix Layout Bugs in Parallel and Reduce #1713

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] Improve iterator handling in layout utilities and parallel operations#1221

[Enhancement] Improve iterator handling in layout utilities and parallel operations#1221
LeiWang1999 merged 2 commits intotile-ai:mainfrom
LeiWang1999:fragment_1110

LeiWang1999 commented Nov 10, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Nov 10, 2025

Uh oh!

coderabbitai bot commented Nov 10, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LeiWang1999 commented Nov 10, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Nov 10, 2025

Uh oh!

coderabbitai bot commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LeiWang1999 commented Nov 10, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 10, 2025 •

edited

Loading