[Parallel][Infer] Free-mode chooses minimal replication between buffer-based and PlanLoopPartition by LeiWang1999 · Pull Request #1559 · tile-ai/tilelang

LeiWang1999 · 2025-12-29T06:37:07Z

Title

Minimize replication in free layout inference by comparing compute-from-buffer vs PlanLoopPartition and choosing the smallest valid replication.

Summary

In ParallelOpNode::InferLayout (free mode), generate two loop layout candidates:
- compute_loop_layout_from_buffer: always correct; may over-replicate
- PlanLoopPartition: often yields smaller replication
Choose the layout that minimizes replication while ensuring correctness:
- Use ProveFragmentContains on loop vars to check mutual containment
- If buffer-based contains plan-based → pick PlanLoopPartition (smaller rep)
- If plan-based contains buffer-based → pick buffer-based
- If neither contains the other, compare ReplicateExtent; pick smaller provably; otherwise fall back to buffer-based (safest)
Preserve all existing guards/checks:
- Injectivity check on chosen loop layout
- ProveFragmentContains against all fragments in the loop
- Predicates for replicated cross-thread stores to shared/global

Why

compute_loop_layout_from_buffer is guaranteed correct but can inflate replicate and hurt performance.
Some cases can use PlanLoopPartition with fewer replicas without sacrificing correctness. This change automatically prefers that smaller replication when containment holds.

Changes

src/op/parallel.cc:
- Free-mode: try both candidates, then select by containment and replicate extent
- Keep vectorization + coalesced width handling and predicate guard logic unchanged
src/transform/layout_inference.cc:
- Prioritize queueing ops that touch any buffer with known layout (previously required all) to improve propagation order
- Simplify touched-buffer dedup logic
tilelang/language/annotations.py:
- Enforce that fragment buffers annotated via annotate_layout must use Fragment objects (type guard)
testing/python/issue/test_tilelang_issue_layout.py:
- Add test ensuring free inference picks the smallest replication when feasible

Impact

Only affects free layout inference path for parallel loops
Expected to reduce redundant replication and improve performance in qualifying cases
Keeps strict/common paths and all safety checks intact

Notes

Logged selection decisions to help diagnose layout choices
Happy to add more targeted tests if desired

Summary by CodeRabbit

New Features
- Layout inference now evaluates two candidate layouts and deterministically selects the best (preferring contained candidates or those with smaller replication), improving memory efficiency.
Bug Fixes
- Prioritization logic refined to trigger sooner when any known layout anchor exists; improved candidate validation and replication-guard insertion for safer parallel access.
Tests
- New test covering the free-layout inference selection behavior.
Chores
- Added runtime validation to ensure fragment buffers carry fragment layouts.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

* Enhanced the layout inference mechanism in ParallelOpNode to utilize two strategies: compute_loop_layout_from_buffer and PlanLoopPartition, selecting the one that minimizes replication while ensuring compatibility. * Updated the logic to choose the best candidate layout based on replication size and containment checks. * Refactored the HasKnownLayoutAnchor function to clarify its purpose in prioritizing buffer layouts. * Added a new test case to validate the layout inference behavior, ensuring the correct fragments are generated in the output. This update aims to optimize layout inference for parallel operations, improving performance and resource utilization.

github-actions · 2025-12-29T06:37:16Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai · 2025-12-29T06:37:17Z

📝 Walkthrough

Walkthrough

Adds a dual-candidate free-layout inference path in ParallelOpNode: derives one candidate from source buffers and another from the plan, validates both against fragment buffers, inserts replication guards when needed, and deterministically selects the preferred candidate by validation, containment, and replication heuristics. Also adjusts prioritization logic, adds a unit test, and enforces fragment layout annotations.

Changes

Cohort / File(s)	Summary
Parallel op inference `src/op/parallel.cc`, `src/op/parallel.h`	Introduces public/private helpers to compute a buffer-derived fragment (`ComputeLoopLayoutFromBuffer`), compute a plan-derived fragment (`ComputePlanCandidate`), validate a candidate against known fragment buffers (`ValidateCandidateAgainstFragments`), insert replication guards (`BuildReplicationGuardsIfNeeded`), and choose between candidates (`ChooseBestCandidate`). Reworks free-mode inference to compute both candidates and select deterministically via validation, containment, and replication heuristics.
Layout inference prioritization `src/transform/layout_inference.cc`	Replaces `ShouldPrioritize` with `HasKnownLayoutAnchor` (now true if any touched buffer has a known layout) and updates enqueue logic to use front-priority when an anchor exists; deduplicates touched-buffer recording with `std::none_of`.
TileLang annotations `tilelang/language/annotations.py`	Imports `Fragment` and `is_fragment`; asserts that buffers marked as fragments are given `Fragment` layouts in `annotate_layout`.
Tests `testing/python/issue/test_tilelang_issue_layout.py`	Adds a JITted TileLang kernel exercising free inference candidate selection and a test asserting the presence of expected fragment buffer declarations in generated kernel source.

Sequence Diagram(s)

mermaid
sequenceDiagram
autonumber
participant Scheduler as LayoutInfer Scheduler
participant Parallel as ParallelOpNode
participant Buffer as Source Buffers
participant Planner as PlanPartitioner
participant Validator as Fragment Validator
participant Chooser as Candidate Chooser
participant GuardBuilder as Replication Guard Inserter

Note over Scheduler,Parallel: Free-mode layout inference start
Scheduler->>Parallel: infer_layout(T)
Parallel->>Buffer: ComputeLoopLayoutFromBuffer(buffer, T)
Buffer-->>Parallel: candidate_from_buffer
Parallel->>Planner: ComputePlanCandidate(T)
Planner-->>Parallel: candidate_from_plan
Parallel->>Validator: ValidateCandidateAgainstFragments(candidate_from_buffer, T)
Validator-->>Parallel: valid_buffer_candidate?
Parallel->>Validator: ValidateCandidateAgainstFragments(candidate_from_plan, T)
Validator-->>Parallel: valid_plan_candidate?
Parallel->>Chooser: ChooseBestCandidate(buffer_candidate, plan_candidate, T)
Chooser-->>Parallel: selected_candidate
Parallel->>GuardBuilder: BuildReplicationGuardsIfNeeded(T, ...)
GuardBuilder-->>Parallel: guards_added?
Parallel->>Scheduler: return selected layout / guards

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

[Layout] Strict annotate completed replicated layout for fragment with constant index #929: Overlaps with replication-aware buffer handling and ComputeLoopLayoutFromBuffer changes in src/op/parallel.cc.
[Enhancement] Enhance let binding handling in layout inference and warp specialized pass #1484: Modifies ParallelOpNode layout-inference flow and helper extraction similar to the new plan/candidate helpers.
[Layout] Enhance Free Layout Inference #1375: Touches free layout-inference and enqueue/prioritization logic in src/op/parallel.cc and src/transform/layout_inference.cc.

Poem

🐰 Two fragments hop, two paths to try,
I sniff the buffers and consult the sky,
Validate, contain, and replication small—
I pick the best, guards rise if they sprawl,
Kernel sings, the layout stands tall. 🎉

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and accurately describes the main change: free-mode layout inference now chooses between two candidates based on minimal replication.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

testing/python/issue/test_tilelang_issue_layout.py (1)

27-32: Effective source-level verification of fragment sizes.

The assertions checking for float S_frag[4];, float A_frag[4];, and float B_frag[4]; directly validate that the layout inference chose the minimal replication (4 elements per thread from a 128×4 tensor with 128 threads).

Consider adding a negative assertion or comment explaining what the incorrect output would look like (e.g., float S_frag[512]; if over-replicated) to make the test's purpose clearer to future maintainers.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 193eff1 and fed3df1.

📒 Files selected for processing (4)

src/op/parallel.cc
src/transform/layout_inference.cc
testing/python/issue/test_tilelang_issue_layout.py
tilelang/language/annotations.py

🧰 Additional context used

🧠 Learnings (3)

📚 Learning: 2025-12-18T04:50:00.512Z

Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.

Applied to files:

testing/python/issue/test_tilelang_issue_layout.py

📚 Learning: 2025-11-14T07:56:11.098Z

Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

testing/python/issue/test_tilelang_issue_layout.py

📚 Learning: 2025-09-12T09:47:46.474Z

Learnt from: kurisu6912
Repo: tile-ai/tilelang PR: 794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.

Applied to files:

src/transform/layout_inference.cc

🧬 Code graph analysis (3)

src/op/parallel.cc (1)

src/transform/loop_partition.cc (4)

PlanLoopPartition (253-257)

PlanLoopPartition (253-254)

PlanLoopPartition (259-265)

PlanLoopPartition (259-260)

testing/python/issue/test_tilelang_issue_layout.py (3)

tilelang/language/allocate.py (1)

alloc_fragment (72-85)

tilelang/language/annotations.py (1)

annotate_layout (27-40)

tilelang/language/v2/builder.py (1)

source (690-691)

tilelang/language/annotations.py (4)

tilelang/layout/fragment.py (1)

Fragment (13-205)

tilelang/utils/language.py (1)

is_fragment (104-115)

src/transform/layout_inference.cc (4)

buffer (590-608)

buffer (590-590)

buffer (916-935)

buffer (916-916)

tilelang/language/v2/builder.py (1)

buffer (125-126)

🪛 Ruff (0.14.10)

testing/python/issue/test_tilelang_issue_layout.py

9-9: Unused function argument: A

(ARG001)

9-9: Unused function argument: B

(ARG001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)

🔇 Additional comments (9)

tilelang/language/annotations.py (2)

5-6: LGTM!

The imports correctly bring in the Fragment class and is_fragment utility needed for the new runtime validation of fragment buffer layouts.

31-32: LGTM! Good runtime guard for fragment-layout type consistency.

This assertion ensures that fragment buffers receive proper Fragment layouts, catching type mismatches early. The placement before the isinstance(layout, Layout) check is correct since Fragment is a subclass of Layout.

Consider wrapping this line for readability if the project enforces line length limits.

src/transform/layout_inference.cc (3)

450-461: Semantic change from "all buffers known" to "any buffer known".

The renamed HasKnownLayoutAnchor now returns true if any touched buffer has a known layout (previously required all). This makes prioritization more aggressive—ops get enqueued at the front as soon as they have at least one layout anchor.

This aligns with the dual-candidate inference strategy, allowing earlier propagation from partial layout knowledge.

475-479: LGTM!

The call site correctly uses the renamed HasKnownLayoutAnchor function.

604-607: Good refactor using std::none_of.

The use of std::none_of with same_as comparison is cleaner and more idiomatic than manual iteration for duplicate checking.

src/op/parallel.cc (3)

486-543: LGTM! Clean dual-candidate inference approach.

The code now computes two candidate layouts in free inference mode:

candidate_from_buffer — derived from existing buffer layout (always correct but may over-replicate)

candidate_from_plan — derived from PlanLoopPartition (often smaller replication)

This sets up the selection logic that follows to minimize replication while preserving correctness.

546-588: Well-structured selection heuristic for choosing optimal layout.

The containment-based selection logic is sound:

If buffer-based contains plan-based → plan is smaller/more optimal, choose it

If plan-based contains buffer-based → buffer is more constrained, choose it

If neither contains the other → compare replication extents, prefer smaller

Fallback to buffer-based when uncertain (safety)

One minor observation: when both buf_contains_plan and plan_contains_buf are true (mutual containment, i.e., equivalent layouts), the code falls through to the replication comparison at line 571, which is correct behavior.

656-658: The early return is intentional and safe when both candidates are undefined.

When read_source_buffer is undefined and allow_layout_propgate is false, neither candidate is computed. The code then returns {} at line 657, which aligns with the documented behavior ("Returns empty layout map if loop_layout_ is not already defined").

This early return is protective—it exits before downstream code (line 660 onwards) that calls loop_layout_->DetectInjective() and other methods requiring loop_layout_ to be defined. The function documentation explicitly acknowledges this case, confirming it is intentional.

testing/python/issue/test_tilelang_issue_layout.py (1)

6-24: Good test coverage for dual-candidate layout inference.

This test effectively validates that free inference selects the layout with minimal replication. The kernel structure with annotated A_frag and inferred B_frag/S_frag exercises the selection heuristic.

Regarding the static analysis hint about unused A and B parameters: these are intentionally unused since the test focuses on fragment allocation behavior rather than actual computation. Consider adding a brief comment or using underscore prefix (_A, _B) to signal intent if your style guide requires it.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (4)

src/op/parallel.cc (4)

393-393: Consider removing this comment.

This comment doesn't add meaningful context since the refactoring is evident from the code structure. Removing it would clean up the codebase.

681-693: Consider whether the coalesced_width check should be fatal.

The code enforces that vector_size % coalesced_width == 0 with LOG(FATAL) (line 686-687). This is a strict requirement that will crash if violated. Consider whether:

This should be a validation error with better diagnostics about why the constraint failed

The code could adjust vector_size to be exactly coalesced_width when they don't divide evenly (as you already do on line 689)

The current behavior is intentional for correctness

If the strict check is intentional for correctness, consider adding a comment explaining why violating this constraint would cause incorrect behavior.

738-740: Consider improving the error message for users.

The ICHECK will terminate execution with a somewhat generic message. Consider:

Adding more context about why this combination is invalid

Suggesting what the user should do to fix their code (e.g., "move shared/global stores outside the parallel loop" or "ensure fragment stores don't have cross-thread replication")

Including information about the specific buffers involved to help debugging

768-772: Consider clarifying the containment check lambda.

The contains lambda comment states "contains(A, B) means: for any loop index, the threads that access B's elements are a subset of those that access A's elements." However, the implementation calls ProveFragmentContains(small, big, ...) where the first argument is small and second is big, which can be confusing given the parameter names are reversed (big, small).

Consider either:

Reversing the parameter order to match the call: contains(small, big) with ProveFragmentContains(small, big, ...)

Making the comment even more explicit about the parameter/argument mapping

This would reduce cognitive load when understanding the containment checks at lines 793-794 and 801-804.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fed3df1 and 9f3af2e.

📒 Files selected for processing (2)

src/op/parallel.cc
src/op/parallel.h

🧰 Additional context used

🧬 Code graph analysis (1)

src/op/parallel.h (1)

src/op/parallel.cc (10)

ValidateCandidateAgainstFragments (597-611)

ValidateCandidateAgainstFragments (597-598)

ChooseBestCandidate (752-813)

ChooseBestCandidate (753-755)

ComputeLoopLayoutFromBuffer (613-658)

ComputeLoopLayoutFromBuffer (614-615)

ComputePlanCandidate (660-701)

ComputePlanCandidate (660-660)

BuildReplicationGuardsIfNeeded (703-751)

BuildReplicationGuardsIfNeeded (703-708)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Test for Python 3.12 with Metal (on macos-latest)

🔇 Additional comments (6)

src/op/parallel.h (1)

104-126: LGTM! Well-structured private helper declarations.

The new private method declarations are well-documented, appropriately scoped, and maintain const correctness. The method names clearly convey their purpose in the two-candidate inference flow, and keeping them private preserves the public API surface.

src/op/parallel.cc (5)

401-402: Good refactoring!

Extracting the buffer-based layout computation into a named helper method improves readability and maintainability while preserving the original behavior.

597-611: LGTM! Clean validation logic.

The validation method correctly checks candidate compatibility against all known fragments, using early returns for efficiency and avoiding exceptions for control flow as documented.

613-658: Excellent refactoring with robust error handling.

The buffer-based candidate computation is well-structured with:

Clear separation of common vs. non-common access patterns

Validation against inner variables to catch invalid layouts early

Try-catch block that enriches TVM errors with actionable context

Good diagnostic logging

752-813: Well-designed candidate selection logic.

The selection strategy is thoughtful and well-documented:

Validates both candidates first

Uses containment relationships to prefer the more specific layout

Falls back to replication extent comparison

Maintains deterministic behavior

The logic correctly handles the goal of minimizing replication while ensuring correctness.

787-791: Error handling is present, but clarify that it occurs through injectivity validation rather than explicit candidate validity checks.

When both candidates are invalid (lines 787-791), the function returns candidate_from_buffer. The calling code does validate this at lines 479-486 via an injectivity check: if loop_layout_->DetectInjective() finds errors, it throws LoopLayoutInjectiveException with a detailed message including the layout output and AST. However, this check validates the selected candidate's injectivity, not whether both internal validations failed—consider whether the error message should explicitly indicate when both candidates were rejected internally for better debugging.

coderabbitai · 2025-12-29T12:56:37Z

src/op/parallel.cc

+    // In free inference, try two mechanisms and prefer the one that
+    // minimizes replication while remaining compatible:
+    // 1) compute_loop_layout_from_buffer (always correct but may
+    // over-replicate) 2) PlanLoopPartition (often smaller replication)
+    Fragment candidate_from_buffer;
+    Fragment candidate_from_plan;
+
    if (read_source_buffer.defined() && allow_layout_propgate) {
-      loop_layout_ = compute_loop_layout_from_buffer(read_source_buffer);
+      candidate_from_buffer =
+          ComputeLoopLayoutFromBuffer(read_source_buffer, T);
    }

-    if (!loop_layout_.defined()) {
-      // No source buffer available, use free mode inference
-      // Vectorize Size must be aware of the buffer_remap
-      // As the pass will do post processing to the layout
-      auto maybe_remapped_root_ =
-          IfBufferRemapLoopGenerator::run(root_, T.buffer_remap, T.layout_map);
-      int vector_size = GetVectorizeSize(maybe_remapped_root_, T.analyzer);
-      DLOG(INFO) << "[PlanLoopPartition] vector_size = " << vector_size << '\n';
-
-      PrimExpr loop_total_size = 1;
-      for (Stmt l = root_; l.as<For>().has_value();
-           l = l.as<For>().value()->body)
-        loop_total_size = loop_total_size * l.as<For>().value()->extent;
-      DLOG(INFO) << "[PlanLoopPartition] loop_total_size = " << loop_total_size
-                 << '\n';
-      while (!analyzer_.CanProve(
-                 floormod(loop_total_size,
-                          T.thread_bounds->extent * vector_size) == 0) &&
-             vector_size > 1)
-        vector_size /= 2;
-      DLOG(INFO) << "[PlanLoopPartition] after adjust: vector_size = "
-                 << vector_size << '\n';
-
-      // Check if coalesced_width is defined
-      if (auto coalesced_width =
-              root_->annotations.Get(attr::kCoalescedWidth)) {
-        if (const auto *imm = coalesced_width->as<IntImmNode>()) {
-          int expected = imm->value;
-          // Verify that vector_size is divisible by expected
-          if (vector_size % expected != 0) {
-            LOG(FATAL) << "Vector size " << vector_size
-                       << " is not divisible by coalesced width " << expected;
-          }
-          vector_size = expected;
-        } else {
-          LOG(FATAL) << "coalesced_width should be an IntImmNode.";
-        }
-      }
-      DLOG(INFO) << "[PlanLoopPartition] root_ = " << root_
-                 << " ############# vector_size = " << vector_size
-                 << ", thread_bounds = " << T.thread_bounds << '\n';
-      loop_layout_ = PlanLoopPartition(root_, vector_size, T.thread_bounds);
-      DLOG(INFO) << "[PlanLoopPartition] loop_layout_ = "
-                 << loop_layout_->DebugOutput() << '\n';
+    // try to infer loop layout with two mechanisms and choose the best one
+    {
+      candidate_from_plan = ComputePlanCandidate(T);
    }

-    // Lambda that guards replicated accesses:
-    // - When a loop layout replicates a fragment buffer (rep > 1), each thread
-    //   observes the same fragment elements. Blindly storing to shared/global
-    //   memory in that case would add the same value multiple times.
-    // - We therefore restrict the store so that only the replica with rep == 0
-    //   performs the update (e.g. global[i] += fragment[i] only fires once).
-    // Trigger conditions for this guard:
-    // 1) There are cross-thread stores targeting shared/global memory (no
-    //    fragment stores in this branch; atomic_add and similar remain TODO).
-    // 2) The loop layout replicate extent is greater than 1, inferred from the
-    //    thread bounds captured in the layout.
-
-    [this, &store_shared_global_buffers, &store_fragment_buffers,
-     &has_cross_thread_access, &const_index_fragment_buffer, &T]() {
-      if (is_one(loop_layout_->ReplicateExtent()))
-        return;
-      if (!has_cross_thread_access)
-        return;
-
-      if (!store_fragment_buffers.empty()) {
-        // Iterate replicated fragment stores: when the fragment index is a
-        // constant (e.g. fragment[0]), every thread touches the same slot, so
-        // the rep == 0 predicate is unnecessary. Example: for i in
-        // T.Parallel(...):
-        //   shared[i] = ...
-        //   fragment[0] = ...
-        bool replicate_is_from_dynamic_index_fragment = false;
-        for (const auto &fragment : store_fragment_buffers) {
-          if (!T.layout_map.count(fragment)) {
-            continue;
-          }
-
-          auto fragment_layout = T.layout_map[fragment].as<Fragment>().value();
-          if (is_one(fragment_layout->ReplicateExtent()))
-            continue;
-
-          if (analyzer_.CanProveEqual(fragment_layout->ReplicateExtent(),
-                                      loop_layout_->ReplicateExtent()))
-            continue;
-          if (std::find(const_index_fragment_buffer.begin(),
-                        const_index_fragment_buffer.end(),
-                        fragment) == const_index_fragment_buffer.end()) {
-            replicate_is_from_dynamic_index_fragment = true;
-          }
-        }
-
-        if (!replicate_is_from_dynamic_index_fragment)
-          return;
+    // Choose the best candidate:
+    if (candidate_from_buffer.defined() && candidate_from_plan.defined()) {
+      loop_layout_ =
+          ChooseBestCandidate(candidate_from_buffer, candidate_from_plan, T);
+    } else if (candidate_from_plan.defined()) {
+      loop_layout_ = candidate_from_plan;
+      DLOG(INFO) << "[FreeInfer] only PlanLoopPartition available, choose it.";
+    } else if (candidate_from_buffer.defined()) {
+      loop_layout_ = candidate_from_buffer;
+      DLOG(INFO)
+          << "[FreeInfer] only compute_from_buffer available, choose it.";
+    }

-        ICHECK(store_shared_global_buffers.empty())
-            << "Invalid layout: cannot have both fragment and shared store "
-               "buffers "
-               "in replicated loop layout.";
-        return;
-      } else {
-        // Now, store is global or shared
-        // or T.call_extern or T.call_intrin ...
-        auto inv = loop_layout_->Inverse();
-        Array<PrimExpr> fwd;
-        for (size_t i = 0; i < loop_layout_->OutputDim(); i++)
-          fwd.push_back(0);
-        fwd.push_back(InputPlaceholder(0) - T.thread_bounds->min);
-        auto rep = inv->Forward(fwd).back();
-        AddPredicate(EQ(rep, 0));
-      }
-    }();
+    BuildReplicationGuardsIfNeeded(
+        T, store_shared_global_buffers, store_fragment_buffers,
+        has_cross_thread_access, const_index_fragment_buffer);


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify that loop_layout_ is always defined before the injectivity check rg -n -A5 -B5 "loop_layout_.*DetectInjective" src/op/parallel.cc

Repository: tile-ai/tilelang

Length of output: 505

🏁 Script executed:

#!/bin/bash # Check all assignments to loop_layout_ in the function rg -n "loop_layout_\s*=" src/op/parallel.cc | head -20

Repository: tile-ai/tilelang

Length of output: 248

🏁 Script executed:

#!/bin/bash # Check if loop_layout_ is initialized/declared at function start # Search for the function containing lines 443-475 rg -n -B50 "ComputeLoopLayoutFromBuffer" src/op/parallel.cc | grep -E "(void|Fragment|auto).*loop_layout_|loop_layout_\s*=" | head -10

Repository: tile-ai/tilelang

Length of output: 129

🏁 Script executed:

#!/bin/bash # Look at the broader context around lines 443-475 to understand function scope sed -n '400,500p' src/op/parallel.cc | cat -n

Repository: tile-ai/tilelang

Length of output: 4834

Critical issue: loop_layout_ may be uninitialized when both candidates are undefined.

In the free inference branch, if read_source_buffer is undefined or allow_layout_propgate is false, and ComputePlanCandidate(T) returns an undefined fragment, then loop_layout_ is never assigned. The code then performs a direct pointer dereference at line 480 (loop_layout_->DetectInjective()) without checking if loop_layout_ is defined, resulting in undefined behavior.

Add a fallback case (e.g., an error or default assignment) when both candidates are undefined, or verify that ComputePlanCandidate() is guaranteed to return a defined fragment and add an assertion.

🤖 Prompt for AI Agents

In src/op/parallel.cc around lines 443-475, loop_layout_ can remain uninitialized when both candidate_from_buffer and candidate_from_plan are undefined, leading to a crash when the code later dereferences loop_layout_; fix by adding a fallback branch for the case where neither candidate is defined: either assign a safe default Fragment to loop_layout_, or log an error and early-return/throw so no dereference occurs, or add an explicit assertion that ComputePlanCandidate must return a defined fragment; implement one of these options and ensure subsequent code checks loop_layout_ before dereference.

…t_1229

LeiWang1999 · 2025-12-29T18:09:36Z

@regression-perf

github-actions · 2025-12-29T19:32:09Z

Performance Regression Test Report

Triggered by: @LeiWang1999
Workflow run: https://github.com/tile-ai/tilelang/actions/runs/20579537457

Results

File	Original Latency	Current Latency	Speedup
example_warp_specialize_gemm_copy_1_gemm_0	0.036929	0.038912	0.949039
example_vertical_slash_sparse_attn	0.237968	0.241746	0.98437
example_warp_specialize_gemm_copy_0_gemm_1	0.038945	0.039232	0.992685
example_tilelang_gemm_fp8	0.322648	0.323656	0.996887
example_topk	0.011008	0.01104	0.997101
example_gqa_decode	0.048417	0.048545	0.997363
example_dynamic	0.657295	0.658318	0.998446
example_gemm_autotune	0.022368	0.022401	0.998527
example_tilelang_gemm_fp8_intrinsic	0.466349	0.466973	0.998663
example_gemv	0.275797	0.276085	0.998958
example_mha_bwd_bhsd	0.0400216	0.0400532	0.999211
example_mha_bwd_bshd_wgmma_pipelined	0.0258325	0.0258474	0.999424
example_gqa_bwd_tma_reduce_varlen	0.063536	0.0635571	0.999669
example_gqa_bwd_wgmma_pipelined	0.073712	0.0737351	0.999687
example_mha_fwd_varlen	0.0455452	0.0455582	0.999715
example_elementwise_add	0.295718	0.29579	0.999757
example_linear_attn_fwd	0.0363374	0.0363432	0.999841
example_fusedmoe_tilelang	0.130257	0.130263	0.999953
block_sparse_attn_tilelang	0.0102471	0.010246	1.00011
example_mha_bwd_bshd	0.0405539	0.0405429	1.00027
example_tilelang_gemm_splitk_vectorize_atomicadd	1.40772	1.40726	1.00033
example_gqa_bwd	0.0496566	0.0496304	1.00053
example_linear_attn_bwd	0.151654	0.151538	1.00077
example_gemm_intrinsics	0.035073	0.035041	1.00091
example_gemm_schedule	0.0324312	0.0324	1.00096
example_tilelang_gemm_splitk	1.40915	1.40778	1.00097
tilelang_example_sparse_tensorcore	0.0150451	0.0150304	1.00097
example_gemm	0.022945	0.022912	1.00144
example_convolution_autotune	1.00124	0.996214	1.00505
example_dequant_gemv_fp16xint4	0.0288485	0.0286355	1.00744
example_dequant_groupedgemm_bf16_mxfp4_hopper	3.53478	3.49481	1.01143
example_per_token_cast_to_fp8	0.00743965	0.00732482	1.01568
example_mha_inference	0.073539	0.072384	1.01596
example_dequant_gemm_w4a8	5.51091	5.39941	1.02065
example_tilelang_nsa_fwd	0.0072155	0.0070359	1.02553
example_tilelang_nsa_decode	0.00689747	0.00672424	1.02576
sparse_mla_fwd_pipelined	0.0985678	0.0960827	1.02586
example_warp_specialize_gemm_softpipe_stage2	0.039584	0.038497	1.02824
sparse_mla_fwd	0.143766	0.139727	1.02891
sparse_mla_bwd	0.39456	0.381855	1.03327
example_mha_sink_fwd_bhsd_wgmma_pipelined_sliding_window	0.0159741	0.0154593	1.0333
example_tilelang_block_sparse_attn	0.0104882	0.0101496	1.03336
example_dequant_gemm_fp4_hopper	1.10262	1.067	1.03338
example_mha_sink_fwd_bhsd_sliding_window	0.0161785	0.0156526	1.0336
topk_selector	0.055533	0.0536471	1.03515
example_mha_sink_fwd_bhsd	0.016417	0.0158537	1.03553
fp8_lighting_indexer	0.037222	0.0359062	1.03665
example_mha_sink_fwd_bhsd_wgmma_pipelined	0.0159506	0.0153824	1.03694
example_gqa_sink_fwd_bhsd_wgmma_pipelined	0.0149818	0.01442	1.03896
example_blocksparse_gemm	0.0235058	0.0226231	1.03902
example_tilelang_gemm_fp8_2xAcc	0.194582	0.187236	1.03923
example_dequant_gemm_bf16_fp4_hopper	0.59406	0.571598	1.0393
example_gqa_sink_bwd_bhsd_sliding_window	0.0265329	0.0255232	1.03956
example_dequant_gemm_bf16_mxfp4_hopper	0.528875	0.508109	1.04087
example_gqa_sink_fwd_bhsd_wgmma_pipelined_sliding_window	0.0151531	0.0145578	1.04089
example_mha_sink_bwd_bhsd	0.0644031	0.0618229	1.04173
example_tilelang_sparse_gqa_decode_varlen_indice	0.0178444	0.0171283	1.04181
example_mha_sink_bwd_bhsd_sliding_window	0.0461765	0.0442942	1.0425
example_group_per_split_token_cast_to_fp8	0.0107118	0.0102503	1.04503
example_tilelang_sparse_gqa_decode_varlen_mask	0.0244034	0.0233494	1.04514
example_warp_specialize_gemm_barrierpipe_stage2	0.039424	0.037569	1.04938
example_gqa_sink_bwd_bhsd	0.0438625	0.0417107	1.05159
example_convolution	1.40563	1.33414	1.05358
example_mla_decode	0.487436	0.460938	1.05749

Artifacts

regression_result.png (speedup plot) is attached as a workflow artifact. Download it from the workflow run page above.

LeiWang1999 added 2 commits December 29, 2025 14:34

lint fix

fed3df1

coderabbitai bot reviewed Dec 29, 2025

View reviewed changes

LeiWang1999 added 2 commits December 29, 2025 20:51

bug fix and refactor

1ac15dc

lint fix

9f3af2e

coderabbitai bot reviewed Dec 29, 2025

View reviewed changes

Merge branch 'main' of https://github.com/tile-ai/tilelang into layou…

5e90fb7

…t_1229

LeiWang1999 merged commit 0f9bbd7 into tile-ai:main Dec 30, 2025
6 checks passed

This was referenced Jan 23, 2026

[Layout] Fix Layout Bugs in Parallel and Reduce #1713

Merged

[BugFix] Fix boundary check in loop layout #1732

Closed

coderabbitai bot mentioned this pull request Feb 3, 2026

[Refactor] Reorganize ParallelOp code structure and move ProveFragmentContains to layout utils #1779

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Parallel][Infer] Free-mode chooses minimal replication between buffer-based and PlanLoopPartition#1559

[Parallel][Infer] Free-mode chooses minimal replication between buffer-based and PlanLoopPartition#1559
LeiWang1999 merged 5 commits intotile-ai:mainfrom
LeiWang1999:layout_1229

LeiWang1999 commented Dec 29, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Dec 29, 2025

Uh oh!

coderabbitai bot commented Dec 29, 2025 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Dec 29, 2025

Uh oh!

LeiWang1999 commented Dec 29, 2025

Uh oh!

github-actions bot commented Dec 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LeiWang1999 commented Dec 29, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Dec 29, 2025

Uh oh!

coderabbitai bot commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

LeiWang1999 commented Dec 29, 2025

Uh oh!

github-actions bot commented Dec 29, 2025

Performance Regression Test Report

Results

Artifacts

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LeiWang1999 commented Dec 29, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 29, 2025 •

edited

Loading