Skip to content

[Parallel][Infer] Free-mode chooses minimal replication between buffer-based and PlanLoopPartition#1559

Merged
LeiWang1999 merged 5 commits intotile-ai:mainfrom
LeiWang1999:layout_1229
Dec 30, 2025
Merged

[Parallel][Infer] Free-mode chooses minimal replication between buffer-based and PlanLoopPartition#1559
LeiWang1999 merged 5 commits intotile-ai:mainfrom
LeiWang1999:layout_1229

Conversation

@LeiWang1999
Copy link
Member

@LeiWang1999 LeiWang1999 commented Dec 29, 2025

Title

  • Minimize replication in free layout inference by comparing compute-from-buffer vs PlanLoopPartition and choosing the smallest valid replication.

Summary

  • In ParallelOpNode::InferLayout (free mode), generate two loop layout candidates:
    • compute_loop_layout_from_buffer: always correct; may over-replicate
    • PlanLoopPartition: often yields smaller replication
  • Choose the layout that minimizes replication while ensuring correctness:
    • Use ProveFragmentContains on loop vars to check mutual containment
    • If buffer-based contains plan-based → pick PlanLoopPartition (smaller rep)
    • If plan-based contains buffer-based → pick buffer-based
    • If neither contains the other, compare ReplicateExtent; pick smaller provably; otherwise fall back to buffer-based (safest)
  • Preserve all existing guards/checks:
    • Injectivity check on chosen loop layout
    • ProveFragmentContains against all fragments in the loop
    • Predicates for replicated cross-thread stores to shared/global

Why

  • compute_loop_layout_from_buffer is guaranteed correct but can inflate replicate and hurt performance.
  • Some cases can use PlanLoopPartition with fewer replicas without sacrificing correctness. This change automatically prefers that smaller replication when containment holds.

Changes

  • src/op/parallel.cc:
    • Free-mode: try both candidates, then select by containment and replicate extent
    • Keep vectorization + coalesced width handling and predicate guard logic unchanged
  • src/transform/layout_inference.cc:
    • Prioritize queueing ops that touch any buffer with known layout (previously required all) to improve propagation order
    • Simplify touched-buffer dedup logic
  • tilelang/language/annotations.py:
    • Enforce that fragment buffers annotated via annotate_layout must use Fragment objects (type guard)
  • testing/python/issue/test_tilelang_issue_layout.py:
    • Add test ensuring free inference picks the smallest replication when feasible

Impact

  • Only affects free layout inference path for parallel loops
  • Expected to reduce redundant replication and improve performance in qualifying cases
  • Keeps strict/common paths and all safety checks intact

Notes

  • Logged selection decisions to help diagnose layout choices
  • Happy to add more targeted tests if desired

Summary by CodeRabbit

  • New Features

    • Layout inference now evaluates two candidate layouts and deterministically selects the best (preferring contained candidates or those with smaller replication), improving memory efficiency.
  • Bug Fixes

    • Prioritization logic refined to trigger sooner when any known layout anchor exists; improved candidate validation and replication-guard insertion for safer parallel access.
  • Tests

    • New test covering the free-layout inference selection behavior.
  • Chores

    • Added runtime validation to ensure fragment buffers carry fragment layouts.

✏️ Tip: You can customize this high-level summary in your review settings.

* Enhanced the layout inference mechanism in ParallelOpNode to utilize two strategies: compute_loop_layout_from_buffer and PlanLoopPartition, selecting the one that minimizes replication while ensuring compatibility.
* Updated the logic to choose the best candidate layout based on replication size and containment checks.
* Refactored the HasKnownLayoutAnchor function to clarify its purpose in prioritizing buffer layouts.
* Added a new test case to validate the layout inference behavior, ensuring the correct fragments are generated in the output.

This update aims to optimize layout inference for parallel operations, improving performance and resource utilization.
@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 29, 2025

📝 Walkthrough

Walkthrough

Adds a dual-candidate free-layout inference path in ParallelOpNode: derives one candidate from source buffers and another from the plan, validates both against fragment buffers, inserts replication guards when needed, and deterministically selects the preferred candidate by validation, containment, and replication heuristics. Also adjusts prioritization logic, adds a unit test, and enforces fragment layout annotations.

Changes

Cohort / File(s) Summary
Parallel op inference
src/op/parallel.cc, src/op/parallel.h
Introduces public/private helpers to compute a buffer-derived fragment (ComputeLoopLayoutFromBuffer), compute a plan-derived fragment (ComputePlanCandidate), validate a candidate against known fragment buffers (ValidateCandidateAgainstFragments), insert replication guards (BuildReplicationGuardsIfNeeded), and choose between candidates (ChooseBestCandidate). Reworks free-mode inference to compute both candidates and select deterministically via validation, containment, and replication heuristics.
Layout inference prioritization
src/transform/layout_inference.cc
Replaces ShouldPrioritize with HasKnownLayoutAnchor (now true if any touched buffer has a known layout) and updates enqueue logic to use front-priority when an anchor exists; deduplicates touched-buffer recording with std::none_of.
TileLang annotations
tilelang/language/annotations.py
Imports Fragment and is_fragment; asserts that buffers marked as fragments are given Fragment layouts in annotate_layout.
Tests
testing/python/issue/test_tilelang_issue_layout.py
Adds a JITted TileLang kernel exercising free inference candidate selection and a test asserting the presence of expected fragment buffer declarations in generated kernel source.

Sequence Diagram(s)

mermaid
sequenceDiagram
autonumber
participant Scheduler as LayoutInfer Scheduler
participant Parallel as ParallelOpNode
participant Buffer as Source Buffers
participant Planner as PlanPartitioner
participant Validator as Fragment Validator
participant Chooser as Candidate Chooser
participant GuardBuilder as Replication Guard Inserter

Note over Scheduler,Parallel: Free-mode layout inference start
Scheduler->>Parallel: infer_layout(T)
Parallel->>Buffer: ComputeLoopLayoutFromBuffer(buffer, T)
Buffer-->>Parallel: candidate_from_buffer
Parallel->>Planner: ComputePlanCandidate(T)
Planner-->>Parallel: candidate_from_plan
Parallel->>Validator: ValidateCandidateAgainstFragments(candidate_from_buffer, T)
Validator-->>Parallel: valid_buffer_candidate?
Parallel->>Validator: ValidateCandidateAgainstFragments(candidate_from_plan, T)
Validator-->>Parallel: valid_plan_candidate?
Parallel->>Chooser: ChooseBestCandidate(buffer_candidate, plan_candidate, T)
Chooser-->>Parallel: selected_candidate
Parallel->>GuardBuilder: BuildReplicationGuardsIfNeeded(T, ...)
GuardBuilder-->>Parallel: guards_added?
Parallel->>Scheduler: return selected layout / guards

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Poem

🐰 Two fragments hop, two paths to try,
I sniff the buffers and consult the sky,
Validate, contain, and replication small—
I pick the best, guards rise if they sprawl,
Kernel sings, the layout stands tall. 🎉

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and accurately describes the main change: free-mode layout inference now chooses between two candidates based on minimal replication.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
testing/python/issue/test_tilelang_issue_layout.py (1)

27-32: Effective source-level verification of fragment sizes.

The assertions checking for float S_frag[4];, float A_frag[4];, and float B_frag[4]; directly validate that the layout inference chose the minimal replication (4 elements per thread from a 128×4 tensor with 128 threads).

Consider adding a negative assertion or comment explaining what the incorrect output would look like (e.g., float S_frag[512]; if over-replicated) to make the test's purpose clearer to future maintainers.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 193eff1 and fed3df1.

📒 Files selected for processing (4)
  • src/op/parallel.cc
  • src/transform/layout_inference.cc
  • testing/python/issue/test_tilelang_issue_layout.py
  • tilelang/language/annotations.py
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-12-18T04:50:00.512Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.

Applied to files:

  • testing/python/issue/test_tilelang_issue_layout.py
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

  • testing/python/issue/test_tilelang_issue_layout.py
📚 Learning: 2025-09-12T09:47:46.474Z
Learnt from: kurisu6912
Repo: tile-ai/tilelang PR: 794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.

Applied to files:

  • src/transform/layout_inference.cc
🧬 Code graph analysis (3)
src/op/parallel.cc (1)
src/transform/loop_partition.cc (4)
  • PlanLoopPartition (253-257)
  • PlanLoopPartition (253-254)
  • PlanLoopPartition (259-265)
  • PlanLoopPartition (259-260)
testing/python/issue/test_tilelang_issue_layout.py (3)
tilelang/language/allocate.py (1)
  • alloc_fragment (72-85)
tilelang/language/annotations.py (1)
  • annotate_layout (27-40)
tilelang/language/v2/builder.py (1)
  • source (690-691)
tilelang/language/annotations.py (4)
tilelang/layout/fragment.py (1)
  • Fragment (13-205)
tilelang/utils/language.py (1)
  • is_fragment (104-115)
src/transform/layout_inference.cc (4)
  • buffer (590-608)
  • buffer (590-590)
  • buffer (916-935)
  • buffer (916-916)
tilelang/language/v2/builder.py (1)
  • buffer (125-126)
🪛 Ruff (0.14.10)
testing/python/issue/test_tilelang_issue_layout.py

9-9: Unused function argument: A

(ARG001)


9-9: Unused function argument: B

(ARG001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
🔇 Additional comments (9)
tilelang/language/annotations.py (2)

5-6: LGTM!

The imports correctly bring in the Fragment class and is_fragment utility needed for the new runtime validation of fragment buffer layouts.


31-32: LGTM! Good runtime guard for fragment-layout type consistency.

This assertion ensures that fragment buffers receive proper Fragment layouts, catching type mismatches early. The placement before the isinstance(layout, Layout) check is correct since Fragment is a subclass of Layout.

Consider wrapping this line for readability if the project enforces line length limits.

src/transform/layout_inference.cc (3)

450-461: Semantic change from "all buffers known" to "any buffer known".

The renamed HasKnownLayoutAnchor now returns true if any touched buffer has a known layout (previously required all). This makes prioritization more aggressive—ops get enqueued at the front as soon as they have at least one layout anchor.

This aligns with the dual-candidate inference strategy, allowing earlier propagation from partial layout knowledge.


475-479: LGTM!

The call site correctly uses the renamed HasKnownLayoutAnchor function.


604-607: Good refactor using std::none_of.

The use of std::none_of with same_as comparison is cleaner and more idiomatic than manual iteration for duplicate checking.

src/op/parallel.cc (3)

486-543: LGTM! Clean dual-candidate inference approach.

The code now computes two candidate layouts in free inference mode:

  1. candidate_from_buffer — derived from existing buffer layout (always correct but may over-replicate)
  2. candidate_from_plan — derived from PlanLoopPartition (often smaller replication)

This sets up the selection logic that follows to minimize replication while preserving correctness.


546-588: Well-structured selection heuristic for choosing optimal layout.

The containment-based selection logic is sound:

  • If buffer-based contains plan-based → plan is smaller/more optimal, choose it
  • If plan-based contains buffer-based → buffer is more constrained, choose it
  • If neither contains the other → compare replication extents, prefer smaller
  • Fallback to buffer-based when uncertain (safety)

One minor observation: when both buf_contains_plan and plan_contains_buf are true (mutual containment, i.e., equivalent layouts), the code falls through to the replication comparison at line 571, which is correct behavior.


656-658: The early return is intentional and safe when both candidates are undefined.

When read_source_buffer is undefined and allow_layout_propgate is false, neither candidate is computed. The code then returns {} at line 657, which aligns with the documented behavior ("Returns empty layout map if loop_layout_ is not already defined").

This early return is protective—it exits before downstream code (line 660 onwards) that calls loop_layout_->DetectInjective() and other methods requiring loop_layout_ to be defined. The function documentation explicitly acknowledges this case, confirming it is intentional.

testing/python/issue/test_tilelang_issue_layout.py (1)

6-24: Good test coverage for dual-candidate layout inference.

This test effectively validates that free inference selects the layout with minimal replication. The kernel structure with annotated A_frag and inferred B_frag/S_frag exercises the selection heuristic.

Regarding the static analysis hint about unused A and B parameters: these are intentionally unused since the test focuses on fragment allocation behavior rather than actual computation. Consider adding a brief comment or using underscore prefix (_A, _B) to signal intent if your style guide requires it.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
src/op/parallel.cc (4)

393-393: Consider removing this comment.

This comment doesn't add meaningful context since the refactoring is evident from the code structure. Removing it would clean up the codebase.


681-693: Consider whether the coalesced_width check should be fatal.

The code enforces that vector_size % coalesced_width == 0 with LOG(FATAL) (line 686-687). This is a strict requirement that will crash if violated. Consider whether:

  1. This should be a validation error with better diagnostics about why the constraint failed
  2. The code could adjust vector_size to be exactly coalesced_width when they don't divide evenly (as you already do on line 689)
  3. The current behavior is intentional for correctness

If the strict check is intentional for correctness, consider adding a comment explaining why violating this constraint would cause incorrect behavior.


738-740: Consider improving the error message for users.

The ICHECK will terminate execution with a somewhat generic message. Consider:

  1. Adding more context about why this combination is invalid
  2. Suggesting what the user should do to fix their code (e.g., "move shared/global stores outside the parallel loop" or "ensure fragment stores don't have cross-thread replication")
  3. Including information about the specific buffers involved to help debugging

768-772: Consider clarifying the containment check lambda.

The contains lambda comment states "contains(A, B) means: for any loop index, the threads that access B's elements are a subset of those that access A's elements." However, the implementation calls ProveFragmentContains(small, big, ...) where the first argument is small and second is big, which can be confusing given the parameter names are reversed (big, small).

Consider either:

  1. Reversing the parameter order to match the call: contains(small, big) with ProveFragmentContains(small, big, ...)
  2. Making the comment even more explicit about the parameter/argument mapping

This would reduce cognitive load when understanding the containment checks at lines 793-794 and 801-804.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fed3df1 and 9f3af2e.

📒 Files selected for processing (2)
  • src/op/parallel.cc
  • src/op/parallel.h
🧰 Additional context used
🧬 Code graph analysis (1)
src/op/parallel.h (1)
src/op/parallel.cc (10)
  • ValidateCandidateAgainstFragments (597-611)
  • ValidateCandidateAgainstFragments (597-598)
  • ChooseBestCandidate (752-813)
  • ChooseBestCandidate (753-755)
  • ComputeLoopLayoutFromBuffer (613-658)
  • ComputeLoopLayoutFromBuffer (614-615)
  • ComputePlanCandidate (660-701)
  • ComputePlanCandidate (660-660)
  • BuildReplicationGuardsIfNeeded (703-751)
  • BuildReplicationGuardsIfNeeded (703-708)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
🔇 Additional comments (6)
src/op/parallel.h (1)

104-126: LGTM! Well-structured private helper declarations.

The new private method declarations are well-documented, appropriately scoped, and maintain const correctness. The method names clearly convey their purpose in the two-candidate inference flow, and keeping them private preserves the public API surface.

src/op/parallel.cc (5)

401-402: Good refactoring!

Extracting the buffer-based layout computation into a named helper method improves readability and maintainability while preserving the original behavior.


597-611: LGTM! Clean validation logic.

The validation method correctly checks candidate compatibility against all known fragments, using early returns for efficiency and avoiding exceptions for control flow as documented.


613-658: Excellent refactoring with robust error handling.

The buffer-based candidate computation is well-structured with:

  • Clear separation of common vs. non-common access patterns
  • Validation against inner variables to catch invalid layouts early
  • Try-catch block that enriches TVM errors with actionable context
  • Good diagnostic logging

752-813: Well-designed candidate selection logic.

The selection strategy is thoughtful and well-documented:

  • Validates both candidates first
  • Uses containment relationships to prefer the more specific layout
  • Falls back to replication extent comparison
  • Maintains deterministic behavior

The logic correctly handles the goal of minimizing replication while ensuring correctness.


787-791: Error handling is present, but clarify that it occurs through injectivity validation rather than explicit candidate validity checks.

When both candidates are invalid (lines 787-791), the function returns candidate_from_buffer. The calling code does validate this at lines 479-486 via an injectivity check: if loop_layout_->DetectInjective() finds errors, it throws LoopLayoutInjectiveException with a detailed message including the layout output and AST. However, this check validates the selected candidate's injectivity, not whether both internal validations failed—consider whether the error message should explicitly indicate when both candidates were rejected internally for better debugging.

Comment on lines +443 to +475
// In free inference, try two mechanisms and prefer the one that
// minimizes replication while remaining compatible:
// 1) compute_loop_layout_from_buffer (always correct but may
// over-replicate) 2) PlanLoopPartition (often smaller replication)
Fragment candidate_from_buffer;
Fragment candidate_from_plan;

if (read_source_buffer.defined() && allow_layout_propgate) {
loop_layout_ = compute_loop_layout_from_buffer(read_source_buffer);
candidate_from_buffer =
ComputeLoopLayoutFromBuffer(read_source_buffer, T);
}

if (!loop_layout_.defined()) {
// No source buffer available, use free mode inference
// Vectorize Size must be aware of the buffer_remap
// As the pass will do post processing to the layout
auto maybe_remapped_root_ =
IfBufferRemapLoopGenerator::run(root_, T.buffer_remap, T.layout_map);
int vector_size = GetVectorizeSize(maybe_remapped_root_, T.analyzer);
DLOG(INFO) << "[PlanLoopPartition] vector_size = " << vector_size << '\n';

PrimExpr loop_total_size = 1;
for (Stmt l = root_; l.as<For>().has_value();
l = l.as<For>().value()->body)
loop_total_size = loop_total_size * l.as<For>().value()->extent;
DLOG(INFO) << "[PlanLoopPartition] loop_total_size = " << loop_total_size
<< '\n';
while (!analyzer_.CanProve(
floormod(loop_total_size,
T.thread_bounds->extent * vector_size) == 0) &&
vector_size > 1)
vector_size /= 2;
DLOG(INFO) << "[PlanLoopPartition] after adjust: vector_size = "
<< vector_size << '\n';

// Check if coalesced_width is defined
if (auto coalesced_width =
root_->annotations.Get(attr::kCoalescedWidth)) {
if (const auto *imm = coalesced_width->as<IntImmNode>()) {
int expected = imm->value;
// Verify that vector_size is divisible by expected
if (vector_size % expected != 0) {
LOG(FATAL) << "Vector size " << vector_size
<< " is not divisible by coalesced width " << expected;
}
vector_size = expected;
} else {
LOG(FATAL) << "coalesced_width should be an IntImmNode.";
}
}
DLOG(INFO) << "[PlanLoopPartition] root_ = " << root_
<< " ############# vector_size = " << vector_size
<< ", thread_bounds = " << T.thread_bounds << '\n';
loop_layout_ = PlanLoopPartition(root_, vector_size, T.thread_bounds);
DLOG(INFO) << "[PlanLoopPartition] loop_layout_ = "
<< loop_layout_->DebugOutput() << '\n';
// try to infer loop layout with two mechanisms and choose the best one
{
candidate_from_plan = ComputePlanCandidate(T);
}

// Lambda that guards replicated accesses:
// - When a loop layout replicates a fragment buffer (rep > 1), each thread
// observes the same fragment elements. Blindly storing to shared/global
// memory in that case would add the same value multiple times.
// - We therefore restrict the store so that only the replica with rep == 0
// performs the update (e.g. global[i] += fragment[i] only fires once).
// Trigger conditions for this guard:
// 1) There are cross-thread stores targeting shared/global memory (no
// fragment stores in this branch; atomic_add and similar remain TODO).
// 2) The loop layout replicate extent is greater than 1, inferred from the
// thread bounds captured in the layout.

[this, &store_shared_global_buffers, &store_fragment_buffers,
&has_cross_thread_access, &const_index_fragment_buffer, &T]() {
if (is_one(loop_layout_->ReplicateExtent()))
return;
if (!has_cross_thread_access)
return;

if (!store_fragment_buffers.empty()) {
// Iterate replicated fragment stores: when the fragment index is a
// constant (e.g. fragment[0]), every thread touches the same slot, so
// the rep == 0 predicate is unnecessary. Example: for i in
// T.Parallel(...):
// shared[i] = ...
// fragment[0] = ...
bool replicate_is_from_dynamic_index_fragment = false;
for (const auto &fragment : store_fragment_buffers) {
if (!T.layout_map.count(fragment)) {
continue;
}

auto fragment_layout = T.layout_map[fragment].as<Fragment>().value();
if (is_one(fragment_layout->ReplicateExtent()))
continue;

if (analyzer_.CanProveEqual(fragment_layout->ReplicateExtent(),
loop_layout_->ReplicateExtent()))
continue;
if (std::find(const_index_fragment_buffer.begin(),
const_index_fragment_buffer.end(),
fragment) == const_index_fragment_buffer.end()) {
replicate_is_from_dynamic_index_fragment = true;
}
}

if (!replicate_is_from_dynamic_index_fragment)
return;
// Choose the best candidate:
if (candidate_from_buffer.defined() && candidate_from_plan.defined()) {
loop_layout_ =
ChooseBestCandidate(candidate_from_buffer, candidate_from_plan, T);
} else if (candidate_from_plan.defined()) {
loop_layout_ = candidate_from_plan;
DLOG(INFO) << "[FreeInfer] only PlanLoopPartition available, choose it.";
} else if (candidate_from_buffer.defined()) {
loop_layout_ = candidate_from_buffer;
DLOG(INFO)
<< "[FreeInfer] only compute_from_buffer available, choose it.";
}

ICHECK(store_shared_global_buffers.empty())
<< "Invalid layout: cannot have both fragment and shared store "
"buffers "
"in replicated loop layout.";
return;
} else {
// Now, store is global or shared
// or T.call_extern or T.call_intrin ...
auto inv = loop_layout_->Inverse();
Array<PrimExpr> fwd;
for (size_t i = 0; i < loop_layout_->OutputDim(); i++)
fwd.push_back(0);
fwd.push_back(InputPlaceholder(0) - T.thread_bounds->min);
auto rep = inv->Forward(fwd).back();
AddPredicate(EQ(rep, 0));
}
}();
BuildReplicationGuardsIfNeeded(
T, store_shared_global_buffers, store_fragment_buffers,
has_cross_thread_access, const_index_fragment_buffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify that loop_layout_ is always defined before the injectivity check
rg -n -A5 -B5 "loop_layout_.*DetectInjective" src/op/parallel.cc

Repository: tile-ai/tilelang

Length of output: 505


🏁 Script executed:

#!/bin/bash
# Check all assignments to loop_layout_ in the function
rg -n "loop_layout_\s*=" src/op/parallel.cc | head -20

Repository: tile-ai/tilelang

Length of output: 248


🏁 Script executed:

#!/bin/bash
# Check if loop_layout_ is initialized/declared at function start
# Search for the function containing lines 443-475
rg -n -B50 "ComputeLoopLayoutFromBuffer" src/op/parallel.cc | grep -E "(void|Fragment|auto).*loop_layout_|loop_layout_\s*=" | head -10

Repository: tile-ai/tilelang

Length of output: 129


🏁 Script executed:

#!/bin/bash
# Look at the broader context around lines 443-475 to understand function scope
sed -n '400,500p' src/op/parallel.cc | cat -n

Repository: tile-ai/tilelang

Length of output: 4834


Critical issue: loop_layout_ may be uninitialized when both candidates are undefined.

In the free inference branch, if read_source_buffer is undefined or allow_layout_propgate is false, and ComputePlanCandidate(T) returns an undefined fragment, then loop_layout_ is never assigned. The code then performs a direct pointer dereference at line 480 (loop_layout_->DetectInjective()) without checking if loop_layout_ is defined, resulting in undefined behavior.

Add a fallback case (e.g., an error or default assignment) when both candidates are undefined, or verify that ComputePlanCandidate() is guaranteed to return a defined fragment and add an assertion.

🤖 Prompt for AI Agents
In src/op/parallel.cc around lines 443-475, loop_layout_ can remain
uninitialized when both candidate_from_buffer and candidate_from_plan are
undefined, leading to a crash when the code later dereferences loop_layout_; fix
by adding a fallback branch for the case where neither candidate is defined:
either assign a safe default Fragment to loop_layout_, or log an error and
early-return/throw so no dereference occurs, or add an explicit assertion that
ComputePlanCandidate must return a defined fragment; implement one of these
options and ensure subsequent code checks loop_layout_ before dereference.

@LeiWang1999
Copy link
Member Author

@regression-perf

@github-actions
Copy link

Performance Regression Test Report

Triggered by: @LeiWang1999
Workflow run: https://github.com/tile-ai/tilelang/actions/runs/20579537457

Results

File Original Latency Current Latency Speedup
example_warp_specialize_gemm_copy_1_gemm_0 0.036929 0.038912 0.949039
example_vertical_slash_sparse_attn 0.237968 0.241746 0.98437
example_warp_specialize_gemm_copy_0_gemm_1 0.038945 0.039232 0.992685
example_tilelang_gemm_fp8 0.322648 0.323656 0.996887
example_topk 0.011008 0.01104 0.997101
example_gqa_decode 0.048417 0.048545 0.997363
example_dynamic 0.657295 0.658318 0.998446
example_gemm_autotune 0.022368 0.022401 0.998527
example_tilelang_gemm_fp8_intrinsic 0.466349 0.466973 0.998663
example_gemv 0.275797 0.276085 0.998958
example_mha_bwd_bhsd 0.0400216 0.0400532 0.999211
example_mha_bwd_bshd_wgmma_pipelined 0.0258325 0.0258474 0.999424
example_gqa_bwd_tma_reduce_varlen 0.063536 0.0635571 0.999669
example_gqa_bwd_wgmma_pipelined 0.073712 0.0737351 0.999687
example_mha_fwd_varlen 0.0455452 0.0455582 0.999715
example_elementwise_add 0.295718 0.29579 0.999757
example_linear_attn_fwd 0.0363374 0.0363432 0.999841
example_fusedmoe_tilelang 0.130257 0.130263 0.999953
block_sparse_attn_tilelang 0.0102471 0.010246 1.00011
example_mha_bwd_bshd 0.0405539 0.0405429 1.00027
example_tilelang_gemm_splitk_vectorize_atomicadd 1.40772 1.40726 1.00033
example_gqa_bwd 0.0496566 0.0496304 1.00053
example_linear_attn_bwd 0.151654 0.151538 1.00077
example_gemm_intrinsics 0.035073 0.035041 1.00091
example_gemm_schedule 0.0324312 0.0324 1.00096
example_tilelang_gemm_splitk 1.40915 1.40778 1.00097
tilelang_example_sparse_tensorcore 0.0150451 0.0150304 1.00097
example_gemm 0.022945 0.022912 1.00144
example_convolution_autotune 1.00124 0.996214 1.00505
example_dequant_gemv_fp16xint4 0.0288485 0.0286355 1.00744
example_dequant_groupedgemm_bf16_mxfp4_hopper 3.53478 3.49481 1.01143
example_per_token_cast_to_fp8 0.00743965 0.00732482 1.01568
example_mha_inference 0.073539 0.072384 1.01596
example_dequant_gemm_w4a8 5.51091 5.39941 1.02065
example_tilelang_nsa_fwd 0.0072155 0.0070359 1.02553
example_tilelang_nsa_decode 0.00689747 0.00672424 1.02576
sparse_mla_fwd_pipelined 0.0985678 0.0960827 1.02586
example_warp_specialize_gemm_softpipe_stage2 0.039584 0.038497 1.02824
sparse_mla_fwd 0.143766 0.139727 1.02891
sparse_mla_bwd 0.39456 0.381855 1.03327
example_mha_sink_fwd_bhsd_wgmma_pipelined_sliding_window 0.0159741 0.0154593 1.0333
example_tilelang_block_sparse_attn 0.0104882 0.0101496 1.03336
example_dequant_gemm_fp4_hopper 1.10262 1.067 1.03338
example_mha_sink_fwd_bhsd_sliding_window 0.0161785 0.0156526 1.0336
topk_selector 0.055533 0.0536471 1.03515
example_mha_sink_fwd_bhsd 0.016417 0.0158537 1.03553
fp8_lighting_indexer 0.037222 0.0359062 1.03665
example_mha_sink_fwd_bhsd_wgmma_pipelined 0.0159506 0.0153824 1.03694
example_gqa_sink_fwd_bhsd_wgmma_pipelined 0.0149818 0.01442 1.03896
example_blocksparse_gemm 0.0235058 0.0226231 1.03902
example_tilelang_gemm_fp8_2xAcc 0.194582 0.187236 1.03923
example_dequant_gemm_bf16_fp4_hopper 0.59406 0.571598 1.0393
example_gqa_sink_bwd_bhsd_sliding_window 0.0265329 0.0255232 1.03956
example_dequant_gemm_bf16_mxfp4_hopper 0.528875 0.508109 1.04087
example_gqa_sink_fwd_bhsd_wgmma_pipelined_sliding_window 0.0151531 0.0145578 1.04089
example_mha_sink_bwd_bhsd 0.0644031 0.0618229 1.04173
example_tilelang_sparse_gqa_decode_varlen_indice 0.0178444 0.0171283 1.04181
example_mha_sink_bwd_bhsd_sliding_window 0.0461765 0.0442942 1.0425
example_group_per_split_token_cast_to_fp8 0.0107118 0.0102503 1.04503
example_tilelang_sparse_gqa_decode_varlen_mask 0.0244034 0.0233494 1.04514
example_warp_specialize_gemm_barrierpipe_stage2 0.039424 0.037569 1.04938
example_gqa_sink_bwd_bhsd 0.0438625 0.0417107 1.05159
example_convolution 1.40563 1.33414 1.05358
example_mla_decode 0.487436 0.460938 1.05749

Artifacts

  • regression_result.png (speedup plot) is attached as a workflow artifact. Download it from the workflow run page above.

@LeiWang1999 LeiWang1999 merged commit 0f9bbd7 into tile-ai:main Dec 30, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant