[Parallel][Infer] Free-mode chooses minimal replication between buffer-based and PlanLoopPartition#1559
Conversation
* Enhanced the layout inference mechanism in ParallelOpNode to utilize two strategies: compute_loop_layout_from_buffer and PlanLoopPartition, selecting the one that minimizes replication while ensuring compatibility. * Updated the logic to choose the best candidate layout based on replication size and containment checks. * Refactored the HasKnownLayoutAnchor function to clarify its purpose in prioritizing buffer layouts. * Added a new test case to validate the layout inference behavior, ensuring the correct fragments are generated in the output. This update aims to optimize layout inference for parallel operations, improving performance and resource utilization.
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
📝 WalkthroughWalkthroughAdds a dual-candidate free-layout inference path in ParallelOpNode: derives one candidate from source buffers and another from the plan, validates both against fragment buffers, inserts replication guards when needed, and deterministically selects the preferred candidate by validation, containment, and replication heuristics. Also adjusts prioritization logic, adds a unit test, and enforces fragment layout annotations. Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
testing/python/issue/test_tilelang_issue_layout.py (1)
27-32: Effective source-level verification of fragment sizes.The assertions checking for
float S_frag[4];,float A_frag[4];, andfloat B_frag[4];directly validate that the layout inference chose the minimal replication (4 elements per thread from a 128×4 tensor with 128 threads).Consider adding a negative assertion or comment explaining what the incorrect output would look like (e.g.,
float S_frag[512];if over-replicated) to make the test's purpose clearer to future maintainers.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
src/op/parallel.ccsrc/transform/layout_inference.cctesting/python/issue/test_tilelang_issue_layout.pytilelang/language/annotations.py
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-12-18T04:50:00.512Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.
Applied to files:
testing/python/issue/test_tilelang_issue_layout.py
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.
Applied to files:
testing/python/issue/test_tilelang_issue_layout.py
📚 Learning: 2025-09-12T09:47:46.474Z
Learnt from: kurisu6912
Repo: tile-ai/tilelang PR: 794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.
Applied to files:
src/transform/layout_inference.cc
🧬 Code graph analysis (3)
src/op/parallel.cc (1)
src/transform/loop_partition.cc (4)
PlanLoopPartition(253-257)PlanLoopPartition(253-254)PlanLoopPartition(259-265)PlanLoopPartition(259-260)
testing/python/issue/test_tilelang_issue_layout.py (3)
tilelang/language/allocate.py (1)
alloc_fragment(72-85)tilelang/language/annotations.py (1)
annotate_layout(27-40)tilelang/language/v2/builder.py (1)
source(690-691)
tilelang/language/annotations.py (4)
tilelang/layout/fragment.py (1)
Fragment(13-205)tilelang/utils/language.py (1)
is_fragment(104-115)src/transform/layout_inference.cc (4)
buffer(590-608)buffer(590-590)buffer(916-935)buffer(916-916)tilelang/language/v2/builder.py (1)
buffer(125-126)
🪛 Ruff (0.14.10)
testing/python/issue/test_tilelang_issue_layout.py
9-9: Unused function argument: A
(ARG001)
9-9: Unused function argument: B
(ARG001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
- GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
- GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
🔇 Additional comments (9)
tilelang/language/annotations.py (2)
5-6: LGTM!The imports correctly bring in the
Fragmentclass andis_fragmentutility needed for the new runtime validation of fragment buffer layouts.
31-32: LGTM! Good runtime guard for fragment-layout type consistency.This assertion ensures that fragment buffers receive proper
Fragmentlayouts, catching type mismatches early. The placement before theisinstance(layout, Layout)check is correct sinceFragmentis a subclass ofLayout.Consider wrapping this line for readability if the project enforces line length limits.
src/transform/layout_inference.cc (3)
450-461: Semantic change from "all buffers known" to "any buffer known".The renamed
HasKnownLayoutAnchornow returnstrueif any touched buffer has a known layout (previously required all). This makes prioritization more aggressive—ops get enqueued at the front as soon as they have at least one layout anchor.This aligns with the dual-candidate inference strategy, allowing earlier propagation from partial layout knowledge.
475-479: LGTM!The call site correctly uses the renamed
HasKnownLayoutAnchorfunction.
604-607: Good refactor usingstd::none_of.The use of
std::none_ofwithsame_ascomparison is cleaner and more idiomatic than manual iteration for duplicate checking.src/op/parallel.cc (3)
486-543: LGTM! Clean dual-candidate inference approach.The code now computes two candidate layouts in free inference mode:
candidate_from_buffer— derived from existing buffer layout (always correct but may over-replicate)candidate_from_plan— derived fromPlanLoopPartition(often smaller replication)This sets up the selection logic that follows to minimize replication while preserving correctness.
546-588: Well-structured selection heuristic for choosing optimal layout.The containment-based selection logic is sound:
- If buffer-based contains plan-based → plan is smaller/more optimal, choose it
- If plan-based contains buffer-based → buffer is more constrained, choose it
- If neither contains the other → compare replication extents, prefer smaller
- Fallback to buffer-based when uncertain (safety)
One minor observation: when both
buf_contains_planandplan_contains_bufare true (mutual containment, i.e., equivalent layouts), the code falls through to the replication comparison at line 571, which is correct behavior.
656-658: The early return is intentional and safe when both candidates are undefined.When
read_source_bufferis undefined andallow_layout_propgateis false, neither candidate is computed. The code then returns{}at line 657, which aligns with the documented behavior ("Returns empty layout map if loop_layout_ is not already defined").This early return is protective—it exits before downstream code (line 660 onwards) that calls
loop_layout_->DetectInjective()and other methods requiringloop_layout_to be defined. The function documentation explicitly acknowledges this case, confirming it is intentional.testing/python/issue/test_tilelang_issue_layout.py (1)
6-24: Good test coverage for dual-candidate layout inference.This test effectively validates that free inference selects the layout with minimal replication. The kernel structure with annotated
A_fragand inferredB_frag/S_fragexercises the selection heuristic.Regarding the static analysis hint about unused
AandBparameters: these are intentionally unused since the test focuses on fragment allocation behavior rather than actual computation. Consider adding a brief comment or using underscore prefix (_A,_B) to signal intent if your style guide requires it.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
src/op/parallel.cc (4)
393-393: Consider removing this comment.This comment doesn't add meaningful context since the refactoring is evident from the code structure. Removing it would clean up the codebase.
681-693: Consider whether the coalesced_width check should be fatal.The code enforces that
vector_size % coalesced_width == 0withLOG(FATAL)(line 686-687). This is a strict requirement that will crash if violated. Consider whether:
- This should be a validation error with better diagnostics about why the constraint failed
- The code could adjust
vector_sizeto be exactlycoalesced_widthwhen they don't divide evenly (as you already do on line 689)- The current behavior is intentional for correctness
If the strict check is intentional for correctness, consider adding a comment explaining why violating this constraint would cause incorrect behavior.
738-740: Consider improving the error message for users.The
ICHECKwill terminate execution with a somewhat generic message. Consider:
- Adding more context about why this combination is invalid
- Suggesting what the user should do to fix their code (e.g., "move shared/global stores outside the parallel loop" or "ensure fragment stores don't have cross-thread replication")
- Including information about the specific buffers involved to help debugging
768-772: Consider clarifying the containment check lambda.The
containslambda comment states "contains(A, B) means: for any loop index, the threads that access B's elements are a subset of those that access A's elements." However, the implementation callsProveFragmentContains(small, big, ...)where the first argument issmalland second isbig, which can be confusing given the parameter names are reversed (big,small).Consider either:
- Reversing the parameter order to match the call:
contains(small, big)withProveFragmentContains(small, big, ...)- Making the comment even more explicit about the parameter/argument mapping
This would reduce cognitive load when understanding the containment checks at lines 793-794 and 801-804.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/op/parallel.ccsrc/op/parallel.h
🧰 Additional context used
🧬 Code graph analysis (1)
src/op/parallel.h (1)
src/op/parallel.cc (10)
ValidateCandidateAgainstFragments(597-611)ValidateCandidateAgainstFragments(597-598)ChooseBestCandidate(752-813)ChooseBestCandidate(753-755)ComputeLoopLayoutFromBuffer(613-658)ComputeLoopLayoutFromBuffer(614-615)ComputePlanCandidate(660-701)ComputePlanCandidate(660-660)BuildReplicationGuardsIfNeeded(703-751)BuildReplicationGuardsIfNeeded(703-708)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
🔇 Additional comments (6)
src/op/parallel.h (1)
104-126: LGTM! Well-structured private helper declarations.The new private method declarations are well-documented, appropriately scoped, and maintain const correctness. The method names clearly convey their purpose in the two-candidate inference flow, and keeping them private preserves the public API surface.
src/op/parallel.cc (5)
401-402: Good refactoring!Extracting the buffer-based layout computation into a named helper method improves readability and maintainability while preserving the original behavior.
597-611: LGTM! Clean validation logic.The validation method correctly checks candidate compatibility against all known fragments, using early returns for efficiency and avoiding exceptions for control flow as documented.
613-658: Excellent refactoring with robust error handling.The buffer-based candidate computation is well-structured with:
- Clear separation of common vs. non-common access patterns
- Validation against inner variables to catch invalid layouts early
- Try-catch block that enriches TVM errors with actionable context
- Good diagnostic logging
752-813: Well-designed candidate selection logic.The selection strategy is thoughtful and well-documented:
- Validates both candidates first
- Uses containment relationships to prefer the more specific layout
- Falls back to replication extent comparison
- Maintains deterministic behavior
The logic correctly handles the goal of minimizing replication while ensuring correctness.
787-791: Error handling is present, but clarify that it occurs through injectivity validation rather than explicit candidate validity checks.When both candidates are invalid (lines 787-791), the function returns
candidate_from_buffer. The calling code does validate this at lines 479-486 via an injectivity check: ifloop_layout_->DetectInjective()finds errors, it throwsLoopLayoutInjectiveExceptionwith a detailed message including the layout output and AST. However, this check validates the selected candidate's injectivity, not whether both internal validations failed—consider whether the error message should explicitly indicate when both candidates were rejected internally for better debugging.
| // In free inference, try two mechanisms and prefer the one that | ||
| // minimizes replication while remaining compatible: | ||
| // 1) compute_loop_layout_from_buffer (always correct but may | ||
| // over-replicate) 2) PlanLoopPartition (often smaller replication) | ||
| Fragment candidate_from_buffer; | ||
| Fragment candidate_from_plan; | ||
|
|
||
| if (read_source_buffer.defined() && allow_layout_propgate) { | ||
| loop_layout_ = compute_loop_layout_from_buffer(read_source_buffer); | ||
| candidate_from_buffer = | ||
| ComputeLoopLayoutFromBuffer(read_source_buffer, T); | ||
| } | ||
|
|
||
| if (!loop_layout_.defined()) { | ||
| // No source buffer available, use free mode inference | ||
| // Vectorize Size must be aware of the buffer_remap | ||
| // As the pass will do post processing to the layout | ||
| auto maybe_remapped_root_ = | ||
| IfBufferRemapLoopGenerator::run(root_, T.buffer_remap, T.layout_map); | ||
| int vector_size = GetVectorizeSize(maybe_remapped_root_, T.analyzer); | ||
| DLOG(INFO) << "[PlanLoopPartition] vector_size = " << vector_size << '\n'; | ||
|
|
||
| PrimExpr loop_total_size = 1; | ||
| for (Stmt l = root_; l.as<For>().has_value(); | ||
| l = l.as<For>().value()->body) | ||
| loop_total_size = loop_total_size * l.as<For>().value()->extent; | ||
| DLOG(INFO) << "[PlanLoopPartition] loop_total_size = " << loop_total_size | ||
| << '\n'; | ||
| while (!analyzer_.CanProve( | ||
| floormod(loop_total_size, | ||
| T.thread_bounds->extent * vector_size) == 0) && | ||
| vector_size > 1) | ||
| vector_size /= 2; | ||
| DLOG(INFO) << "[PlanLoopPartition] after adjust: vector_size = " | ||
| << vector_size << '\n'; | ||
|
|
||
| // Check if coalesced_width is defined | ||
| if (auto coalesced_width = | ||
| root_->annotations.Get(attr::kCoalescedWidth)) { | ||
| if (const auto *imm = coalesced_width->as<IntImmNode>()) { | ||
| int expected = imm->value; | ||
| // Verify that vector_size is divisible by expected | ||
| if (vector_size % expected != 0) { | ||
| LOG(FATAL) << "Vector size " << vector_size | ||
| << " is not divisible by coalesced width " << expected; | ||
| } | ||
| vector_size = expected; | ||
| } else { | ||
| LOG(FATAL) << "coalesced_width should be an IntImmNode."; | ||
| } | ||
| } | ||
| DLOG(INFO) << "[PlanLoopPartition] root_ = " << root_ | ||
| << " ############# vector_size = " << vector_size | ||
| << ", thread_bounds = " << T.thread_bounds << '\n'; | ||
| loop_layout_ = PlanLoopPartition(root_, vector_size, T.thread_bounds); | ||
| DLOG(INFO) << "[PlanLoopPartition] loop_layout_ = " | ||
| << loop_layout_->DebugOutput() << '\n'; | ||
| // try to infer loop layout with two mechanisms and choose the best one | ||
| { | ||
| candidate_from_plan = ComputePlanCandidate(T); | ||
| } | ||
|
|
||
| // Lambda that guards replicated accesses: | ||
| // - When a loop layout replicates a fragment buffer (rep > 1), each thread | ||
| // observes the same fragment elements. Blindly storing to shared/global | ||
| // memory in that case would add the same value multiple times. | ||
| // - We therefore restrict the store so that only the replica with rep == 0 | ||
| // performs the update (e.g. global[i] += fragment[i] only fires once). | ||
| // Trigger conditions for this guard: | ||
| // 1) There are cross-thread stores targeting shared/global memory (no | ||
| // fragment stores in this branch; atomic_add and similar remain TODO). | ||
| // 2) The loop layout replicate extent is greater than 1, inferred from the | ||
| // thread bounds captured in the layout. | ||
|
|
||
| [this, &store_shared_global_buffers, &store_fragment_buffers, | ||
| &has_cross_thread_access, &const_index_fragment_buffer, &T]() { | ||
| if (is_one(loop_layout_->ReplicateExtent())) | ||
| return; | ||
| if (!has_cross_thread_access) | ||
| return; | ||
|
|
||
| if (!store_fragment_buffers.empty()) { | ||
| // Iterate replicated fragment stores: when the fragment index is a | ||
| // constant (e.g. fragment[0]), every thread touches the same slot, so | ||
| // the rep == 0 predicate is unnecessary. Example: for i in | ||
| // T.Parallel(...): | ||
| // shared[i] = ... | ||
| // fragment[0] = ... | ||
| bool replicate_is_from_dynamic_index_fragment = false; | ||
| for (const auto &fragment : store_fragment_buffers) { | ||
| if (!T.layout_map.count(fragment)) { | ||
| continue; | ||
| } | ||
|
|
||
| auto fragment_layout = T.layout_map[fragment].as<Fragment>().value(); | ||
| if (is_one(fragment_layout->ReplicateExtent())) | ||
| continue; | ||
|
|
||
| if (analyzer_.CanProveEqual(fragment_layout->ReplicateExtent(), | ||
| loop_layout_->ReplicateExtent())) | ||
| continue; | ||
| if (std::find(const_index_fragment_buffer.begin(), | ||
| const_index_fragment_buffer.end(), | ||
| fragment) == const_index_fragment_buffer.end()) { | ||
| replicate_is_from_dynamic_index_fragment = true; | ||
| } | ||
| } | ||
|
|
||
| if (!replicate_is_from_dynamic_index_fragment) | ||
| return; | ||
| // Choose the best candidate: | ||
| if (candidate_from_buffer.defined() && candidate_from_plan.defined()) { | ||
| loop_layout_ = | ||
| ChooseBestCandidate(candidate_from_buffer, candidate_from_plan, T); | ||
| } else if (candidate_from_plan.defined()) { | ||
| loop_layout_ = candidate_from_plan; | ||
| DLOG(INFO) << "[FreeInfer] only PlanLoopPartition available, choose it."; | ||
| } else if (candidate_from_buffer.defined()) { | ||
| loop_layout_ = candidate_from_buffer; | ||
| DLOG(INFO) | ||
| << "[FreeInfer] only compute_from_buffer available, choose it."; | ||
| } | ||
|
|
||
| ICHECK(store_shared_global_buffers.empty()) | ||
| << "Invalid layout: cannot have both fragment and shared store " | ||
| "buffers " | ||
| "in replicated loop layout."; | ||
| return; | ||
| } else { | ||
| // Now, store is global or shared | ||
| // or T.call_extern or T.call_intrin ... | ||
| auto inv = loop_layout_->Inverse(); | ||
| Array<PrimExpr> fwd; | ||
| for (size_t i = 0; i < loop_layout_->OutputDim(); i++) | ||
| fwd.push_back(0); | ||
| fwd.push_back(InputPlaceholder(0) - T.thread_bounds->min); | ||
| auto rep = inv->Forward(fwd).back(); | ||
| AddPredicate(EQ(rep, 0)); | ||
| } | ||
| }(); | ||
| BuildReplicationGuardsIfNeeded( | ||
| T, store_shared_global_buffers, store_fragment_buffers, | ||
| has_cross_thread_access, const_index_fragment_buffer); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify that loop_layout_ is always defined before the injectivity check
rg -n -A5 -B5 "loop_layout_.*DetectInjective" src/op/parallel.ccRepository: tile-ai/tilelang
Length of output: 505
🏁 Script executed:
#!/bin/bash
# Check all assignments to loop_layout_ in the function
rg -n "loop_layout_\s*=" src/op/parallel.cc | head -20Repository: tile-ai/tilelang
Length of output: 248
🏁 Script executed:
#!/bin/bash
# Check if loop_layout_ is initialized/declared at function start
# Search for the function containing lines 443-475
rg -n -B50 "ComputeLoopLayoutFromBuffer" src/op/parallel.cc | grep -E "(void|Fragment|auto).*loop_layout_|loop_layout_\s*=" | head -10Repository: tile-ai/tilelang
Length of output: 129
🏁 Script executed:
#!/bin/bash
# Look at the broader context around lines 443-475 to understand function scope
sed -n '400,500p' src/op/parallel.cc | cat -nRepository: tile-ai/tilelang
Length of output: 4834
Critical issue: loop_layout_ may be uninitialized when both candidates are undefined.
In the free inference branch, if read_source_buffer is undefined or allow_layout_propgate is false, and ComputePlanCandidate(T) returns an undefined fragment, then loop_layout_ is never assigned. The code then performs a direct pointer dereference at line 480 (loop_layout_->DetectInjective()) without checking if loop_layout_ is defined, resulting in undefined behavior.
Add a fallback case (e.g., an error or default assignment) when both candidates are undefined, or verify that ComputePlanCandidate() is guaranteed to return a defined fragment and add an assertion.
🤖 Prompt for AI Agents
In src/op/parallel.cc around lines 443-475, loop_layout_ can remain
uninitialized when both candidate_from_buffer and candidate_from_plan are
undefined, leading to a crash when the code later dereferences loop_layout_; fix
by adding a fallback branch for the case where neither candidate is defined:
either assign a safe default Fragment to loop_layout_, or log an error and
early-return/throw so no dereference occurs, or add an explicit assertion that
ComputePlanCandidate must return a defined fragment; implement one of these
options and ensure subsequent code checks loop_layout_ before dereference.
|
@regression-perf |
Performance Regression Test ReportTriggered by: @LeiWang1999 Results
Artifacts
|
Title
Summary
ParallelOpNode::InferLayout(free mode), generate two loop layout candidates:ProveFragmentContainson loop vars to check mutual containmentReplicateExtent; pick smaller provably; otherwise fall back to buffer-based (safest)ProveFragmentContainsagainst all fragments in the loopWhy
compute_loop_layout_from_bufferis guaranteed correct but can inflatereplicateand hurt performance.PlanLoopPartitionwith fewer replicas without sacrificing correctness. This change automatically prefers that smaller replication when containment holds.Changes
annotate_layoutmust useFragmentobjects (type guard)Impact
Notes
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.