[Layout] Fix Layout Bugs in Parallel and Reduce#1713
[Layout] Fix Layout Bugs in Parallel and Reduce#1713kurisu6912 merged 15 commits intotile-ai:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
📝 WalkthroughWalkthroughCentralizes per-buffer access tracking in ParallelOpNode via a new BufferAccessInfo and RecordBufferAccess/GetAccessInfo APIs; refactors layout inference and fragment validation to use per-buffer indices plus read/write flags; adds reducer-aware layout logic in ReduceOp; and adds seven TileLang regression tests for issue 1719. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant ParallelOpNode
participant Buffer
participant LayoutInferencer
participant FragmentValidator
Caller->>ParallelOpNode: RecordBufferAccess(buffer, indices, is_write)
Note over ParallelOpNode: Store BufferAccessInfo { indices, is_read, is_write }
Caller->>ParallelOpNode: Trigger InferLayout / ComputePlan
ParallelOpNode->>ParallelOpNode: For each (buffer, access) -> GetAccessInfo(buffer)
ParallelOpNode->>LayoutInferencer: Provide buffer access infos (access.indices, access.is_read/is_write)
LayoutInferencer->>FragmentValidator: Validate candidate fragments using access.indices & flags
FragmentValidator-->>LayoutInferencer: validation result
LayoutInferencer-->>ParallelOpNode: chosen layout/plan (may include red_layout)
ParallelOpNode-->>Caller: return plan
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/op/reduce.cc`:
- Around line 475-481: Summary: Fix the minor typo in a comment in
src/op/reduce.cc. Locate the comment block inside reduce.cc that reads "We
shouldn't widen the layout here, because is may be written by other parallel for
op" and change "is" to "it" so it reads "because it may be written by other
parallel for op"; no logic change required—only update the comment text in that
block (the commented-out if (dst_rep > src_rep) { ... } section).
🧹 Nitpick comments (1)
testing/python/issue/test_tilelang_issue_1719.py (1)
12-19: Consider initializingtmp1or adding a clarifying comment.
tmp1is allocated but never initialized before being used as a divisor at line 19. While the test only inspects the generated source (never executes the kernel), using uninitialized memory can be confusing for future maintainers who might try to run the kernel.Consider either initializing
tmp1or adding a comment clarifying that this test only validates code generation, not runtime behavior.💡 Option 1: Initialize tmp1
tmp1 = T.alloc_fragment((N, M), T.float32) tmp2 = T.alloc_fragment((N, M), T.float32) tmp3 = T.alloc_fragment((N, M, M), T.float32) + for i, k in T.Parallel(N, M): + tmp1[i, k] = 1 for i, j, k in T.Parallel(N, M, M): tmp3[i, j, k] = 1💡 Option 2: Add clarifying comment
def test_tilelang_issue_1719(): + # This test validates code generation only; the kernel is not executed. + # tmp1 is intentionally left uninitialized as we only check the source. `@tilelang.jit`() def _buggy_kernel(M: int, N: int) -> tilelang.JITKernel:
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
|
@regression-perf |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/op/reduce.cc`:
- Around line 412-437: The proven-path currently stores BufferLoad(clear_buffer,
red_indices) while the unproven path stores the computed update (which may
include accumulation), causing inconsistent semantics when need_update is true;
change the analyzer->CanProve(predicate) branch in the block handling
need_duplicate so that it pushes BufferStore(dst_buffer, update, dst_indices)
(i.e., reuse the previously computed update value) when need_update is true (and
still push the plain clear_buffer load only when need_update is false), ensuring
both the proven and unproven branches store the same value; update references:
analyzer->CanProve(predicate), need_update, update, BufferStore, clear_buffer,
dst_buffer, dst_indices, BufferLoad.
🧹 Nitpick comments (4)
src/op/reduce.cc (1)
244-245: Consider null check foras_const_intresults.The
as_const_intfunction returns a pointer that could be null if the expression isn't a constant integer. While the context likely guarantees constant extents, a defensive check would prevent potential crashes on unexpected inputs.💡 Suggested defensive pattern
- auto dst_rep = *as_const_int(dst_layout->ReplicateExtent()); - auto red_rep = *as_const_int(red_layout->ReplicateExtent()); + auto dst_rep_ptr = as_const_int(dst_layout->ReplicateExtent()); + auto red_rep_ptr = as_const_int(red_layout->ReplicateExtent()); + ICHECK(dst_rep_ptr && red_rep_ptr) + << "ReplicateExtent must be constant for reduce layout inference"; + auto dst_rep = *dst_rep_ptr; + auto red_rep = *red_rep_ptr;testing/python/issue/test_tilelang_issue_1719.py (3)
7-26: Test only prints kernel source without validation.
test_issue_1719_layout_1compiles a kernel but only prints the source without any assertions. Consider adding validation to ensure the kernel compiles correctly and the layout is as expected.💡 Suggested improvement
kernel = _buggy_kernel() print(kernel.get_kernel_source()) + # Verify kernel compiles without error + assert kernel.get_kernel_source() is not None
122-124: Expression statementtmp2[0]has no effect.Line 124 loads
tmp2[0]but doesn't do anything with the result. This appears to be either incomplete or intended to verify the buffer is accessible. Consider making the intent clearer:💡 Clarify intent
T.reduce_sum(tmp1, tmp2) - tmp2[0] + # Verify buffer is accessible after reduction + _ = tmp2[0]
77-150: Consider adding more assertions to remaining tests.Tests 4-7 are smoke tests that verify compilation succeeds, which is valuable. However, consider adding at least minimal assertions to catch regressions:
- Verify
get_kernel_source()returns non-empty output- For tests that instantiate without parameters (like 4, 6, 7), ensure the kernel object is valid
This would make the tests more robust without significantly increasing complexity.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/op/reduce.cc`:
- Around line 244-245: The code dereferences the result of as_const_int on
dst_layout->ReplicateExtent() and red_layout->ReplicateExtent() into dst_rep and
red_rep without checking for nullptr; update the code in the reduction routine
to call as_const_int for both ReplicateExtent() results, check each pointer for
nullptr before dereferencing, and handle the symbolic (nullptr) case
appropriately (e.g., return an error/status, throw an exception, or fall back to
a runtime-handled path) so you avoid null pointer dereference when
ReplicateExtent is not a compile-time constant.
🧹 Nitpick comments (1)
src/op/reduce.cc (1)
163-186: Consider adding defensive bounds check fordimparameter.The function accesses
src_layout->InputShape()[dim]without validating thatdimis within bounds. While the caller (LowerandInferLayout) validates dimensions elsewhere, adding a defensive check here would make the function safer for potential future use.♻️ Suggested defensive check
static Fragment ComputeReducerLayout(const Fragment &src_layout, int dim) { + ICHECK(dim >= 0 && static_cast<size_t>(dim) < src_layout->InputDim()) + << "Reduction dimension out of bounds: " << dim; PrimExpr src_rep_extent = src_layout->ReplicateExtent(); PrimExpr indice_rep_extent = src_layout->InputShape()[dim];
Performance Regression Test ReportTriggered by: @kurisu6912 Results
Artifacts
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@testing/python/issue/test_tilelang_issue_1719.py`:
- Around line 50-75: The test test_issue_1719_layout_3 creates CUDA tensors and
must be skipped on CPU-only runners; add the `@tilelang.testing.requires_cuda`
decorator immediately above the test_issue_1719_layout_3 definition to ensure it
only runs when CUDA is available, leaving the rest of the function
(buggy_kernel, kernel invocation, and assertions) unchanged.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@src/op/reduce.cc`:
- Around line 389-414: The code unconditionally calls inv.pop_back() after
dst_layout->Inverse()->Forward(dst_th_indices), which breaks when the replicate
component was condensed (i.e. Forward did not append an extra element); guard
the pop by checking the returned inv size against the expected size (e.g. only
call inv.pop_back() if inv.size() > dst_layout->InputDim() or inv.size() ==
dst_layout->InputDim() + 1), and handle the mismatch case appropriately (skip
the pop and/or set predicate to false or bail out) before using inv in the loop
that builds the predicate; update the logic around inv.pop_back(), the
subsequent for-loop comparing inv[i] to dst_vars[i]->var, and keep the
analyzer->Simplify(predicate) call.
- Around line 277-321: When a layout mismatch causes creation of clear_buffer
(need_duplicate=true) but require_init is false (e.g., max/min with
clear=false), the duplicate buffer remains uninitialized; seed it from the
original dst_buffer after decl_buffer. After creating clear_buffer (the
decl_buffer call), if need_duplicate && !require_init, emit elementwise copies
that read from dst_buffer (BufferLoad(dst_buffer, ...)) and write into
clear_buffer (BufferStore(clear_buffer, ...)) using the appropriate index
mapping between dst_layout and red_layout (use the same index expressions you
use elsewhere, e.g., dst/red indices), so the accumulator starts with the
original dst values; alternatively, set require_init=true and initialize from
MakeInitValue if you prefer forcing clear. Ensure you reference decl_buffer,
clear_buffer, dst_buffer, need_duplicate, require_init, BufferLoad and
BufferStore when implementing the fix.
🧹 Nitpick comments (1)
testing/python/issue/test_tilelang_issue_1719.py (1)
96-106: Unused expression on line 104.
tmp2[0]is a standalone expression that reads the value but discards it. If this is intentional to test that the read compiles correctly afterreduce_sum, consider adding a brief comment explaining the intent. Otherwise, if this was meant to be an assignment or assertion, please fix accordingly.💡 If intentional, add a clarifying comment
T.reduce_sum(tmp1, tmp2) - tmp2[0] + # Access tmp2[0] to verify layout inference handles post-reduction reads + _ = tmp2[0]
T.Parallel: When a loop read a buffer, the buffer layout contains the loop layout (every read can hit the buffer); and when the loop write a buffer, loop layout contains the buffer layout (every place in the buffer is written)T.reduce_xx: First infer a reducer layout, then use the reducer layout to do reduce, and finally copy the reducer to target buffer.Summary by CodeRabbit
Refactor
Behavior Change
Tests
✏️ Tip: You can customize this high-level summary in your review settings.