-
Notifications
You must be signed in to change notification settings - Fork 333
[Enhancement] Support Copy for Buffer Load witih scalar indices #946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…n tilelang - Introduced new functions for buffer load copy with stride and parallel execution. - Enhanced the copy logic in `copy.py` to simplify nested if statements for BufferLoad nodes. - Added corresponding test cases for the new buffer load functionalities.
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
|
also fix for issue #837 |
WalkthroughAdds early-lowering logic in tilelang copy to emit a BufferStore when both operands are BufferLoad with undetermined extents, altering control flow. Introduces new tests validating buffer load copy, including a parallel grid-tiling variant, with compilation and runtime checks. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Caller
participant Copy as copy(...)
participant Lower as Lowering
participant Store as BufferStore
participant Generic as GenericPath
Caller->>Copy: invoke copy(src, dst)
Copy->>Lower: analyze operands and extents
alt both operands are BufferLoad AND extents are None
note over Lower: New early-lowering path
Lower-->>Store: emit BufferStore(dst, src)
Store-->>Caller: return
else
Lower-->>Generic: proceed with generic flow (assertions/extent deduction)
Generic-->>Caller: continue existing lowering
end
sequenceDiagram
autonumber
participant Test as test_tilelang_copy_buffer_load_with_parallel
participant Build as T.Kernel build
participant CUDA as CUDA runtime
participant Run as Kernel launch
participant Assert as NumPy/Torch assert
Test->>Build: construct parallel tiled kernel (M,N, block_M, block_N)
Build-->>CUDA: compile for CUDA
Test->>Run: execute with random A
Run-->>Test: produce B
Test->>Assert: compare B ≈ A within tol
Assert-->>Test: pass/fail
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tilelang/language/copy.py (1)
48-54: Scalar copy lowering looks correct.The early return for BufferLoad-to-BufferLoad with undetermined extents correctly generates a direct BufferStore, implementing the scalar copy optimization described in the PR.
A few observations:
- The logic correctly checks both extents are None and both operands are BufferLoad before returning
- The BufferStore signature
(dst.buffer, src, dst.indices)is correct- The
coalesced_width,disable_tma, andeviction_policyparameters are bypassed in this path, which is appropriate for scalar operations where bulk-transfer semantics don't applyHowever, note that the function's docstring states
Returns: tir.Call, but this new path returnstir.BufferStore. Consider updating the docstring to reflect this conditional return behavior.Apply this diff to update the docstring:
Returns: - tir.Call: A handle to the copy operation + Union[tir.Call, tir.BufferStore]: A handle to the copy operation (tir.BufferStore for scalar copies, tir.Call otherwise)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
testing/python/language/test_tilelang_language_copy.py(1 hunks)tilelang/language/copy.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
testing/python/language/test_tilelang_language_copy.py (4)
tilelang/language/allocate.py (1)
alloc_local(39-50)tilelang/language/copy.py (1)
copy(10-86)tilelang/jit/__init__.py (1)
compile(33-86)tilelang/transform/pass_config.py (1)
PassConfigKey(6-101)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build-test-amd
- GitHub Check: format-check
🔇 Additional comments (2)
testing/python/language/test_tilelang_language_copy.py (2)
89-113: Compilation-only test for indirection pattern.This test validates compilation for the scalar copy case with indirection (
indices[pid]→idx[0]), but doesn't execute or verify runtime behavior.Is runtime validation intentionally skipped? If the indirection pattern is challenging to test end-to-end, consider documenting why (e.g., requires special setup, known limitation, etc.).
120-155: LGTM! Parallel copy test matches PR example.The test correctly validates the scalar copy lowering inside a parallel loop, matching the exact pattern shown in the PR description:
T.copy(A[by * block_M + i, bx * block_N + j], B[by * block_M + i, bx * block_N + j])The runtime assertion confirms output matches input within tolerances.
…n tilelang (tile-ai#946) - Introduced new functions for buffer load copy with stride and parallel execution. - Enhanced the copy logic in `copy.py` to simplify nested if statements for BufferLoad nodes. - Added corresponding test cases for the new buffer load functionalities.
will be lowered into
Summary by CodeRabbit
Bug Fixes
Tests