-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linalgtiling #974
base: main
Are you sure you want to change the base?
Linalgtiling #974
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some fly-by comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, this is a temporary pass to expose the behaviour we want and to compare in the future to an upstream implementation of a proper brgemm
tiling for vectorization into XSMM-like loops.
Currently on memref
because it's easier to reason about pointers, but linalg tiling is done at tensors, and there's already a mechanism for that upstream (TilingInterface
and tileLinalgOpImpl
) which need to be correctly updated.
But this is a good comparison for the future upstream patches, both in implementation and performance, so it's good to have it in tpp-mlir.
Before we merge, here are a few requests:
- Add a benchmark for the tling/vectorization, by duplicating one of the GEMM
mlir-gen
benchmarks and addingtpp-opt
with your new options to it. It should be slower than XSMM, but faster than plain vectorization. - Rename the pass to
brgemm-tiling
, since this will not be expanded to all linalg ops and actual linalg tiling already exists upstream, at tensor level. - Expand comments for most of the code, that is too dense to read through and follow the core logic.
- Rename variables and test files to
brgemm
instead oflinalgOp
ormatmul
, to make it clear what the pass is supposed to do.
Thanks!
lib/TPP/Transforms/LinalgTiling.cpp
Outdated
for (auto itrShapeM = finaltile.begin(); itrShapeM != finaltile.end(); | ||
itrShapeM++, i++) { | ||
int index = swap_i[i] / boundariesOne[swap_i[i]]; | ||
int offset = swap_i[i] / (finaltile.size() - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear what is index
, offset
and indexTwo
here. Can you elaborate on a comment and perhaps name them better?
lib/TPP/Transforms/LinalgTiling.cpp
Outdated
std::prev(innermostForLoop.getBody()->end(), 1)); | ||
auto clone = rewriter.clone(*linalgOp); | ||
linalgOp.replaceAllUsesWith(clone); | ||
if (linalgOp->use_empty()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After removing all users, why use would be not empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are getting error: operand #0 does not dominate this use
, if we don't call use_empty()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is too dense to review properly. This cannot be upstreamed anyway, so let's just go with it and see what the IR we get for the next stage and iterate.
Remember, when we want to upstream this, it will have to go as implementation of the tiling interface for these operations (batch/reduce/matmul), so we'll have to re-implement from scratch in a completely different mindset.
But if this works out, it will serve as a baseline, so let's get going and reach that baseline.
This PR adds lingalg tiling pass to the pipeline.
Linalg tiling pass does tiling of the innermost dimensions of the batch reduce matmul. After tiling it swaps the BRGEMM loop and k-loop. The loop structure looks like M-loop->N-loop->BRGEMM-loop->K-loop(batch reduce matmul). Then lower the tiled code for vectorization using the vector contract pass.
The goal is to perform vectorization in tpp-mlir itself without relying on the libxsmm.