[AMD][Backend] Add OptimizeDescriptorEncoding pass for AMDGPU#9792
Conversation
- Introduces the pass re-using the `AssignDescriptorMemoryLayouts` - Better handling of descriptor lowerings - Move shared encoding assignment for descriptor loads with dot operands from LowerLoops to this pass - Fixes rank-reducing descriptor loads - Fixes descriptors passed as kernel args
| auto encoding = type.getEncoding(); | ||
| assert(isPaddedEncoding(encoding) && | ||
| "expected padded encoding or partitioned wrapping padded"); | ||
| if (auto padded = dyn_cast<PaddedSharedEncodingAttr>(encoding)) { |
There was a problem hiding this comment.
how can a tensor have a padded layout?
There was a problem hiding this comment.
We introduced this function to handle rank reducing loads with descriptors. In the lowering of TDM copy in LoadStoreOpToLLVM, we need the layout of the descriptor's block type to properly configure the TDM copy operation. This is particularly useful in rank-reducing loads, since the block type has the layout information required for TDM copy. So to get its linear component, this function was introduced. The padded encoding is from the descriptors' block type.
There was a problem hiding this comment.
a tensor should not have a shared layout. We should check this in the verifier..
There was a problem hiding this comment.
We introduced this function to handle rank reducing loads with descriptors. In the lowering of TDM copy in LoadStoreOpToLLVM, we need the layout of the descriptor's block type to properly configure the TDM copy operation. This is particularly useful in rank-reducing loads, since the block type has the layout information required for TDM copy. So to get its linear component, this function was introduced. The padded encoding is from the descriptors' block type.
the descriptor can have a shared layout to represent the destination layout but surely the tensor wouldn't. I'm confused how this works
There was a problem hiding this comment.
AFAICT, descriptor type just warps a RankedTensorType inside and we just attach the planned shared memory encoding to that inner ranked tensor type; so seemingly we have a shared layout for a tensor type here. This is how it's done for NVIDIA side too like shown in the tests. For AMD we are just using #ttg.padded_shared instead of #ttg.nvmma_shared.
@sriakrish is OOO right now. I changed the API here to give separate shape and encoding here to avoid being confusing a bit, given LinearLayoutConversions.h is more generic.
There was a problem hiding this comment.
Right, I think we may want to move the shared encoding attribute to be directly attached to descriptor type itself to better disambiguate? I can create another refactoring pull request if we agree that's better.
There was a problem hiding this comment.
I think maybe there is a misunderstanding, the RankedTensorType with a shared encoding is never actually the type of an IR value, it's just a component of the TensorDescType. I did it this way simply because RankedTensorType has all the attributes as well as printing and parsing logic. We can clean this up if you guys prefer, but it's purely cosmetic IMO.
There was a problem hiding this comment.
e.g. currently the IR looks like !tt.tensordesc<tensor<16x128xf32, #shared>> but instead it could be !tt.tensordesc<16x128xf32, #shared> and not involve RankedTensorType at all.
There was a problem hiding this comment.
+1 sounds much better, I didn't realize we had the ranked tensor right now.
antiagainst
left a comment
There was a problem hiding this comment.
The major concern is peeled out to #9851 which will be resolved separately. I have reviewed the rest internally earlier so I'll land for now to move forward. If any further comments we can address post-commit. :)
…-lang#9792) - Introduces the pass re-using the `AssignDescriptorMemoryLayouts` - Better handling of descriptor lowerings - Move shared encoding assignment for descriptor loads with dot operands from LowerLoops to this pass - Fixes rank-reducing descriptor loads - Fixes descriptors passed as kernel args --------- Co-authored-by: Lei Zhang <antiagainst@gmail.com>
AssignDescriptorMemoryLayoutsNew contributor declaration
I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run
pre-commit run --from-ref origin/main --to-ref HEAD.Select one of the following.
/testforlittests/unittestfor C++ tests/python/testfor end-to-end testsFILL THIS IN.Select one of the following.
littests.littests I have added follow these best practices,including the "tests should be minimal" section. (Usually running Python code
and using the instructions it generates is not minimal.)