-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Granularize toMemoryConfigOp #840
Conversation
ff5f704
to
b006356
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to understand this better. Is it true that TTNN to_device op can only move from host to dram, or can it also put the tensor directly into an L1 sharded configuration?
///////////////////////////////////////// | ||
// Unsupported eltwise ops with sharding | ||
// * Concat: Sharded concat requires ROW MAJOR layout | ||
// * Softmax: Sharded softmax produces incorrect values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have an issue on tt-metal to track this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK we need to pass in a specific parameter SoftmaxShardedMultiCoreProgramConfig for this to work with sharded data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add an issue to our gitlab to track this, with all the details, and put a TODO with bug id in the comment?
It can put it in a L1 sharded configuration directly, the toDeviceOp accepts a memory config as a parameter. However if the tensor is already on device, and we want to move between l1/dram or shard/interleave, then we want to use the toMemoryConfig op. |
b006356
to
37d60fd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @jnie-TT, I'm not sure I understand what the intent of this PR is. There's several changes that I think are not the way we want to go - I'd like to zoom over this to better understand what it is we're trying to accomplish here.
In general, we should strive to model what TTNN does today - those are the best practices that will save us a lot of headaches going forward.
It handles: | ||
- ROW_MAJOR to TILE | ||
- TILE to ROW_MAJOR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that it will error out if called for ROW_MAJOR -> ROW_MAJOR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a toLayoutOp will not be created if it's ROW_MAJOR -> ROW_MAJOR, only when the input/output layouts are different will a toLayoutOp be created
}]; | ||
|
||
let arguments = (ins AnyRankedTensor:$input, | ||
TT_Device:$device, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the removal? TTNN lib has this arg, and it is used in some places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When moving a tensor to device we should use the toDeviceOp, toLayoutOp should be solely used to tilize/untilize tensors. If there's too much overlap between functionality of ops then it makes it hard to model in the compiler, and these boundaries should be set between ops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your reasoning, and from a high level, it just makes so much sense. Similarly, I wanted to use ttnn::to_layout()
to "set" all the things tensor related in one go: dtype, memcfg, layout, device. However, it turned out not to work: when I called the op with all the params, I got an error that was something like "dtype cannot be specified when converting to row_major".
It's not ideal that the APIs have hidden issues like that. We should ask for fixes, but I'm pretty sure that they won't all get fixed relatively fast, so I don't want us to bet on that. What we can do today is to copy their behaviour in order to minimize the risk of hitting similar issues.
This is exactly what I tried to do in #624 and #747. I went and talked to metal folks to understand how they do it for their scenarios. These codepaths are the most tested, and there's less chance that something will silently break on their end. If we resort to doing things how we think should be done, I'm afraid we'll end up hitting random errors as those paths are not as well tested (if at all).
In #624, we copy the behaviour from the from_torch
method, as proposed by metal folks. In that method specifically, the to_layout
function is called with the device parameter supplied each time. While the device object might be None
in some cases, it's not always going to be, and we should not model the to_layout
API so that we exclude the parameter.
In general, as dialect writers, we want to model their APIs as close as possible, and let the users of the dialect decide on how to use the APIs (which is usually us, and the top half of the comment talks about that).
def TTNN_TensorMemoryLayout_HeightSharded : I32EnumAttrCase<"HeightSharded", 2, "height_sharded">; | ||
def TTNN_TensorMemoryLayout_WidthSharded : I32EnumAttrCase<"WidthSharded", 3, "width_sharded">; | ||
def TTNN_TensorMemoryLayout_BlockSharded : I32EnumAttrCase<"BlockSharded", 4, "block_sharded">; | ||
def TTNN_TensorMemoryLayout_None : I32EnumAttrCase<"None", 0, "none">; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In TTNN lib, enum TensorMemoryLayout
has no value None
.
The idea behind TTNN dialect is to model the TTNN lib as faithfully as possible. Adding an enum value to an enum in dialect, which doesn't exist in the lib, will create issues down the road - ttrt and emitc have to implement special logic to check if the value is None
and decide what to do with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this all host tensors should have None tensor memory layout. It wouldn't make sense for host tensors to have other memory layouts. If you look at the get_memory_config API in ttnn, they model host tensors with std::nullopt tensor memory layouts (wrapped in the memory config). This mimics that behaviour:
inline std::optional<ttnn::MemoryConfig> get_memory_config(const ttnn::Tensor& tensor) {
if (not tensor.is_allocated() or not is_tensor_on_device_or_multidevice(tensor)) {
return std::nullopt;
}
return tensor.memory_config();
}
Once we have the TTNN-specific tensor type, we can model this better by say making the memConfigAttr of the tensor optional/setting some default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to do that - remove memCfg from the layout attribute of the tensor. @mtopalovicTT is actively working on adding TTNN specific layout attr (currently we only have TT_LayoutAttr
). #701
@mtopalovicTT can you give any updates/timelines on this? Do you see any issues in supporting this case?
@@ -49,6 +49,18 @@ static Value getOrInsertDevice(ConversionPatternRewriter &rewriter, | |||
return deviceOp.getResult(); | |||
} | |||
|
|||
static DataType getDataTypeFromMemRef(mlir::MemRefType memref) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good candidate for include/ttmlir/Dialect/TTNN/Utils/Utils.h
/ lib/Dialect/TTNN/Utils/Utils.cpp
.
// are required to achieve an arbitrary layout, namely: | ||
// - ToLayoutOp: to set the layout (ROW_MAJOR, TILE) of the tensor | ||
// - TypecastOp: to change the data type of the tensor | ||
// - ToDeviceOp: to move the tensor to a specific device | ||
// - FromDeviceOp: to move the tensor from a specific device to host | ||
// - ToMemoryConfigOp: to set the memory configuration (dram, l1, interleaved, | ||
// sharded) of the tensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's zoom about this, I'd like to understand what the intent here is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I'd be happy to zoom. In the meantime hopefully this comment clarifies things a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want us to granularize this way for reasons mentioned in one of the comments above - we should model the from_torch
approach, which is what the current code tries to do. Additionally, this piece of the change introduces a lot more code, which I find hard to read and maintain.
Can you post a list of specific scenarios we want to support, but are not currently supported?
@@ -376,7 +454,7 @@ class ToLayoutOpConversionPattern | |||
attrs.push_back(convertLayoutAttr(rewriter, srcOp.getLayoutAttr())); | |||
attrs.push_back(createStdNullopt(rewriter)); | |||
attrs.push_back(createStdNullopt(rewriter)); | |||
attrs.push_back(mlir::IntegerAttr::get(rewriter.getIndexType(), 1)); | |||
attrs.push_back(createNullDevicePointer(rewriter)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because toLayoutOps shouldn't be moving tensors to device, that's the responsibility of toDeviceOps. Therefore we set the device pointer to null.
@svuckovicTT this essentially builds on top of the changes added in #624 that replaces toMemoryConfigOp with more granular ops. To summarize the current state/issues of memory ops are:
This change attempts to fix the above issues:
The way I see it, these issues should be solved eventually, and should be solved before any further multi-chip changes go in. Or else we would need to update a fromDeviceOp that's not being used (therefore cannot be tested easily), a compound toMemoryConfigOp that's checking everything (therefore tons of updates will need to go in which will probably get removed anyway when we granularize it), and a toDeviceOp + toLayoutOp that has mixed boundaries and are also not being properly checked. |
}]; | ||
|
||
let arguments = (ins AnyRankedTensor:$input, | ||
TT_Device:$device, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your reasoning, and from a high level, it just makes so much sense. Similarly, I wanted to use ttnn::to_layout()
to "set" all the things tensor related in one go: dtype, memcfg, layout, device. However, it turned out not to work: when I called the op with all the params, I got an error that was something like "dtype cannot be specified when converting to row_major".
It's not ideal that the APIs have hidden issues like that. We should ask for fixes, but I'm pretty sure that they won't all get fixed relatively fast, so I don't want us to bet on that. What we can do today is to copy their behaviour in order to minimize the risk of hitting similar issues.
This is exactly what I tried to do in #624 and #747. I went and talked to metal folks to understand how they do it for their scenarios. These codepaths are the most tested, and there's less chance that something will silently break on their end. If we resort to doing things how we think should be done, I'm afraid we'll end up hitting random errors as those paths are not as well tested (if at all).
In #624, we copy the behaviour from the from_torch
method, as proposed by metal folks. In that method specifically, the to_layout
function is called with the device parameter supplied each time. While the device object might be None
in some cases, it's not always going to be, and we should not model the to_layout
API so that we exclude the parameter.
In general, as dialect writers, we want to model their APIs as close as possible, and let the users of the dialect decide on how to use the APIs (which is usually us, and the top half of the comment talks about that).
def TTNN_TensorMemoryLayout_HeightSharded : I32EnumAttrCase<"HeightSharded", 2, "height_sharded">; | ||
def TTNN_TensorMemoryLayout_WidthSharded : I32EnumAttrCase<"WidthSharded", 3, "width_sharded">; | ||
def TTNN_TensorMemoryLayout_BlockSharded : I32EnumAttrCase<"BlockSharded", 4, "block_sharded">; | ||
def TTNN_TensorMemoryLayout_None : I32EnumAttrCase<"None", 0, "none">; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to do that - remove memCfg from the layout attribute of the tensor. @mtopalovicTT is actively working on adding TTNN specific layout attr (currently we only have TT_LayoutAttr
). #701
@mtopalovicTT can you give any updates/timelines on this? Do you see any issues in supporting this case?
// are required to achieve an arbitrary layout, namely: | ||
// - ToLayoutOp: to set the layout (ROW_MAJOR, TILE) of the tensor | ||
// - TypecastOp: to change the data type of the tensor | ||
// - ToDeviceOp: to move the tensor to a specific device | ||
// - FromDeviceOp: to move the tensor from a specific device to host | ||
// - ToMemoryConfigOp: to set the memory configuration (dram, l1, interleaved, | ||
// sharded) of the tensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want us to granularize this way for reasons mentioned in one of the comments above - we should model the from_torch
approach, which is what the current code tries to do. Additionally, this piece of the change introduces a lot more code, which I find hard to read and maintain.
Can you post a list of specific scenarios we want to support, but are not currently supported?
const bool transposeA = false; | ||
const bool transposeB = false; | ||
const std::optional<const ::tt::tt_metal::MemoryConfig> memoryConfig = | ||
std::make_optional(outputMemoryConfig); | ||
const std::optional<const ::ttnn::DataType> dtype = | ||
std::make_optional(outputDataType); | ||
const std::optional<const ::ttnn::operations::matmul::MatmulProgramConfig> | ||
programConfig = std::nullopt; | ||
const std::optional<const std::string> activation = std::nullopt; | ||
const std::optional<const ::ttnn::DeviceComputeKernelConfig> | ||
computeKernelConfig = std::nullopt; | ||
const std::optional<const ::ttnn::CoreGrid> coreGrid = std::nullopt; | ||
|
||
::ttnn::Tensor out = | ||
::ttnn::matmul(lhs, rhs, transposeA, transposeB, memoryConfig, dtype, | ||
programConfig, activation, computeKernelConfig, coreGrid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need to do this for matmul (let's first prove that we do), we should reflect it in the dialect itself. That way, EmitC can work as well.
///////////////////////////////////////// | ||
// Unsupported eltwise ops with sharding | ||
// * Concat: Sharded concat requires ROW MAJOR layout | ||
// * Softmax: Sharded softmax produces incorrect values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add an issue to our gitlab to track this, with all the details, and put a TODO with bug id in the comment?
TTNN_LayoutAttr:$layout); | ||
let results = (outs AnyRankedTensor:$result); | ||
} | ||
|
||
def TTNN_TypecastOp : TTNN_Op<"typecast"> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ttnn::experimental::typecast
op is a DPS op, we should adjust for that.
@@ -355,6 +360,79 @@ class FromDeviceOpConversionPattern | |||
} | |||
}; | |||
|
|||
// TypecastOp conversion pattern | |||
// | |||
class TypecastOpConversionPattern |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The op lives in ttnn::experimental
namespace, you can override the getPrefixSwapPattern()
. GetDeviceOpConversionPattern
has an example.
37d60fd
to
376d9ff
Compare
376d9ff
to
77e7c35
Compare
a47e773
to
2cd1a7d
Compare
class ToLayoutOpConversionPattern | ||
: public OpConversionPattern<ttir::ToLayoutOp> { | ||
public: | ||
using OpConversionPattern<ttir::ToLayoutOp>::OpConversionPattern; | ||
|
||
LogicalResult | ||
matchAndRewrite(ttir::ToLayoutOp op, OpAdaptor adaptor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My ask here would be to try and migrate this decomposition of ToLayoutOp to a separate TTNN pass which occurs after lowering and thus keep lowering simple and clean. This would also allow optimizer to work with single memory layout manipulation op(for example ttnn:ToLayoutOp) while a separate pass would properly decompose it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a separate pass after the optimizer that decomposes the composite toLayout op instead of doing the decomposing during the lowering stage. Updated optimizer accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnie-TT Looks great on a first pass, will take a more detailed look on Monday morning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, please proceed.
213b2f5
to
57cf309
Compare
294d326
to
2a088ea
Compare
2a088ea
to
e148f14
Compare
@@ -27,13 +27,36 @@ def TTNN_GetDeviceOp : TTNN_Op<"get_device"> { | |||
let results = (outs TT_Device:$device); | |||
} | |||
|
|||
def TTNN_CompositeToLayoutOp : TTNN_Op<"composite_to_layout"> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to be permanent? Perhaps we can replace this with ToLayout? As I understand it TTNN folks plan to make ttnn::to_layout
flexible enough to support all kinds of layout transitions. In the meantime we can specialize ToLayout to the respective lowered op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nsmithtt that's a good point. I guess I can rename this op to ToLayout, and then the downstream decomposed ToLayout to something like TTNN_TilizeUntilize. Later when ttnn::to_layout
supports all combinations of parameters, we can simply remove the downstream decomposition pass and use this op instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah if you are going to change the name it's good idea to change downstream decomposed to something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we just leave it? At the initial conversion / optimizer just targets:
ToLayout
Then, in the short term, we might have to lower ToLayout
to one of:
ToMemoryConfig
Typecast
- etc.
- Or just leave it, if it's supported, as
ToLayout
Over time, we can just erase the specialized lowering for cases where ToLayout
has added support for it. I don't think we should introduce dummy ops if we don't need to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I see what you're saying. So we have a ToLayout op that captures all the parameters up front, effectively doing what the current CompositeToLayoutOp does (maybe set a couple parameters as optional since ttnn::to_layout miight not support them yet), and we would just use the same op for tilize/untilze. And eventually we can merge the decomposed ops into this op as ttnn supports them, and make subsequent arguments required instead of optional. Does that sound right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a hard preference here but leaving the same name is in my opinion bug prone. Reordering passes may create hard to debug issues with Op which has different meaning depending on place it's used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnie-TT yeah that's right.
@nobradovictt, I agree it's not ideal, but neither is introducing a dummy op esp if the eventual semantics of the ToLayout
op are exactly how we want to use it. Think of it as a workaround, we lower to ToLayout
, workaround passes at the end get applied that specialize out all the cases that ToLayout
doesn't support yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on having the same name for the op. One of the invariants of the TTNN Dialect that we want to adhere to, is that it should map to TTNN lib as close as possible. Having an op that doesn't exist in TTNN lib breaks this invariant.
For practicality reasons, we could have a flag on the op itself that let's us know whether the op has been "decomposed" or not. If anyone feels that there's a need for it today, feel free to open an issue and assign it to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@svuckovicTT I could make the same argument against having an op of the same name as in TTNN lib which carries completely different meaning(that it pushes TTNN dialect further from TTNN lib). I personally actually find it even worse, it's "wrong result" but insidious. It's easy to assert that dummy op is not expected in final IR, but if I show you IR with toLayout op will you be able to tell me that it went through decomposition pass?
But as this is a temp workaround, and more of you seems to prefer same name option, Im fine with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll create a separate issue to track this.
bool isCompositeToLayoutOp(mlir::Operation *op) { | ||
return isa<ttnn::CompositeToLayoutOp>(op); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this util function anymore since it is a single op again.
@@ -98,7 +121,7 @@ void registerTTNNPipelines() { | |||
mlir::PassPipelineRegistration< | |||
mlir::tt::ttnn::TTIRToTTNNBackendPipelineOptions>( | |||
"ttir-to-ttnn-backend-pipeline", | |||
"Pipeline lowering ttir to ttmetal backend.", | |||
"Pipeline lowering ttir to ttnn backend.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch :)
…essary and obey constraints in a separate pass, properly model toLayout op.
e148f14
to
4088fc9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @jnie-TT, thanks for taking the time to rework the ToLayout conversion, I think this is much cleaner now :)
Looks good from dialects perspective, can you please make sure to add the missing issues? (layout=none, typecast to dps, and dtype=none)
Great work!
UInt32 = 9, | ||
UInt16 = 10, | ||
UInt8 = 11, | ||
None = 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to TTNN_TensorMemoryLayout_None
, we should create an issue to track the removal of this.
ttnn::BufferTypeAttr::get(op.getContext(), bufferType)); | ||
if (newOutputLayoutEnum == ttnn::Layout::Tile) { | ||
TileType tileType = | ||
TileType::get(rewriter.getContext(), {32, 32}, outputDtype); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{32, 32}
Can you use mlir::tt::ttnn::TTNN_TILE_HEIGHT
and mlir::tt::ttnn::TTNN_TILE_HEIGHT
instead?
TileType tileType = | ||
TileType::get(rewriter.getContext(), {32, 32}, outputDtype); | ||
llvm::SmallVector<int64_t> newShardShape = | ||
tileType.getTiledShape(llvm::SmallVector<int64_t>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
llvm::SmallVector<int64_t>
Is shard shape always (or most of the time) 2 values? If so, we should use llvm::SmallVector<int64_t, 2>
.
@@ -27,13 +27,36 @@ def TTNN_GetDeviceOp : TTNN_Op<"get_device"> { | |||
let results = (outs TT_Device:$device); | |||
} | |||
|
|||
def TTNN_CompositeToLayoutOp : TTNN_Op<"composite_to_layout"> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on having the same name for the op. One of the invariants of the TTNN Dialect that we want to adhere to, is that it should map to TTNN lib as close as possible. Having an op that doesn't exist in TTNN lib breaks this invariant.
For practicality reasons, we could have a flag on the op itself that let's us know whether the op has been "decomposed" or not. If anyone feels that there's a need for it today, feel free to open an issue and assign it to me.
For sure, I'll create separate issues for this to track. Thanks! |
Introduce 5 ops that cover the functionality of the previous toMemoryConfigOp:
Updated downstream consumers (TTNN to flatbuffer, TTNN Runtime, TTNNToEmitC)
Creating these ops in combination is a challenge, as there are several (known) constraints:
This PR also fixes tensor type when forcing row_major/tile layouts. Previously when forcing tile/row_major layout, the output tensor layout/shard shape/element type are not changed, and so runtime had to hack tilize/untilize logic instead of trusting the tile shape.
Also added a shardSpec attribute to the memoryConfigAttr because the optimizer needs it for reshard implementation.
Another goal here is to make adding multi-chip changes easier, as currently without this change we are missing a couple of ops, and also hacking toMemoryConfig ops in the compiler. I would thus need to update the compound toMemoryConfig op to support multichip, which will get removed anyway.
Here's a simple example to illustrate a couple issues and how they are fixed with this PR.
On Main we have
Notice we are hard-coding
to_layout
to tilize on host (to_layout always comes before to_device), and we are hacking inttnn.to_memory_config
at the end to move data from device to host and do all the conversions necessary at the end (typecasting data type, untilize, to_cpu etc.). There's no ability currently to typecast/convert memory config in the middle of the program - everything happens at the end when we're done.Also notice the to_layout op that produces tensor %1:
%1 = "ttnn.to_layout"(%arg0, %0) <{layout = #ttnn.layout<tile>}> : (tensor<512x1024xbf16, #layout>, !tt.device<#device>) -> tensor<512x1024xbf16, #layout1>
. In the parameters of the op, we seettnn.layout<tile>
, which means this op is trying to tilize the tensor. However, notice the layout of the output tensor#layout1... memref<512x1024xbf16, #dram>
, the element type is not tilized, instead it is still row_major. This causes a mismatch between op parameter and output tensor type, and further downstream, all consumers of this tensor will assume the tensor is not tilized when inspecting it's element type.With the fixes in this PR we have
Now, notice the first to_layout op is inserted after the to_device op, thus we are indeed tilizing on device now because the dataformat is bf16. Also notice there is no hacked to_memory_config in the end when the ops is finished. Instead, we call a from_device op that reads from device, and a to_layout op that untilizes on host, and finally return the tensor. Also notice the first to_layout_op:
%2 = "ttnn.to_layout"(%1) <{layout = #ttnn.layout<tile>}> : (tensor<512x1024xbf16, #layout1>) -> tensor<512x1024xbf16, #layout1>
. The output tensor layout is#layout1
, which is now indeed in tile format:memref<1x1x!tt.tile<32x32, bf16>
, which is aligned with the parameter/goal of the op. Now, all downstream consumers will see a tilized tensor.Misc:
Synced with @svuckovicTT, we concluded that we should:
TensorMemoryLayout::None
. This won't be needed once we have TTNN-specific tensor type in our dialect, as we could just make the memory config optionalAdditional changes: