Granularize toMemoryConfigOp #840

jnie-TT · 2024-09-28T01:32:26Z

Introduce 5 ops that cover the functionality of the previous toMemoryConfigOp:

ToDeviceOp: Move data to device
FromDeviceOp: Move data from device to host
ToLayoutOp: Tilize/untilize data
TypecastOp: typecast data
- FYI @mmanzoorTT I updated the TTNN implementation of the typecast op, creating its own conversion pattern. In TTNN, typecast does not belong to eltwise unary, that's why you had to add a separate method for running the op in runtime. The stableHLO to ttir interface remains unchanged.
ToMemoryConfigOp: Solely converts memory config, including Dram <-> L1 movement, interleaved <-> sharding conversion

Updated downstream consumers (TTNN to flatbuffer, TTNN Runtime, TTNNToEmitC)

Creating these ops in combination is a challenge, as there are several (known) constraints:

Tilizing on device requires bf16
Typecasting on device requires TILIZED data
Untilizing on device requires width % 2 == 0, page size > sizeof(uint32_t).
- Here it seems very hard to update the page size. Some ways I think might work is typecasting or padding. However padding -> untilize -> unpad seems tricky, and typecasting requires tilized data.
- Therefore for now we will always untilize on host (which is what we are doing currently), since we normally only untilize at the end of the program when we move data from device to host, the perf hit should be acceptable. We rarely have device to device untilize cases.

This PR also fixes tensor type when forcing row_major/tile layouts. Previously when forcing tile/row_major layout, the output tensor layout/shard shape/element type are not changed, and so runtime had to hack tilize/untilize logic instead of trusting the tile shape.

Also added a shardSpec attribute to the memoryConfigAttr because the optimizer needs it for reshard implementation.

Another goal here is to make adding multi-chip changes easier, as currently without this change we are missing a couple of ops, and also hacking toMemoryConfig ops in the compiler. I would thus need to update the compound toMemoryConfig op to support multichip, which will get removed anyway.

Here's a simple example to illustrate a couple issues and how they are fixed with this PR.
On Main we have

#layout = #tt.layout<(d0, d1) -> (d0, d1), undef, <1x1>, memref<512x1024xbf16, #system>>
#layout1 = #tt.layout<(d0, d1) -> (d0, d1), undef, <1x1>, memref<512x1024xbf16, #dram>, interleaved>
module attributes {tt.device = #device, tt.system_desc = #system_desc} {
  func.func @softmax(%arg0: tensor<512x1024xbf16, #layout>) -> tensor<512x1024xbf16, #layout> {
    %0 = "ttnn.get_device"() <{mesh_shape = #ttnn<mesh_shape 1x1>}> : () -> !tt.device<#device>
    %1 = "ttnn.to_layout"(%arg0, %0) <{layout = #ttnn.layout<tile>}> : (tensor<512x1024xbf16, #layout>, !tt.device<#device>) -> tensor<512x1024xbf16, #layout1>
    %2 = "ttnn.to_device"(%1, %0) <{memory_config = #ttnn.memory_config<<interleaved>, <dram>>}> : (tensor<512x1024xbf16, #layout1>, !tt.device<#device>) -> tensor<512x1024xbf16, #layout1>
    "ttnn.dealloc"(%1) : (tensor<512x1024xbf16, #layout1>) -> ()
    %3 = "ttnn.softmax"(%2) <{dimension = 1 : si32}> : (tensor<512x1024xbf16, #layout1>) -> tensor<512x1024xbf16, #layout1>
    "ttnn.dealloc"(%2) : (tensor<512x1024xbf16, #layout1>) -> ()
    %4 = "ttnn.softmax"(%3) <{dimension = -1 : si32}> : (tensor<512x1024xbf16, #layout1>) -> tensor<512x1024xbf16, #layout1>
    "ttnn.dealloc"(%3) : (tensor<512x1024xbf16, #layout1>) -> ()
    %5 = "ttnn.to_memory_config"(%4, %0) : (tensor<512x1024xbf16, #layout1>, !tt.device<#device>) -> tensor<512x1024xbf16, #layout>
    "ttnn.dealloc"(%4) : (tensor<512x1024xbf16, #layout1>) -> ()
    return %5 : tensor<512x1024xbf16, #layout>
  }
}

Notice we are hard-coding to_layout to tilize on host (to_layout always comes before to_device), and we are hacking in ttnn.to_memory_config at the end to move data from device to host and do all the conversions necessary at the end (typecasting data type, untilize, to_cpu etc.). There's no ability currently to typecast/convert memory config in the middle of the program - everything happens at the end when we're done.
Also notice the to_layout op that produces tensor %1: %1 = "ttnn.to_layout"(%arg0, %0) <{layout = #ttnn.layout<tile>}> : (tensor<512x1024xbf16, #layout>, !tt.device<#device>) -> tensor<512x1024xbf16, #layout1>. In the parameters of the op, we see ttnn.layout<tile>, which means this op is trying to tilize the tensor. However, notice the layout of the output tensor #layout1... memref<512x1024xbf16, #dram>, the element type is not tilized, instead it is still row_major. This causes a mismatch between op parameter and output tensor type, and further downstream, all consumers of this tensor will assume the tensor is not tilized when inspecting it's element type.

With the fixes in this PR we have

#layout = #tt.layout<(d0, d1) -> (d0, d1), undef, <1x1>, memref<512x1024xbf16, #system>>
#layout1 = #tt.layout<(d0, d1) -> (d0, d1), undef, <1x1>, memref<1x1x!tt.tile<32x32, bf16>, #dram>, interleaved>
#layout2 = #tt.layout<(d0, d1) -> (d0, d1), undef, <1x1>, memref<512x1024xbf16, #dram>, interleaved>
module attributes {tt.device = #device, tt.system_desc = #system_desc} {
  func.func @softmax(%arg0: tensor<512x1024xbf16, #layout>) -> tensor<512x1024xbf16, #layout> {
    %0 = "ttnn.get_device"() <{mesh_shape = #ttnn<mesh_shape 1x1>}> : () -> !tt.device<#device>
    %1 = "ttnn.to_device"(%arg0, %0) <{memory_config = #ttnn.memory_config<<interleaved>, <dram>, <<1x1>>>}> : (tensor<512x1024xbf16, #layout>, !tt.device<#device>) -> tensor<512x1024xbf16, #layout1>
    %2 = "ttnn.to_layout"(%1) <{layout = #ttnn.layout<tile>}> : (tensor<512x1024xbf16, #layout1>) -> tensor<512x1024xbf16, #layout1>
    "ttnn.dealloc"(%1) : (tensor<512x1024xbf16, #layout1>) -> ()
    %3 = "ttnn.softmax"(%2) <{dimension = 1 : si32}> : (tensor<512x1024xbf16, #layout1>) -> tensor<512x1024xbf16, #layout2>
    "ttnn.dealloc"(%2) : (tensor<512x1024xbf16, #layout1>) -> ()
    %4 = "ttnn.softmax"(%3) <{dimension = -1 : si32}> : (tensor<512x1024xbf16, #layout2>) -> tensor<512x1024xbf16, #layout2>
    "ttnn.dealloc"(%3) : (tensor<512x1024xbf16, #layout2>) -> ()
    %5 = "ttnn.from_device"(%4) : (tensor<512x1024xbf16, #layout2>) -> tensor<512x1024xbf16, #layout>
    "ttnn.dealloc"(%4) : (tensor<512x1024xbf16, #layout2>) -> ()
    %6 = "ttnn.to_layout"(%5) <{layout = #ttnn.layout<row_major>}> : (tensor<512x1024xbf16, #layout>) -> tensor<512x1024xbf16, #layout>
    "ttnn.dealloc"(%5) : (tensor<512x1024xbf16, #layout>) -> ()
    return %6 : tensor<512x1024xbf16, #layout>
  }
}

Now, notice the first to_layout op is inserted after the to_device op, thus we are indeed tilizing on device now because the dataformat is bf16. Also notice there is no hacked to_memory_config in the end when the ops is finished. Instead, we call a from_device op that reads from device, and a to_layout op that untilizes on host, and finally return the tensor. Also notice the first to_layout_op: %2 = "ttnn.to_layout"(%1) <{layout = #ttnn.layout<tile>}> : (tensor<512x1024xbf16, #layout1>) -> tensor<512x1024xbf16, #layout1>. The output tensor layout is #layout1, which is now indeed in tile format: memref<1x1x!tt.tile<32x32, bf16>, which is aligned with the parameter/goal of the op. Now, all downstream consumers will see a tilized tensor.

Misc:
Synced with @svuckovicTT, we concluded that we should:

Break down the big if statement into separate methods (Done, now all cases belong to their own function)
Model to-layout op to include all parameters in TTNN as optional (Done in this PR)
Convert typecast op to DPS, will create a separate issue for this, don't want to bloat this PR any further
Create an issue that tracks TensorMemoryLayout::None. This won't be needed once we have TTNN-specific tensor type in our dialect, as we could just make the memory config optional

Additional changes:

Removed sharded softmax test because the outputs are wrong Sharded softmax produces incorrect values #843
Updated matmul to use ::ttnn::matmul
Marked ttrt test_device as skipped because we are seeing perf issues on main: MNIST perf mode device collection failure #828

odjuricicTT

Trying to understand this better. Is it true that TTNN to_device op can only move from host to dram, or can it also put the tensor directly into an L1 sharded configuration?

odjuricicTT · 2024-09-30T09:22:06Z

test/ttmlir/Silicon/TTNN/sharded/simple_eltwise_sharded.mlir

 /////////////////////////////////////////
 // Unsupported eltwise ops with sharding
 //  * Concat: Sharded concat requires ROW MAJOR layout
+//  * Softmax: Sharded softmax produces incorrect values


Do we have an issue on tt-metal to track this?

AFAIK we need to pass in a specific parameter SoftmaxShardedMultiCoreProgramConfig for this to work with sharded data.

Can you add an issue to our gitlab to track this, with all the details, and put a TODO with bug id in the comment?

lib/Conversion/TTIRToTTNN/TTIRToTTNN.cpp

jnie-TT · 2024-09-30T13:52:47Z

Trying to understand this better. Is it true that TTNN to_device op can only move from host to dram, or can it also put the tensor directly into an L1 sharded configuration?

It can put it in a L1 sharded configuration directly, the toDeviceOp accepts a memory config as a parameter. However if the tensor is already on device, and we want to move between l1/dram or shard/interleave, then we want to use the toMemoryConfig op.

svuckovicTT

Hey @jnie-TT, I'm not sure I understand what the intent of this PR is. There's several changes that I think are not the way we want to go - I'd like to zoom over this to better understand what it is we're trying to accomplish here.

In general, we should strive to model what TTNN does today - those are the best practices that will save us a lot of headaches going forward.

svuckovicTT · 2024-10-01T10:32:43Z

include/ttmlir/Dialect/TTNN/IR/TTNNOps.td

+      It handles:
+        - ROW_MAJOR to TILE
+        - TILE to ROW_MAJOR


Does this mean that it will error out if called for ROW_MAJOR -> ROW_MAJOR?

a toLayoutOp will not be created if it's ROW_MAJOR -> ROW_MAJOR, only when the input/output layouts are different will a toLayoutOp be created

svuckovicTT · 2024-10-01T10:35:03Z

include/ttmlir/Dialect/TTNN/IR/TTNNOps.td

    }];

    let arguments = (ins AnyRankedTensor:$input,
-                         TT_Device:$device,


Why the removal? TTNN lib has this arg, and it is used in some places.

When moving a tensor to device we should use the toDeviceOp, toLayoutOp should be solely used to tilize/untilize tensors. If there's too much overlap between functionality of ops then it makes it hard to model in the compiler, and these boundaries should be set between ops.

I understand your reasoning, and from a high level, it just makes so much sense. Similarly, I wanted to use ttnn::to_layout() to "set" all the things tensor related in one go: dtype, memcfg, layout, device. However, it turned out not to work: when I called the op with all the params, I got an error that was something like "dtype cannot be specified when converting to row_major".

It's not ideal that the APIs have hidden issues like that. We should ask for fixes, but I'm pretty sure that they won't all get fixed relatively fast, so I don't want us to bet on that. What we can do today is to copy their behaviour in order to minimize the risk of hitting similar issues.

This is exactly what I tried to do in #624 and #747. I went and talked to metal folks to understand how they do it for their scenarios. These codepaths are the most tested, and there's less chance that something will silently break on their end. If we resort to doing things how we think should be done, I'm afraid we'll end up hitting random errors as those paths are not as well tested (if at all).

In #624, we copy the behaviour from the from_torch method, as proposed by metal folks. In that method specifically, the to_layout function is called with the device parameter supplied each time. While the device object might be None in some cases, it's not always going to be, and we should not model the to_layout API so that we exclude the parameter.

In general, as dialect writers, we want to model their APIs as close as possible, and let the users of the dialect decide on how to use the APIs (which is usually us, and the top half of the comment talks about that).

svuckovicTT · 2024-10-01T10:39:30Z

include/ttmlir/Dialect/TTNN/IR/TTNNOpsEnums.td

-def TTNN_TensorMemoryLayout_HeightSharded : I32EnumAttrCase<"HeightSharded", 2, "height_sharded">;
-def TTNN_TensorMemoryLayout_WidthSharded : I32EnumAttrCase<"WidthSharded", 3, "width_sharded">;
-def TTNN_TensorMemoryLayout_BlockSharded : I32EnumAttrCase<"BlockSharded", 4, "block_sharded">;
+def TTNN_TensorMemoryLayout_None : I32EnumAttrCase<"None", 0, "none">;


In TTNN lib, enum TensorMemoryLayout has no value None.

The idea behind TTNN dialect is to model the TTNN lib as faithfully as possible. Adding an enum value to an enum in dialect, which doesn't exist in the lib, will create issues down the road - ttrt and emitc have to implement special logic to check if the value is None and decide what to do with it.

With this all host tensors should have None tensor memory layout. It wouldn't make sense for host tensors to have other memory layouts. If you look at the get_memory_config API in ttnn, they model host tensors with std::nullopt tensor memory layouts (wrapped in the memory config). This mimics that behaviour:

inline std::optional<ttnn::MemoryConfig> get_memory_config(const ttnn::Tensor& tensor) { if (not tensor.is_allocated() or not is_tensor_on_device_or_multidevice(tensor)) { return std::nullopt; } return tensor.memory_config(); }

Once we have the TTNN-specific tensor type, we can model this better by say making the memConfigAttr of the tensor optional/setting some default.

We should be able to do that - remove memCfg from the layout attribute of the tensor. @mtopalovicTT is actively working on adding TTNN specific layout attr (currently we only have TT_LayoutAttr). #701

@mtopalovicTT can you give any updates/timelines on this? Do you see any issues in supporting this case?

svuckovicTT · 2024-10-01T10:42:28Z

lib/Conversion/TTIRToTTNN/TTIRToTTNN.cpp

@@ -49,6 +49,18 @@ static Value getOrInsertDevice(ConversionPatternRewriter &rewriter,
  return deviceOp.getResult();
 }

+static DataType getDataTypeFromMemRef(mlir::MemRefType memref) {


Good candidate for include/ttmlir/Dialect/TTNN/Utils/Utils.h / lib/Dialect/TTNN/Utils/Utils.cpp.

svuckovicTT · 2024-10-01T11:03:48Z

lib/Conversion/TTIRToTTNN/TTIRToTTNN.cpp

+// are required to achieve an arbitrary layout, namely:
+// - ToLayoutOp: to set the layout (ROW_MAJOR, TILE) of the tensor
+// - TypecastOp: to change the data type of the tensor
+// - ToDeviceOp: to move the tensor to a specific device
+// - FromDeviceOp: to move the tensor from a specific device to host
+// - ToMemoryConfigOp: to set the memory configuration (dram, l1, interleaved,
+// sharded) of the tensor


Let's zoom about this, I'd like to understand what the intent here is.

Sure I'd be happy to zoom. In the meantime hopefully this comment clarifies things a bit.

I don't want us to granularize this way for reasons mentioned in one of the comments above - we should model the from_torch approach, which is what the current code tries to do. Additionally, this piece of the change introduces a lot more code, which I find hard to read and maintain.

Can you post a list of specific scenarios we want to support, but are not currently supported?

svuckovicTT · 2024-10-01T11:05:17Z

lib/Conversion/TTNNToEmitC/TTNNToEmitC.cpp

@@ -376,7 +454,7 @@ class ToLayoutOpConversionPattern
    attrs.push_back(convertLayoutAttr(rewriter, srcOp.getLayoutAttr()));
    attrs.push_back(createStdNullopt(rewriter));
    attrs.push_back(createStdNullopt(rewriter));
-    attrs.push_back(mlir::IntegerAttr::get(rewriter.getIndexType(), 1));
+    attrs.push_back(createNullDevicePointer(rewriter));


Why this change?

Because toLayoutOps shouldn't be moving tensors to device, that's the responsibility of toDeviceOps. Therefore we set the device pointer to null.

jnie-TT · 2024-10-02T02:27:11Z

Hey @jnie-TT, I'm not sure I understand what the intent of this PR is. There's several changes that I think are not the way we want to go - I'd like to zoom over this to better understand what it is we're trying to accomplish here.

In general, we should strive to model what TTNN does today - those are the best practices that will save us a lot of headaches going forward.

@svuckovicTT this essentially builds on top of the changes added in #624 that replaces toMemoryConfigOp with more granular ops.

To summarize the current state/issues of memory ops are:

We're missing a couple ops (typecast, toMemoryConfig). Here toMemoryConfig solely converts the memory config of the tensor.
We're always tilizing on host (toLayout ops are always inserted before toDeviceOps). I was told by the TTNN folks that the host tilizer is very bad, and we should always tilize on device if possible.
We're always assuming toLayout should move tensors to device (because we're always passing in the device argument). Therefore we cannot untilize on host (which is fixed by the next bullet).
We're hacking in the old compound toMemoryConfigOp when reading tensors from device because there are issues with toLayout and fromDevice ops (Partial revert 19732d0b2a10b053109d41243a29f95acc9cbabe #789 ).
We have a compound toMemoryConfig op that checks everything, which eliminates granular control and is hard to update for specific use-cases. This op should solely convert memory configs of the device tensor.
toDeviceOps are created when tensors are already on device (no checks are in place), related PR: Reshape using row major layout #819
toLayoutOps are created when input output tensors could already have the same layout (no checks in place)

This change attempts to fix the above issues:

Added typecast and toMemoryConfig ops
We determine where to tilize (if the tensor can be tilized on device and the tensor will be moved to device/is on device, we tilize on device)
We only move tensors to device with toDeviceOps, clear boundaries are set
We use fromDevice ops to read tensors back from device, and we always untilize on host (because device untilize has many issues mentioned in the PR description)
We remove the compound toMemoryConfig op, and make it lightweight such that all it does it convert memory configs of device tensors
We check whether toDeviceOps should be created, and abide with constraints (e.g. should not be created in conjunction with fromDeviceOps, should not be created if tensor is already on device, should not be created before an illegal tilize/typecast etc.)
We check the input/output tensor layout and determine if a toLayoutOp should be created

The way I see it, these issues should be solved eventually, and should be solved before any further multi-chip changes go in. Or else we would need to update a fromDeviceOp that's not being used (therefore cannot be tested easily), a compound toMemoryConfigOp that's checking everything (therefore tons of updates will need to go in which will probably get removed anyway when we granularize it), and a toDeviceOp + toLayoutOp that has mixed boundaries and are also not being properly checked.

svuckovicTT · 2024-10-07T09:14:29Z

include/ttmlir/Dialect/TTNN/IR/TTNNOps.td

    }];

    let arguments = (ins AnyRankedTensor:$input,
-                         TT_Device:$device,


I understand your reasoning, and from a high level, it just makes so much sense. Similarly, I wanted to use ttnn::to_layout() to "set" all the things tensor related in one go: dtype, memcfg, layout, device. However, it turned out not to work: when I called the op with all the params, I got an error that was something like "dtype cannot be specified when converting to row_major".

It's not ideal that the APIs have hidden issues like that. We should ask for fixes, but I'm pretty sure that they won't all get fixed relatively fast, so I don't want us to bet on that. What we can do today is to copy their behaviour in order to minimize the risk of hitting similar issues.

This is exactly what I tried to do in #624 and #747. I went and talked to metal folks to understand how they do it for their scenarios. These codepaths are the most tested, and there's less chance that something will silently break on their end. If we resort to doing things how we think should be done, I'm afraid we'll end up hitting random errors as those paths are not as well tested (if at all).

In #624, we copy the behaviour from the from_torch method, as proposed by metal folks. In that method specifically, the to_layout function is called with the device parameter supplied each time. While the device object might be None in some cases, it's not always going to be, and we should not model the to_layout API so that we exclude the parameter.

In general, as dialect writers, we want to model their APIs as close as possible, and let the users of the dialect decide on how to use the APIs (which is usually us, and the top half of the comment talks about that).

svuckovicTT · 2024-10-07T09:42:34Z

include/ttmlir/Dialect/TTNN/IR/TTNNOpsEnums.td

-def TTNN_TensorMemoryLayout_HeightSharded : I32EnumAttrCase<"HeightSharded", 2, "height_sharded">;
-def TTNN_TensorMemoryLayout_WidthSharded : I32EnumAttrCase<"WidthSharded", 3, "width_sharded">;
-def TTNN_TensorMemoryLayout_BlockSharded : I32EnumAttrCase<"BlockSharded", 4, "block_sharded">;
+def TTNN_TensorMemoryLayout_None : I32EnumAttrCase<"None", 0, "none">;


We should be able to do that - remove memCfg from the layout attribute of the tensor. @mtopalovicTT is actively working on adding TTNN specific layout attr (currently we only have TT_LayoutAttr). #701

@mtopalovicTT can you give any updates/timelines on this? Do you see any issues in supporting this case?

svuckovicTT · 2024-10-07T10:45:50Z

lib/Conversion/TTIRToTTNN/TTIRToTTNN.cpp

+// are required to achieve an arbitrary layout, namely:
+// - ToLayoutOp: to set the layout (ROW_MAJOR, TILE) of the tensor
+// - TypecastOp: to change the data type of the tensor
+// - ToDeviceOp: to move the tensor to a specific device
+// - FromDeviceOp: to move the tensor from a specific device to host
+// - ToMemoryConfigOp: to set the memory configuration (dram, l1, interleaved,
+// sharded) of the tensor


I don't want us to granularize this way for reasons mentioned in one of the comments above - we should model the from_torch approach, which is what the current code tries to do. Additionally, this piece of the change introduces a lot more code, which I find hard to read and maintain.

Can you post a list of specific scenarios we want to support, but are not currently supported?

svuckovicTT · 2024-10-07T10:49:24Z

runtime/lib/ttnn/operations/matmul/matmul.cpp

+  const bool transposeA = false;
+  const bool transposeB = false;
+  const std::optional<const ::tt::tt_metal::MemoryConfig> memoryConfig =
+      std::make_optional(outputMemoryConfig);
+  const std::optional<const ::ttnn::DataType> dtype =
+      std::make_optional(outputDataType);
+  const std::optional<const ::ttnn::operations::matmul::MatmulProgramConfig>
+      programConfig = std::nullopt;
+  const std::optional<const std::string> activation = std::nullopt;
+  const std::optional<const ::ttnn::DeviceComputeKernelConfig>
+      computeKernelConfig = std::nullopt;
+  const std::optional<const ::ttnn::CoreGrid> coreGrid = std::nullopt;
+
+  ::ttnn::Tensor out =
+      ::ttnn::matmul(lhs, rhs, transposeA, transposeB, memoryConfig, dtype,
+                     programConfig, activation, computeKernelConfig, coreGrid);


If we need to do this for matmul (let's first prove that we do), we should reflect it in the dialect itself. That way, EmitC can work as well.

svuckovicTT · 2024-10-07T10:51:06Z

test/ttmlir/Silicon/TTNN/sharded/simple_eltwise_sharded.mlir

 /////////////////////////////////////////
 // Unsupported eltwise ops with sharding
 //  * Concat: Sharded concat requires ROW MAJOR layout
+//  * Softmax: Sharded softmax produces incorrect values


Can you add an issue to our gitlab to track this, with all the details, and put a TODO with bug id in the comment?

svuckovicTT · 2024-10-07T10:53:20Z

include/ttmlir/Dialect/TTNN/IR/TTNNOps.td

                         TTNN_LayoutAttr:$layout);
    let results = (outs AnyRankedTensor:$result);
 }

+def TTNN_TypecastOp : TTNN_Op<"typecast"> {


ttnn::experimental::typecast op is a DPS op, we should adjust for that.

svuckovicTT · 2024-10-07T10:55:22Z

lib/Conversion/TTNNToEmitC/TTNNToEmitC.cpp

@@ -355,6 +360,79 @@ class FromDeviceOpConversionPattern
  }
 };

+// TypecastOp conversion pattern
+//
+class TypecastOpConversionPattern


The op lives in ttnn::experimental namespace, you can override the getPrefixSwapPattern(). GetDeviceOpConversionPattern has an example.

nobradovictt · 2024-10-22T13:58:08Z

lib/Conversion/TTIRToTTNN/TTIRToTTNN.cpp

 class ToLayoutOpConversionPattern
    : public OpConversionPattern<ttir::ToLayoutOp> {
 public:
  using OpConversionPattern<ttir::ToLayoutOp>::OpConversionPattern;

+  LogicalResult
+  matchAndRewrite(ttir::ToLayoutOp op, OpAdaptor adaptor,


My ask here would be to try and migrate this decomposition of ToLayoutOp to a separate TTNN pass which occurs after lowering and thus keep lowering simple and clean. This would also allow optimizer to work with single memory layout manipulation op(for example ttnn:ToLayoutOp) while a separate pass would properly decompose it.

Added a separate pass after the optimizer that decomposes the composite toLayout op instead of doing the decomposing during the lowering stage. Updated optimizer accordingly

@jnie-TT Looks great on a first pass, will take a more detailed look on Monday morning.

Looks good, please proceed.

nsmithtt · 2024-10-27T13:03:08Z

include/ttmlir/Dialect/TTNN/IR/TTNNOps.td

@@ -27,13 +27,36 @@ def TTNN_GetDeviceOp : TTNN_Op<"get_device"> {
    let results = (outs TT_Device:$device);
 }

+def TTNN_CompositeToLayoutOp : TTNN_Op<"composite_to_layout"> {


Is this going to be permanent? Perhaps we can replace this with ToLayout? As I understand it TTNN folks plan to make ttnn::to_layout flexible enough to support all kinds of layout transitions. In the meantime we can specialize ToLayout to the respective lowered op.

@nsmithtt that's a good point. I guess I can rename this op to ToLayout, and then the downstream decomposed ToLayout to something like TTNN_TilizeUntilize. Later when ttnn::to_layout supports all combinations of parameters, we can simply remove the downstream decomposition pass and use this op instead.

Yeah if you are going to change the name it's good idea to change downstream decomposed to something else.

Or we just leave it? At the initial conversion / optimizer just targets:

ToLayout

Then, in the short term, we might have to lower ToLayout to one of:

ToMemoryConfig

Typecast

etc.

Or just leave it, if it's supported, as ToLayout

Over time, we can just erase the specialized lowering for cases where ToLayout has added support for it. I don't think we should introduce dummy ops if we don't need to.

I think I see what you're saying. So we have a ToLayout op that captures all the parameters up front, effectively doing what the current CompositeToLayoutOp does (maybe set a couple parameters as optional since ttnn::to_layout miight not support them yet), and we would just use the same op for tilize/untilze. And eventually we can merge the decomposed ops into this op as ttnn supports them, and make subsequent arguments required instead of optional. Does that sound right?

I don't have a hard preference here but leaving the same name is in my opinion bug prone. Reordering passes may create hard to debug issues with Op which has different meaning depending on place it's used.

@jnie-TT yeah that's right.

@nobradovictt, I agree it's not ideal, but neither is introducing a dummy op esp if the eventual semantics of the ToLayout op are exactly how we want to use it. Think of it as a workaround, we lower to ToLayout, workaround passes at the end get applied that specialize out all the cases that ToLayout doesn't support yet.

+1 on having the same name for the op. One of the invariants of the TTNN Dialect that we want to adhere to, is that it should map to TTNN lib as close as possible. Having an op that doesn't exist in TTNN lib breaks this invariant.

For practicality reasons, we could have a flag on the op itself that let's us know whether the op has been "decomposed" or not. If anyone feels that there's a need for it today, feel free to open an issue and assign it to me.

@svuckovicTT I could make the same argument against having an op of the same name as in TTNN lib which carries completely different meaning(that it pushes TTNN dialect further from TTNN lib). I personally actually find it even worse, it's "wrong result" but insidious. It's easy to assert that dummy op is not expected in final IR, but if I show you IR with toLayout op will you be able to tell me that it went through decomposition pass?
But as this is a temp workaround, and more of you seems to prefer same name option, Im fine with that.

I'll create a separate issue to track this.

nobradovictt · 2024-10-28T08:57:39Z

lib/Dialect/TTNN/Analysis/DFShardingPolicy.cpp

+bool isCompositeToLayoutOp(mlir::Operation *op) {
+  return isa<ttnn::CompositeToLayoutOp>(op);
 }


No need for this util function anymore since it is a single op again.

nobradovictt · 2024-10-28T09:02:04Z

lib/Dialect/TTNN/Pipelines/TTNNPipelines.cpp

@@ -98,7 +121,7 @@ void registerTTNNPipelines() {
  mlir::PassPipelineRegistration<
      mlir::tt::ttnn::TTIRToTTNNBackendPipelineOptions>(
      "ttir-to-ttnn-backend-pipeline",
-      "Pipeline lowering ttir to ttmetal backend.",
+      "Pipeline lowering ttir to ttnn backend.",


Nice catch :)

…essary and obey constraints in a separate pass, properly model toLayout op.

svuckovicTT

Hey @jnie-TT, thanks for taking the time to rework the ToLayout conversion, I think this is much cleaner now :)

Looks good from dialects perspective, can you please make sure to add the missing issues? (layout=none, typecast to dps, and dtype=none)

Great work!

svuckovicTT · 2024-10-28T14:32:35Z

include/ttmlir/Target/Common/types.fbs

-  UInt32 = 9,
-  UInt16 = 10,
-  UInt8 = 11,
+  None = 0,


Similar to TTNN_TensorMemoryLayout_None, we should create an issue to track the removal of this.

svuckovicTT · 2024-10-28T15:05:20Z

lib/Conversion/TTIRToTTNN/TTIRToTTNN.cpp

-        ttnn::BufferTypeAttr::get(op.getContext(), bufferType));
+    if (newOutputLayoutEnum == ttnn::Layout::Tile) {
+      TileType tileType =
+          TileType::get(rewriter.getContext(), {32, 32}, outputDtype);


{32, 32}

Can you use mlir::tt::ttnn::TTNN_TILE_HEIGHT and mlir::tt::ttnn::TTNN_TILE_HEIGHT instead?

svuckovicTT · 2024-10-28T15:07:47Z

lib/Conversion/TTIRToTTNN/TTIRToTTNN.cpp

+      TileType tileType =
+          TileType::get(rewriter.getContext(), {32, 32}, outputDtype);
+      llvm::SmallVector<int64_t> newShardShape =
+          tileType.getTiledShape(llvm::SmallVector<int64_t>(


llvm::SmallVector<int64_t>

Is shard shape always (or most of the time) 2 values? If so, we should use llvm::SmallVector<int64_t, 2>.

svuckovicTT · 2024-10-28T15:29:31Z

include/ttmlir/Dialect/TTNN/IR/TTNNOps.td

@@ -27,13 +27,36 @@ def TTNN_GetDeviceOp : TTNN_Op<"get_device"> {
    let results = (outs TT_Device:$device);
 }

+def TTNN_CompositeToLayoutOp : TTNN_Op<"composite_to_layout"> {


+1 on having the same name for the op. One of the invariants of the TTNN Dialect that we want to adhere to, is that it should map to TTNN lib as close as possible. Having an op that doesn't exist in TTNN lib breaks this invariant.

For practicality reasons, we could have a flag on the op itself that let's us know whether the op has been "decomposed" or not. If anyone feels that there's a need for it today, feel free to open an issue and assign it to me.

jnie-TT · 2024-10-28T16:01:38Z

Hey @jnie-TT, thanks for taking the time to rework the ToLayout conversion, I think this is much cleaner now :)

Looks good from dialects perspective, can you please make sure to add the missing issues? (layout=none, typecast to dps, and dtype=none)

Great work!

For sure, I'll create separate issues for this to track. Thanks!

jnie-TT requested review from kmabeeTT, AleksKnezevic, pilkicTT, sdjordjevicTT, rpavlovicTT, nobradovictt, svuckovicTT, tapspatel and nsmithtt as code owners September 28, 2024 01:32

jnie-TT force-pushed the jnie/mem_config_refactor_squashed branch 3 times, most recently from ff5f704 to b006356 Compare September 29, 2024 21:33

odjuricicTT reviewed Sep 30, 2024

View reviewed changes

jnie-TT force-pushed the jnie/mem_config_refactor_squashed branch from b006356 to 37d60fd Compare September 30, 2024 15:51

svuckovicTT requested changes Oct 1, 2024

View reviewed changes

svuckovicTT requested changes Oct 7, 2024

View reviewed changes

jnie-TT force-pushed the jnie/mem_config_refactor_squashed branch from 37d60fd to 376d9ff Compare October 21, 2024 04:36

jnie-TT requested review from mtopalovicTT, jserbedzijaTT and mrakitaTT as code owners October 21, 2024 04:36

jnie-TT force-pushed the jnie/mem_config_refactor_squashed branch from 376d9ff to 77e7c35 Compare October 21, 2024 04:48

jnie-TT mentioned this pull request Oct 21, 2024

[Optimizer/TTNN] Migrating Optimizer to TTNN with minimal possible fu… #944

Merged

jnie-TT force-pushed the jnie/mem_config_refactor_squashed branch 2 times, most recently from a47e773 to 2cd1a7d Compare October 21, 2024 13:54

nobradovictt reviewed Oct 22, 2024

View reviewed changes

nobradovictt approved these changes Oct 22, 2024

View reviewed changes

jnie-TT force-pushed the jnie/mem_config_refactor_squashed branch 2 times, most recently from 213b2f5 to 57cf309 Compare October 25, 2024 18:57

jnie-TT force-pushed the jnie/mem_config_refactor_squashed branch 4 times, most recently from 294d326 to 2a088ea Compare October 26, 2024 18:49

tapspatel approved these changes Oct 26, 2024

View reviewed changes

jnie-TT force-pushed the jnie/mem_config_refactor_squashed branch from 2a088ea to e148f14 Compare October 27, 2024 03:43

nsmithtt reviewed Oct 27, 2024

View reviewed changes

nsmithtt approved these changes Oct 27, 2024

View reviewed changes

nobradovictt reviewed Oct 28, 2024

View reviewed changes

Fix toMemoryConfig Op hacks and hardcodings, create memory ops as nec…

4088fc9

…essary and obey constraints in a separate pass, properly model toLayout op.

jnie-TT force-pushed the jnie/mem_config_refactor_squashed branch from e148f14 to 4088fc9 Compare October 28, 2024 15:01

svuckovicTT approved these changes Oct 28, 2024

View reviewed changes

AleksKnezevic approved these changes Oct 28, 2024

View reviewed changes

kmabeeTT approved these changes Oct 28, 2024

View reviewed changes

jnie-TT merged commit 21a5026 into main Oct 28, 2024
13 checks passed

This was referenced Oct 28, 2024

Consistent name for CompositeToLayoutOp #1090

Closed

Typecast op DPS #1091

Closed

TensorMemoryLayout::None #1092

Closed

Remove tt::target::DataType::None #1102

Closed

Remove ttnn::CompositeToLayoutOp, use ttnn::ToLayoutOp instead #1105

Merged

svuckovicTT mentioned this pull request Nov 13, 2024

[TTNN Dialect] Add ability to change CPU tensor to desired dtype #621

Closed

Granularize toMemoryConfigOp #840

Granularize toMemoryConfigOp #840

Conversation

jnie-TT commented Sep 28, 2024 • edited Loading

odjuricicTT left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnie-TT commented Sep 30, 2024 • edited Loading

svuckovicTT left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnie-TT Oct 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnie-TT commented Oct 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nsmithtt Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

jnie-TT Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nobradovictt Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

svuckovicTT left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnie-TT commented Oct 28, 2024

jnie-TT commented Sep 28, 2024 •

edited

Loading

jnie-TT commented Sep 30, 2024 •

edited

Loading

jnie-TT Oct 2, 2024 •

edited

Loading

jnie-TT commented Oct 2, 2024 •

edited

Loading

nsmithtt Oct 28, 2024 •

edited

Loading

jnie-TT Oct 28, 2024 •

edited

Loading

nobradovictt Oct 28, 2024 •

edited

Loading