Use tilized dram-interleaved as default input-output layout #1744

jnie-TT · 2025-01-10T04:59:54Z

Description

Part of the runtime stitching effort #1743.

This PR updates the default input/output layout to tiled dram-interleaved from system memory row-major.

Combined the runtime stitching APIs, this enables the user to pre-tilize and interleave tensors (such as weights) and reuse them over multiple programs, eliminating ping-ponging between host/dram, row-major/tile

IR Example

TTNN IR of simple_matmul test on main:

#system_memory = #ttnn.buffer_type<system_memory>
#ttnn_layout = #ttnn.ttnn_layout<(d0, d1) -> (d0, d1), <1x1>, memref<64x128xbf16, #system_memory>>
#ttnn_layout1 = #ttnn.ttnn_layout<(d0, d1) -> (d0, d1), <1x1>, memref<128x96xbf16, #system_memory>>
#ttnn_layout2 = #ttnn.ttnn_layout<(d0, d1) -> (d0, d1), <1x1>, memref<64x96xbf16, #system_memory>>
#ttnn_layout3 = #ttnn.ttnn_layout<(d0, d1) -> (d0, d1), <1x1>, memref<2x4x!tt.tile<32x32, bf16>, #dram>, <interleaved>>
#ttnn_layout4 = #ttnn.ttnn_layout<(d0, d1) -> (d0, d1), <1x1>, memref<4x3x!tt.tile<32x32, bf16>, #dram>, <interleaved>>
#ttnn_layout5 = #ttnn.ttnn_layout<(d0, d1) -> (d0, d1), <1x1>, memref<2x3x!tt.tile<32x32, bf16>, #dram>, <interleaved>>
module attributes {tt.device = #device, tt.system_desc = #system_desc} {
  func.func @forward(%arg0: tensor<64x128xbf16, #ttnn_layout>, %arg1: tensor<128x96xbf16, #ttnn_layout1>) -> tensor<64x96xbf16, #ttnn_layout2> {
    %0 = "ttnn.get_device"() <{mesh_shape = #ttnn<mesh_shape 1x1>}> : () -> !tt.device<#device>
    %1 = "ttnn.to_device"(%arg0, %0) <{memory_config = #ttnn.memory_config<#dram, <<2x4>>, <interleaved>>}> : (tensor<64x128xbf16, #ttnn_layout>, !tt.device<#device>) -> tensor<64x128xbf16, #ttnn_layout3>
    %2 = "ttnn.to_layout"(%1) <{layout = #ttnn.layout<tile>}> : (tensor<64x128xbf16, #ttnn_layout3>) -> tensor<64x128xbf16, #ttnn_layout3>
    "ttnn.deallocate"(%1) <{force = false}> : (tensor<64x128xbf16, #ttnn_layout3>) -> ()
    %3 = "ttnn.to_device"(%arg1, %0) <{memory_config = #ttnn.memory_config<#dram, <<4x3>>, <interleaved>>}> : (tensor<128x96xbf16, #ttnn_layout1>, !tt.device<#device>) -> tensor<128x96xbf16, #ttnn_layout4>
    %4 = "ttnn.to_layout"(%3) <{layout = #ttnn.layout<tile>}> : (tensor<128x96xbf16, #ttnn_layout4>) -> tensor<128x96xbf16, #ttnn_layout4>
    "ttnn.deallocate"(%3) <{force = false}> : (tensor<128x96xbf16, #ttnn_layout4>) -> ()
    %5 = "ttnn.empty"(%0) <{dtype = #tt.supportedDataTypes<bf16>, layout = #ttnn.layout<tile>, memory_config = #ttnn.memory_config<#dram, <<2x3>>, <interleaved>>, shape = #ttnn.shape<64x96>}> : (!tt.device<#device>) -> tensor<64x96xbf16, #ttnn_layout5>
    %6 = "ttnn.matmul"(%2, %4, %5) : (tensor<64x128xbf16, #ttnn_layout3>, tensor<128x96xbf16, #ttnn_layout4>, tensor<64x96xbf16, #ttnn_layout5>) -> tensor<64x96xbf16, #ttnn_layout5>
    "ttnn.deallocate"(%4) <{force = false}> : (tensor<128x96xbf16, #ttnn_layout4>) -> ()
    "ttnn.deallocate"(%2) <{force = false}> : (tensor<64x128xbf16, #ttnn_layout3>) -> ()
    %7 = "ttnn.from_device"(%6) : (tensor<64x96xbf16, #ttnn_layout5>) -> tensor<64x96xbf16, #ttnn_layout2>
    "ttnn.deallocate"(%5) <{force = false}> : (tensor<64x96xbf16, #ttnn_layout5>) -> ()
    %8 = "ttnn.to_layout"(%7) <{layout = #ttnn.layout<row_major>}> : (tensor<64x96xbf16, #ttnn_layout2>) -> tensor<64x96xbf16, #ttnn_layout2>
    "ttnn.deallocate"(%7) <{force = false}> : (tensor<64x96xbf16, #ttnn_layout2>) -> ()
    return %8 : tensor<64x96xbf16, #ttnn_layout2>
  }
}

TTNN IR of simple_matmul test after this change:

#ttnn_layout = #ttnn.ttnn_layout<(d0, d1) -> (d0, d1), <1x1>, memref<2x4x!tt.tile<32x32, bf16>, #dram>, <interleaved>>
#ttnn_layout1 = #ttnn.ttnn_layout<(d0, d1) -> (d0, d1), <1x1>, memref<4x3x!tt.tile<32x32, bf16>, #dram>, <interleaved>>
#ttnn_layout2 = #ttnn.ttnn_layout<(d0, d1) -> (d0, d1), <1x1>, memref<2x3x!tt.tile<32x32, bf16>, #dram>, <interleaved>>
module attributes {tt.device = #device, tt.system_desc = #system_desc} {
  func.func @forward(%arg0: tensor<64x128xbf16, #ttnn_layout>, %arg1: tensor<128x96xbf16, #ttnn_layout1>) -> tensor<64x96xbf16, #ttnn_layout2> {
    %0 = "ttnn.get_device"() <{mesh_shape = #ttnn<mesh_shape 1x1>}> : () -> !tt.device<#device>
    %1 = "ttnn.empty"(%0) <{dtype = #tt.supportedDataTypes<bf16>, layout = #ttnn.layout<tile>, memory_config = #ttnn.memory_config<#dram, <<2x3>>, <interleaved>>, shape = #ttnn.shape<64x96>}> : (!tt.device<#device>) -> tensor<64x96xbf16, #ttnn_layout2>
    %2 = "ttnn.matmul"(%arg0, %arg1, %1) : (tensor<64x128xbf16, #ttnn_layout>, tensor<128x96xbf16, #ttnn_layout1>, tensor<64x96xbf16, #ttnn_layout2>) -> tensor<64x96xbf16, #ttnn_layout2>
    return %2 : tensor<64x96xbf16, #ttnn_layout2>
  }
}

Changes

TTNNLayout

Updated the default memory space to dram, tensor memory layout to interleaved, and layout to tiled.
Moved force row major logic from the TTIRtoTTNN pass to this pass. This will determine whether or not to untilize the tensor. The issue with having the force row major logic in a downstream pass was that a toLayoutOp may not even be created in the first place, since the input is already defaulted to tile (thus no tilization would be needed).

TTIRToTTNN

Uplifted force row major logic to TTNNLayout Pass.

Optimizer

Added a workaround that moves GetDeviceOps to the front of the op schedule.
- Hit an issue where GetDeviceOps were non-deterministically moved to the end of the schedule when running mnist_sharded test
- I'll create a follow up issue for this to be properly fixed
Added a workaround that checks for ReturnOps in L1 usage calculation
- Return ops were not considered when calculating L1 usage. This was fine before because we would always have a to_layout op at the end before returning, but now we could very likely return directly without any layout conversion.
- I'll create a follow up issue for this to be properly fixed
Marked layout-forcing tests as XFail.
- With this change it seems like the layout-forcing tests return incorrect results.
- Thus marking these tests as XFail for now, I'll create a follow up issue for this to be properly fixed

Runtime

Added a workaround for runtime APIs to assume first device in the device mesh when sending tensors to device.
- Currently there's no device attribute in TTNNLayoutAttr, and therefore runtime can't know which device the tensor belongs to. This workaround configures runtime to always assume the tensor belongs to the first device (device id 0) in the mesh.
- Next task in-line is to add the device attribute to TTNNLayoutAttr. Once that's done we can remove the workaround.

MLIR Tests

Updated file-checks to adapt to new IR (e.g. removed anything that checked ttnn.to_device, redundant ttnn.to_layout etc.)
Expanded simple_eltwise to individual files.
- Using a large file made it hard to isolate errors. This also matches what we're doing in Dialect and Perf.
- Allows more complex/diverse testing per-op.

TODOs Before Merging

Frontends need to add a runtime::toHost call before memcpying tensors.
- This is because tensors are now returned in tile layout - runtime::toHost accepts an untilize flag that will untilize the tensor.
Update TODO comments once proper issues are created (optimizer, runtime workaround).

lib/Dialect/TTNN/Transforms/TTNNLayout.cpp

runtime/lib/ttnn/include/tt/runtime/ttnn/utils.cpp

nsmithtt

Reviewed the compiler portion, will take a look at the runtime part later today!

lib/Conversion/TTIRToTTNN/TTIRToTTNN.cpp

nsmithtt · 2025-01-14T16:51:29Z

lib/Dialect/TTNN/Analysis/BFInterleavedPolicy.cpp

+      currentL1Usage -= currentL1UsagePerOp[op].l1MemUsagePerUser;
+      currentL1UsagePerOp.erase(op);
+    }
+


@fbajraktariTT, can you review this file?

FYI @odjuricicTT, as @fbajraktariTT completed internship recently.

@jnie-TT I'm not sure that this extra logic is needed. Was a test failing without this temp fix? If so, can you provide more details?

@odjuricicTT there's an assert below that checks if the currentL1Usage is 0. This error only surfaces with my changes - it's fine without my changes because we always untilize (to_layout) before returning. However it's possible now with my changes that we will return directly without any intermediate ops between the current op and the return op, and this causes issues because we wouldn't have zeroed out currentL1Usage.

Since this function doesn't decrement l1 usage on return op, the assert will fire and say that the l1 usage is non 0. My change basically adds a check that if the consumer op is a return op, we decrement the l1 usage.

@jnie-TT Thanks! Your solution is fine for now. Just please file the followup issue and reference it in the comment.

nsmithtt · 2025-01-14T16:52:06Z

lib/Dialect/TTNN/Transforms/Optimizer.cpp

+          opSchedule[func].erase(it);
+          opSchedule[func].insert(opSchedule[func].begin(), deviceOp);
+        }
+


@odjuricicTT, can you review this file?

@jnie-TT The proper fix for this would be to add it here:
https://github.com/tenstorrent/tt-mlir/blob/main/lib/Dialect/TTNN/Analysis/DFShardingPolicy.cpp#L37

Try changing the if to check for GetDeviceOp as well as ToLayoutOp.

lib/Dialect/TTNN/Transforms/TTNNLayout.cpp

nsmithtt · 2025-01-14T16:57:43Z

lib/Dialect/TTNN/Transforms/TTNNLayout.cpp

+
+    // TTNN Reshape does not support implicit tilization/untilization
+    // Therefore input output layouts should be the same
+    if (mlir::isa<ttir::ReshapeOp>(operation) && operandNumber == 1) {


I feel like we should have attributes on the op that denote these kind of capabilities instead of having this code be special cased for a specific op. @sdjordjevicTT thoughts?

Perhaps we should add an interface to all TTNN ops called shouldTilize that defaults to true and that ops can specialize.

Yeah that would be awesome to have, I know a lot of eltwise ops are facing a similar issue regarding data type, where some ops can typecast implicitly whereas some ops cannot. This results in the IR being misaligned with the actual runtime output.

I am thinking about this scenarios, do we have some examples?

Aren't these examples? i.e. reshape, conv2d, slice & embedding, or do you mean something else?

@sdjordjevicTT if you mean the implicit typecast ops an example would be relational binary ops vs unary ops .
Relational operations take an output_dtype that we setting to typecast implicitly within the op:

template <BinaryOpType binary_op_type> struct RelationalBinary { static Tensor invoke( uint8_t queue_id, const Tensor &input_tensor_a_arg, const Tensor &input_tensor_b_arg, const std::optional<const DataType> &output_dtype = std::nullopt, const std::optional<MemoryConfig> &memory_config = std::nullopt, std::optional<Tensor> optional_output_tensor = std::nullopt, std::optional<unary::FusedActivations> activations = std::nullopt, std::optional<unary::UnaryWithParam> input_tensor_a_activation = std::nullopt);

However unary ops do not:

template <UnaryOpType... unary_op_types> Tensor ExecuteUnary<unary_op_types...>::invoke( const Tensor& input_tensor, const std::optional<MemoryConfig>& memory_config, const std::optional<Tensor>& optional_output_tensor) {

And our compiler doesn't distinguish between them, i.e. for unary ops it'll still assume the output tensor of the unary op is properly typecasted to the desired data type.

As for ops that don't support implicit tilization/untilization, some examples include reshape, concat, transpose.

I believe there was a misunderstanding between us. :)

Regarding Conv, Slice, and Embedding, I'm aware that they require some inputs to be in a row-major layout. I'll address this by implementing the necessary layout workarounds. If the Metal developers decide not to support tile layout for them, then we can introduce a trait\interface to accommodate them.

Regarding the implicit conversions, I get it for the data_type, but how we are specifying whether the output is in tile\row major? By defining the optional_output_tensor? I see what can be the issue, if you have some row-major input, you want to keep it row-major output for such ops. We can think about adding the interface on an op level to support this.

I created issues on myself to follow up on this:

[TTNN] Consider creating shouldTilize interface on TTNN op level #1862

[TTNN] Consider creating supportImplicitTilizing interface on TTNN op level #1863

nsmithtt · 2025-01-14T16:58:34Z

lib/Dialect/TTNN/Transforms/TTNNLayout.cpp

+    if (mlir::isa<ttir::Conv2dOp>(operation) ||
+        mlir::isa<ttir::SliceOp>(operation) ||
+        (mlir::isa<ttir::EmbeddingBackwardOp>(operation) &&
+         operandNumber < 2)) {


Same as above.

This will be cleaned up with the workarounds. We have tasks for each of these to cleanup.

runtime/include/tt/runtime/detail/workarounds.h

lib/Dialect/TTNN/Transforms/TTNNLayout.cpp

sdjordjevicTT · 2025-01-15T14:14:54Z

lib/Dialect/TTNN/Transforms/TTNNLayout.cpp

+    if (mlir::isa<ttir::Conv2dOp>(operation) ||
+        mlir::isa<ttir::SliceOp>(operation) ||
+        (mlir::isa<ttir::EmbeddingBackwardOp>(operation) &&
+         operandNumber < 2)) {


This will be cleaned up with the workarounds. We have tasks for each of these to cleanup.

sdjordjevicTT · 2025-01-15T14:15:37Z

lib/Dialect/TTNN/Transforms/TTNNLayout.cpp

+
+    // TTNN Reshape does not support implicit tilization/untilization
+    // Therefore input output layouts should be the same
+    if (mlir::isa<ttir::ReshapeOp>(operation) && operandNumber == 1) {


I am thinking about this scenarios, do we have some examples?

lib/Dialect/TTNN/Transforms/Workarounds/TTNNWorkarounds.cpp

odjuricicTT

A few comments on Optimizer related changes, but looks good overall.

Requesting changes until optimizer layout overrides are fixed. I'll help with this.

odjuricicTT · 2025-01-16T11:23:22Z

test/ttmlir/Dialect/TTNN/optimizer/input_layout_loc_override.mlir

@@ -4,12 +4,11 @@
 // CHECK-DAG: #[[LOC_MATMUL_IN1:.*]] = loc("matmul_1_in_1_layout"(#loc3))
 // CHECK-DAG: #[[LOC_MATMUL:.*]] = loc("matmul_1"(#loc3))
 // CHECK-DAG: #[[IN_1_LAYOUT:.*]] = #ttnn.ttnn_layout<(d0, d1) -> (d0, d1), <1x1>, memref<1x12x!tt.tile<32x32, bf16>, #l1_>, <interleaved>>
-
+// XFAIL: *


Unfortunately, tt-explorer and some forge-fe tests depend on layout overrides working so this cannot be tackled in a followup PR.

@jnie-TT Do you have more context on why this is not working? I will also take a deeper look later today.

@odjuricicTT I haven't looked deep into it. I suspect there may be some assumptions about the initial tensor location/layout in the optimizer that aren't valid anymore? With this change basically all initial tensors will be in dram in tile layout.

@jnie-TT Here is the fix for one of the tests makred as XFAIL 0bf4366

The other one can stay XFAIL for now, just file a followup issue.

nsmithtt · 2025-01-16T19:08:16Z

runtime/include/tt/runtime/detail/workarounds.h

+  // API can determine the correct devices. Enabling this workaround will assume
+  // that a device tensor will reside in the L1/Dram of the first device (device
+  // id 0) of the device grid. This should be removed once we add the device
+  // grid information to the tensorDesc.


So there is strategy on LayoutDesc that will be set to ::tt::target::DistributedTensorConfig::NONE for single chip setup. Or it will be set to some kind of multi-device distribution if set. LMK if this doesn't resolve this issue.

table LayoutDesc { stride: [int]; oob_val: OOBVal; core_range_set: [Dim2dRange]; memory_desc: MemoryDesc; strategy: DistributionStrategy; }

Reach out to @wooseokTT if you need help/interpreting its programming.

@nsmithtt the strategy doesn't tell us which submesh a tensor belongs to though right? I remember that when I added it, it was used to specify the tensor distribution method across multi device (replicate, shard etc.).

I can use it to distinguish between single/multichip, but I don't know the mesh shape or mesh offset that the tensor is mapped to if it's multidevice. And I need this info if I want to move a tensor to a multidevice mesh in the toLayout API.

It depends on the strategy, but for e.g. ShardTensor2D.shard_mesh does tell you the mesh shape. I think it's always inferred that the offset is implicitly [0, 0], @wooseokTT feel free to correct me if I'm wrong, but this is reflected from TTNN API which doesn't support arbitrary mesh offsets.

Yeah you're right ShardTensor2D has the shard_mesh. But seems like the other ones don't... If all we're using is ShardTensor2D and offset 0, 0 then I guess I can just derive it from that. And maybe add an assert that checks the strategy must be ShardTensor2D. Does doing it this way make sense with how we're performing multichip operations?

runtime/lib/ttnn/operations/include/tt/runtime/ttnn/operations/utils.h

jnie-TT requested review from odjuricicTT, tt-mpantic, sdjordjevicTT, nobradovictt, tapspatel, nsmithtt, kmabeeTT, AleksKnezevic, pilkicTT, svuckovicTT, mtopalovicTT, jserbedzijaTT, azecevicTT and mrakitaTT as code owners January 10, 2025 04:59

jnie-TT mentioned this pull request Jan 10, 2025

Runtime Stitching Progress #1743

Open

8 tasks

sdjordjevicTT reviewed Jan 10, 2025

View reviewed changes

lib/Dialect/TTNN/Transforms/TTNNLayout.cpp Outdated Show resolved Hide resolved

sdjordjevicTT reviewed Jan 10, 2025

View reviewed changes

runtime/lib/ttnn/include/tt/runtime/ttnn/utils.cpp Outdated Show resolved Hide resolved

jnie-TT force-pushed the jnie/dram_interleaved_tiled_default_rebased branch 3 times, most recently from 1a31acb to d4c5383 Compare January 14, 2025 03:57

nsmithtt reviewed Jan 14, 2025

View reviewed changes

jnie-TT force-pushed the jnie/dram_interleaved_tiled_default_rebased branch from d4c5383 to 676c714 Compare January 14, 2025 19:59

sdjordjevicTT reviewed Jan 15, 2025

View reviewed changes

Use tilized dram-interleaved as default input-output layout

a9a8eff

jnie-TT force-pushed the jnie/dram_interleaved_tiled_default_rebased branch from 676c714 to a9a8eff Compare January 15, 2025 22:09

odjuricicTT requested changes Jan 16, 2025

View reviewed changes

This was referenced Jan 16, 2025

[TTNN] Consider creating shouldTilize interface on TTNN op level #1862

Open

[TTNN] Consider creating supportImplicitTilizing interface on TTNN op level #1863

Open

nsmithtt reviewed Jan 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use tilized dram-interleaved as default input-output layout #1744

Use tilized dram-interleaved as default input-output layout #1744

jnie-TT commented Jan 10, 2025 •

edited

Loading

nsmithtt left a comment

nsmithtt Jan 14, 2025

sdjordjevicTT Jan 15, 2025

odjuricicTT Jan 16, 2025

jnie-TT Jan 16, 2025

odjuricicTT Jan 17, 2025

nsmithtt Jan 14, 2025

odjuricicTT Jan 16, 2025

nsmithtt Jan 14, 2025

nsmithtt Jan 14, 2025

jnie-TT Jan 14, 2025

sdjordjevicTT Jan 15, 2025

nsmithtt Jan 15, 2025

jnie-TT Jan 15, 2025 •

edited

Loading

sdjordjevicTT Jan 16, 2025

nsmithtt Jan 14, 2025

sdjordjevicTT Jan 15, 2025

sdjordjevicTT Jan 15, 2025

sdjordjevicTT Jan 15, 2025

odjuricicTT left a comment

odjuricicTT Jan 16, 2025

jnie-TT Jan 16, 2025

odjuricicTT Jan 17, 2025

nsmithtt Jan 16, 2025

nsmithtt Jan 16, 2025

jnie-TT Jan 16, 2025 •

edited

Loading

nsmithtt Jan 16, 2025

jnie-TT Jan 16, 2025 •

edited

Loading

Use tilized dram-interleaved as default input-output layout #1744

Are you sure you want to change the base?

Use tilized dram-interleaved as default input-output layout #1744

Conversation

jnie-TT commented Jan 10, 2025 • edited Loading

Description

IR Example

Changes

TTNNLayout

TTIRToTTNN

Optimizer

Runtime

MLIR Tests

TODOs Before Merging

nsmithtt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnie-TT Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

odjuricicTT left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnie-TT Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnie-TT Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

jnie-TT commented Jan 10, 2025 •

edited

Loading

jnie-TT Jan 15, 2025 •

edited

Loading

jnie-TT Jan 16, 2025 •

edited

Loading

jnie-TT Jan 16, 2025 •

edited

Loading