Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Granularize toMemoryConfigOp #840

Merged
merged 1 commit into from
Oct 28, 2024
Merged

Conversation

jnie-TT
Copy link
Contributor

@jnie-TT jnie-TT commented Sep 28, 2024

Introduce 5 ops that cover the functionality of the previous toMemoryConfigOp:

  • ToDeviceOp: Move data to device
  • FromDeviceOp: Move data from device to host
  • ToLayoutOp: Tilize/untilize data
  • TypecastOp: typecast data
    • FYI @mmanzoorTT I updated the TTNN implementation of the typecast op, creating its own conversion pattern. In TTNN, typecast does not belong to eltwise unary, that's why you had to add a separate method for running the op in runtime. The stableHLO to ttir interface remains unchanged.
  • ToMemoryConfigOp: Solely converts memory config, including Dram <-> L1 movement, interleaved <-> sharding conversion

Updated downstream consumers (TTNN to flatbuffer, TTNN Runtime, TTNNToEmitC)

Creating these ops in combination is a challenge, as there are several (known) constraints:

  • Tilizing on device requires bf16
  • Typecasting on device requires TILIZED data
  • Untilizing on device requires width % 2 == 0, page size > sizeof(uint32_t).
    • Here it seems very hard to update the page size. Some ways I think might work is typecasting or padding. However padding -> untilize -> unpad seems tricky, and typecasting requires tilized data.
    • Therefore for now we will always untilize on host (which is what we are doing currently), since we normally only untilize at the end of the program when we move data from device to host, the perf hit should be acceptable. We rarely have device to device untilize cases.

This PR also fixes tensor type when forcing row_major/tile layouts. Previously when forcing tile/row_major layout, the output tensor layout/shard shape/element type are not changed, and so runtime had to hack tilize/untilize logic instead of trusting the tile shape.

Also added a shardSpec attribute to the memoryConfigAttr because the optimizer needs it for reshard implementation.

Another goal here is to make adding multi-chip changes easier, as currently without this change we are missing a couple of ops, and also hacking toMemoryConfig ops in the compiler. I would thus need to update the compound toMemoryConfig op to support multichip, which will get removed anyway.

Here's a simple example to illustrate a couple issues and how they are fixed with this PR.
On Main we have

#layout = #tt.layout<(d0, d1) -> (d0, d1), undef, <1x1>, memref<512x1024xbf16, #system>>
#layout1 = #tt.layout<(d0, d1) -> (d0, d1), undef, <1x1>, memref<512x1024xbf16, #dram>, interleaved>
module attributes {tt.device = #device, tt.system_desc = #system_desc} {
  func.func @softmax(%arg0: tensor<512x1024xbf16, #layout>) -> tensor<512x1024xbf16, #layout> {
    %0 = "ttnn.get_device"() <{mesh_shape = #ttnn<mesh_shape 1x1>}> : () -> !tt.device<#device>
    %1 = "ttnn.to_layout"(%arg0, %0) <{layout = #ttnn.layout<tile>}> : (tensor<512x1024xbf16, #layout>, !tt.device<#device>) -> tensor<512x1024xbf16, #layout1>
    %2 = "ttnn.to_device"(%1, %0) <{memory_config = #ttnn.memory_config<<interleaved>, <dram>>}> : (tensor<512x1024xbf16, #layout1>, !tt.device<#device>) -> tensor<512x1024xbf16, #layout1>
    "ttnn.dealloc"(%1) : (tensor<512x1024xbf16, #layout1>) -> ()
    %3 = "ttnn.softmax"(%2) <{dimension = 1 : si32}> : (tensor<512x1024xbf16, #layout1>) -> tensor<512x1024xbf16, #layout1>
    "ttnn.dealloc"(%2) : (tensor<512x1024xbf16, #layout1>) -> ()
    %4 = "ttnn.softmax"(%3) <{dimension = -1 : si32}> : (tensor<512x1024xbf16, #layout1>) -> tensor<512x1024xbf16, #layout1>
    "ttnn.dealloc"(%3) : (tensor<512x1024xbf16, #layout1>) -> ()
    %5 = "ttnn.to_memory_config"(%4, %0) : (tensor<512x1024xbf16, #layout1>, !tt.device<#device>) -> tensor<512x1024xbf16, #layout>
    "ttnn.dealloc"(%4) : (tensor<512x1024xbf16, #layout1>) -> ()
    return %5 : tensor<512x1024xbf16, #layout>
  }
}

Notice we are hard-coding to_layout to tilize on host (to_layout always comes before to_device), and we are hacking in ttnn.to_memory_config at the end to move data from device to host and do all the conversions necessary at the end (typecasting data type, untilize, to_cpu etc.). There's no ability currently to typecast/convert memory config in the middle of the program - everything happens at the end when we're done.
Also notice the to_layout op that produces tensor %1: %1 = "ttnn.to_layout"(%arg0, %0) <{layout = #ttnn.layout<tile>}> : (tensor<512x1024xbf16, #layout>, !tt.device<#device>) -> tensor<512x1024xbf16, #layout1>. In the parameters of the op, we see ttnn.layout<tile>, which means this op is trying to tilize the tensor. However, notice the layout of the output tensor #layout1... memref<512x1024xbf16, #dram>, the element type is not tilized, instead it is still row_major. This causes a mismatch between op parameter and output tensor type, and further downstream, all consumers of this tensor will assume the tensor is not tilized when inspecting it's element type.

With the fixes in this PR we have

#layout = #tt.layout<(d0, d1) -> (d0, d1), undef, <1x1>, memref<512x1024xbf16, #system>>
#layout1 = #tt.layout<(d0, d1) -> (d0, d1), undef, <1x1>, memref<1x1x!tt.tile<32x32, bf16>, #dram>, interleaved>
#layout2 = #tt.layout<(d0, d1) -> (d0, d1), undef, <1x1>, memref<512x1024xbf16, #dram>, interleaved>
module attributes {tt.device = #device, tt.system_desc = #system_desc} {
  func.func @softmax(%arg0: tensor<512x1024xbf16, #layout>) -> tensor<512x1024xbf16, #layout> {
    %0 = "ttnn.get_device"() <{mesh_shape = #ttnn<mesh_shape 1x1>}> : () -> !tt.device<#device>
    %1 = "ttnn.to_device"(%arg0, %0) <{memory_config = #ttnn.memory_config<<interleaved>, <dram>, <<1x1>>>}> : (tensor<512x1024xbf16, #layout>, !tt.device<#device>) -> tensor<512x1024xbf16, #layout1>
    %2 = "ttnn.to_layout"(%1) <{layout = #ttnn.layout<tile>}> : (tensor<512x1024xbf16, #layout1>) -> tensor<512x1024xbf16, #layout1>
    "ttnn.dealloc"(%1) : (tensor<512x1024xbf16, #layout1>) -> ()
    %3 = "ttnn.softmax"(%2) <{dimension = 1 : si32}> : (tensor<512x1024xbf16, #layout1>) -> tensor<512x1024xbf16, #layout2>
    "ttnn.dealloc"(%2) : (tensor<512x1024xbf16, #layout1>) -> ()
    %4 = "ttnn.softmax"(%3) <{dimension = -1 : si32}> : (tensor<512x1024xbf16, #layout2>) -> tensor<512x1024xbf16, #layout2>
    "ttnn.dealloc"(%3) : (tensor<512x1024xbf16, #layout2>) -> ()
    %5 = "ttnn.from_device"(%4) : (tensor<512x1024xbf16, #layout2>) -> tensor<512x1024xbf16, #layout>
    "ttnn.dealloc"(%4) : (tensor<512x1024xbf16, #layout2>) -> ()
    %6 = "ttnn.to_layout"(%5) <{layout = #ttnn.layout<row_major>}> : (tensor<512x1024xbf16, #layout>) -> tensor<512x1024xbf16, #layout>
    "ttnn.dealloc"(%5) : (tensor<512x1024xbf16, #layout>) -> ()
    return %6 : tensor<512x1024xbf16, #layout>
  }
}

Now, notice the first to_layout op is inserted after the to_device op, thus we are indeed tilizing on device now because the dataformat is bf16. Also notice there is no hacked to_memory_config in the end when the ops is finished. Instead, we call a from_device op that reads from device, and a to_layout op that untilizes on host, and finally return the tensor. Also notice the first to_layout_op: %2 = "ttnn.to_layout"(%1) <{layout = #ttnn.layout<tile>}> : (tensor<512x1024xbf16, #layout1>) -> tensor<512x1024xbf16, #layout1>. The output tensor layout is #layout1, which is now indeed in tile format: memref<1x1x!tt.tile<32x32, bf16>, which is aligned with the parameter/goal of the op. Now, all downstream consumers will see a tilized tensor.

Misc:
Synced with @svuckovicTT, we concluded that we should:

  • Break down the big if statement into separate methods (Done, now all cases belong to their own function)
  • Model to-layout op to include all parameters in TTNN as optional (Done in this PR)
  • Convert typecast op to DPS, will create a separate issue for this, don't want to bloat this PR any further
  • Create an issue that tracks TensorMemoryLayout::None. This won't be needed once we have TTNN-specific tensor type in our dialect, as we could just make the memory config optional

Additional changes:

Copy link
Contributor

@odjuricicTT odjuricicTT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to understand this better. Is it true that TTNN to_device op can only move from host to dram, or can it also put the tensor directly into an L1 sharded configuration?

/////////////////////////////////////////
// Unsupported eltwise ops with sharding
// * Concat: Sharded concat requires ROW MAJOR layout
// * Softmax: Sharded softmax produces incorrect values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an issue on tt-metal to track this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK we need to pass in a specific parameter SoftmaxShardedMultiCoreProgramConfig for this to work with sharded data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an issue to our gitlab to track this, with all the details, and put a TODO with bug id in the comment?

lib/Conversion/TTIRToTTNN/TTIRToTTNN.cpp Outdated Show resolved Hide resolved
@jnie-TT
Copy link
Contributor Author

jnie-TT commented Sep 30, 2024

Trying to understand this better. Is it true that TTNN to_device op can only move from host to dram, or can it also put the tensor directly into an L1 sharded configuration?

It can put it in a L1 sharded configuration directly, the toDeviceOp accepts a memory config as a parameter. However if the tensor is already on device, and we want to move between l1/dram or shard/interleave, then we want to use the toMemoryConfig op.

@jnie-TT jnie-TT force-pushed the jnie/mem_config_refactor_squashed branch from b006356 to 37d60fd Compare September 30, 2024 15:51
Copy link
Contributor

@svuckovicTT svuckovicTT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jnie-TT, I'm not sure I understand what the intent of this PR is. There's several changes that I think are not the way we want to go - I'd like to zoom over this to better understand what it is we're trying to accomplish here.

In general, we should strive to model what TTNN does today - those are the best practices that will save us a lot of headaches going forward.

Comment on lines +53 to +71
It handles:
- ROW_MAJOR to TILE
- TILE to ROW_MAJOR
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that it will error out if called for ROW_MAJOR -> ROW_MAJOR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a toLayoutOp will not be created if it's ROW_MAJOR -> ROW_MAJOR, only when the input/output layouts are different will a toLayoutOp be created

}];

let arguments = (ins AnyRankedTensor:$input,
TT_Device:$device,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the removal? TTNN lib has this arg, and it is used in some places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When moving a tensor to device we should use the toDeviceOp, toLayoutOp should be solely used to tilize/untilize tensors. If there's too much overlap between functionality of ops then it makes it hard to model in the compiler, and these boundaries should be set between ops.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your reasoning, and from a high level, it just makes so much sense. Similarly, I wanted to use ttnn::to_layout() to "set" all the things tensor related in one go: dtype, memcfg, layout, device. However, it turned out not to work: when I called the op with all the params, I got an error that was something like "dtype cannot be specified when converting to row_major".

It's not ideal that the APIs have hidden issues like that. We should ask for fixes, but I'm pretty sure that they won't all get fixed relatively fast, so I don't want us to bet on that. What we can do today is to copy their behaviour in order to minimize the risk of hitting similar issues.

This is exactly what I tried to do in #624 and #747. I went and talked to metal folks to understand how they do it for their scenarios. These codepaths are the most tested, and there's less chance that something will silently break on their end. If we resort to doing things how we think should be done, I'm afraid we'll end up hitting random errors as those paths are not as well tested (if at all).

In #624, we copy the behaviour from the from_torch method, as proposed by metal folks. In that method specifically, the to_layout function is called with the device parameter supplied each time. While the device object might be None in some cases, it's not always going to be, and we should not model the to_layout API so that we exclude the parameter.

In general, as dialect writers, we want to model their APIs as close as possible, and let the users of the dialect decide on how to use the APIs (which is usually us, and the top half of the comment talks about that).

def TTNN_TensorMemoryLayout_HeightSharded : I32EnumAttrCase<"HeightSharded", 2, "height_sharded">;
def TTNN_TensorMemoryLayout_WidthSharded : I32EnumAttrCase<"WidthSharded", 3, "width_sharded">;
def TTNN_TensorMemoryLayout_BlockSharded : I32EnumAttrCase<"BlockSharded", 4, "block_sharded">;
def TTNN_TensorMemoryLayout_None : I32EnumAttrCase<"None", 0, "none">;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In TTNN lib, enum TensorMemoryLayout has no value None.

The idea behind TTNN dialect is to model the TTNN lib as faithfully as possible. Adding an enum value to an enum in dialect, which doesn't exist in the lib, will create issues down the road - ttrt and emitc have to implement special logic to check if the value is None and decide what to do with it.

Copy link
Contributor Author

@jnie-TT jnie-TT Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this all host tensors should have None tensor memory layout. It wouldn't make sense for host tensors to have other memory layouts. If you look at the get_memory_config API in ttnn, they model host tensors with std::nullopt tensor memory layouts (wrapped in the memory config). This mimics that behaviour:

inline std::optional<ttnn::MemoryConfig> get_memory_config(const ttnn::Tensor& tensor) {
    if (not tensor.is_allocated() or not is_tensor_on_device_or_multidevice(tensor)) {
        return std::nullopt;
    }
    return tensor.memory_config();
}

Once we have the TTNN-specific tensor type, we can model this better by say making the memConfigAttr of the tensor optional/setting some default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to do that - remove memCfg from the layout attribute of the tensor. @mtopalovicTT is actively working on adding TTNN specific layout attr (currently we only have TT_LayoutAttr). #701

@mtopalovicTT can you give any updates/timelines on this? Do you see any issues in supporting this case?

@@ -49,6 +49,18 @@ static Value getOrInsertDevice(ConversionPatternRewriter &rewriter,
return deviceOp.getResult();
}

static DataType getDataTypeFromMemRef(mlir::MemRefType memref) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good candidate for include/ttmlir/Dialect/TTNN/Utils/Utils.h / lib/Dialect/TTNN/Utils/Utils.cpp.

Comment on lines 134 to 128
// are required to achieve an arbitrary layout, namely:
// - ToLayoutOp: to set the layout (ROW_MAJOR, TILE) of the tensor
// - TypecastOp: to change the data type of the tensor
// - ToDeviceOp: to move the tensor to a specific device
// - FromDeviceOp: to move the tensor from a specific device to host
// - ToMemoryConfigOp: to set the memory configuration (dram, l1, interleaved,
// sharded) of the tensor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's zoom about this, I'd like to understand what the intent here is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I'd be happy to zoom. In the meantime hopefully this comment clarifies things a bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want us to granularize this way for reasons mentioned in one of the comments above - we should model the from_torch approach, which is what the current code tries to do. Additionally, this piece of the change introduces a lot more code, which I find hard to read and maintain.

Can you post a list of specific scenarios we want to support, but are not currently supported?

@@ -376,7 +454,7 @@ class ToLayoutOpConversionPattern
attrs.push_back(convertLayoutAttr(rewriter, srcOp.getLayoutAttr()));
attrs.push_back(createStdNullopt(rewriter));
attrs.push_back(createStdNullopt(rewriter));
attrs.push_back(mlir::IntegerAttr::get(rewriter.getIndexType(), 1));
attrs.push_back(createNullDevicePointer(rewriter));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because toLayoutOps shouldn't be moving tensors to device, that's the responsibility of toDeviceOps. Therefore we set the device pointer to null.

@jnie-TT
Copy link
Contributor Author

jnie-TT commented Oct 2, 2024

Hey @jnie-TT, I'm not sure I understand what the intent of this PR is. There's several changes that I think are not the way we want to go - I'd like to zoom over this to better understand what it is we're trying to accomplish here.

In general, we should strive to model what TTNN does today - those are the best practices that will save us a lot of headaches going forward.

@svuckovicTT this essentially builds on top of the changes added in #624 that replaces toMemoryConfigOp with more granular ops.

To summarize the current state/issues of memory ops are:

  • We're missing a couple ops (typecast, toMemoryConfig). Here toMemoryConfig solely converts the memory config of the tensor.
  • We're always tilizing on host (toLayout ops are always inserted before toDeviceOps). I was told by the TTNN folks that the host tilizer is very bad, and we should always tilize on device if possible.
  • We're always assuming toLayout should move tensors to device (because we're always passing in the device argument). Therefore we cannot untilize on host (which is fixed by the next bullet).
  • We're hacking in the old compound toMemoryConfigOp when reading tensors from device because there are issues with toLayout and fromDevice ops (Partial revert 19732d0b2a10b053109d41243a29f95acc9cbabe #789 ).
  • We have a compound toMemoryConfig op that checks everything, which eliminates granular control and is hard to update for specific use-cases. This op should solely convert memory configs of the device tensor.
  • toDeviceOps are created when tensors are already on device (no checks are in place), related PR: Reshape using row major layout #819
  • toLayoutOps are created when input output tensors could already have the same layout (no checks in place)

This change attempts to fix the above issues:

  • Added typecast and toMemoryConfig ops
  • We determine where to tilize (if the tensor can be tilized on device and the tensor will be moved to device/is on device, we tilize on device)
  • We only move tensors to device with toDeviceOps, clear boundaries are set
  • We use fromDevice ops to read tensors back from device, and we always untilize on host (because device untilize has many issues mentioned in the PR description)
  • We remove the compound toMemoryConfig op, and make it lightweight such that all it does it convert memory configs of device tensors
  • We check whether toDeviceOps should be created, and abide with constraints (e.g. should not be created in conjunction with fromDeviceOps, should not be created if tensor is already on device, should not be created before an illegal tilize/typecast etc.)
  • We check the input/output tensor layout and determine if a toLayoutOp should be created

The way I see it, these issues should be solved eventually, and should be solved before any further multi-chip changes go in. Or else we would need to update a fromDeviceOp that's not being used (therefore cannot be tested easily), a compound toMemoryConfigOp that's checking everything (therefore tons of updates will need to go in which will probably get removed anyway when we granularize it), and a toDeviceOp + toLayoutOp that has mixed boundaries and are also not being properly checked.

}];

let arguments = (ins AnyRankedTensor:$input,
TT_Device:$device,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your reasoning, and from a high level, it just makes so much sense. Similarly, I wanted to use ttnn::to_layout() to "set" all the things tensor related in one go: dtype, memcfg, layout, device. However, it turned out not to work: when I called the op with all the params, I got an error that was something like "dtype cannot be specified when converting to row_major".

It's not ideal that the APIs have hidden issues like that. We should ask for fixes, but I'm pretty sure that they won't all get fixed relatively fast, so I don't want us to bet on that. What we can do today is to copy their behaviour in order to minimize the risk of hitting similar issues.

This is exactly what I tried to do in #624 and #747. I went and talked to metal folks to understand how they do it for their scenarios. These codepaths are the most tested, and there's less chance that something will silently break on their end. If we resort to doing things how we think should be done, I'm afraid we'll end up hitting random errors as those paths are not as well tested (if at all).

In #624, we copy the behaviour from the from_torch method, as proposed by metal folks. In that method specifically, the to_layout function is called with the device parameter supplied each time. While the device object might be None in some cases, it's not always going to be, and we should not model the to_layout API so that we exclude the parameter.

In general, as dialect writers, we want to model their APIs as close as possible, and let the users of the dialect decide on how to use the APIs (which is usually us, and the top half of the comment talks about that).

def TTNN_TensorMemoryLayout_HeightSharded : I32EnumAttrCase<"HeightSharded", 2, "height_sharded">;
def TTNN_TensorMemoryLayout_WidthSharded : I32EnumAttrCase<"WidthSharded", 3, "width_sharded">;
def TTNN_TensorMemoryLayout_BlockSharded : I32EnumAttrCase<"BlockSharded", 4, "block_sharded">;
def TTNN_TensorMemoryLayout_None : I32EnumAttrCase<"None", 0, "none">;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to do that - remove memCfg from the layout attribute of the tensor. @mtopalovicTT is actively working on adding TTNN specific layout attr (currently we only have TT_LayoutAttr). #701

@mtopalovicTT can you give any updates/timelines on this? Do you see any issues in supporting this case?

Comment on lines 134 to 128
// are required to achieve an arbitrary layout, namely:
// - ToLayoutOp: to set the layout (ROW_MAJOR, TILE) of the tensor
// - TypecastOp: to change the data type of the tensor
// - ToDeviceOp: to move the tensor to a specific device
// - FromDeviceOp: to move the tensor from a specific device to host
// - ToMemoryConfigOp: to set the memory configuration (dram, l1, interleaved,
// sharded) of the tensor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want us to granularize this way for reasons mentioned in one of the comments above - we should model the from_torch approach, which is what the current code tries to do. Additionally, this piece of the change introduces a lot more code, which I find hard to read and maintain.

Can you post a list of specific scenarios we want to support, but are not currently supported?

Comment on lines 20 to 35
const bool transposeA = false;
const bool transposeB = false;
const std::optional<const ::tt::tt_metal::MemoryConfig> memoryConfig =
std::make_optional(outputMemoryConfig);
const std::optional<const ::ttnn::DataType> dtype =
std::make_optional(outputDataType);
const std::optional<const ::ttnn::operations::matmul::MatmulProgramConfig>
programConfig = std::nullopt;
const std::optional<const std::string> activation = std::nullopt;
const std::optional<const ::ttnn::DeviceComputeKernelConfig>
computeKernelConfig = std::nullopt;
const std::optional<const ::ttnn::CoreGrid> coreGrid = std::nullopt;

::ttnn::Tensor out =
::ttnn::matmul(lhs, rhs, transposeA, transposeB, memoryConfig, dtype,
programConfig, activation, computeKernelConfig, coreGrid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to do this for matmul (let's first prove that we do), we should reflect it in the dialect itself. That way, EmitC can work as well.

/////////////////////////////////////////
// Unsupported eltwise ops with sharding
// * Concat: Sharded concat requires ROW MAJOR layout
// * Softmax: Sharded softmax produces incorrect values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an issue to our gitlab to track this, with all the details, and put a TODO with bug id in the comment?

TTNN_LayoutAttr:$layout);
let results = (outs AnyRankedTensor:$result);
}

def TTNN_TypecastOp : TTNN_Op<"typecast"> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ttnn::experimental::typecast op is a DPS op, we should adjust for that.

@@ -355,6 +360,79 @@ class FromDeviceOpConversionPattern
}
};

// TypecastOp conversion pattern
//
class TypecastOpConversionPattern
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The op lives in ttnn::experimental namespace, you can override the getPrefixSwapPattern(). GetDeviceOpConversionPattern has an example.

@jnie-TT jnie-TT force-pushed the jnie/mem_config_refactor_squashed branch from 37d60fd to 376d9ff Compare October 21, 2024 04:36
@jnie-TT jnie-TT force-pushed the jnie/mem_config_refactor_squashed branch from 376d9ff to 77e7c35 Compare October 21, 2024 04:48
@jnie-TT jnie-TT force-pushed the jnie/mem_config_refactor_squashed branch 2 times, most recently from a47e773 to 2cd1a7d Compare October 21, 2024 13:54
class ToLayoutOpConversionPattern
: public OpConversionPattern<ttir::ToLayoutOp> {
public:
using OpConversionPattern<ttir::ToLayoutOp>::OpConversionPattern;

LogicalResult
matchAndRewrite(ttir::ToLayoutOp op, OpAdaptor adaptor,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My ask here would be to try and migrate this decomposition of ToLayoutOp to a separate TTNN pass which occurs after lowering and thus keep lowering simple and clean. This would also allow optimizer to work with single memory layout manipulation op(for example ttnn:ToLayoutOp) while a separate pass would properly decompose it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a separate pass after the optimizer that decomposes the composite toLayout op instead of doing the decomposing during the lowering stage. Updated optimizer accordingly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnie-TT Looks great on a first pass, will take a more detailed look on Monday morning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, please proceed.

@jnie-TT jnie-TT force-pushed the jnie/mem_config_refactor_squashed branch 2 times, most recently from 213b2f5 to 57cf309 Compare October 25, 2024 18:57
@jnie-TT jnie-TT force-pushed the jnie/mem_config_refactor_squashed branch 4 times, most recently from 294d326 to 2a088ea Compare October 26, 2024 18:49
@jnie-TT jnie-TT force-pushed the jnie/mem_config_refactor_squashed branch from 2a088ea to e148f14 Compare October 27, 2024 03:43
@@ -27,13 +27,36 @@ def TTNN_GetDeviceOp : TTNN_Op<"get_device"> {
let results = (outs TT_Device:$device);
}

def TTNN_CompositeToLayoutOp : TTNN_Op<"composite_to_layout"> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to be permanent? Perhaps we can replace this with ToLayout? As I understand it TTNN folks plan to make ttnn::to_layout flexible enough to support all kinds of layout transitions. In the meantime we can specialize ToLayout to the respective lowered op.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nsmithtt that's a good point. I guess I can rename this op to ToLayout, and then the downstream decomposed ToLayout to something like TTNN_TilizeUntilize. Later when ttnn::to_layout supports all combinations of parameters, we can simply remove the downstream decomposition pass and use this op instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah if you are going to change the name it's good idea to change downstream decomposed to something else.

Copy link
Contributor

@nsmithtt nsmithtt Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we just leave it? At the initial conversion / optimizer just targets:

  • ToLayout

Then, in the short term, we might have to lower ToLayout to one of:

  • ToMemoryConfig
  • Typecast
  • etc.
  • Or just leave it, if it's supported, as ToLayout

Over time, we can just erase the specialized lowering for cases where ToLayout has added support for it. I don't think we should introduce dummy ops if we don't need to.

Copy link
Contributor Author

@jnie-TT jnie-TT Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I see what you're saying. So we have a ToLayout op that captures all the parameters up front, effectively doing what the current CompositeToLayoutOp does (maybe set a couple parameters as optional since ttnn::to_layout miight not support them yet), and we would just use the same op for tilize/untilze. And eventually we can merge the decomposed ops into this op as ttnn supports them, and make subsequent arguments required instead of optional. Does that sound right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a hard preference here but leaving the same name is in my opinion bug prone. Reordering passes may create hard to debug issues with Op which has different meaning depending on place it's used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnie-TT yeah that's right.

@nobradovictt, I agree it's not ideal, but neither is introducing a dummy op esp if the eventual semantics of the ToLayout op are exactly how we want to use it. Think of it as a workaround, we lower to ToLayout, workaround passes at the end get applied that specialize out all the cases that ToLayout doesn't support yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on having the same name for the op. One of the invariants of the TTNN Dialect that we want to adhere to, is that it should map to TTNN lib as close as possible. Having an op that doesn't exist in TTNN lib breaks this invariant.

For practicality reasons, we could have a flag on the op itself that let's us know whether the op has been "decomposed" or not. If anyone feels that there's a need for it today, feel free to open an issue and assign it to me.

Copy link
Contributor

@nobradovictt nobradovictt Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@svuckovicTT I could make the same argument against having an op of the same name as in TTNN lib which carries completely different meaning(that it pushes TTNN dialect further from TTNN lib). I personally actually find it even worse, it's "wrong result" but insidious. It's easy to assert that dummy op is not expected in final IR, but if I show you IR with toLayout op will you be able to tell me that it went through decomposition pass?
But as this is a temp workaround, and more of you seems to prefer same name option, Im fine with that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll create a separate issue to track this.

Comment on lines +11 to 13
bool isCompositeToLayoutOp(mlir::Operation *op) {
return isa<ttnn::CompositeToLayoutOp>(op);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this util function anymore since it is a single op again.

@@ -98,7 +121,7 @@ void registerTTNNPipelines() {
mlir::PassPipelineRegistration<
mlir::tt::ttnn::TTIRToTTNNBackendPipelineOptions>(
"ttir-to-ttnn-backend-pipeline",
"Pipeline lowering ttir to ttmetal backend.",
"Pipeline lowering ttir to ttnn backend.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch :)

…essary and obey constraints in a separate pass, properly model toLayout op.
@jnie-TT jnie-TT force-pushed the jnie/mem_config_refactor_squashed branch from e148f14 to 4088fc9 Compare October 28, 2024 15:01
Copy link
Contributor

@svuckovicTT svuckovicTT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jnie-TT, thanks for taking the time to rework the ToLayout conversion, I think this is much cleaner now :)

Looks good from dialects perspective, can you please make sure to add the missing issues? (layout=none, typecast to dps, and dtype=none)

Great work!

UInt32 = 9,
UInt16 = 10,
UInt8 = 11,
None = 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to TTNN_TensorMemoryLayout_None, we should create an issue to track the removal of this.

ttnn::BufferTypeAttr::get(op.getContext(), bufferType));
if (newOutputLayoutEnum == ttnn::Layout::Tile) {
TileType tileType =
TileType::get(rewriter.getContext(), {32, 32}, outputDtype);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{32, 32}

Can you use mlir::tt::ttnn::TTNN_TILE_HEIGHT and mlir::tt::ttnn::TTNN_TILE_HEIGHT instead?

TileType tileType =
TileType::get(rewriter.getContext(), {32, 32}, outputDtype);
llvm::SmallVector<int64_t> newShardShape =
tileType.getTiledShape(llvm::SmallVector<int64_t>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

llvm::SmallVector<int64_t>

Is shard shape always (or most of the time) 2 values? If so, we should use llvm::SmallVector<int64_t, 2>.

@@ -27,13 +27,36 @@ def TTNN_GetDeviceOp : TTNN_Op<"get_device"> {
let results = (outs TT_Device:$device);
}

def TTNN_CompositeToLayoutOp : TTNN_Op<"composite_to_layout"> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on having the same name for the op. One of the invariants of the TTNN Dialect that we want to adhere to, is that it should map to TTNN lib as close as possible. Having an op that doesn't exist in TTNN lib breaks this invariant.

For practicality reasons, we could have a flag on the op itself that let's us know whether the op has been "decomposed" or not. If anyone feels that there's a need for it today, feel free to open an issue and assign it to me.

@jnie-TT
Copy link
Contributor Author

jnie-TT commented Oct 28, 2024

Hey @jnie-TT, thanks for taking the time to rework the ToLayout conversion, I think this is much cleaner now :)

Looks good from dialects perspective, can you please make sure to add the missing issues? (layout=none, typecast to dps, and dtype=none)

Great work!

For sure, I'll create separate issues for this to track. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants