Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] BuiltinLower has hard-coded heuristic for alloca which is not appropriate for all kDLCPU target devices #9022

Closed
mbs-octoml opened this issue Sep 15, 2021 · 26 comments
Labels
tir:transform TIR transforms: src/tir/transforms and python/tvm/tir/transforms type: bug

Comments

@mbs-octoml
Copy link
Contributor

mbs-octoml commented Sep 15, 2021

if (device_type_.defined()) {

This particular code has a complex history but it's purpose is to avoid pool allocations for small tensors on CPU targets (a big perf win). However, not all devices of type 'CPU' support alloca since the stack may not be hardware addressable. In particular, the EthosU target requires all mem ops to be rewritten to the pool allocation primitives for later handling.

So the logic which fires this early exit heuristic (so that a later pass will rewrite allocs to alloca) is too hard coded.

Any easy fix would be to make the kMaxStackAlloca threshold zero on target which do not support alloca. That can be done by moving the value to a TargetKind attribute.

@tqchen
Copy link
Member

tqchen commented Sep 15, 2021

@mbs-octoml To give a bit of context

In the context of CPU, we want to preserve small alloca until the code generation point. And then the code will generate the stack alloca in an explicit way. Only when memory is big enough(bigger than a constant), we will use an opaque allocation instead.

Stack allocation is important for the performance of the CPU code. In the case of TVM, we do not have explicit concept of registers in most cases. Instead we need to rely on LLVM's mem2reg pass to transform a set of constant indexing into stack allocation and turn them into registers, so the code can run effectively. So removing this code path can complicate the code generator side optimization by quite a bit and slow down the CPU code.

Of course this can be a target specific thing. LowerTVMBuiltin right now has the assumption to only run on host(CPU) code.

  • Allocate always prefers (native) stack allocation when possible, but also allows other means of opaque allocation(as long as the allocation is fulfilled)
  • There are however, cases when stack allocation is not possible
    • When the size of memory requested is too big, stack alloca will explode the stack space(That is why there is a size check in the CPU case and the use of global opaque was meant as a fallback to avoid stackoverflow in models with big intermediate temp space)
    • LowerTVMBuiltin was originally designed to run on the host side, which means as soon as the allocation is about device side memory, it will need to call onto a (host side) device API to allocate the memory instead

So rationales for the specific CPU side logic:

  • We want to have stack alloca on host when possible(to gain mem2reg optimization)
  • When the requested size is too large, we fallback to opaque workspace allocation on heap to allow the code to safely handle code with big temp memory requests as well as dynamic size allocation requests.

My guess is we need to look into why VM cannot work with code that allocates on stack in the multiple target case

@mbs-octoml
Copy link
Contributor Author

mbs-octoml commented Sep 16, 2021

Thanks so much for for the context. I'll try to capture that in a comment.
This is a 'bug' only in the sense that this heuristic is not working for the EthosU AOT codegen, I think because they are expecting to intercept the workpool ops downstream? But it does suggest a very simple way forward: make kMaxStackAlloca a TargetKind attribute so they can force it to zero.
@manupa-arm can you chime in here?

@tqchen
Copy link
Member

tqchen commented Sep 16, 2021

Right, this the gets to the target dependent generation regime where TargetKind attribute is indeed the right solution. We should also send a PR to add comments to that code block so we could have more context in the future.

@manupak
Copy link
Contributor

manupak commented Sep 16, 2021

@tqchen @mbs-octoml ,

This is not specific to Arm(R) Ethos(TM)-U codegen and its generally applicable for any micro controller where we would want to avoid creating allocation of memories in the stack but rather service them via platform abstraction that is handled via TVMBackendWorkspaceAlloc --> TVMPlatformAllocate.

This only showed up in Arm(R) Ethos(TM)-U codegen because we use TVMPlatformAllocate allocate memory from a buffer placed in a memory that is both accessible by CPU and NPU. Thus, it makes this a functional bug.
However, with this change current main produces code that have much higher stack allocation for micros -- that is not desired.

cc : @u99127 @areusch

Stack allocation is important for the performance of the CPU code. In the case of TVM, we do not have explicit concept of registers in most cases. Instead we need to rely on LLVM's mem2reg pass to transform a set of constant indexing into stack allocation and turn them into registers, so the code can run effectively. So removing this code path can complicate the code generator side optimization by quite a bit and slow down the CPU code.

The correct way represent this seems to be using tir.allocates with storage_scope="local" for device=CPU to go into the stack. For targets that needs this behavior, there should be an explicit pass to convert them to local to make them placed to the stack rather than assuming this default behaviour.

Of course this can be a target specific thing. LowerTVMBuiltin right now has the assumption to only run on host(CPU) code.

Allocate always prefers (native) stack allocation when possible, but also allows other means of opaque allocation(as long as the allocation is fulfilled)
There are however, cases when stack allocation is not possible
When the size of memory requested is too big, stack alloca will explode the stack space(That is why there is a size check in the CPU case and the use of global opaque was meant as a fallback to avoid stackoverflow in models with big intermediate temp space)
LowerTVMBuiltin was originally designed to run on the host side, which means as soon as the allocation is about device side memory, it will need to call onto a (host side) device API to allocate the memory instead

Clearly, this definition of CPU leaves out micros.
It feels wrong to print out allocates with "global" storage_scope directly into CPU PrimFunc that gets printed as a stack allocation rather it should be serviced via TVMBAW call which moves the responsibility for the runtime/application layer.

So rationales for the specific CPU side logic:

We want to have stack alloca on host when possible(to gain mem2reg optimization)
When the requested size is too large, we fallback to opaque workspace allocation on heap to allow the code to safely handle code with big temp memory requests as well as dynamic size allocation requests.

This certainly sounds like we could use an optimization pass to convert the tir.allocate's storage_scope for targets that requires this rather than making that the default behaviour for tir.allocates with "global" storage scope.

cc : @tom-gall @mbaret @Mousius

@manupak
Copy link
Contributor

manupak commented Sep 16, 2021

Our proposal is to add a check to that loop whether it has 'local' storage_scope before we place them into the stack as it is the solution that works for the wider definition of the CPU, rather than performing a hidden optimization in the codegen that is applicable for a subset of CPUs.

@tqchen
Copy link
Member

tqchen commented Sep 16, 2021

@manupa-arm in cpu we do not necessarily differentiate local from global for now as they are from the same namespace.

I can understand the need from the micro side, and I believe this can be resolved by making TVMLowerBuiltin target dependent, and query the max stack alloca property from the target. We can set the value to be zero for micro related targets.

@manupak
Copy link
Contributor

manupak commented Sep 16, 2021

I feel that is a workaround for an optimization that certain set of CPUs require it.

Moreover, TVMBAWs allows an abstraction to use an Arena that could be used in any memory for non-micro cases. Therefore it feels more like the general solution we want unless a specific target/option requires it placed on the stack.

Landing the above change that provides an optimization for certain CPUs in certain cases where creating functional problems for other backend does not seem like the right thing to do. I'd rather revert the change for now and land the correct one that works for the general case.

@tqchen
Copy link
Member

tqchen commented Sep 16, 2021

Thanks for the discussions. Before we suggest a resolution, it would be helpful
to summarize the discussions so far.

Semantics of Allocate and storage_scope

Allocate at the moment does not specify the way to allocate the memory.
The lowering phase can feel free to pick a runtime dependent one or a
more platform native one(e.g. stack alloca or direct declaration of shared memory).

In practie, we see two different kinds of needs:

  • N0: The need for optimizing "device" level code.
  • N1: The need to be able to allocate memory from the host side, possibly
    in a runtime dependent way.

N0 is needed to get best performing kernel, since a native way of allocation
that can be understood by the compiler would give the best chance for followup
optimizations. This is the case for CPU related optimizations. Note that such
optimization is also needed for micro setting, when we use TIR to generate kernel
code that requested a temp scracth memory for output tiling.

N1 is needed for "driving" the set of computations. In the case of GPU, we have
to rely on host side driver function. In the case of micro AOT, it is desirable
to allocate using platform specific memory.

One possible approach is to try to ask user to differentiate these two kinds of
allocations. However, to give user maximum convenience, TVM allows a mixed host/device
programming model like CUDA and hide that complexity from the user. The boundary
of N0 and N1 can also be blurred, e.g. a kernel can request to allocate from platform
dependent memory.

In specific specialized target devices, we also see an additional need:

  • N2: Allocating memory with special semantics.

For example, an unified device pinned memory that is accessible from both NPU and CPU.
A specialized texture memory or shared memory. The request in N2 is quite different
and brings additional requirement to how we allocate the memory and how the memory can be
used and represented in codegen.

Right now N2 can be covered by introducing a special memory tag -- "global.npu_unified".
This will usually allow the TVM to know that the memory needed to be lowered by possibly
special target dependent passes, and leave them out. It is important to do this because
the compiler makes no assumption that "global" can be accessed by other types of devices.

Possible Resolutions

Based on discussions so far, there are two suggested resolutions.

  • R0: Separate out a "local" scope that carries the stack allocation heavior. (proposed by @manupa-arm )
  • R1: Keep "global" scope as it is, introduce a special tagged scope "global.workspace"
    that represents a global memory request specifically for workspace memory.
    And introduce lowerings for them. For specialized memory(e.g. NPU unified), introduce
    separate memory scopes.
  • R3: Introduce target specific attribute that marks the possible stack alloca size for
    lowering the R0("local") or R1("global"). Note R3 can be done in addition to R0 or R1.

@tqchen
Copy link
Member

tqchen commented Sep 16, 2021

So in the above post I tried to summarize the state. Now let me try to share some of my thoughts based on the summary.

First of all, R0 and R1 are not that different in nature. Both tries to introduce two separate scopes that brings different behavior. The main questions boils down to how can we name the "global" scope.

Per allocate semantics, we treats "global" as normal CPU memory which can come from stack or platform specific allocation. The system can choose the best way of doing such lowering. Always lowering to TBAW is indeed more general for the need of N1. However, the need N0 would favor stack allocation when possible. Note that we will likely need a related behavior for micro devices as well when generating operator kernels.

While it is OK to differentiate stack allocated memory from a platform specific one, doing so would bring additional burden to the user and would require significant refactor of the operator implementations.

The main requests so far comes from need of N1. In that case, it would be easy for AOT generator to allocate memory with special tags("global.workspace"), that enforces workspace allocation since in this setting there is a single expected behavior.

So my suggestion would be R1+R2, as it helps to resolve the need in a way that is compatible with the current semantics and usecases. It will also open doors for more future scope dependent optimizations

@manupak
Copy link
Contributor

manupak commented Sep 16, 2021

Thanks @tqchen for summarizing the ideas and presenting possible resolutions.

The two needs seems very valid.

For N0, The operators should really be tagged with 'local' storage scope for the needs of N0 as they are quite local to the operator primfunc and they benefit of further optimizations within and beyond TVM -- i.e. follow up C compiler / LLVM.

For N1, we could use the 'global' tag to give the responsibility for the application/runtime layer to service the allocation.

Therefore, the actual fix should have been tagging the allocates that are expected to be optimized to be 'local' to the PrimFunc, rather than making the 'global' allocates to CPU being treated as local.

N0 is needed to get best performing kernel, since a native way of allocation
that can be understood by the compiler would give the best chance for followup
optimizations. This is the case for CPU related optimizations. Note that such
optimization is also needed for micro setting, when we use TIR to generate kernel
code that requested a temp scracth memory for output tiling.

I feel we are incorrectly tagging the storage_scopes here. They should really be 'local' for this specific usecase.

First of all, R0 and R1 are not that different in nature. Both tries to introduce two separate scopes that brings different behavior. The main questions boils down to how can we name the "global" scope.

In the solution R1, I still that as a workaround for incorrect treatment of 'global' scoped memories where we create an override of an actual 'global' what we declare as 'global.workspace'. In shared memory SoCs, it would be un-scalable explosion of tags if we want to keep tagging memories for devices which have access to them. I would think we'd want to treat memories to devices having a many-to-many relationship.

The lowering we had until few days back was general enough so they were serviced by a runtime/application layer routine and that were more aligned with what we call as 'global' (with respect to codegen) scoped storage.

Per allocate semantics, we treats "global" as normal CPU memory which can come from stack or platform specific allocation

Can you explain what you define as 'normal CPU memory' ? A CPU can technically have access to many memories.

However, the need N0 would favor stack allocation when possible. Note that we will likely need a related behavior for micro devices as well when generating operator kernels.

It would have been nice to have a RFC (Apologize in advance if I missed this if there was a one already) to discuss before we move from TVMBAW style allocation which I find more generic than just stack allocations. It almost feel the schedules should have tagged them 'local' if this was the expectation rather than assuming a combined logic : 'global' and CPU.

One possible approach is to try to ask user to differentiate these two kinds of allocations.

Wouldn't it be simpler if tag allocations for N0 to be 'local' and N1 to be 'global' ?

N2: Allocating memory with special semantics. For example, an unified device pinned memory that is accessible from both NPU and CPU. A specialized texture memory or shared memory. The request in N2 is quite different and brings additional requirement to how we allocate the memory and how the memory can be used and represented in codegen.

Its a memory where multiple Target have access to which the runtime/application could provide via TVMBAW with a specialized global.<pool_name>.

It is important to do this because the compiler makes no assumption that "global" can be accessed by other types of devices.

Hmmmm, this argument seems counter-intuitive to me. I think we should assume the 'global' to be accessible unless they are explicitly specified to be restricted. i.e. global.<pool_name>. Otherwise, the terminology is confusing.

R0: Separate out a "local" scope that carries the stack allocation heavior. (proposed by @manupa-arm )
R1: Keep "global" scope as it is, introduce a special tagged scope "global.workspace"
that represents a global memory request specifically for workspace memory.
And introduce lowerings for them. For specialized memory(e.g. NPU unified), introduce
separate memory scopes.
R3: Introduce target specific attribute that marks the possible stack alloca size for
lowering the R0("local") or R1("global"). Note R3 can be done in addition to R0 or R1.

Ideally, the allocates destined to be end up in stack should have been 'local'. Moreover, at which point we can decide not to tag allocates that exceed the target specific attribute rather than dealing this in the codegen or lower_builtin_tvm pass.

I would propose :

R4 : R0 + I think its cleaner if we just introduce a optional pass to tag memories with 'local'. At which point, we should only tag them according to the target specific attribute max stack alloca size -- that could work until we fix the schedules that wants them in stack to have 'local' storage scope for the need N0 -- that is a conscious decision the schedule writer / autoscheduler takes.

@tqchen
Copy link
Member

tqchen commented Sep 16, 2021

Thanks @manupa-arm . Trying to capture some of the discussions.

  • Right now the "global" scope translate to something that can be accessed by CPU, and there was no requirement of allocating from TVMBAW. This is consistent with allocate node for other kinds of scopes as well, such as shared memory in a GPU.
  • I can understand the general proposal R4: "global" to "local" would introduce a semantic change of the "global" scope, and can cause backward compatibility issues.

The main reason I would suggest R1 is because this is a backward compatible change(no changes to topi and other libraries needed). Additionally, it reduces the users' mental cost overall to think and choose another kind of memory.
While I can totally get the other type of interpretation as well(by defining "local" to be stack allocation and "global" to be absolutely not), it is not the current intended semantics.

Note that introducing an explicit tag for TVMBAW via "global.workspace" won't necessarily mean we need to introduce tags for all kinds of memory. It does places constraints on how this kind of memory can be allocated. So things can work out of box

@manupak
Copy link
Contributor

manupak commented Sep 16, 2021

Thanks @tqchen .

In the R4, I was not suggesting to change TOPI but saying we could just do a Pass to change the storage scope. What are your thoughts about making the pass to make the storage_scope 'global.stack' after the pass? This should be backward compatible as it just performs an explicit specialization transparently in the IR. We could even do the max alloca size check in this particular pass.

This makes it much clear and specific. Moreover, global.stack will not require further specialization unlike global.workspace where we would want to choose between multiple workspace buffers. The reason being I'd like to keep the tag to denote between global memories the compute devices have access to.

@mbs-octoml
Copy link
Contributor Author

mbs-octoml commented Sep 16, 2021

plus @csullivan who also needs finer grained scopes
Can someone explain how the current Stage scope, which ends up in ProducerRealize nodes, would be coherent with the Allocate scope?
@manupa-arm if you wanted to unblock EthosU today you could make the kMaxStackAlloca heuristic Target defined. That could buy us time to tackle scopes properly.

@tqchen
Copy link
Member

tqchen commented Sep 16, 2021

Thanks @manupa-arm . I understand that proposal R4 can also work by having a pass to convert "global" to something more specialize as a pass (essentially R1 and R4 are not that different except for different choices of scope names).

The main question is what is the semantics around the scope "global". Each memory scope represent a "constraint" of what kind of memory it is.

Right now, when the device type is CPU, "global" means any memory that can be accessed by the host cpu. This means the actual implement can come from include TVMBAW, memory from stack, or memory allocated by other means. While a memory allocated by TVMBAW can have other benefit(e.g. accessible by other devices because it is pinned), it is not the constraint specified by the "global" scope.

We can of course further constraint the setting, to be say "global.workspace", that reduces the possible ways to allocate the memory, but still not preclude from choosing between multiple workspace buffers.

So from semantics point of view. The pass can indeed choose to return "global" or rewrite to "global.stack" to ensure it is a stack allocation. But if the scope remains "global", we should not preclude the possibility for downstream from allocating from stack(the code generator should be able to choose any kind that satisfies the constraint). To say it in another way, we cannot say that "global" definitely mean no stack allocation.

If the code needs to impose additional constraint that the memory must be accessible from a separate device(e.g. NPU), it certainly would require a more specialized constraint that is better spelled out explicitly.

As we can see that this is another kind of flexibility we want to enable here -- flexibility of picking possible backend allocation implementations without over constraining the code generator to a backend specific behavior that is platform dependent (like the case of pinned memory)

@masahi
Copy link
Member

masahi commented Sep 17, 2021

Can someone explain how the current Stage scope, which ends up in ProducerRealize nodes, would be coherent with the Allocate scope?

@mbs-octoml The PR #8366 I did might answer your question. The storage scope attached to ProduceRealizeNode is used to create PointerType and Buffer. Allocate node query the storage scope associated with its buffer variable PointerType.

@manupak
Copy link
Contributor

manupak commented Sep 17, 2021

@tqchen ,

To say it in another way, we cannot say that "global" definitely mean no stack allocation.

The current issue is in the device "CPU" && 'global' certainly means its definitely stack allocation if its less that heuristic size and not the other way around.

If the code needs to impose additional constraint that the memory must be accessible from a separate device(e.g. NPU), it certainly would require a more specialized constraint that is better spelled out explicitly.

As we can see that this is another kind of flexibility we want to enable here -- flexibility of picking possible backend allocation implementations without over constraining the code generator to a backend specific behavior that is platform dependent (like the case of pinned memory

Yes this is something we want eventually and we will be working towards achieving with USMP work.

Until we have that, the natural assumption should be in absense of a 'constraint' that the memories are more accessible rather than being less accessible (e.g. stack). Its unfortunate that the current design prefers the latter especially in a absense of a constraint.

@mbs-octoml ,

Short term solution :

I think you are right, we might want to unblock this using a target-dependent kMaxStackAllocaSize.

May I ask why was the default chosen to be this ?

/*! \brief Maximum size that can be allocated on stack */
constexpr int kMaxStackAlloca = 1024;

Its interesting because the stack size go up beyond that size as it is just looking on a single allocate at a time. i.e. you could have multiple allocates that are less than < 1024. So the stack usage is not even bounded by the current approach.

Therefore, to both unlock us with Ethos-U and also somewhat solve the problem that current micro builds using stack for tensors < 1024 instead of the workspace buffer provided, maybe we should just make kMaxStackAllocaSize=0 (a binary decision rather than a value range).

@Mousius @leandron @areusch , this means there is going to be another argument for a simple micro deployment to be added to the already long list of arguments. Something like "--use-external-workspace" ?

@tqchen , I still feel it would have been super helpful that kMaxStackAllocaSize is by default zero but with the option of going higher based on a user argument. e.g. --max-stack-alloca-size=1024. It is not very convincing that we are leaving out stack allocation of tensors with the prospect of being optimized by mem2reg without doing any marking (i.e. special storage scope).

@tqchen
Copy link
Member

tqchen commented Sep 17, 2021

Please allow me to explain the overall rationale here, in particular over the term "constraint"

  • C0: On one hand, we want a "default" memory to be generically accessible (per @manupa-arm 's comment) in all cases so runtime libraries can be built to leverage the generic property(e.g. access from NPU).
  • C1: On the other hand, from compiler's POV, we want to leave flexibility to code optimizations and codegen phase, and only constraint on the property we want (e.g. accessible from CPU)

The two ways to see the constrainting

The way C0 sees the constaint is about the possible accessor of the memory

  • "global"=> memory can be accessed from {cpu, npu, other devices}
  • "global.stack" => memory can be accessed from { cpu }

The way C1 sees the constraint is about possible memories to choose from.

  • "global"(memory that is accessible from CPU) => can choose from {stack, TVMBAW, other allocators}
  • "global.workspace" => can choose from { TVMBAW }
  • "global.stack" => can choose from { stack }

Discussions

When we say a compiler IR is more constrainted than another one. Usually we mean that less optimizations can be performed, because there is lack of flexibility in terms of rewriting. For example, volatile keyword puts additional constraints on memory access.

This makes C1 more aligned in the common compiler IR design. Note that "memory that is accessible from all devices" is term that depends on the specific runtime platform, and not very well defined in a generic IR.
The more constraints we put on the memory itself, the smaller set it can become. As a result, there are less opportunities
of code transformations and optimizations.

Under the current semantics "CPU" && "global" can result in stack allocation. Note that is is one kind of flexibility we want to offer to later stages so that specializations can be made.

  • So yes it is indeed OK for a pass to map "global" to TVMBAW, the resulting program will run slower, but still correctly on CPU.
  • It also does not precldue TVMLowerBuiltin to take benefit of the semantics to choose stack allocation, which usually benefit performance.

One thing we should keep in mind is that the codegen and AOT compiler should not rely on the behavior of TVMLowerBuiltin to ensure correctness(since it can choose to do anything in the case of "global", including dispatching to another custom allocator). If a special kind of memory is needed, we should declare such constraint through IR. Attaching a special scope is the best way to do so under the current semantics, regardless of the implementation of TVMLowerBuiltin.

TVMLowerBuiltin picks kMaxStackAllocaSize as a heuristic number that maximizes the benefit of stack allocation without overexploding the stack. Of course a better heuristic can be used, setting it default to 0 would bring down performance of a lot of CPU code so not as desirable. We could certainly have a target dependent property for micro target and set that to 0. It should be pretty easy to do as well, see https://github.com/apache/tvm/blob/main/src/tir/transforms/lower_warp_memory.cc#L392 that obtains a target dependent warp size

@manupak
Copy link
Contributor

manupak commented Sep 17, 2021

Thanks @tqchen for the explanation of two viewpoints of how we could see the constraints.

I do agree that we should put more constraints progressively to guide the compiler in the lowering.

It is just that the treatment of TVMBAW as a peer to heap and stack seemed not right. In fact one could use TVMBAW to perform heap allocation. My concern was that we are taking a decision way down in the compilation flow where it could have been taken transparently in the IR itself a bit higher.

I think we are moving there with scopes but it would have been nicer to stick to TVMBAW for now as it is the more general lowering for 'global' and I would not see necessarily that as an orthogonal choice to the list you have mentioned. It just boils to the fact that we just want them to be on stack for mem2reg optimizations. While I dont disagree with the logic of the argument, but wish it was more explicit higher up in the compilation flow. If it was not for mem2reg optimizations, one could simply provide a Arena that could provide the workspace required as it would from the stack -- thus it seemed to me like a better interrim solution until we specialize the scope in the lowering.

Yes, as you suggested the implementation of the target-dependent query for the max alloca size is not particularly challenging, it is just the API that we provide for the user is what we were worried about. This is important especially "micro" is not a target really in TVM -- so the short term solution seems like we would need one of the following TargetKind attributes for C and LLVM backends :

A1 : --max-stack-alloca-size
A2 : --max-stack-size
A3 : --use-backend-alloc-workspace

So here ideally from UX point of view, it would be better to give the user A2 argument rather than a constraining the size of single allocation seems like a proxy to control the behaviour of mem2reg transformations. However, that does not match with to behavior controlled by kMaxStackAllocaSize.

A3 on the other hand set kMaxStackAllocaSize to zero and forcing all the allocates to be serviced by TVMBAW which could be from heap or Arena placed anywhere controlled by the runtime/application.

@tqchen
Copy link
Member

tqchen commented Sep 17, 2021

Thanks @manupa-arm. I agree that putting TVMBAW as a peer to heap is not right(that was meant as an example to demonstrate the viewpoint. I do not necessary want to enforce heap as a peer to TVMBAW).

For the case of AOT, however. I think it would still be useful to introduce a specialized scope eventually. So we can place the constraint that the memory is accessible by npu explicitly in the IR.

While there is no micro specific target, we can introduce micro specific tags that set these properties. See https://github.com/apache/tvm/blob/main/src/target/tag.cc#L84 as an example of cuda tags

I think all three attributes you proposed could work. A1 might be the simplest for now and it won't preclude A2/A3 options in the future.

@mbs-octoml mbs-octoml changed the title [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices [Bug] BuiltinLower has hard-coded heuristic for alloca which is not appropriate for all kDLCPU target devices Sep 20, 2021
@areusch
Copy link
Contributor

areusch commented Sep 20, 2021

apologies for the delay in reply here.

I agree that allocating on the stack is, in the current GraphExecutor world, a codegen-level optimization. In my mind, the introduction of AOTExecutor (and actually also VMExecutor) means that this is moving one level up. Here is why:

  • previously, the maximum effective call depth of a TIR function was 1, meaning that codegen could decide then and there that it made sense to move a particular local tir.allocate node onto the stack.
  • with the introduction of AOT and control flow VMExecutor, such tir.allocate may now live while another TIR function is being invoked e.g. with tir.call_packed.

Now that this is the case, it doesn't make sense to make stack-allocation of tir.allocate a codegen-level optimization. It must be done prior to codegen using graph-level memory analysis e.g. USMP.

Then question then is how to model this in the program. I think using storage_scope does make sense. The definition I've been working with of a storage_scope is:

A contiguous region of memory, potentially with a maximum size, which can be used by operator implementations on a set of tvm.Device.

Under this definition, a special <context>.stack storage_scope could be created with a target-specific bound on the amount of memory available. I do not think we should create a default global scope anywhere, as it's unlikely a global scope is truly ever global except in the case of single-CPU homogenous model execution. We just haven't modeled this yet in TVM, as we don't have an explicit mapping indicating which storage_scope are accessible by which device. The closest thing to a global scope I can think of would be something like executor scope, which could be used to place DLTensor used by the AOT executor for intermediate tensors.

Something we still have yet to address is how to configure the storage_scope made available for a particular Target. I think the Target-specific attributes may not quite cut it, as llvm obviously would have different amounts of stack RAM available based on the selected CPU. And, the user will need to be able to override this as well, since sometimes linking with different software libraries reduces the stack RAM available (e.g. some libraries require more global memory and therefore take away from that RAM available for stack usage; or other times, some libraries require massive interrupt stacks and a larger budget must be reserved to mitigate the risk of overwriting global memory due to an interrupt).

It's likely that a scalable solution in micro-land is to have the Project API server able to provide a boilerplate set of storage_scope in an e.g. server_info_query call; and then it's likely we will need to have a way for users to override this.

Lastly, as to the <context> prefix mentioned above: there are many cases when multiple stacks exist on a device:

  • multi-core execution
  • parallel thread execution (e.g. stack viewed as a sort of thread-local storage)

in the former case, it's likely that <context> should be replaced with the name given to the TargetDevice e.g. in #8892. For example dsp.stack or always-on-cpu.stack.

in the latter case, we probably additionally need a thread identifier e.g. dsp.thread0.stack.

Until we have USMP, my thoughts are that the short-term solution should be to stick with a codegen-level optimization and add an attribute which can be used to disable the stack-allocation optimization. What do you think @tqchen @manupa-arm ?

@tqchen
Copy link
Member

tqchen commented Sep 20, 2021

One thing I want to note here is that the prefix(global/shared/warp/local) have things to do with thread execution hierachy so we cannot add arbitrary prefix here. we can however, add any tags that follows.

In most cases the storage is assumed local to the execution context(of the thread), so we do not need to annotate threading as part of the storage scope.

@manupak
Copy link
Contributor

manupak commented Sep 21, 2021

So there are two sub issues here, if you follow the discussion :

P1: The AoT executor codegen creates tir.allocates that should be served TVMBAW calls more generally.
P2: The CPU PrimFuncs generating stack allocation for some allocates based on a per-allocate heuristic.

For P1,
We are working on a PR (to be published soon) make the AoT executor codegen create tir.allocate with 'global.workspace' (as discussed above) storage scopes to make sure they get lowered to TVMBAWs that are device/memory agnostic calls from a codegen perspective.

For P2,
We might need to control stack-alloca heuristic for each compilation.

I feel that should solve the issues in the scope.

@areusch shall we move the discussion to another issue/discuss post above promoting/introducing storage_scope that is more higher than the current definition of 'global'?

I agree with you that the current definition of 'global' is a misnomer. There are certain memories (e.g. stack) that could be 'owned' by a TargetDevice -- in which case we could do <TargetDevice>.stack. In a way, stack is special as it does not create an opaque call to get hold of the memory, so that LLVM (or any other post-TVM compiler) could do optimizations. Hence, my emphasis on treating it especially with 'local' storage_scope. However, I think it would be too big of a change given the historical precedence of TVM, as @tqchen points out.

However, in shared memory SoCs, the ownership of memories to TargetDevice will not scale very well. For e.g., if we have a SRAM that is accessed by CPU, DSP, and NPU in the same SoC, what should be the prefix? The current approach we are taking (and discussed in the RFC) of USMP is to use the storage_scope: global.<pool_name> where pool_name points to PoolInfo that describes which TargetDevice have access to it. Therefore, I believe we could make an Arena that is pinned to the memories in the runtime layer to serve each Pool via TVMBAWs. If USMP is invoked, the static-sized allocations will be translated to fixed offsets within a static buffer pinned to the same memories. I guess my point is the provision of Arenas/buffers should be lifted out of the codegen to the application/runtime layer and interfaced by generated codes via TVMBAWs (for non-static sized allocates) or offset (for static-sized allocates).

@manupak
Copy link
Contributor

manupak commented Sep 21, 2021

P1 : #9065

@areusch
Copy link
Contributor

areusch commented Sep 22, 2021

okay @manupa-arm this short-term solution seems workable to unblock the Ethos-U merge. I agree we should revisit how to solve this more generally in AOT once we have USMP and TargetDevice merged. we can discuss this in a separate post.

@masahi
Copy link
Member

masahi commented Jan 9, 2022

Can we close this?

@manupak
Copy link
Contributor

manupak commented Jan 24, 2022

P2 : #9950
Thanks @SebastianBoblestETAS

@areusch areusch added the needs-triage PRs or issues that need to be investigated by maintainers to find the right assignees to address it label Oct 19, 2022
@driazati driazati added tir:transform TIR transforms: src/tir/transforms and python/tvm/tir/transforms and removed needs-triage PRs or issues that need to be investigated by maintainers to find the right assignees to address it labels Nov 28, 2022
@tqchen tqchen closed this as completed Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tir:transform TIR transforms: src/tir/transforms and python/tvm/tir/transforms type: bug
Projects
None yet
Development

No branches or pull requests

6 participants