Skip to content

Conversation

@MasterJH5574
Copy link
Contributor

This PR enhances the static block memory planning pass. Prior to this PR, the memory planning only works on memory allocation that is not externally referenced. In dynamic shape settings, such memory allocation is not fully static and may lead to memory fragmentation.

This PR enhances the behavior, so that for such memory allocation, we first allocate a storage with regard to its estimated upper bound (when known), and then allocate the tensor with the actual dynamic shape out from the storage. This will ensure the static memory allocation and avoid memory fragmentation.

@junrushao
Copy link
Member

I was not 100% sure if it is a desirable behavior to set the output buffer to its size upper bound, because a relax function can be, by default, called as many times as callee wants, and if the output of a relax function is never deallocated, it means multiple copies of upper bound memory will be retained for indeterministic life span

@MasterJH5574
Copy link
Contributor Author

@junrushao Just want to make sure I understand correctly: are you saying when called for multiple times, there will be multiple allocations of the output buffer? Yes there will indeed be multiple allocations if we do not use pool allocator or never release allocated buffers. I feel this behavior already exists when we do not allocate upper-bound size for the output tensor? The main purpose of this PR is to make the output tensor buffer size purely static, so that we can avoid memory fragmentation in runtime when using a pool allocator.

@junrushao
Copy link
Member

junrushao commented Nov 13, 2023

I was thinking about the following case: allocating a tensor of shape (n, ), where upper bound of n is huge, e.g. 128k, while the actual runtime value of n is usually small, e.g. 1k. This could be a scenario when context length becomes gigantic in LLMs. In this case, static planning will always return a tensor of 128k length no matter what n is.

As a compiler infra, it cannot assume runtime use cases when no specific hints/annotations are given, which, in our particular case, means it cannot assume the life span of the returned value when always allocating return tensors to its upper bound memory usage. This behavior, as it cannot assume runtime usecases, could be suboptimal if the caller of the relax function indefinitely extends the life span of the returned value, for example, keeping multiple small ns as return tensors for indefinite long period of time, each of which are allocated to upper bound 128k, but actually only needs 1k memory.

@junrushao
Copy link
Member

there will be multiple allocations of the output buffer?

My primary concern is about over-allocated being kept for an indefinite life span, not merely a function being called multiple times.

so that we can avoid memory fragmentation in runtime when using a pool allocator.

Yeah this is definitely good to reduce fragmentation, but it also leads to unnecessary memory over-allocation and potential waste, so there's definitely a trade-off here.

@MasterJH5574
Copy link
Contributor Author

MasterJH5574 commented Nov 13, 2023

@junrushao Yes there is some amount of waste in terms of the storage size for sure. As long as the upper bound is as much tight as possible, I feel the “over-allocation” is not going to be a severe issue. For the number of allocations, the change in this PR does not increase the number of allocations than before. (If we don't use pool allocator, the number of allocations remains the same. And if we use pool allocator, only one allocation happens.)

For the example of batching, say we can analyze the maximum possible batch size in ahead and annotate that value as the upper bound. In a serving engine, every integer between [1, max_batch_size) will be effectively used as the real batch size. And in this case only one static storage is allocated with this PR (when pool allocator enabled), which I believe is a great thing. If we do not allocate the output storage statically, there will be max_batch_size allocations, and each one of them have a different storage size, which, in the worst case, will cause the total memory in the pool allocator (when enabled) to be O(max_batch_size^2) times of a single allocated storage.

@junrushao
Copy link
Member

junrushao commented Nov 13, 2023

I would love to emphasize that, as a generic compiler infrastructure, it usually does not assume a single usecase or depend on runtime behavior that, for example, GCC cannot assume AVX-512 instructions always exist if not being explicitly told.

I feel the “over-allocation” is not going to be a severe issue.

There's definitely difference between personal feelings and objective factors to be discussed in design choices, and I'd love to discuss specifically on objective factors. Let me write a bit more about the examples I drew previously in the thread, consider the case that we are calling a main method which returns a dynamic shape buffer with actual length 1k, but stored in pre-allocated 128k buffer, iterating for 1024 times:

std::vector<NDArray> outputs;
for (int i = 0; i < 1024; ++i) {
  NDArray outputs = mod["main"](...); # size = 1k, but storage = 128k;
  logits.push_back(result);
}
this->outputs = outputs;

It effectively means the outputs takes 1024 * 128k in RAM instead of 1024 * 1k. And to make things even less controllable, the compiler cannot assume or control when those over-allocations are recycled, meaning it's possible to craft cases where over-allocations dominate RAM usage (e.g., with a huge vocab), if the engineers are not 100% percent familiar with every detail of the compiler passes involved.

in the worst case, will cause the total memory in the pool allocator (when enabled) to be O(max_batch_size^2) times of a single allocated storage

As alternatives, I'd love to suggest that upper-bound allocation is not the only solution to de-fragmentation, and there's indeed well-practiced solutions, for example, bucketing, which creates a bucket i of memory slots when allocating memory within (2 ^ (i - 1), 2 * i], as well as buddy allocator as its straightforward generalization.

@MasterJH5574
Copy link
Contributor Author

MasterJH5574 commented Nov 13, 2023

@junrushao Yes I agree with you in general that a compiler is not supposed to make assumption on the runtime. In terms of the current memory management in TVM Unity, the VM makes heavy use of the pool allocator, which, as you pointed out, is not clever enough. I would be more than happy to follow up in the future to enhance the pool allocator to be more intelligent on allocation management, so that we can reach a balance between fragmentation and over-allocation. The main purpose of this PR is, still, to manage the memory fragmentation, which has proven to be an existing issue that can be severe when the memory usage gets close to the memory limit. For now, I think it is acceptable to make use of the upper-bound strategy, due to the fact that we are making heavy use of the pool allocator. I agree with you that this is not optimal, and I am happy to revise the memory planning algorithm after we have a cleverer allocator at least.

@junrushao
Copy link
Member

junrushao commented Nov 13, 2023

Thanks for your response. To summarize, on the high-level, we both agree that fragmentation is an issue to be resolved, but differ in approaches we believe are effective and sustainable.

I think it is acceptable to make use of the upper-bound strategy, due to the fact that we are making heavy use of the pool allocator

I believe the point you wanted to make here is that relying on pool allocator, as you pointed out repetitively in this thread, could alleviate/address the issue of over-allocation with unknown life time. This point is valid for temporary intermediate buffers, but not the case when it comes to over-allocating return tensors.

To better help you understand why, let me work you through the example I gave in the previous response pasting below:

1  std::vector<NDArray> outputs;
2  for (int i = 0; i < 1024; ++i) {
3    NDArray outputs = mod["main"](...); # size = 1k, but storage = 128k;
4    logits.push_back(result);
5  }
6  this->outputs = outputs;

As you may already tell, the over-allocation happened in Line 3 is carried through into the vector outputs. No matter what implementation of the underlying allocator is, it does not control when the vector outputs in Line 6 is recycled. It means the RAM over-usage is propagated by the callee rather than RelaxVM internally.

The main purpose of this PR is, still, to manage the memory fragmentation, which has proven to be an existing issue that can be severe when the memory usage gets close to the memory limit.

We both agree that memory fragmentation is a problem in general, and I believe we both wanted to take stabs at solving them without shooting ourselves and other developers in the foot in usecases in the very near future. And more broadly, as static memory planning is a common pass used in every relax.build() call, we will have to consider if it's going to impact the entire Relax compilation flow and the end users. Meanwhile, I do believe alternatives do concretely exist and am happy to help you understand how them work.

@MasterJH5574
Copy link
Contributor Author

Thanks for the inputs! Your example is clear and demonstrates the weakness of the upper-bound allocation. So far we discussed the three possible allocation strategies for the output tensor:

  • S1. exact-size allocation,
  • S2. upper-bound allocation,
  • S3. other alternatives such as bucketing.

And we discussed two runtime use cases of VM:

  • C1. Running repetitively and hold every output array:
    // Case 1.
    std::vector<NDArray> outputs;
    for (int i = 0; i < 1024; ++i) {
      NDArray logits = mod["main"](1k, ...); // size = 1k, but storage = 128k;
      outputs.push_back(logits);
    }
    this->outputs = outputs;
  • C2. Running repetitively with different output tensor size, not holding output array:
    // Case 2.
    for (int i = 1; i <= 128; ++i) {
      NDArray logits = mod["main"](i * 1k + 1, ...);
      // post-processing of logits and then release.
    }

General cases may have more complicated use of output tensor which mix both cases above.

Based on our discussion, we agree that

  • S1 avoids over-allocation for both C1 and C2, while incurs fragmentation in C2. For C2, using S1 may at most have ((1 + 128) * 128 / 2) * 1k = 8256k memory on held in VM when using pool allocator.
  • S2 avoids fragmentation for both cases, while incurs over-allocation in C1 (when outputs are held). For C1, using S2 may have 1023 * 128k memory wasted.
  • S3 might overly allocates in both cases. In the worst case of bucketing, each iteration there can be a waste of nearly the same size as the output array's in C1. Similarly, for C2, in the end there will also be 128k unused in the pool, which is much less than S2, though.

Though a general compiler pass is not supposed to assume the execution runtime to follow certain behavior, we believe the runtime behavior (when we clearly know it) can be a helpful information for compilers. For example, in the use case of MLC LLM, we are sure that output of VM functions (e.g., logits) will be released before the next invocation of the function. And in this case, compiling the model with upper-bound allocation for output tensors is beneficial.

In consideration of this, one approach is to introduce a compilation flag in the form of a compile-time function attribute to suggest "whether to allocate output tensor statically with upper-bound estimation." For cases where we know the runtime use of output tensors (like in MLC LLM), we can enable this flag during model compilation, so that we can yield completely static runtime memory. This flag is by default not enabled, and we will keep the exact-size allocation for general cases where the output tensors may be arbitrarily used.

@junrushao
Copy link
Member

Thanks for getting back to me @MasterJH5574! I believe C1 and C2 make our points crystal clear, where C1 is the case for over-allocated being kept for an indefinite life span, and C2 is the case for immediate memory recycling with the pooled allocator.

Now moving to the discussion on S1, S2 and S3, where S1 and S2 are based on static analysis and S3 is purely runtime. The point I'd love make here is that static analysis may not be sufficient once dynamism is involved, and it might be eventually desirable to have a hybrid approach instead.

To give a specific example in LLMs:

  • The compiler and the runtime needs to work together to find out the optimal combination of max_sequence_length and prefill_chunk_size under a certain memory constraint.
  • The memory constraint, such as 6GB, is not known during compilation time;
  • Repetitive re-compilation is not desirable.

If the problem is designed specifically only for Llama2-7B, it would be relatively easy to resolve, i.e. static analysis gives a function f(max_sequence_length, prefill_chunk_size, ...) that returns the upper bound RAM needed, and the runtime figures out the maximum max_sequence_length that satisfies the memory constraint. In this case, purely static analysis won't work out and pure runtime approach would more or less lead to memory waste.

Within the scope of this PR, it wouldn't be that complicated that we will have to find a perfect solution, and agreeing with your assessment, if the upstream framework could instruct compiler to apply certain upper bound-based approach with function attributes/annotation, it should be sufficient specifically for LLM serving so far.

@junrushao junrushao force-pushed the unity branch 2 times, most recently from c95d45f to 45eeb8c Compare December 18, 2023 21:00
@MasterJH5574 MasterJH5574 force-pushed the unity-dev/2023-11-11-memory-plan branch from 60cdee7 to 3cb7a0f Compare January 14, 2024 03:21
This PR enhances the static block memory planning pass.
Prior to this PR, the memory planning only works on memory
allocation that is not externally referenced. In dynamic
shape settings, such memory allocation is not fully static
and may lead to memory fragmentation.

This PR enhances the behavior, so that for such memory
allocation, we first allocate a storage with regard to its
estimated upper bound (when known), and then allocate the
tensor with the actual dynamic shape out from the storage.
This will ensure the static memory allocation and avoid
memory fragmentation.
@MasterJH5574 MasterJH5574 force-pushed the unity-dev/2023-11-11-memory-plan branch from 3cb7a0f to 727575f Compare January 14, 2024 22:47
MasterJH5574 added a commit to MasterJH5574/mlc-llm that referenced this pull request Jan 14, 2024
This PR adds a pass into the model compilation pipeline, which
attach an attribute `"relax.memory_plan_dynamic_func_output"`
for each Relax function in the IRModule. This attribute suggests
that the Relax functions' output tensors, though having dynamic
shapes, are statically plannable.

This enhancement makes sure that in serving scenarios, our
memory allcoation is completely static after stablized. So we
will not be worried about continuing memory usage growth, and
can allocate more memory for KV cache.

This PR can be early merged, but it will not take effects until
apache/tvm#16111 is merged.
MasterJH5574 added a commit to MasterJH5574/mlc-llm that referenced this pull request Jan 14, 2024
This PR adds a pass into the model compilation pipeline, which
attach an attribute `"relax.memory_plan_dynamic_func_output"`
for each Relax function in the IRModule. This attribute suggests
that the Relax functions' output tensors, though having dynamic
shapes, are statically plannable.

This enhancement makes sure that in serving scenarios, our
memory allcoation is completely static after stablized. So we
will not be worried about continuing memory usage growth, and
can allocate more memory for KV cache.

This PR can be early merged, but it will not take effects until
apache/tvm#16111 is merged.
@tqchen tqchen merged commit cf14edd into apache:unity Jan 15, 2024
tqchen pushed a commit to mlc-ai/mlc-llm that referenced this pull request Jan 15, 2024
This PR adds a pass into the model compilation pipeline, which
attach an attribute `"relax.memory_plan_dynamic_func_output"`
for each Relax function in the IRModule. This attribute suggests
that the Relax functions' output tensors, though having dynamic
shapes, are statically plannable.

This enhancement makes sure that in serving scenarios, our
memory allcoation is completely static after stablized. So we
will not be worried about continuing memory usage growth, and
can allocate more memory for KV cache.

This PR can be early merged, but it will not take effects until
apache/tvm#16111 is merged.
MasterJH5574 added a commit to MasterJH5574/mlc-llm that referenced this pull request Jan 15, 2024
This PR adds a pass into the model compilation pipeline, which
attach an attribute `"relax.memory_plan_dynamic_func_output"`
for each Relax function in the IRModule. This attribute suggests
that the Relax functions' output tensors, though having dynamic
shapes, are statically plannable.

This enhancement makes sure that in serving scenarios, our
memory allcoation is completely static after stablized. So we
will not be worried about continuing memory usage growth, and
can allocate more memory for KV cache.

This PR can be early merged, but it will not take effects until
apache/tvm#16111 is merged.
MasterJH5574 added a commit to MasterJH5574/mlc-llm that referenced this pull request Jan 15, 2024
This PR adds a pass into the model compilation pipeline, which
attach an attribute `"relax.memory_plan_dynamic_func_output"`
for each Relax function in the IRModule. This attribute suggests
that the Relax functions' output tensors, though having dynamic
shapes, are statically plannable.

This enhancement makes sure that in serving scenarios, our
memory allcoation is completely static after stablized. So we
will not be worried about continuing memory usage growth, and
can allocate more memory for KV cache.

This PR can be early merged, but it will not take effects until
apache/tvm#16111 is merged.
masahi pushed a commit to masahi/tvm that referenced this pull request Feb 20, 2024
…che#16111)

This PR enhances the static block memory planning pass.
Prior to this PR, the memory planning only works on memory
allocation that is not externally referenced. In dynamic
shape settings, such memory allocation is not fully static
and may lead to memory fragmentation.

This PR enhances the behavior, so that for such memory
allocation, we first allocate a storage with regard to its
estimated upper bound (when known), and then allocate the
tensor with the actual dynamic shape out from the storage.
This will ensure the static memory allocation and avoid
memory fragmentation.
elvin-n pushed a commit to Deelvin/tvm that referenced this pull request Mar 19, 2024
…che#16111)

This PR enhances the static block memory planning pass.
Prior to this PR, the memory planning only works on memory
allocation that is not externally referenced. In dynamic
shape settings, such memory allocation is not fully static
and may lead to memory fragmentation.

This PR enhances the behavior, so that for such memory
allocation, we first allocate a storage with regard to its
estimated upper bound (when known), and then allocate the
tensor with the actual dynamic shape out from the storage.
This will ensure the static memory allocation and avoid
memory fragmentation.
smickey040404 added a commit to smickey040404/mlc-llm that referenced this pull request Feb 11, 2025
This PR adds a pass into the model compilation pipeline, which
attach an attribute `"relax.memory_plan_dynamic_func_output"`
for each Relax function in the IRModule. This attribute suggests
that the Relax functions' output tensors, though having dynamic
shapes, are statically plannable.

This enhancement makes sure that in serving scenarios, our
memory allcoation is completely static after stablized. So we
will not be worried about continuing memory usage growth, and
can allocate more memory for KV cache.

This PR can be early merged, but it will not take effects until
apache/tvm#16111 is merged.
tristankincaid added a commit to tristankincaid/mlc-llm that referenced this pull request Feb 16, 2025
This PR adds a pass into the model compilation pipeline, which
attach an attribute `"relax.memory_plan_dynamic_func_output"`
for each Relax function in the IRModule. This attribute suggests
that the Relax functions' output tensors, though having dynamic
shapes, are statically plannable.

This enhancement makes sure that in serving scenarios, our
memory allcoation is completely static after stablized. So we
will not be worried about continuing memory usage growth, and
can allocate more memory for KV cache.

This PR can be early merged, but it will not take effects until
apache/tvm#16111 is merged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants