[Feature Request] Efficient model loading #15080

elephantpanda · 2023-03-16T11:03:20Z

Describe the feature request

If you currently load a model of say 5GB, it will first load the model into ram taking 5GB, then it will do some sort of duplication, using another 5GB RAM. Spiking at 10GB RAM. It then transfers 5GB to the GPU and removes the 10GB from the RAM. (I am using c# and directml)

This is extremely wasteful and unnecessary. Because of short spike as observed (notice the 'church spire'):

It means you need double the RAM you should actually require to run certain models.

I'm sure this can easily be overcome by loading the model piecemeal into RAM instead of inefficiently loading the whole model into RAM at once doing some wasterful duplication and then deleting the entire thing.

Alternatively some of that work could be shifted to the VRAM.

Either way, this spike in RAM is just a symptom of very inefficient model loading.

Basically, the model loading could be done more efficiently to avoid this spike in RAM. I'm sure there are ways to avoid this spike in RAM that could be achieved through clever optimisation tricks, quickly deleting of unused RAM and sequential model loading.

Describe scenario use case

To load large models without having to buy 2x the RAM you actually should require. (Remembering the average amount of RAM on a typical users PC is 8GB or even 4GB)

skottmckay · 2023-03-17T07:43:17Z

It's not easily overcome with the current implementation.

ORT loads the model from a protobuf format .onnx file. It's up to protobuf to decide what arenas etc. are used to load that into memory.

The bulk of the memory is the initializer/weights data which may be stored in a packed format.
- i.e. we can't treat it as a simple array of the data type so it needs to be in a different format to actually run the model.
- a copy is required to do that, which leads to 2x the memory usage for each initializer
We try and free individual things as we go along where possible, but that may not make a difference given protobuf controls what memory arenas etc. are used and a chunk of memory it owns cannot be freed until everything using that chunk is no longer in use.

Theoretically if you saved the model in some different format that when loaded you could directly use it for the initializer data you could avoid this copy, however a lot of the code is written to operate on the protobuf types, such as all the optimizers, and rewriting that would be a significant undertaking.

elephantpanda · 2023-03-17T09:52:36Z

Theoretically if you saved the model in some different format that when loaded you could directly use it for the initializer data you could avoid this copy, however a lot of the code is written to operate on the protobuf types, such as all the optimizers, and rewriting that would be a significant undertaking.

Microsoft is a ~~multi-billion~~ trillion dollar company with very intelligent people. Just hire someone to do it.

What I'm hearing is "we agree Onnxruntime is inefficient and we don't want to fix it." Is that really the Microsoft attitude? 😁

I jest, of course. I'm sure you could all fix it in a weekend of hacking if you put your minds to it.

Using double the amount of RAM needed is the very definition of inefficient code.

Why not just load the Onnx file layer by layer and convert it into the desired type layer by layer? Then delete from RAM as we go along. Doesn't seem impossible, just seems like different departments need to work together.

On the other hand, it seems like you're saying ONNX itself is flawed because it uses protobuf format. So maybe we shouldn't be using Onnxruntime at all?

As I say, pytorch manages to not uses double the amount of memory or VRAM so it's definitely possible.

"The bulk of the memory is the initializer/weights data which may be stored in a packed format". OK, well once a layer has been converted into the format needed to run the model, delete the packet. What's so hard about that? Do it layer by layer to avoid the spike.

faxu · 2023-03-17T18:08:05Z

Hi @pauldog, please take a look at our Code of Conduct, which outlines our expectations for Microsoft open source community engagement.

We do our best to monitor and support community product feedback, and we expect community members to use respectful language when discussing issues.

pranavsharma · 2023-03-17T18:46:11Z

@pauldog If memory is a concern you can solve it the following way.

Create a session with the model by setting the optimized_file_path to serialize the optimized model file.
Externalize all weights from the optimized model.
Create OrtValues for each of the weights.
Feed them to ORT using this API. Even though this API was originally developed to share weights between multiple models (sessions), it can still be used with a single session. It'll ensure the weights are allocated only once (by the user). Here's a test that shows its usage.

elephantpanda · 2023-03-17T19:09:34Z

@pauldog If memory is a concern you can solve it the following way.

Create a session with the model by setting the optimized_file_path to serialize the optimized model file.

Externalize all weights from the optimized model.

Create OrtValues for each of the weights.

Feed them to ORT using this API. Even though this API was originally developed to share weights between multiple models (sessions), it can still be used with a single session. It'll ensure the weights are allocated only once (by the user). Here's a test that shows its usage.

Thanks. I'll give it a go!😃 It looks ~~a little~~ very tricky but I'll give it a try.
Sorry if my previous words offended. 😔

I assume this is the c# version.

Will all this really load the model with less memory? If so, it would be great as a tutorial as I'm sure lots of people would find it useful.

This is really tricky :( I'm stuck on "Externalize all weights"

On second thoughts I don't think this is the problem. Since the spike only occurs when I have DirectML mode not CPU mode.

elephantpanda · 2023-04-07T04:58:20Z

@pauldog If memory is a concern you can solve it the following way.

Create a session with the model by setting the optimized_file_path to serialize the optimized model file.

Externalize all weights from the optimized model.

Create OrtValues for each of the weights.

Feed them to ORT using this API. Even though this API was originally developed to share weights between multiple models (sessions), it can still be used with a single session. It'll ensure the weights are allocated only once (by the user). Here's a test that shows its usage.

Hi can you clarify what you mean by "externalize all weights"? I'm not sure I understand thanks.

elephantpanda added the feature request request for unsupported feature or enhancement label Mar 16, 2023

github-actions bot added the ep:DML issues related to the DirectML execution provider label Mar 16, 2023

skottmckay removed the ep:DML issues related to the DirectML execution provider label Mar 17, 2023

elephantpanda mentioned this issue Mar 17, 2023

[Web] Memory spike in ORT-web leading to app crash #15086

Open

elephantpanda mentioned this issue Apr 13, 2023

My experiment creating a more efficient loader. Uses 90% less RAM (while loading). 🚀 #15429

Closed

t83714 mentioned this issue Jul 24, 2024

Test Model RAM consumption magda-io/magda-embedding-api#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Efficient model loading #15080

[Feature Request] Efficient model loading #15080

elephantpanda commented Mar 16, 2023 •

edited

Loading

skottmckay commented Mar 17, 2023

elephantpanda commented Mar 17, 2023 •

edited

Loading

faxu commented Mar 17, 2023

pranavsharma commented Mar 17, 2023

elephantpanda commented Mar 17, 2023 •

edited

Loading

elephantpanda commented Apr 7, 2023

[Feature Request] Efficient model loading #15080

[Feature Request] Efficient model loading #15080

Comments

elephantpanda commented Mar 16, 2023 • edited Loading

Describe the feature request

Describe scenario use case

skottmckay commented Mar 17, 2023

elephantpanda commented Mar 17, 2023 • edited Loading

faxu commented Mar 17, 2023

pranavsharma commented Mar 17, 2023

elephantpanda commented Mar 17, 2023 • edited Loading

elephantpanda commented Apr 7, 2023

elephantpanda commented Mar 16, 2023 •

edited

Loading

elephantpanda commented Mar 17, 2023 •

edited

Loading

elephantpanda commented Mar 17, 2023 •

edited

Loading