Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Efficient model loading #15080

Open
elephantpanda opened this issue Mar 16, 2023 · 6 comments
Open

[Feature Request] Efficient model loading #15080

elephantpanda opened this issue Mar 16, 2023 · 6 comments
Labels
feature request request for unsupported feature or enhancement

Comments

@elephantpanda
Copy link

elephantpanda commented Mar 16, 2023

Describe the feature request

If you currently load a model of say 5GB, it will first load the model into ram taking 5GB, then it will do some sort of duplication, using another 5GB RAM. Spiking at 10GB RAM. It then transfers 5GB to the GPU and removes the 10GB from the RAM. (I am using c# and directml)

This is extremely wasteful and unnecessary. Because of short spike as observed (notice the 'church spire'):

screenshotx

It means you need double the RAM you should actually require to run certain models.

I'm sure this can easily be overcome by loading the model piecemeal into RAM instead of inefficiently loading the whole model into RAM at once doing some wasterful duplication and then deleting the entire thing.

Alternatively some of that work could be shifted to the VRAM.

Either way, this spike in RAM is just a symptom of very inefficient model loading.

Basically, the model loading could be done more efficiently to avoid this spike in RAM. I'm sure there are ways to avoid this spike in RAM that could be achieved through clever optimisation tricks, quickly deleting of unused RAM and sequential model loading.

Describe scenario use case

To load large models without having to buy 2x the RAM you actually should require. (Remembering the average amount of RAM on a typical users PC is 8GB or even 4GB)

@elephantpanda elephantpanda added the feature request request for unsupported feature or enhancement label Mar 16, 2023
@github-actions github-actions bot added the ep:DML issues related to the DirectML execution provider label Mar 16, 2023
@skottmckay
Copy link
Contributor

It's not easily overcome with the current implementation.

ORT loads the model from a protobuf format .onnx file. It's up to protobuf to decide what arenas etc. are used to load that into memory.

  • The bulk of the memory is the initializer/weights data which may be stored in a packed format.
    • i.e. we can't treat it as a simple array of the data type so it needs to be in a different format to actually run the model.
    • a copy is required to do that, which leads to 2x the memory usage for each initializer
  • We try and free individual things as we go along where possible, but that may not make a difference given protobuf controls what memory arenas etc. are used and a chunk of memory it owns cannot be freed until everything using that chunk is no longer in use.

Theoretically if you saved the model in some different format that when loaded you could directly use it for the initializer data you could avoid this copy, however a lot of the code is written to operate on the protobuf types, such as all the optimizers, and rewriting that would be a significant undertaking.

@skottmckay skottmckay removed the ep:DML issues related to the DirectML execution provider label Mar 17, 2023
@elephantpanda
Copy link
Author

elephantpanda commented Mar 17, 2023

Theoretically if you saved the model in some different format that when loaded you could directly use it for the initializer data you could avoid this copy, however a lot of the code is written to operate on the protobuf types, such as all the optimizers, and rewriting that would be a significant undertaking.

Microsoft is a multi-billion trillion dollar company with very intelligent people. Just hire someone to do it.

What I'm hearing is "we agree Onnxruntime is inefficient and we don't want to fix it." Is that really the Microsoft attitude? 😁

I jest, of course. I'm sure you could all fix it in a weekend of hacking if you put your minds to it.

Using double the amount of RAM needed is the very definition of inefficient code.

Why not just load the Onnx file layer by layer and convert it into the desired type layer by layer? Then delete from RAM as we go along. Doesn't seem impossible, just seems like different departments need to work together.

On the other hand, it seems like you're saying ONNX itself is flawed because it uses protobuf format. So maybe we shouldn't be using Onnxruntime at all?

As I say, pytorch manages to not uses double the amount of memory or VRAM so it's definitely possible.

"The bulk of the memory is the initializer/weights data which may be stored in a packed format". OK, well once a layer has been converted into the format needed to run the model, delete the packet. What's so hard about that? Do it layer by layer to avoid the spike.

@faxu
Copy link
Contributor

faxu commented Mar 17, 2023

Hi @pauldog, please take a look at our Code of Conduct, which outlines our expectations for Microsoft open source community engagement.

We do our best to monitor and support community product feedback, and we expect community members to use respectful language when discussing issues.

@pranavsharma
Copy link
Contributor

@pauldog If memory is a concern you can solve it the following way.

  1. Create a session with the model by setting the optimized_file_path to serialize the optimized model file.
  2. Externalize all weights from the optimized model.
  3. Create OrtValues for each of the weights.
  4. Feed them to ORT using this API. Even though this API was originally developed to share weights between multiple models (sessions), it can still be used with a single session. It'll ensure the weights are allocated only once (by the user). Here's a test that shows its usage.

@elephantpanda
Copy link
Author

elephantpanda commented Mar 17, 2023

@pauldog If memory is a concern you can solve it the following way.

  1. Create a session with the model by setting the optimized_file_path to serialize the optimized model file.
  2. Externalize all weights from the optimized model.
  3. Create OrtValues for each of the weights.
  4. Feed them to ORT using this API. Even though this API was originally developed to share weights between multiple models (sessions), it can still be used with a single session. It'll ensure the weights are allocated only once (by the user). Here's a test that shows its usage.

Thanks. I'll give it a go!😃 It looks a little very tricky but I'll give it a try.
Sorry if my previous words offended. 😔

I assume this is the c# version.

Will all this really load the model with less memory? If so, it would be great as a tutorial as I'm sure lots of people would find it useful.

This is really tricky :( I'm stuck on "Externalize all weights"

On second thoughts I don't think this is the problem. Since the spike only occurs when I have DirectML mode not CPU mode.

@elephantpanda
Copy link
Author

@pauldog If memory is a concern you can solve it the following way.

  1. Create a session with the model by setting the optimized_file_path to serialize the optimized model file.
  2. Externalize all weights from the optimized model.
  3. Create OrtValues for each of the weights.
  4. Feed them to ORT using this API. Even though this API was originally developed to share weights between multiple models (sessions), it can still be used with a single session. It'll ensure the weights are allocated only once (by the user). Here's a test that shows its usage.

Hi can you clarify what you mean by "externalize all weights"? I'm not sure I understand thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request request for unsupported feature or enhancement
Projects
None yet
Development

No branches or pull requests

4 participants