-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Efficient model loading #15080
Comments
It's not easily overcome with the current implementation. ORT loads the model from a protobuf format .onnx file. It's up to protobuf to decide what arenas etc. are used to load that into memory.
Theoretically if you saved the model in some different format that when loaded you could directly use it for the initializer data you could avoid this copy, however a lot of the code is written to operate on the protobuf types, such as all the optimizers, and rewriting that would be a significant undertaking. |
Microsoft is a What I'm hearing is "we agree Onnxruntime is inefficient and we don't want to fix it." Is that really the Microsoft attitude? 😁 I jest, of course. I'm sure you could all fix it in a weekend of hacking if you put your minds to it. Using double the amount of RAM needed is the very definition of inefficient code. Why not just load the Onnx file layer by layer and convert it into the desired type layer by layer? Then delete from RAM as we go along. Doesn't seem impossible, just seems like different departments need to work together. On the other hand, it seems like you're saying ONNX itself is flawed because it uses protobuf format. So maybe we shouldn't be using Onnxruntime at all? As I say, pytorch manages to not uses double the amount of memory or VRAM so it's definitely possible. "The bulk of the memory is the initializer/weights data which may be stored in a packed format". OK, well once a layer has been converted into the format needed to run the model, delete the packet. What's so hard about that? Do it layer by layer to avoid the spike. |
Hi @pauldog, please take a look at our Code of Conduct, which outlines our expectations for Microsoft open source community engagement. We do our best to monitor and support community product feedback, and we expect community members to use respectful language when discussing issues. |
@pauldog If memory is a concern you can solve it the following way.
|
Thanks. I'll give it a go!😃 It looks I assume this is the c# version. Will all this really load the model with less memory? If so, it would be great as a tutorial as I'm sure lots of people would find it useful. This is really tricky :( I'm stuck on "Externalize all weights" On second thoughts I don't think this is the problem. Since the spike only occurs when I have DirectML mode not CPU mode. |
Hi can you clarify what you mean by "externalize all weights"? I'm not sure I understand thanks. |
Describe the feature request
If you currently load a model of say 5GB, it will first load the model into ram taking 5GB, then it will do some sort of duplication, using another 5GB RAM. Spiking at 10GB RAM. It then transfers 5GB to the GPU and removes the 10GB from the RAM. (I am using c# and directml)
This is extremely wasteful and unnecessary. Because of short spike as observed (notice the 'church spire'):
It means you need double the RAM you should actually require to run certain models.
I'm sure this can easily be overcome by loading the model piecemeal into RAM instead of inefficiently loading the whole model into RAM at once doing some wasterful duplication and then deleting the entire thing.
Alternatively some of that work could be shifted to the VRAM.
Either way, this spike in RAM is just a symptom of very inefficient model loading.
Basically, the model loading could be done more efficiently to avoid this spike in RAM. I'm sure there are ways to avoid this spike in RAM that could be achieved through clever optimisation tricks, quickly deleting of unused RAM and sequential model loading.
Describe scenario use case
To load large models without having to buy 2x the RAM you actually should require. (Remembering the average amount of RAM on a typical users PC is 8GB or even 4GB)
The text was updated successfully, but these errors were encountered: