-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
My experiment creating a more efficient loader. Uses 90% less RAM (while loading). 🚀 #15429
Comments
My experiment is not working so far because there appears to be a memory leak (DirectML c#). The code:
Pushes the float values onto the GPU but doesn't seem to release them from System Memory. So I can push the weights onto the GPU but can't clear up the RAM afterwards. |
Update: I solved that memory "leak" with:
So now my experiment can load in a 1.6GB onnx file with barely any RAM usage. Ideal for computers which are short on RAM. 😀🚀🌙 There is a tiny bit of RAM leakage but that might be other parts of my code. So I solved the RAM issue now my speed issue is mainly how to get a load of ushorts from a file into a |
@pauldog You've externalized the weights meaning you've arranged for the weights to be stored separately from the onnx model. ORT memory maps external weights. Not fully sure about Windows but on Linux this is definitely not counted as resident memory of the process. So, unless you intend to share these weights across many sessions or processes, this memory saving is a bit misleading. |
Sorry I think you misunderstood what I am doing. The point of all this is to load the model onto the GPU without a memory spike on RAM which can lead to out of memory exceptions. Once the model is loaded it is using exactly the same RAM and VRAM. This means I can load a 12GB model into the VRAM without the RAM spiking by 24GB and causing a memory exception. Without externalising the weights this was not possible. To reiterate, it is the loader that is using less RAM not the model itself. Which is very clear from my above memory reading which shows no spike when I load it this way. I tried to follow your instructions #15080 But I couldn't understand them as I am using C# so I did it this way. Unless I misunderstood and there's an easier way to avoid this memory spike? (Maybe there is no spike on Linux but there definitely is one on Windows even with external weights) I am using Windows because I am a game developer so these things need to run on Windows. I can't write software with memory spikes because it will crash people's computers. |
Closing as resolved. |
More Efficient Onnx Loader
I am working on creating a more efficient loader for onnx to use less RAM in c#. Currently the session loads the whole file into RAM and decompresses it on RAM meaning it can take 2x the size of the onnx file of RAM just to load it onto the GPU. This can lead to out of memory errors. Using this new method, you can load a 10GB onnx onto GPU with only 12GB RAM.
Hopefully by sequentially loading the weights from separate files one by one, it should reduce RAM usage considerably when loading the models. I'm not sure why the default method is so inefficient but it can be solved this way.
Here are my steps starting from a torch model
export the torch model with
torch.onnx.export(..)
and the flagexport_params=True
[This creates an onnx file with embeded weights "model.onnx")export the torch model with
torch.onnx.export(..)
and the flagexport_params=False
[This creates a small onnx file without weights "model_no_weights.onnx")load the model.onnx and then
onnx.save(model, output_path, save_as_external_data=True, all_tensors_to_one_file=False)
[This takes the large model.onnx file and separates all the weights into separate files "model_separated.onnx" plus lots of weight files]Create a session using the "model_no_weights.onnx"
Iterate through all the 'inputs' (each weight becomes an an input), get the name of the input, use this to load the weights file from the disk. Use IOBinding to bind these weights to that input.
Release the RAM from the IOBinding (I haven't worked out how to do this yet!)
Bind the actual input data
Bind the output
Run the inference with RunWithBindingAndNames()
So far I have got it running on DirectML c#.
for my experiments. float16s are the floats loaded from one of the separate weight files.
This is very convoluted but eventually if I get it to work should use practically zero RAM loading a model onto the GPU.
If someone has a better way of doing the same thing let me know.
What would make the steps easier might be a function like LoadSessionWithoutWeights(), or LoadSessionSequentiallyFromFiles(). In the case where the onnx is a load of separate weight files, the default behaviour should surely be to load it sequentially freeing up RAM as it goes. This is my proof of concept. I don't know if all this binding will effect the speed of the inference.
(Is this the best way to do it or should I be using AddInitializer and PrepackedWeightContainers???)
Another problem I had was that for the stable diffusion Unet I created a onnx file with no weights and it took 40 seconds to load it! Whereas the unet with weights took 8 seconds. I don't know what went wrong here!
To reproduce
as above
Urgency
No response
Platform
Windows
OS Version
Windows 10
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.15
ONNX Runtime API
C#
Architecture
X64
Execution Provider
DirectML
Execution Provider Library Version
No response
The text was updated successfully, but these errors were encountered: