My experiment creating a more efficient loader. Uses 90% less RAM (while loading). 🚀 #15429

elephantpanda · 2023-04-07T23:32:21Z

More Efficient Onnx Loader

I am working on creating a more efficient loader for onnx to use less RAM in c#. Currently the session loads the whole file into RAM and decompresses it on RAM meaning it can take 2x the size of the onnx file of RAM just to load it onto the GPU. This can lead to out of memory errors. Using this new method, you can load a 10GB onnx onto GPU with only 12GB RAM.

Hopefully by sequentially loading the weights from separate files one by one, it should reduce RAM usage considerably when loading the models. I'm not sure why the default method is so inefficient but it can be solved this way.

Here are my steps starting from a torch model

export the torch model with torch.onnx.export(..) and the flag export_params=True [This creates an onnx file with embeded weights "model.onnx")
export the torch model with torch.onnx.export(..) and the flag export_params=False [This creates a small onnx file without weights "model_no_weights.onnx")
load the model.onnx and then onnx.save(model, output_path, save_as_external_data=True, all_tensors_to_one_file=False) [This takes the large model.onnx file and separates all the weights into separate files "model_separated.onnx" plus lots of weight files]
Create a session using the "model_no_weights.onnx"
Iterate through all the 'inputs' (each weight becomes an an input), get the name of the input, use this to load the weights file from the disk. Use IOBinding to bind these weights to that input.
Release the RAM from the IOBinding (I haven't worked out how to do this yet!)
Bind the actual input data
Bind the output
Run the inference with RunWithBindingAndNames()

So far I have got it running on DirectML c#.

 using (FixedBufferOnnxValue value = FixedBufferOnnxValue.CreateFromTensor(new DenseTensor<Float16>(float16s, dims)))
  {
      binding.BindInput(key, value); 
  }

for my experiments. float16s are the floats loaded from one of the separate weight files.

This is very convoluted but eventually if I get it to work should use practically zero RAM loading a model onto the GPU.

If someone has a better way of doing the same thing let me know.

What would make the steps easier might be a function like LoadSessionWithoutWeights(), or LoadSessionSequentiallyFromFiles(). In the case where the onnx is a load of separate weight files, the default behaviour should surely be to load it sequentially freeing up RAM as it goes. This is my proof of concept. I don't know if all this binding will effect the speed of the inference.

(Is this the best way to do it or should I be using AddInitializer and PrepackedWeightContainers???)

Another problem I had was that for the stable diffusion Unet I created a onnx file with no weights and it took 40 seconds to load it! Whereas the unet with weights took 8 seconds. I don't know what went wrong here!

To reproduce

as above

Urgency

No response

Platform

Windows

OS Version

Windows 10

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.15

ONNX Runtime API

C#

Architecture

X64

Execution Provider

DirectML

Execution Provider Library Version

No response

The text was updated successfully, but these errors were encountered:

elephantpanda · 2023-04-09T02:59:37Z

My experiment is not working so far because there appears to be a memory leak (DirectML c#). The code:

 using (FixedBufferOnnxValue value = FixedBufferOnnxValue.CreateFromTensor(new DenseTensor<Float16>(float16s, dims)))
  {
      binding.BindInput(key, value); 
  }

Pushes the float values onto the GPU but doesn't seem to release them from System Memory. So I can push the weights onto the GPU but can't clear up the RAM afterwards.

elephantpanda · 2023-04-09T03:45:34Z

Update: I solved that memory "leak" with:

 using (FixedBufferOnnxValue value = FixedBufferOnnxValue.CreateFromTensor(new DenseTensor<Float16>(float16s, dims)))
  {
      binding.BindInput(key, value);  binding.SynchronizeBoundInputs();
  }

So now my experiment can load in a 1.6GB onnx file with barely any RAM usage. Ideal for computers which are short on RAM. 😀🚀🌙

There is a tiny bit of RAM leakage but that might be other parts of my code.

So I solved the RAM issue now my speed issue is mainly how to get a load of ushorts from a file into a Tensor<Float16> array as fast as possible.

elephantpanda · 2023-04-09T05:06:45Z

IHere are the results using the standard method. And the new RAM saving method:

My next step might be to compress the weight files on the disk to also save on HD space.

pranavsharma · 2023-04-13T03:21:07Z

@pauldog You've externalized the weights meaning you've arranged for the weights to be stored separately from the onnx model. ORT memory maps external weights. Not fully sure about Windows but on Linux this is definitely not counted as resident memory of the process. So, unless you intend to share these weights across many sessions or processes, this memory saving is a bit misleading.

elephantpanda · 2023-04-13T05:37:25Z

@pauldog You've externalized the weights meaning you've arranged for the weights to be stored separately from the onnx model. ORT memory maps external weights. Not fully sure about Windows but on Linux this is definitely not counted as resident memory of the process. So, unless you intend to share these weights across many sessions or processes, this memory saving is a bit misleading.

Sorry I think you misunderstood what I am doing. The point of all this is to load the model onto the GPU without a memory spike on RAM which can lead to out of memory exceptions. Once the model is loaded it is using exactly the same RAM and VRAM.

This means I can load a 12GB model into the VRAM without the RAM spiking by 24GB and causing a memory exception. Without externalising the weights this was not possible.

To reiterate, it is the loader that is using less RAM not the model itself. Which is very clear from my above memory reading which shows no spike when I load it this way.

I tried to follow your instructions #15080 But I couldn't understand them as I am using C# so I did it this way.

Unless I misunderstood and there's an easier way to avoid this memory spike? (Maybe there is no spike on Linux but there definitely is one on Windows even with external weights)

I am using Windows because I am a game developer so these things need to run on Windows. I can't write software with memory spikes because it will crash people's computers.

nums11 · 2023-09-05T19:29:40Z

Closing as resolved.

github-actions bot added ep:CUDA issues related to the CUDA execution provider ep:DML issues related to the DirectML execution provider platform:windows issues related to the Windows platform labels Apr 7, 2023

elephantpanda changed the title ~~My experiement creating a more efficient loader~~ My experiment creating a more efficient loader Apr 7, 2023

github-actions bot removed the ep:CUDA issues related to the CUDA execution provider label Apr 8, 2023

elephantpanda changed the title ~~My experiment creating a more efficient loader~~ My experiment creating a more efficient loader - memory leak? Apr 9, 2023

elephantpanda changed the title ~~My experiment creating a more efficient loader - memory leak?~~ My experiment creating a more efficient loader. Uses 90% less RAM. Apr 9, 2023

elephantpanda changed the title ~~My experiment creating a more efficient loader. Uses 90% less RAM.~~ My experiment creating a more efficient loader. Uses 90% less RAM. 🚀 Apr 9, 2023

elephantpanda changed the title ~~My experiment creating a more efficient loader. Uses 90% less RAM. 🚀~~ My experiment creating a more efficient loader. Uses 90% less RAM (while loading). 🚀 Apr 13, 2023

nums11 closed this as completed Sep 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

My experiment creating a more efficient loader. Uses 90% less RAM (while loading). 🚀 #15429

My experiment creating a more efficient loader. Uses 90% less RAM (while loading). 🚀 #15429

elephantpanda commented Apr 7, 2023 •

edited

Loading

elephantpanda commented Apr 9, 2023 •

edited

Loading

elephantpanda commented Apr 9, 2023 •

edited

Loading

elephantpanda commented Apr 9, 2023 •

edited

Loading

pranavsharma commented Apr 13, 2023

elephantpanda commented Apr 13, 2023 •

edited

Loading

nums11 commented Sep 5, 2023

My experiment creating a more efficient loader. Uses 90% less RAM (while loading). 🚀 #15429

My experiment creating a more efficient loader. Uses 90% less RAM (while loading). 🚀 #15429

Comments

elephantpanda commented Apr 7, 2023 • edited Loading

More Efficient Onnx Loader

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

elephantpanda commented Apr 9, 2023 • edited Loading

elephantpanda commented Apr 9, 2023 • edited Loading

elephantpanda commented Apr 9, 2023 • edited Loading

pranavsharma commented Apr 13, 2023

elephantpanda commented Apr 13, 2023 • edited Loading

nums11 commented Sep 5, 2023

elephantpanda commented Apr 7, 2023 •

edited

Loading

elephantpanda commented Apr 9, 2023 •

edited

Loading

elephantpanda commented Apr 9, 2023 •

edited

Loading

elephantpanda commented Apr 9, 2023 •

edited

Loading

elephantpanda commented Apr 13, 2023 •

edited

Loading