Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

My experiment creating a more efficient loader. Uses 90% less RAM (while loading). 🚀 #15429

Closed
elephantpanda opened this issue Apr 7, 2023 · 6 comments
Labels
ep:DML issues related to the DirectML execution provider platform:windows issues related to the Windows platform

Comments

@elephantpanda
Copy link

elephantpanda commented Apr 7, 2023

More Efficient Onnx Loader

I am working on creating a more efficient loader for onnx to use less RAM in c#. Currently the session loads the whole file into RAM and decompresses it on RAM meaning it can take 2x the size of the onnx file of RAM just to load it onto the GPU. This can lead to out of memory errors. Using this new method, you can load a 10GB onnx onto GPU with only 12GB RAM.

Hopefully by sequentially loading the weights from separate files one by one, it should reduce RAM usage considerably when loading the models. I'm not sure why the default method is so inefficient but it can be solved this way.

Here are my steps starting from a torch model

  1. export the torch model with torch.onnx.export(..) and the flag export_params=True [This creates an onnx file with embeded weights "model.onnx")

  2. export the torch model with torch.onnx.export(..) and the flag export_params=False [This creates a small onnx file without weights "model_no_weights.onnx")

  3. load the model.onnx and then onnx.save(model, output_path, save_as_external_data=True, all_tensors_to_one_file=False) [This takes the large model.onnx file and separates all the weights into separate files "model_separated.onnx" plus lots of weight files]

  4. Create a session using the "model_no_weights.onnx"

  5. Iterate through all the 'inputs' (each weight becomes an an input), get the name of the input, use this to load the weights file from the disk. Use IOBinding to bind these weights to that input.

  6. Release the RAM from the IOBinding (I haven't worked out how to do this yet!)

  7. Bind the actual input data

  8. Bind the output

  9. Run the inference with RunWithBindingAndNames()

So far I have got it running on DirectML c#.

 using (FixedBufferOnnxValue value = FixedBufferOnnxValue.CreateFromTensor(new DenseTensor<Float16>(float16s, dims)))
  {
      binding.BindInput(key, value); 
  }

for my experiments. float16s are the floats loaded from one of the separate weight files.

This is very convoluted but eventually if I get it to work should use practically zero RAM loading a model onto the GPU.

If someone has a better way of doing the same thing let me know.

What would make the steps easier might be a function like LoadSessionWithoutWeights(), or LoadSessionSequentiallyFromFiles(). In the case where the onnx is a load of separate weight files, the default behaviour should surely be to load it sequentially freeing up RAM as it goes. This is my proof of concept. I don't know if all this binding will effect the speed of the inference.

(Is this the best way to do it or should I be using AddInitializer and PrepackedWeightContainers???)

Another problem I had was that for the stable diffusion Unet I created a onnx file with no weights and it took 40 seconds to load it! Whereas the unet with weights took 8 seconds. I don't know what went wrong here!

To reproduce

as above

Urgency

No response

Platform

Windows

OS Version

Windows 10

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.15

ONNX Runtime API

C#

Architecture

X64

Execution Provider

DirectML

Execution Provider Library Version

No response

@github-actions github-actions bot added ep:CUDA issues related to the CUDA execution provider ep:DML issues related to the DirectML execution provider platform:windows issues related to the Windows platform labels Apr 7, 2023
@elephantpanda elephantpanda changed the title My experiement creating a more efficient loader My experiment creating a more efficient loader Apr 7, 2023
@github-actions github-actions bot removed the ep:CUDA issues related to the CUDA execution provider label Apr 8, 2023
@elephantpanda
Copy link
Author

elephantpanda commented Apr 9, 2023

My experiment is not working so far because there appears to be a memory leak (DirectML c#). The code:

 using (FixedBufferOnnxValue value = FixedBufferOnnxValue.CreateFromTensor(new DenseTensor<Float16>(float16s, dims)))
  {
      binding.BindInput(key, value); 
  }

Pushes the float values onto the GPU but doesn't seem to release them from System Memory. So I can push the weights onto the GPU but can't clear up the RAM afterwards.

image

@elephantpanda elephantpanda changed the title My experiment creating a more efficient loader My experiment creating a more efficient loader - memory leak? Apr 9, 2023
@elephantpanda
Copy link
Author

elephantpanda commented Apr 9, 2023

Update: I solved that memory "leak" with:

 using (FixedBufferOnnxValue value = FixedBufferOnnxValue.CreateFromTensor(new DenseTensor<Float16>(float16s, dims)))
  {
      binding.BindInput(key, value);  binding.SynchronizeBoundInputs();
  }

image

So now my experiment can load in a 1.6GB onnx file with barely any RAM usage. Ideal for computers which are short on RAM. 😀🚀🌙

There is a tiny bit of RAM leakage but that might be other parts of my code.

So I solved the RAM issue now my speed issue is mainly how to get a load of ushorts from a file into a Tensor<Float16> array as fast as possible.

@elephantpanda elephantpanda changed the title My experiment creating a more efficient loader - memory leak? My experiment creating a more efficient loader. Uses 90% less RAM. Apr 9, 2023
@elephantpanda elephantpanda changed the title My experiment creating a more efficient loader. Uses 90% less RAM. My experiment creating a more efficient loader. Uses 90% less RAM. 🚀 Apr 9, 2023
@elephantpanda
Copy link
Author

elephantpanda commented Apr 9, 2023

IHere are the results using the standard method. And the new RAM saving method:

image

My next step might be to compress the weight files on the disk to also save on HD space.

@pranavsharma
Copy link
Contributor

@pauldog You've externalized the weights meaning you've arranged for the weights to be stored separately from the onnx model. ORT memory maps external weights. Not fully sure about Windows but on Linux this is definitely not counted as resident memory of the process. So, unless you intend to share these weights across many sessions or processes, this memory saving is a bit misleading.

@elephantpanda
Copy link
Author

elephantpanda commented Apr 13, 2023

@pauldog You've externalized the weights meaning you've arranged for the weights to be stored separately from the onnx model. ORT memory maps external weights. Not fully sure about Windows but on Linux this is definitely not counted as resident memory of the process. So, unless you intend to share these weights across many sessions or processes, this memory saving is a bit misleading.

Sorry I think you misunderstood what I am doing. The point of all this is to load the model onto the GPU without a memory spike on RAM which can lead to out of memory exceptions. Once the model is loaded it is using exactly the same RAM and VRAM.

This means I can load a 12GB model into the VRAM without the RAM spiking by 24GB and causing a memory exception. Without externalising the weights this was not possible.

To reiterate, it is the loader that is using less RAM not the model itself. Which is very clear from my above memory reading which shows no spike when I load it this way.

I tried to follow your instructions #15080 But I couldn't understand them as I am using C# so I did it this way.

Unless I misunderstood and there's an easier way to avoid this memory spike? (Maybe there is no spike on Linux but there definitely is one on Windows even with external weights)

I am using Windows because I am a game developer so these things need to run on Windows. I can't write software with memory spikes because it will crash people's computers.

@elephantpanda elephantpanda changed the title My experiment creating a more efficient loader. Uses 90% less RAM. 🚀 My experiment creating a more efficient loader. Uses 90% less RAM (while loading). 🚀 Apr 13, 2023
@nums11
Copy link
Contributor

nums11 commented Sep 5, 2023

Closing as resolved.

@nums11 nums11 closed this as completed Sep 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:DML issues related to the DirectML execution provider platform:windows issues related to the Windows platform
Projects
None yet
Development

No branches or pull requests

3 participants