Skip to content

Conversation

calledit
Copy link

@calledit calledit commented Apr 6, 2025

This PR Reduces CUDA memory use by:

  1. Not saving all frames of the full resolution depth estimation on to the GPU.
  2. By only moving the downscaled depth estimations to the GPU when it is about to be used on the GPU.

@slothfulxtx
Copy link
Collaborator

slothfulxtx commented Apr 8, 2025

Thanks for your advice. Following your instructions, we also conduct some experiments on saving memory usage by moving as many tensors as possible to the CPU device, where the decoded point map results are also moved to CPU. However, this modification leads to minor change in memory usage. We test with 576*1024 110-frame video, the memory usage is still around 40G, without obvious decline. Do you have any further suggestion on this problem? Maybe this modification only works for downsampled processing?

@calledit
Copy link
Author

calledit commented Apr 8, 2025

We test with 576*1024 110-frame video, the memory usage is still around 40G, without obvious decline. Do you have any further suggestion on this problem? Maybe this modification only works for downsampled processing?

It only reduces CUDA memory use when the input is longer than 110 frames and/or there is downsampling.

I have only tested at --height 384 --width 640 with original input of 1080x 1920

@calledit
Copy link
Author

calledit commented Apr 8, 2025

With latest changes and using chunk size 6 instead of 8 i managed to get the model to do 1024 x 576 x 108 frames on a Nvidia 3090

@slothfulxtx
Copy link
Collaborator

slothfulxtx commented Apr 9, 2025

Really appreciate your work on optimizing our implementation. Could you modify the corresponding part in the determ pipeline? So I can merge the PR once. By the way, remember to update to our latest version. Thanks again for your work.

@slothfulxtx
Copy link
Collaborator

I've just checked your modification. Here's some advices :

  1. I prefer not modifying third_party/__init__.py, putting all device moving operations in geometrycrafter/diff_ppl.py is more friendly for further improvement.
  2. Using device or self._execution_device instead of "cuda" can follow the code style

The frequent device moving operations and puting some interpolation operations on cpu may influence the inference speed, I'll find some balance between inference speed and memory usage. Inspired by your modification, I'll update our implementation after merging your PR, which can make GeometryCrafter better.

@slothfulxtx
Copy link
Collaborator

We've integrated this feature into the repo, thanks again to your helpful suggestions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants