Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maximum frames/steps etc for 24GB card? Keep getting OOM #189

Open
shaun-ba opened this issue Dec 20, 2024 · 11 comments
Open

Maximum frames/steps etc for 24GB card? Keep getting OOM #189

shaun-ba opened this issue Dec 20, 2024 · 11 comments

Comments

@shaun-ba
Copy link

shaun-ba commented Dec 20, 2024

As title, just wondering what we should be looking at, even with 720 I'm getting OOMs (sometimes it works if i restart comfy), so maybe something isn't be released after generation

edit 1:

I don't see how people are generating decent size/length videos, I'm only able to get to 624/832 size with 45 frames?

edit 2:

Best I can generate so far with 3090 and sage with block swap. This the best to be expected?

  • Swapping 20 double blocks and 0 single blocks
  • Sampling 97 frames in 25 latents at 544x960 with 30 inference steps
@kijai
Copy link
Owner

kijai commented Dec 20, 2024

I only have experience with a 4090, 129 frames at 960x544 uses about ~22GB with torch.compile, without torch.compile it will oom, in both comfy native and this wrapper. Compile seems to have a huge effect on VRAM use, and is about 30% faster, but from what I hear compiling isn't working at fp8 on a 3090 and requires 40xx card.

With the wrapper sage/flash have proper memory use, sdpa implementation is highly inefficient and much better in comfy native.

You can additionally enable swapping for up to 40 single blocks too.

As to releasing the VRAM, it's always done when the force_offload is enabled in the node, but it is NOT done if you interrupt the process, so that can leave stuff on VRAM temporarily.

@Fredd4e
Copy link

Fredd4e commented Dec 20, 2024

As title, just wondering what we should be looking at, even with 720 I'm getting OOMs (sometimes it works if i restart comfy), so maybe something isn't be released after generation

edit 1:

I don't see how people are generating decent size/length videos, I'm only able to get to 624/832 size with 45 frames?

edit 2:

Best I can generate so far with 3090 and sage with block swap. This the best to be expected?

* Swapping 20 double blocks and 0 single blocks

* Sampling 97 frames in 25 latents at 544x960 with 30 inference steps

Hey,

I have a GTX 1080 TI and i can do these settings, it takes a long time tho because of old GPU, but if you use the same settings how high can you crank the resolution/frames? (1080ti only has 11gb of vram)

bild

btw these settings take me 160s per iteration, im curious how long it takes for 3090 as i may buy one soon.

@shaun-ba
Copy link
Author

Hey,

I have a GTX 1080 TI and i can do these settings, it takes a long time tho because of old GPU, but if you use the same settings how high can you crank the resolution/frames? (1080ti only has 11gb of vram)

Currently I've only been testing for a few hours, but below takes 3.2min

Sampling 129 frames in 33 latents at 512x384 with 20 inference steps

@shaun-ba
Copy link
Author

@Fredd4e If you are on Linux installing Sage is pretty simple and apparently gives good time savings.

@shaun-ba
Copy link
Author

shaun-ba commented Dec 20, 2024

@kijai block swap keeps me under 24GB but then I get OOM on VideoDecode, so kind of defeats the point unless I'm missing something?

I also see that all of this methods are for low vram, I can't get anywhere near the suggested sizes from the model creators and I wouldn't consider 24GB to be low vram for a fp8 model

@kijai
Copy link
Owner

kijai commented Dec 20, 2024

@kijai block swap keeps me under 24GB but then I get OOM on VideoDecode, so kind of defeats the point unless I'm missing something?

I also see that all of this methods are for low vram, I can't get anywhere near the suggested sizes from the model creators and I wouldn't consider 24GB to be low vram for a fp8 model

You can reduce the tile size on the decode node, works fine with 128 spatial (halves the VRAM use compared to default 256), keep the temporal at 64 though to avoid stuttering/ghosting in the result. Have to disable the auto_size for the adjustments to take effect.

The max resolution is very heavy, they did say that takes something like 60GB initially after all, so we are very much in "low VRAM" territory with 24GB.

@shaun-ba
Copy link
Author

@kijai block swap keeps me under 24GB but then I get OOM on VideoDecode, so kind of defeats the point unless I'm missing something?
I also see that all of this methods are for low vram, I can't get anywhere near the suggested sizes from the model creators and I wouldn't consider 24GB to be low vram for a fp8 model

You can reduce the tile size on the decode node, works fine with 128 spatial (halves the VRAM use compared to default 256), keep the temporal at 64 though to avoid stuttering/ghosting in the result. Have to disable the auto_size for the adjustments to take effect.

The max resolution is very heavy, they did say that takes something like 60GB initially after all, so we are very much in "low VRAM" territory with 24GB.

Makes sense, can't wait for a dual 5090 setup!

For me then for now, it seems I max out at 1024 x 1024 109 frames = 50minutes! Not really worth it, but I'm sure improvements will come soon.[

@Fredd4e
Copy link

Fredd4e commented Dec 20, 2024

@Fredd4e If you are on Linux installing Sage is pretty simple and apparently gives good time savings.

I wish, currently i am on windows 10, i did give it a try to use sageattention - if i udnerstand correctly i need triton to run it, and triton does not seem to support my 1080ti.

However if you do still think it should work id love to go deeper.

@eas125
Copy link

eas125 commented Dec 22, 2024

I only have experience with a 4090, 129 frames at 960x544 uses about ~22GB with torch.compile, without torch.compile it will oom, in both comfy native and this wrapper. Compile seems to have a huge effect on VRAM use, and is about 30% faster, but from what I hear compiling isn't working at fp8 on a 3090 and requires 40xx card.

With the wrapper sage/flash have proper memory use, sdpa implementation is highly inefficient and much better in comfy native.

You can additionally enable swapping for up to 40 single blocks too.

As to releasing the VRAM, it's always done when the force_offload is enabled in the node, but it is NOT done if you interrupt the process, so that can leave stuff on VRAM temporarily.

Sorry to hijack a bit, but I'm trying to run torch.compile and get this error. Any ideas?
image

@kijai
Copy link
Owner

kijai commented Dec 22, 2024

I only have experience with a 4090, 129 frames at 960x544 uses about ~22GB with torch.compile, without torch.compile it will oom, in both comfy native and this wrapper. Compile seems to have a huge effect on VRAM use, and is about 30% faster, but from what I hear compiling isn't working at fp8 on a 3090 and requires 40xx card.
With the wrapper sage/flash have proper memory use, sdpa implementation is highly inefficient and much better in comfy native.
You can additionally enable swapping for up to 40 single blocks too.
As to releasing the VRAM, it's always done when the force_offload is enabled in the node, but it is NOT done if you interrupt the process, so that can leave stuff on VRAM temporarily.

Sorry to hijack a bit, but I'm trying to run torch.compile and get this error. Any ideas? image

It looks like the bug in torch for Windows, it's probably going to be fixed in 2.6.0, for now you can manually edit the code as this PR indicates: https://github.com/pytorch/pytorch/pull/138992/files

That file would be in your venv or python_embeded folder, for example:

\python_embeded\Lib\site-packages\torch_inductor

@eas125
Copy link

eas125 commented Dec 22, 2024

I only have experience with a 4090, 129 frames at 960x544 uses about ~22GB with torch.compile, without torch.compile it will oom, in both comfy native and this wrapper. Compile seems to have a huge effect on VRAM use, and is about 30% faster, but from what I hear compiling isn't working at fp8 on a 3090 and requires 40xx card.
With the wrapper sage/flash have proper memory use, sdpa implementation is highly inefficient and much better in comfy native.
You can additionally enable swapping for up to 40 single blocks too.
As to releasing the VRAM, it's always done when the force_offload is enabled in the node, but it is NOT done if you interrupt the process, so that can leave stuff on VRAM temporarily.

Sorry to hijack a bit, but I'm trying to run torch.compile and get this error. Any ideas? image

It looks like the bug in torch for Windows, it's probably going to be fixed in 2.6.0, for now you can manually edit the code as this PR indicates: https://github.com/pytorch/pytorch/pull/138992/files

That file would be in your venv or python_embeded folder, for example:

\python_embeded\Lib\site-packages\torch_inductor

That did it. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants