-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tried this on NVIDIA Labs SANA Pipe and made absolutely 0 difference #4
Comments
ok tried here and it broken the output
|
tried vae and it makes no difference on output and vram
|
DC-AE is using depth-wise conv, so it may have some bugs. How large is your resolution for SDXL and SD3? If the resolution is small, the VRAM is similar. |
the resolution is 4096x4096 i tried 1024 splits and it gave some error 512 splits still OOM :) by the way 2048x2048 model fits into 24 gb vram and using patch_conv still didnt make any vram reduction |
It seems you've resolved the issue with the VAE tiling in Diffusers, which directly tiles the decoder input. While this approach effectively reduces memory usage, it can come at the cost of some image quality. In contrast, PatchConv may require more GPU memory but guarantees mathematically equivalent results, addressing the PyTorch Conv2D memory issue. For 2K images, the convolution inputs are not large enough to trigger patchified inference (see this line). However, when increasing the resolution to 4K for SD3.5, PatchConv successfully reduced memory usage to 53GB and the results look correct. Regarding SANA's VAE, I noticed it has already employed the input tiling method in Diffusers. |
I haven't studied the code with the limitation and the technology in general in detail, but the questions are: Is it theoretically possible to reduce the VRAM requirements with small parameter values (with small sizes of generated images)? Or does it only work in combination with large parameters? In my opinion, in some sense it is similar to traditional "Tile Upscaling", when the image is split into smaller images and then regenerated in parts (thereby avoiding). Are there any problems with patch boundaries in your implementation and with patches in general? Will patches capture the context of the whole image during generation? Or is there no such problem due to the peculiarities of creating and using patches in your implementation? |
The model and pipe is here
https://github.com/NVlabs/Sana
How i tried is below. the app works, but 0 difference on VRAM or output
The text was updated successfully, but these errors were encountered: