Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMD generating takes 25 minutes #1958

Closed
mpirescarvalho opened this issue Jan 17, 2024 · 12 comments
Closed

AMD generating takes 25 minutes #1958

mpirescarvalho opened this issue Jan 17, 2024 · 12 comments
Labels
bug (AMD) Something isn't working (AMD specific) duplicate This issue or pull request already exists

Comments

@mpirescarvalho
Copy link

mpirescarvalho commented Jan 17, 2024

Read Troubleshoot

[x] I admit that I have read the Troubleshoot before making this issue.

Describe the problem
Its working but its taking SUPER long to generate the images.

CPU: AMD Ryzen 7 5700X
RAM: 16 GB
SWAP: 44GB on M.2 SSD
GPU: AMD Radeon RX 6700 XT 12 GB VRAM

Full Console Log
C:\www\stable-diffusion\Fooocus>.\python_embeded\python.exe -s Fooocus\entry_with_update.py --directml
Already up-to-date
Update succeeded.
[System ARGV] ['Fooocus\entry_with_update.py', '--directml']
Python 3.10.9 (tags/v3.10.9:1dd9be6, Dec 6 2022, 20:01:21) [MSC v.1934 64 bit (AMD64)]
Fooocus version: 2.1.862
Running on local URL: http://127.0.0.1:7865

To create a public link, set share=True in launch().
Using directml with device:
Total VRAM 1024 MB, total RAM 16310 MB
Set vram state to: NORMAL_VRAM
Always offload VRAM
Device: privateuseone
VAE dtype: torch.float32
Using sub quadratic optimization for cross attention, if you have memory or speed issues try using: --attention-split
Refiner unloaded.
model_type EPS
UNet ADM Dimension 2816
Using split attention in VAE
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
Using split attention in VAE
extra {'cond_stage_model.clip_l.logit_scale', 'cond_stage_model.clip_g.transformer.text_model.embeddings.position_ids', 'cond_stage_model.clip_l.text_projection'}
Base model loaded: C:\www\stable-diffusion\Fooocus\Fooocus\models\checkpoints\juggernautXL_version6Rundiffusion.safetensors
Request to load LoRAs [['sd_xl_offset_example-lora_1.0.safetensors', 0.1], ['None', 1.0], ['None', 1.0], ['None', 1.0], ['None', 1.0]] for model [C:\www\stable-diffusion\Fooocus\Fooocus\models\checkpoints\juggernautXL_version6Rundiffusion.safetensors].
Loaded LoRA [C:\www\stable-diffusion\Fooocus\Fooocus\models\loras\sd_xl_offset_example-lora_1.0.safetensors] for UNet [C:\www\stable-diffusion\Fooocus\Fooocus\models\checkpoints\juggernautXL_version6Rundiffusion.safetensors] with 788 keys at weight 0.1.
Fooocus V2 Expansion: Vocab with 642 words.
Fooocus Expansion engine loaded for cpu, use_fp16 = False.
Requested to load SDXLClipModel
Requested to load GPT2LMHeadModel
Loading 2 new models
App started successful. Use the app with http://127.0.0.1:7865/ or 127.0.0.1:7865
[Parameters] Adaptive CFG = 7
[Parameters] Sharpness = 2
[Parameters] ADM Scale = 1.5 : 0.8 : 0.3
[Parameters] CFG = 4.0
[Parameters] Seed = 3435339128246104584
[Parameters] Sampler = dpmpp_2m_sde_gpu - karras
[Parameters] Steps = 30 - 15
[Fooocus] Initializing ...
[Fooocus] Loading models ...
Refiner unloaded.
[Fooocus] Processing prompts ...
[Fooocus] Preparing Fooocus text #1 ...
[Prompt Expansion] cat in spacesuit, light shining, intricate, elegant, sharp focus, professional color, highly detailed, sublime, innocent, dramatic, cinematic, new classic, beautiful, dynamic, attractive, cute, epic, stunning, brilliant, creative, positive, artistic, awesome, confident, colorful, shiny, iconic, cool, best, pure, quiet, lovely, great, relaxed
[Fooocus] Preparing Fooocus text #2 ...
[Prompt Expansion] cat in spacesuit, light flowing colors, extremely detailed, beautiful, intricate, elegant, sharp focus, highly detail, dramatic cinematic perfect, open color, inspired, rich deep vivid vibrant scenic full atmosphere, professional composition, stunning, magical, amazing, creative, wonderful, epic, hopeful, awesome, brilliant, surreal, symmetry, ambient, best, pure, fine, very
[Fooocus] Encoding positive #1 ...
[Fooocus] Encoding positive #2 ...
[Fooocus] Encoding negative #1 ...
[Fooocus] Encoding negative #2 ...
[Parameters] Denoising Strength = 1.0
[Parameters] Initial Latent shape: Image Space (896, 1152)
Preparation time: 12.77 seconds
[Sampler] refiner_swap_method = joint
[Sampler] sigma_min = 0.0291671771556139, sigma_max = 14.614643096923828
Requested to load SDXL
Loading 1 new model
loading in lowvram mode 64.0
[Fooocus Model Management] Moving model(s) has taken 70.08 seconds
0%| | 0/30 [00:00<?, ?it/s]C:\www\stable-diffusion\Fooocus\Fooocus\modules\anisotropic.py:132: UserWarning: The operator 'aten::std_mean.correction' is not currently supported on the DML backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at D:\a_work\1\s\pytorch-directml-plugin\torch_directml\csrc\dml\dml_cpu_fallback.cpp:17.)
s, m = torch.std_mean(g, dim=(1, 2, 3), keepdim=True)
100%|██████████████████████████████████████████████████████████████████████████████████| 30/30 [11:59<00:00, 23.98s/it]
Requested to load AutoencoderKL
Loading 1 new model
loading in lowvram mode 64.0
[Fooocus Model Management] Moving model(s) has taken 1.60 seconds
Image generated with private log at: C:\www\stable-diffusion\Fooocus\Fooocus\outputs\2024-01-17\log.html
Generating and saving time: 795.84 seconds
[Sampler] refiner_swap_method = joint
[Sampler] sigma_min = 0.0291671771556139, sigma_max = 14.614643096923828
Requested to load SDXL
Loading 1 new model
loading in lowvram mode 64.0
[Fooocus Model Management] Moving model(s) has taken 58.36 seconds
100%|██████████████████████████████████████████████████████████████████████████████████| 30/30 [24:17<00:00, 48.57s/it]
Requested to load AutoencoderKL
Loading 1 new model
loading in lowvram mode 64.0
[Fooocus Model Management] Moving model(s) has taken 1.25 seconds
Image generated with private log at: C:\www\stable-diffusion\Fooocus\Fooocus\outputs\2024-01-17\log.html
Generating and saving time: 1519.52 seconds

@f0n51
Copy link

f0n51 commented Jan 17, 2024

Your Fooocus is generating images with CPU only. Thats the reason it takes so long.
The operator 'aten::std_mean.correction' is not currently supported on the DML backend and will fall back to run on the CPU

Have you read the readme concerning AMD GPUs?
https://github.com/lllyasviel/Fooocus?tab=readme-ov-file#windowsamd-gpus

@mpirescarvalho
Copy link
Author

mpirescarvalho commented Jan 17, 2024

Yes, I followed the instructions. This is my run.bat:

.\python_embeded\python.exe -m pip uninstall torch torchvision torchaudio torchtext functorch xformers -y
.\python_embeded\python.exe -m pip install torch-directml
.\python_embeded\python.exe -s Fooocus\entry_with_update.py --directml
pause

note: gpu memory is being used on the generating process

I've seen others running on amd gpus on windows, not sure what is happening

@f0n51
Copy link

f0n51 commented Jan 17, 2024

I'm having the same problems on Windows with my AMD GPU too and still couldn't find out whats wrong. There's an open issue in the DirectML project here:
https://github.com/microsoft/DirectML/issues/536

I switched over to Google Colab as I couldn't get good results with my AMD GPU, neither on Windows nor on Linux.

@mpirescarvalho
Copy link
Author

I'll keep an eye on that issue, thanks

@f0n51
Copy link

f0n51 commented Jan 17, 2024

Maybe someone from the dev team joins and has a solution for that. I personally gave up on harassing my AMD GPU :-D

@pscheit
Copy link

pscheit commented Jan 17, 2024

@mpirescarvalho looks like it is using a GPU but one with only 1024 RAM? (It says it uses low vram mode)
So maybe this is your onboard graphic?

@mpirescarvalho
Copy link
Author

Negative, my processor doesnt have onboard graphics

@patientx
Copy link

vram being "1024" is normal. same in comfyui. that's how the dml reports it.

*** First of all I have to say thanks to the devs for finally building an app that I can use on windows to generate with SDXL models without crashing instantly or at best at second try. I tried sdwebui, sdnext, comfyui and only with sdnext I was able to gen but there app just gives out of memory instantly or at best scenerio 2nd try. With fooocus if I change a lot of models eventually the same out of memory errors pop up but if I use one or two models constantly I just get slow generation , no crashes at all ...

I am using an rx 6600 8 gb and did various things to speed up the generation.

  • First enabled lowvram from the cmdline, but since you have 12gb you should be better in this regard, still that could work.

  • Second , my windows swap file was on my ssd which is my C drive, after seeing that because of our vram problems app just moves the models to system ram and eventually it fills up and switches to using swap file I thought of moving the swap file to my nvme drive which I normally use for other sdwebui stuff and games. Just because of this the moving of models FELL FROM 80-100 SEC to 30-40 secs.

  • Third, use turbosdxl models with 8-10 steps or use any sdxl with LCM lora. With sd1.5, the results I am able to get after upscaling 2-3 times with various techniques while trying not to trigger out of memory errors, is faaaar beneath what I get here instantly. OK , the times I spent there around the same with sd1.5 with all the upscaling but here it is far less worrisome to do.

ALSO , That error "the operator 'aten::std_mean.correction' is not currently supported on the DML backend and will fall back to run on the CPU" pops up from time to time on comfyui too, and as far as I know in comfyui that error didn't actually affected the speed at all maybe a bit ?... I am not sure how much it effects the sdxl generation though ... As far as I know it is already slow

*** Finally I also have to note here , I am using both my gpu and cpu at underpowered underclocked states. 6600 is 48w power limited and my 3600 is 40w power limited. With this limits in mind :::

Before I moved my swap to nvme , an 8 step LCM sampled "extreme speed" using the default model juggernaught...
3 image generation , preparetion time 18-20 secs, model offloading to system memory 80 to 100 secs (smaller vae's and stuff happens in 2-3sec), 8 step generation around 10sec/it, total time 485 seconds ,

After moving the swap and making it 2 times my system memory (16 gb mem - 32 gb swap file on nvme)
3 image generation , preparetion time 17-18 secs, model offloading to system memory around 30 secs each time more or less, (smaller vae's and stuff happens in 2-3sec), 8 step generation around 7sec/it, total time 280 seconds ,

So, swap in nvme and two times system ram is very effective, lcm is effective or using turbo models. I am atm using normal models with lcm lora and lcm samplers in around 10-12 steps. Cfg 2 for turbo and 1.5 for lcm so that I can use negatives.

@mashb1t
Copy link
Collaborator

mashb1t commented Jan 17, 2024

@f0n51 thank you for providing the reference to DirectML and @patientx for your insights.

microsoft/DirectML#536 (comment) already references to #1321, which i closed 2 weeks ago in #1321 (comment), as this is not an issue of Fooocus.

We can still keep this issue open but i'd suggest to close it as there's nothing we can actively do.
@mpirescarvalho this is your call

@mashb1t mashb1t added duplicate This issue or pull request already exists bug (AMD) Something isn't working (AMD specific) labels Jan 17, 2024
@patientx
Copy link

patientx commented Jan 17, 2024

One thing to do is , once the 1st gen starts and that error pops up just skip or stop it, the next ones won't have it. Just tested a bit more and the first one always has that error and the step time is around 40 here too, but if I cancel it and start again this time step time starts around 30 and drops to around 20ish for me. /remember 48w power limited 6600/ so probably with 12gb vram and full power 6700 xt everything would be much faster.

@mpirescarvalho
Copy link
Author

@mashb1t agreed

@mpirescarvalho mpirescarvalho closed this as not planned Won't fix, can't repro, duplicate, stale Jan 18, 2024
@mpirescarvalho
Copy link
Author

mpirescarvalho commented Jan 25, 2024

Update:

After adding more 16GB of ram to my setup, generating time went down to 2 minutes per image

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug (AMD) Something isn't working (AMD specific) duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

5 participants