Skip to content

Conversation

@wbruna
Copy link

@wbruna wbruna commented Jul 2, 2025

This simply applies the existing config to the image generation parameters.

Tested on SD1.5, SDXL and Flux, both on ROCm and Vulkan.

@LostRuins
Copy link
Owner

Does it work well? Any speedups/memory savings? I haven't tried SD flash attn extensively.
Been a little busy recently so before I merge this can we do a quick sanity check - I see you've tried SD1.5, SDXL and Flux, can we also:

  • Test multiple resolutions (basically 512x512, 1024x1024 and a non-square one)
  • Test with a lora, (any lora is fine, i know sd1.5 and sdxl loras work usually)
  • Test with/without VAE tiling
  • Test with TAE SD

And just note any potential incompatibilities. Then we should be good to merge.

@wbruna
Copy link
Author

wbruna commented Jul 3, 2025

A quick test with SDXL + DMD2 LoRA rendering a 1024x1536 image (not including the VAE phase):

fa backend idle vram inference vram inference time
off Vulkan (radv) 6.5G 8.2G 35.33s
on Vulkan (radv) 6.5G 7.2G 28.62s
off Vulkan (amdvlk) 6.5G 8.2G 31.57s
on Vulkan (amdvlk) 6.5G 7.2G 23.90s
off ROCm 7.0G 8.6G 31.83s
on ROCm 7.0G 7.7G 16.73s

GPU is a 7600 XT; it's one of the few that gets better LLM token generation speed with flash attention on, so it may be kind of a best case.

A quirk: I'm getting changed images in some cases, when turning on flash attention. Not hugely so: the composition stays the same, and the quality seems fine; it's mostly like the variations I get when changing between different backends.

@wbruna
Copy link
Author

wbruna commented Jul 4, 2025

I did a few more tests on SDXL+Vulkan: in general, higher resolutions benefit more from flash attention. At 1024x1024, I'm getting around 500M VRAM / 18.0s, versus 900M VRAM / 21.5s; at 512x512 and 512x768, essentially the same ~200M VRAM, and slightly faster inference (~4-8%).

SD1.5 doesn't seem to benefit from it, even when rendering an 832x832 image: same VRAM usage and speed.

So... pretty much confirmed that it's working according to @Green-Sky 's description of the --diffusion-fa PR (which I just found 🙂).

@Green-Sky
Copy link

Green-Sky commented Jul 4, 2025

Oh cool. Last time I checked only cuda got faster. Must have been upstream ggml updates.

Also, iirc sd1.x does not make use of it (yet). Might remember this wrong though.

@Green-Sky
Copy link

A quirk: I'm getting changed images in some cases, when turning on flash attention. Not hugely so: the composition stays the same, and the quality seems fine; it's mostly like the variations I get when changing between different backends.

FA requires F16, so something is potentially cast down (or up?)

@Green-Sky
Copy link

Green-Sky commented Jul 4, 2025

  • Test multiple resolutions (basically 512x512, 1024x1024 and a non-square one)
  • Test with/without VAE tiling
  • Test with TAE SD

It is currently ONLY for the diffusion model. Yes some large dimensions don't work, but thats a ggml FA thing. (past 1024x1024 somewhere, I don't remember)

@wbruna
Copy link
Author

wbruna commented Jul 4, 2025

FA requires F16, so something is potentially cast down (or up?)

Interesting... I did measure the same memory gains on quantized models, so I guess they're indeed being converted to F16 on demand.

BTW, you mentioned that you saw no benefit for the VAE stage. Any chance of that being different now, with the newer backend implementations?

@Green-Sky
Copy link

Green-Sky commented Jul 4, 2025

BTW, you mentioned that you saw no benefit for the VAE stage. Any chance of that being different now, with the newer backend implementations?

I would have to check what the issue was, but there where a lot of cases that needed padding. Whether or not something makes sense depends very much on the shape of the tensors.

ref: leejet/stable-diffusion.cpp#386

Copy link
Owner

@LostRuins LostRuins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm then

@LostRuins LostRuins merged commit d74c16e into LostRuins:concedo_experimental Jul 5, 2025
@wbruna wbruna deleted the koboldcpp_sd_flash_attn branch August 22, 2025 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants