enable flash attention for image generation #1633

wbruna · 2025-07-02T14:05:24Z

This simply applies the existing config to the image generation parameters.

Tested on SD1.5, SDXL and Flux, both on ROCm and Vulkan.

LostRuins · 2025-07-03T09:01:10Z

Does it work well? Any speedups/memory savings? I haven't tried SD flash attn extensively.
Been a little busy recently so before I merge this can we do a quick sanity check - I see you've tried SD1.5, SDXL and Flux, can we also:

Test multiple resolutions (basically 512x512, 1024x1024 and a non-square one)
Test with a lora, (any lora is fine, i know sd1.5 and sdxl loras work usually)
Test with/without VAE tiling
Test with TAE SD

And just note any potential incompatibilities. Then we should be good to merge.

wbruna · 2025-07-03T14:52:14Z

A quick test with SDXL + DMD2 LoRA rendering a 1024x1536 image (not including the VAE phase):

fa	backend	idle vram	inference vram	inference time
off	Vulkan (radv)	6.5G	8.2G	35.33s
on	Vulkan (radv)	6.5G	7.2G	28.62s
off	Vulkan (amdvlk)	6.5G	8.2G	31.57s
on	Vulkan (amdvlk)	6.5G	7.2G	23.90s
off	ROCm	7.0G	8.6G	31.83s
on	ROCm	7.0G	7.7G	16.73s

GPU is a 7600 XT; it's one of the few that gets better LLM token generation speed with flash attention on, so it may be kind of a best case.

A quirk: I'm getting changed images in some cases, when turning on flash attention. Not hugely so: the composition stays the same, and the quality seems fine; it's mostly like the variations I get when changing between different backends.

wbruna · 2025-07-04T00:22:27Z

I did a few more tests on SDXL+Vulkan: in general, higher resolutions benefit more from flash attention. At 1024x1024, I'm getting around 500M VRAM / 18.0s, versus 900M VRAM / 21.5s; at 512x512 and 512x768, essentially the same ~200M VRAM, and slightly faster inference (~4-8%).

SD1.5 doesn't seem to benefit from it, even when rendering an 832x832 image: same VRAM usage and speed.

So... pretty much confirmed that it's working according to @Green-Sky 's description of the --diffusion-fa PR (which I just found 🙂).

Green-Sky · 2025-07-04T11:23:56Z

Oh cool. Last time I checked only cuda got faster. Must have been upstream ggml updates.

Also, iirc sd1.x does not make use of it (yet). Might remember this wrong though.

Green-Sky · 2025-07-04T11:25:09Z

A quirk: I'm getting changed images in some cases, when turning on flash attention. Not hugely so: the composition stays the same, and the quality seems fine; it's mostly like the variations I get when changing between different backends.

FA requires F16, so something is potentially cast down (or up?)

Green-Sky · 2025-07-04T11:28:31Z

Test multiple resolutions (basically 512x512, 1024x1024 and a non-square one)

Test with/without VAE tiling

Test with TAE SD

It is currently ONLY for the diffusion model. Yes some large dimensions don't work, but thats a ggml FA thing. (past 1024x1024 somewhere, I don't remember)

wbruna · 2025-07-04T12:25:24Z

FA requires F16, so something is potentially cast down (or up?)

Interesting... I did measure the same memory gains on quantized models, so I guess they're indeed being converted to F16 on demand.

BTW, you mentioned that you saw no benefit for the VAE stage. Any chance of that being different now, with the newer backend implementations?

Green-Sky · 2025-07-04T12:33:44Z

BTW, you mentioned that you saw no benefit for the VAE stage. Any chance of that being different now, with the newer backend implementations?

I would have to check what the issue was, but there where a lot of cases that needed padding. Whether or not something makes sense depends very much on the shape of the tensors.

ref: leejet/stable-diffusion.cpp#386

LostRuins

lgtm then

enable flash attention for image generation

63c5901

LostRuins approved these changes Jul 5, 2025

View reviewed changes

LostRuins merged commit d74c16e into LostRuins:concedo_experimental Jul 5, 2025

wbruna deleted the koboldcpp_sd_flash_attn branch August 22, 2025 16:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

enable flash attention for image generation #1633

enable flash attention for image generation #1633

Uh oh!

wbruna commented Jul 2, 2025

Uh oh!

LostRuins commented Jul 3, 2025

Uh oh!

wbruna commented Jul 3, 2025

Uh oh!

wbruna commented Jul 4, 2025

Uh oh!

Green-Sky commented Jul 4, 2025 •

edited

Loading

Uh oh!

Green-Sky commented Jul 4, 2025

Uh oh!

Green-Sky commented Jul 4, 2025 •

edited

Loading

Uh oh!

wbruna commented Jul 4, 2025

Uh oh!

Green-Sky commented Jul 4, 2025 •

edited

Loading

Uh oh!

LostRuins left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

enable flash attention for image generation #1633

enable flash attention for image generation #1633

Uh oh!

Conversation

wbruna commented Jul 2, 2025

Uh oh!

LostRuins commented Jul 3, 2025

Uh oh!

wbruna commented Jul 3, 2025

Uh oh!

wbruna commented Jul 4, 2025

Uh oh!

Green-Sky commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Green-Sky commented Jul 4, 2025

Uh oh!

Green-Sky commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wbruna commented Jul 4, 2025

Uh oh!

Green-Sky commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LostRuins left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Green-Sky commented Jul 4, 2025 •

edited

Loading

Green-Sky commented Jul 4, 2025 •

edited

Loading

Green-Sky commented Jul 4, 2025 •

edited

Loading