Skip to content

Conversation

@contentis
Copy link
Contributor

Enable cuDNN attention and set it as the highest priority backend. cuDNN SDPA backend performs on-par or sometimes faster than flash-attention backend. More importantly, is the flash-attention backend disabled on windows, falling back to the much slower mem-efficient backend.
On Windows I've seen ~2x speed-up for SDPA kernel on multiple models (FLUX, SDXL, SD3.5 Medium & Large, Qwen).

@comfyanonymous comfyanonymous merged commit 3da5a07 into comfyanonymous:master Aug 13, 2025
6 checks passed
@jurgenprins
Copy link

comfyui does not start anymore; this is what I get now on startup

pytorch version: 2.5.1+cu124
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3060 : cudaMallocAsync
Traceback (most recent call last):
File "E:\ComfyUI\main.py", line 147, in
import execution
File "E:\ComfyUI\execution.py", line 16, in
import nodes
File "E:\ComfyUI\nodes.py", line 24, in
import comfy.diffusers_load
File "E:\ComfyUI\comfy\diffusers_load.py", line 3, in
import comfy.sd
File "E:\ComfyUI\comfy\sd.py", line 9, in
from .ldm.models.autoencoder import AutoencoderKL, AutoencodingEngine
File "E:\ComfyUI\comfy\ldm\models\autoencoder.py", line 11, in
import comfy.ops
File "E:\ComfyUI\comfy\ops.py", line 82, in
class disable_weight_init:
File "E:\ComfyUI\comfy\ops.py", line 262, in disable_weight_init
@sdpa_kernel(backends=SDPA_BACKEND_PRIORITY, set_priority=True)
File "C:\Program Files\Python310\lib\contextlib.py", line 281, in helper
return _GeneratorContextManager(func, args, kwds)
File "C:\Program Files\Python310\lib\contextlib.py", line 103, in init
self.gen = func(*args, **kwds)
TypeError: sdpa_kernel() got an unexpected keyword argument 'set_priority'

@contentis
Copy link
Contributor Author

@jurgenprins this API was introduced with PyTorch 2.6, can you please try upgrading Torch?

@contentis contentis deleted the sdpa_kernel_selection branch August 14, 2025 08:42
zhangp365 pushed a commit to zhangp365/ComfyUI that referenced this pull request Aug 14, 2025
@jurgenprins
Copy link

@jurgenprins this API was introduced with PyTorch 2.6, can you please try upgrading Torch?

I noted it was due to Torch 2.5.1 thanks!

I am happy fow now to accept that its now in try/catch with a 'cannot set' message, instead of crashing at startup.

I am not sure exactly what would be the benefit of immediate upgrading to make this work perhaps something for the release notes.

In time will try the upgrade, thank you!

@Askelhardd
Copy link

This won't force replace SageAttention 2++/3, right ?

@contentis
Copy link
Contributor Author

This won't force replace SageAttention 2++/3, right ?

It shouldn't - if you encounter any issues, please let me know and I'll look into it. Feel free to tag me in a related issue.

Vander-Bilt pushed a commit to Vander-Bilt/ComfyUI that referenced this pull request Aug 26, 2025
@Arcitec
Copy link

Arcitec commented Aug 31, 2025

Thanks for doing this. Sounds like it replaces the default attention calculations for CUDA users. But if people already use SageAttention, there's no speed change, right?

If so, it's a great improvement for stock usage but not for people who already use a faster attention method.

From what I can see:

  • Windows: Used slow attention by default due to no FlashAttention on that platform. This commit gives them up to 2x speedup via cuDNN SDPA.
  • Linux: Replaces FlashAttention with cuDNN SDPA, which is "on-par or slightly faster". So not much difference?
  • SageAttention: People who manually use SageAttention already have even better performance. No performance gains for those users. For example, "SageAttention 2++" is 3.9x faster than FlashAttention.

Have any other CUDA specific changes been merged recently (after Comfy 3.50) by the way?

@contentis
Copy link
Contributor Author

If people are using a 3rd party Flash Attention implementation there shouldn't be large improvements.
I've found SageAttention to be a mixed-experience in terms of performance as the Q/DQ takes very long if unfused ending up to deliver similar performance to FP16/BF16. Even when moving to SA3 gains were "surprisingly" small out of the box.

Ther is another pending PR for convolution performance (SDXL/SDx.x) that is CUDA specific: #9301

There'll be more to come in the future, but it'll take some time.

@Arcitec
Copy link

Arcitec commented Sep 1, 2025

@contentis Thank you so much for the great explanation, I really appreciate it.

I agree that SageAttention has some painful parts that can still be optimized, and I've been curious about SA3's speedups via their new FP4 algorithm.

But hearing that SDPA is not that far behind SageAttention (v2/v3) in speed, is interesting. I haven't seen people compare those before, but now I found an old discussion from October 2024 where the person said that SageAttention v1 (the only one out at that time; before the public release) was 14% faster than SDPA. Newer SageAttention versions should be even faster than SDPA.

I'll try doing some comparisons on Linux when my 5090 arrives. If SDPA offers better quality at a small slowdown, it could be worth it!

Thanks a lot for the SDXL/SD1.5 pull request too. That model is still pretty good, so the speedups you've gained there are fantastic. <3

@contentis
Copy link
Contributor Author

Here is an example using the FLUX SDPA config:

cuDNN Attnetion:
Screenshot 2025-09-01 at 09 21 33

SageAtten 2++
Screenshot 2025-09-01 at 09 21 56

As you can see, for the kernel itself we see a speed-up of ~592.513/361.617=1.63x. But given that the IO needs to be quantized/dequantized (Q/DQ) there are 5 additional kernels running. They are very fast but add up, making the end-to-end time for SA ~457us and thereby the speed-up comes down to ~592.513/457=1.3x.

@Arcitec
Copy link

Arcitec commented Sep 1, 2025

Ahh yes, I see, the time of all the extra kernels adds up and it's not saving that much time overall.

That's also a real testament to how good SDPA is already, compared to FlashAttention.

For the small time difference, I would happily switch from SA to cuDNN SDPA if it's better quality. But that seems like a hard bar to beat, since it's constantly claimed that SA2 has practically no quality loss compared to FlashAttention.

So the way I see it, SDPA is an amazing out-of-the-box choice that really helps people, and they won't have to spend time compiling SageAttention 2/3 (Triton and C compiler). But if someone wants a little more speed (useful for video models), going through the extra work of installing SageAttention 2++ or 3 is worth it. It just won't be as much of a boost anymore, since cuDNN SDPA now exists in Comfy by default. 👍 ❤️ :)

toxicwind pushed a commit to toxicwind/ComfyUI that referenced this pull request Oct 12, 2025
adlerfaulkner pushed a commit to LucaLabsInc/ComfyUI that referenced this pull request Oct 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants