-
Notifications
You must be signed in to change notification settings - Fork 10.5k
SDPA backend priority #9299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SDPA backend priority #9299
Conversation
|
comfyui does not start anymore; this is what I get now on startup pytorch version: 2.5.1+cu124 |
|
@jurgenprins this API was introduced with PyTorch 2.6, can you please try upgrading Torch? |
I noted it was due to Torch 2.5.1 thanks! I am happy fow now to accept that its now in try/catch with a 'cannot set' message, instead of crashing at startup. I am not sure exactly what would be the benefit of immediate upgrading to make this work perhaps something for the release notes. In time will try the upgrade, thank you! |
|
This won't force replace SageAttention 2++/3, right ? |
It shouldn't - if you encounter any issues, please let me know and I'll look into it. Feel free to tag me in a related issue. |
|
Thanks for doing this. Sounds like it replaces the default attention calculations for CUDA users. But if people already use SageAttention, there's no speed change, right? If so, it's a great improvement for stock usage but not for people who already use a faster attention method. From what I can see:
Have any other CUDA specific changes been merged recently (after Comfy 3.50) by the way? |
|
If people are using a 3rd party Flash Attention implementation there shouldn't be large improvements. Ther is another pending PR for convolution performance (SDXL/SDx.x) that is CUDA specific: #9301 There'll be more to come in the future, but it'll take some time. |
|
@contentis Thank you so much for the great explanation, I really appreciate it. I agree that SageAttention has some painful parts that can still be optimized, and I've been curious about SA3's speedups via their new FP4 algorithm. But hearing that SDPA is not that far behind SageAttention (v2/v3) in speed, is interesting. I haven't seen people compare those before, but now I found an old discussion from October 2024 where the person said that SageAttention v1 (the only one out at that time; before the public release) was 14% faster than SDPA. Newer SageAttention versions should be even faster than SDPA. I'll try doing some comparisons on Linux when my 5090 arrives. If SDPA offers better quality at a small slowdown, it could be worth it! Thanks a lot for the SDXL/SD1.5 pull request too. That model is still pretty good, so the speedups you've gained there are fantastic. <3 |
|
Here is an example using the FLUX SDPA config: As you can see, for the kernel itself we see a speed-up of ~592.513/361.617=1.63x. But given that the IO needs to be quantized/dequantized (Q/DQ) there are 5 additional kernels running. They are very fast but add up, making the end-to-end time for SA ~457us and thereby the speed-up comes down to ~592.513/457=1.3x. |
|
Ahh yes, I see, the time of all the extra kernels adds up and it's not saving that much time overall. That's also a real testament to how good SDPA is already, compared to FlashAttention. For the small time difference, I would happily switch from SA to cuDNN SDPA if it's better quality. But that seems like a hard bar to beat, since it's constantly claimed that SA2 has practically no quality loss compared to FlashAttention. So the way I see it, SDPA is an amazing out-of-the-box choice that really helps people, and they won't have to spend time compiling SageAttention 2/3 (Triton and C compiler). But if someone wants a little more speed (useful for video models), going through the extra work of installing SageAttention 2++ or 3 is worth it. It just won't be as much of a boost anymore, since cuDNN SDPA now exists in Comfy by default. 👍 ❤️ :) |


Enable cuDNN attention and set it as the highest priority backend. cuDNN SDPA backend performs on-par or sometimes faster than flash-attention backend. More importantly, is the flash-attention backend disabled on windows, falling back to the much slower mem-efficient backend.
On Windows I've seen ~2x speed-up for SDPA kernel on multiple models (FLUX, SDXL, SD3.5 Medium & Large, Qwen).