-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Enable Convolution AutoTuning #9301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Convolution AutoTuning #9301
Conversation
|
The overhead in my case during first inference on blackwell 6000 is ~15 seconds for an 1280x1680 SDXL workflow. Will have to decide if this should be enabled by default or behind --fast |
|
Was this on Windows or Linux? I retested on an RTX6000, SDLX @ 1280x1680, Ubuntu 24.04 and saw the following: SDXL
One thing I noticed is that the alloc backend has a large influence on this - I was using native for all my testing. When using cuda-malloc the first run takes ~12006.10ms |
|
I added an option to the |
|
can you rebase? there's something weird with your branch. |
|
Sounds like a very good improvement. If it adds up to ~5 - 10 seconds on the first inference, that's not worth worrying about and should be on by default, since it saves so much time in general. People will make up that time after 2 images. But it seems like it adds less than a second to the first run if a proper model loading backend is used. I guess the reason for longer delays is that cuda-malloc has some overhead that really adds up with repeated tests? PS: The pull request shows a ton of changes. Needs a rebase. :') |
|
It was already re-based, github had issues updating the PR - I reset the target branch, it should now show up correctly. |
|
Thank you, I can finally see the diff now. :) Oh, so cuDNN benchmarking is a built-in PyTorch feature. That's awesome. I'll try to make time to test this on my 3090 on Linux (the 5090 arrives this week so I can try both). I'm working on a lot of projects and my Comfy is currently outdated. I'll try to make time for it! Edit: I didn't have time to test it before merge, oops. I was super busy. :') |
|
The overhead is a lot higher on ROCm, at least my 7900 XTX, taking over a minute to tune a basic SDXL kernel. So this should probably be turned off on AMD unless explicitly requested. |
…e-update * commit '4f5812b93712e0f52ae8fe80a89e8b5e7d0fa309': (77 commits) Update template to 0.1.73 (comfyanonymous#9686) ImageScaleToMaxDimension node. (comfyanonymous#9689) Accept prompt_id in interrupt handler (comfyanonymous#9607) uso -> uxo/uno as requested. (comfyanonymous#9688) USO style reference. (comfyanonymous#9677) Enable Convolution AutoTuning (comfyanonymous#9301) Implement the USO subject identity lora. (comfyanonymous#9674) Probably not necessary anymore. (comfyanonymous#9646) SEEDS: update noise decomposition and refactor (comfyanonymous#9633) convert Primitive nodes to V3 schema (comfyanonymous#9372) convert nodes_stability.py to V3 schema (comfyanonymous#9497) convert Video nodes to V3 schema (comfyanonymous#9489) convert Stable Cascade nodes to V3 schema (comfyanonymous#9373) ComfyUI version 0.3.56 Lower ram usage on windows. (comfyanonymous#9628) ComfyUI v0.3.55 Update template to 0.1.70 (comfyanonymous#9620) Trim audio to video when saving video. (comfyanonymous#9617) Support the 5B fun inpaint model. (comfyanonymous#9614) Support wan2.2 5B fun control model. (comfyanonymous#9611) ...
|
if use autotune...task start very slow. On rtx 5090 prepare 10s sampling 3s. aha. |
This will test the Top10 algorithms returned by the cuDNN heuristic and select the fastest. On a 5090 I'm seeing
This does add a small overhead during the first inference - in my case this was ~200ms.