Skip to content

Conversation

@contentis
Copy link
Contributor

This will test the Top10 algorithms returned by the cuDNN heuristic and select the fastest. On a 5090 I'm seeing

  • 1.41x on SDXL
  • 1.32x on SD15
  • 1.21x on VAE Decoder (SDXL)

This does add a small overhead during the first inference - in my case this was ~200ms.

@comfyanonymous
Copy link
Owner

comfyanonymous commented Aug 12, 2025

The overhead in my case during first inference on blackwell 6000 is ~15 seconds for an 1280x1680 SDXL workflow. Will have to decide if this should be enabled by default or behind --fast

@contentis
Copy link
Contributor Author

contentis commented Aug 13, 2025

Was this on Windows or Linux? I retested on an RTX6000, SDLX @ 1280x1680, Ubuntu 24.04 and saw the following:

SDXL

  • With AutoTuning
    • First Run: 6690.49ms
    • Second Run 3168.08ms
  • Without
    • First Run: 5524.88ms
    • Second Run: 4161.28ms

One thing I noticed is that the alloc backend has a large influence on this - I was using native for all my testing. When using cuda-malloc the first run takes ~12006.10ms

@contentis
Copy link
Contributor Author

I added an option to the --fast autotune argument to make this feature opt-in for now.

@comfyanonymous
Copy link
Owner

can you rebase? there's something weird with your branch.

@contentis contentis mentioned this pull request Sep 1, 2025
@Arcitec
Copy link

Arcitec commented Sep 1, 2025

Sounds like a very good improvement. If it adds up to ~5 - 10 seconds on the first inference, that's not worth worrying about and should be on by default, since it saves so much time in general. People will make up that time after 2 images.

But it seems like it adds less than a second to the first run if a proper model loading backend is used. I guess the reason for longer delays is that cuda-malloc has some overhead that really adds up with repeated tests?

PS: The pull request shows a ton of changes. Needs a rebase. :')

@contentis
Copy link
Contributor Author

It was already re-based, github had issues updating the PR - I reset the target branch, it should now show up correctly.

@Arcitec
Copy link

Arcitec commented Sep 1, 2025

Thank you, I can finally see the diff now. :) Oh, so cuDNN benchmarking is a built-in PyTorch feature. That's awesome. I'll try to make time to test this on my 3090 on Linux (the 5090 arrives this week so I can try both). I'm working on a lot of projects and my Comfy is currently outdated. I'll try to make time for it!

Edit: I didn't have time to test it before merge, oops. I was super busy. :')

@comfyanonymous comfyanonymous merged commit e2d1e5d into comfyanonymous:master Sep 2, 2025
7 checks passed
@FeepingCreature
Copy link
Contributor

The overhead is a lot higher on ROCm, at least my 7900 XTX, taking over a minute to tune a basic SDXL kernel. So this should probably be turned off on AMD unless explicitly requested.

Thor-ATX pushed a commit to asteriafilmco/ComfyUI that referenced this pull request Sep 3, 2025
…e-update

* commit '4f5812b93712e0f52ae8fe80a89e8b5e7d0fa309': (77 commits)
  Update template to 0.1.73 (comfyanonymous#9686)
  ImageScaleToMaxDimension node. (comfyanonymous#9689)
  Accept prompt_id in interrupt handler (comfyanonymous#9607)
  uso -> uxo/uno as requested.  (comfyanonymous#9688)
  USO style reference. (comfyanonymous#9677)
  Enable Convolution AutoTuning (comfyanonymous#9301)
  Implement the USO subject identity lora. (comfyanonymous#9674)
  Probably not necessary anymore. (comfyanonymous#9646)
  SEEDS: update noise decomposition and refactor (comfyanonymous#9633)
  convert Primitive nodes to V3 schema (comfyanonymous#9372)
  convert nodes_stability.py to V3 schema (comfyanonymous#9497)
  convert Video nodes to V3 schema (comfyanonymous#9489)
  convert Stable Cascade nodes to V3 schema (comfyanonymous#9373)
  ComfyUI version 0.3.56
  Lower ram usage on windows. (comfyanonymous#9628)
  ComfyUI v0.3.55
  Update template to 0.1.70 (comfyanonymous#9620)
  Trim audio to video when saving video. (comfyanonymous#9617)
  Support the 5B fun inpaint model. (comfyanonymous#9614)
  Support wan2.2 5B fun control model. (comfyanonymous#9611)
  ...
@Windecay
Copy link

if use autotune...task start very slow. On rtx 5090 prepare 10s sampling 3s. aha.

toxicwind pushed a commit to toxicwind/ComfyUI that referenced this pull request Oct 12, 2025
adlerfaulkner pushed a commit to LucaLabsInc/ComfyUI that referenced this pull request Oct 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants