Skip to content

Conversation

@rattus128
Copy link
Contributor

Comfy core recently introduced a feature where weights may be pinned when loading, particularly for the case of offloading.

Intercept this, and immediately detached each weight before the pinning. This avoids a crash that at least some users are experiencing.

Use a little dict on the modules to keep track of whats already done, and when the catch-all detacher loop comes through, use this dict (which has already done modules removed) as the iterator basis.

Comfy core recently introduced a feature where weights may be pinned
when loading, particularly for the case of offloading.

Intercept this, and immediately detached each weight before the pinning.
This avoids a crash that at least some users are experiencing.

Use a little dict on the modules to keep track of whats already done,
and when the catch-all detacher loop comes through, use this dict
(which has already done modules removed) as the iterator basis.
@thrnz
Copy link

thrnz commented Nov 2, 2025

Thanks for this. It It seems to have successfully fixed the cuda oom I was getting when using pinned_memory with gguf models.

@jprsyt5
Copy link

jprsyt5 commented Nov 3, 2025

I can confirm on my setup that it also solved the recent CUDA crash when using GGUF with the --fast argument.

Edit: Anyway, I have a question, maybe it's still related? using --fast basically enables all performance optimizations like fp16_accum, including pinned_memory, right? Does using pinned_memory increase VRAM usage, or should it only increase RAM usage?

I tried running the same gguf workflow (WAN i2V w Torch Compile + pytorch 2.9) using exactly the same settings but on different ComfyUI builds.

First, I tested on 9da397ea and vram usage are like 73% on 1st run.

However, on the latest comfy (which already includes pinned_memory), the VRAM usage increased to 98% on the first run, which causing a noticable slowdown.

@rattus128
Copy link
Contributor Author

I can confirm on my setup that it also solved the recent CUDA crash when using GGUF with the --fast argument.

Edit: Anyway, I have a question, maybe it's still related? using --fast basically enables all performance optimizations like fp16_accum, including pinned_memory, right? Does using pinned_memory increase VRAM usage, or should it only increase RAM usage?

I tried running the same gguf workflow (WAN i2V w Torch Compile + pytorch 2.9) using exactly the same settings but on different ComfyUI builds.

First, I tested on 9da397ea and vram usage are like 73% on 1st run.

However, on the latest comfy (which already includes pinned_memory), the VRAM usage increased to 98% on the first run, which causing a noticable slowdown.

This shouldnt have any effect on VRAM. If you have a look at all the available optimizations of --fast, try manually listing all except --pinned_memory and then test with pinned_memory to test the single variable.

@jprsyt5
Copy link

jprsyt5 commented Nov 3, 2025

I can confirm on my setup that it also solved the recent CUDA crash when using GGUF with the --fast argument.
Edit: Anyway, I have a question, maybe it's still related? using --fast basically enables all performance optimizations like fp16_accum, including pinned_memory, right? Does using pinned_memory increase VRAM usage, or should it only increase RAM usage?
I tried running the same gguf workflow (WAN i2V w Torch Compile + pytorch 2.9) using exactly the same settings but on different ComfyUI builds.
First, I tested on 9da397ea and vram usage are like 73% on 1st run.
However, on the latest comfy (which already includes pinned_memory), the VRAM usage increased to 98% on the first run, which causing a noticable slowdown.

This shouldnt have any effect on VRAM. If you have a look at all the available optimizations of --fast, try manually listing all except --pinned_memory and then test with pinned_memory to test the single variable.

Never mind, I'm dumb.

I forgot I made a local edit, so I had commented out @torch.compiler.disable in comfy/ops.py to enable Torch Compile on my old Comfy build, as instructed in the Sage Attention nightly build, but I forgot to comment it out again afterward on the latest comfy. That's why the VRAM usage was higher.

Now, in the latest comfy, performance is indeed faster with pinned_memory (VRAM usage stays similar). On the 1st run, I usually got ~90s/it, and now it's around 65s/it. 🤯

Everything looks good now! Thanks for this quick PR, hopefully it will be merged soon.

@city96
Copy link
Owner

city96 commented Nov 3, 2025

Thanks for the PR. I did a quick test and don't see any regression on old versions either. There's some small nitpick comments that could be made like using an empty dict instead of None or defining a default for it next to mmap_released so it always exists, but I doubt anything will hit those edgecases, so I'll just merge it and push an updated version to the comfy registry as well.

@city96 city96 merged commit 100c06c into city96:main Nov 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants