-
Notifications
You must be signed in to change notification settings - Fork 255
add default named_modules_to_munmap variable #357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
worked for me |
|
This is mostly what I was concerned about in my comment on the original PR as well:
Could you try something like this instead? We should use
|
I am also having this problem. The above patch causes the following error for me: When I revert the patch, the error goes away. |
Are you saying your workflow fully works without the patch and this causes regression or are you saying you have the same error as the report, and now with this you get a different error? |
Default this to handle to case of cloning a GGUFModelPatcher which then expects it to exist on the clone. I could never reproduce this condition, however 4 users have the error. This error path is very sensitive to VRAM condtions and workflow and I never got the details so its a bit of a guess. But from a code point of view, its the right thing anyway
I get the following error now (without the above patch): With the above patch, when the workflow gets to the sampler, I get this: Undoing the patch gets me back up-and-running, but then I get the GGUFModelPatcher error again. A possible workaround may be to run ComfyUI in --lowvram mode to avoid the GGUFModelPatcher error. |
bf13cf4 to
4f5a547
Compare
Fully updated to do this. Thanks |
Ok there is another report of this at large here: Comfy-Org/ComfyUI#10662 (comment) I think you are getting further with this change here but are blocked by this second bug behind it. Can you update your comfy core to the latest git as a fix was merged earlier today? It worked for some people but OP in that link still has a repro. If you can still reproduce this second bug can I get your workflow log and HW specs as I am short on reliable reproducers. |
Already on the latest. I haven't yet encountered the OP's error with my Wan 2.2 workflow since I set ComfyUI to --lowvram mode, but I still get the CUDA error if I patch nodes.py. I'm on an AMD system, 64GB RAM + 4090 running Debian. |
There are workflows where the same underlying model will be loaded twice by two different ModelPatchers. With the current code as-is, the first patcher will load fine and intercept the pins while removing already done pins from the free modules list. It then will load again with the second MP however, this one will not get the callbacks for the already pinned stuff and then rip through and .to().to() them all. This destroys the pinning setup while comfy core still keeps the modules registered as unpinned. This causes a crash on einval when core goes to unpin the not pinned tensors. Comfy core now has a guard against unpinning the not-pinned, however what I think is happening, is the RAM of the pin in freed in the .to().to() and then malloc re-allocates it to new tensors with cuda keeping its records of previous pinning. einval can happen when you try and unpin a tensor in the middle of the block which is definately possible on this use after free pattern.
|
I have updated this pull with a second change that fixes my reproducer of that cuda einval from @phaserblast if you want to give it a test to see if it helps your use case let me know either way. |
|
Hi everyone here, this PR has fixed the error for my test workflow in kijai/ComfyUI-KJNodes#430, but it still gives the same error if I add a LoraLoader node as below: bugged_3.json
and the log is pretty much the same. I also suggest to check if TorchCompileModel (native) and TorchCompileModelWanVideoV2 (in KJNodes) works with it since it's another common node of WanVideo workflow, but I don't use it right now. |
|
I can't re-test at the moment, but the current changes look good to me. I think we can merge this and maybe do a follow-up PR for the LoRA/torch compile code for any fixes that are needed there, just to get main into a working state for the time being if that works for you @rattus128 |
Hi, I have a reproducer with the workflow although its only on a rerun when I change settings. Can I get a repaste of your full error since changes. Also include the model load statistics as this issue is sensitive to these numbers (an it helps me adapt from your 4090 which I dont have). Something like: |
Yes I think so. If we need to go again here ill do another PR. Just this much will help a significant number of users. |
|
Hi @rattus128 thanks again for your help! I updated the main Comfy repo and this custom node to the latest and did another test. (Besides, I actually have a 5070 Ti rather than a 4090.) Here is a more complete log which errors in the first run: |
Looks good. Tested it on both a 4090 and a GB10, no problems. This is the line that fixed it for me:
|
@nestflow I never got a first flow reproducer or the einval you pasted here. Im back in no reproducer purgatory. I did reproduce an OOM that I have since PRd but it's not your error case. I have implemented something of a sledgehammer fix to lock down the pinned memory pieces to the core and avoid custom node packs leaking memory or creating use-after-frees. Can you give this branch of comfy core a go? https://github.com/rattus128/ComfyUI/tree/wip/pin-lockdown# If it still fails, paste log, if it work, can you expand the console in the comfyui GUI and paste me the success log too? (I added warming prints to the conditions we dont want to see but I compensated for). Thanks for your help in tracking these down. Depending on what happens we may want to cut a fresh issue somewhere. |
Hi @rattus128 , thanks for your continual following up on this issue. Yes this fix seemed to work and here is the success log: |
|
I am having issue of crashing at VAE decode? can someone guide how to fix? I am not able to work on my workflow. |
what is your pytorch version? There is a theory that pytorch 2.7 has an issue. We are yet to confirm this one but we could use the data and its something you could try. |
|
i had Pytorch 2.7 only |




Default this to handle to case of cloning a GGUFModelPatcher which then expects it to exist on the clone.
I could never reproduce this condition, however 4 users have the error. This error path is very sensitive to VRAM condtions and workflow and I never got the details so its a bit of a guess.
But from a code point of view, its the right thing anyway