Skip to content

Conversation

@rattus128
Copy link
Contributor

Default this to handle to case of cloning a GGUFModelPatcher which then expects it to exist on the clone.

I could never reproduce this condition, however 4 users have the error. This error path is very sensitive to VRAM condtions and workflow and I never got the details so its a bit of a guess.

But from a code point of view, its the right thing anyway

@RYG81
Copy link

RYG81 commented Nov 6, 2025

worked for me

@city96
Copy link
Owner

city96 commented Nov 6, 2025

This is mostly what I was concerned about in my comment on the original PR as well:

defining a default for it next to mmap_released so it always exists

Could you try something like this instead? We should use mmap_released as the control logic and keep named_modules_to_munmap as a dictionary to avoid other edgecases by separating the two. I think that should work as well:

image

@phaserblast
Copy link

phaserblast commented Nov 7, 2025

Could you try something like this instead? We should use mmap_released as the control logic and keep named_modules_to_munmap as a dictionary to avoid other edgecases by separating the two. I think that should work as well:

I am also having this problem. The above patch causes the following error for me:

torch.AcceleratorError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

When I revert the patch, the error goes away.

@rattus128
Copy link
Contributor Author

Could you try something like this instead? We should use mmap_released as the control logic and keep named_modules_to_munmap as a dictionary to avoid other edgecases by separating the two. I think that should work as well:

I am also having this problem. The above patch causes the following error for me:

torch.AcceleratorError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

When I revert the patch, the error goes away.

Are you saying your workflow fully works without the patch and this causes regression or are you saying you have the same error as the report, and now with this you get a different error?

Default this to handle to case of cloning a GGUFModelPatcher which then
expects it to exist on the clone.

I could never reproduce this condition, however 4 users have the error.
This error path is very sensitive to VRAM condtions and workflow and I
never got the details so its a bit of a guess.

But from a code point of view, its the right thing anyway
@phaserblast
Copy link

phaserblast commented Nov 7, 2025

Are you saying your workflow fully works without the patch and this causes regression or are you saying you have the same error as the report, and now with this you get a different error?

I get the following error now (without the above patch):
CLIPTextEncode 'GGUFModelPatcher' object has no attribute 'named_modules_to_munmap

With the above patch, when the workflow gets to the sampler, I get this:

torch.AcceleratorError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Undoing the patch gets me back up-and-running, but then I get the GGUFModelPatcher error again.

A possible workaround may be to run ComfyUI in --lowvram mode to avoid the GGUFModelPatcher error.

@rattus128
Copy link
Contributor Author

This is mostly what I was concerned about in my comment on the original PR as well:

defining a default for it next to mmap_released so it always exists

Could you try something like this instead? We should use mmap_released as the control logic and keep named_modules_to_munmap as a dictionary to avoid other edgecases by separating the two. I think that should work as well:
image

Fully updated to do this. Thanks

@rattus128
Copy link
Contributor Author

Are you saying your workflow fully works without the patch and this causes regression or are you saying you have the same error as the report, and now with this you get a different error?

I get the following error now (without the above patch): CLIPTextEncode 'GGUFModelPatcher' object has no attribute 'named_modules_to_munmap

With the above patch, when the workflow gets to the sampler, I get this:

torch.AcceleratorError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Undoing the patch gets me back up-and-running, but then I get the GGUFModelPatcher error again.

A possible workaround may be to run ComfyUI in --lowvram mode to avoid the GGUFModelPatcher error.

Ok there is another report of this at large here:

Comfy-Org/ComfyUI#10662 (comment)

I think you are getting further with this change here but are blocked by this second bug behind it.

Can you update your comfy core to the latest git as a fix was merged earlier today? It worked for some people but OP in that link still has a repro. If you can still reproduce this second bug can I get your workflow log and HW specs as I am short on reliable reproducers.

@phaserblast
Copy link

phaserblast commented Nov 7, 2025

Can you update your comfy core to the latest git

git rev-parse HEAD
cf97b033ee80cf245b4592d42f89e6de67e409a4

Already on the latest. I haven't yet encountered the OP's error with my Wan 2.2 workflow since I set ComfyUI to --lowvram mode, but I still get the CUDA error if I patch nodes.py.

I'm on an AMD system, 64GB RAM + 4090 running Debian.

@rattus128 rattus128 marked this pull request as draft November 7, 2025 10:59
There are workflows where the same underlying model will be loaded
twice by two different ModelPatchers.

With the current code as-is, the first patcher will load fine and
intercept the pins while removing already done pins from the free
modules list.

It then will load again with the second MP however, this one will
not get the callbacks for the already pinned stuff and then rip
through and .to().to() them all. This destroys the pinning setup
while comfy core still keeps the modules registered as unpinned.
This causes a crash on einval when core goes to unpin the not pinned
tensors.

Comfy core now has a guard against unpinning the not-pinned, however
what I think is happening, is the RAM of the pin in freed in the
.to().to() and then malloc re-allocates it to new tensors with cuda
keeping its records of previous pinning. einval can happen when
you try and unpin a tensor in the middle of the block which is
definately possible on this use after free pattern.
@rattus128 rattus128 marked this pull request as ready for review November 7, 2025 14:09
@rattus128
Copy link
Contributor Author

I have updated this pull with a second change that fixes my reproducer of that cuda einval from

kijai/ComfyUI-KJNodes#430

@phaserblast if you want to give it a test to see if it helps your use case let me know either way.

@nestflow
Copy link

nestflow commented Nov 7, 2025

Hi everyone here, this PR has fixed the error for my test workflow in kijai/ComfyUI-KJNodes#430, but it still gives the same error if I add a LoraLoader node as below: bugged_3.json

bug3

and the log is pretty much the same. I also suggest to check if TorchCompileModel (native) and TorchCompileModelWanVideoV2 (in KJNodes) works with it since it's another common node of WanVideo workflow, but I don't use it right now.

@city96
Copy link
Owner

city96 commented Nov 7, 2025

I can't re-test at the moment, but the current changes look good to me. I think we can merge this and maybe do a follow-up PR for the LoRA/torch compile code for any fixes that are needed there, just to get main into a working state for the time being if that works for you @rattus128

@rattus128
Copy link
Contributor Author

Hi everyone here, this PR has fixed the error for my test workflow in kijai/ComfyUI-KJNodes#430, but it still gives the same error if I add a LoraLoader node as below: bugged_3.json
bug3

and the log is pretty much the same. I also suggest to check if TorchCompileModel (native) and TorchCompileModelWanVideoV2 (in KJNodes) works with it since it's another common node of WanVideo workflow, but I don't use it right now.

Hi, I have a reproducer with the workflow although its only on a rerun when I change settings.

Can I get a repaste of your full error since changes. Also include the model load statistics as this issue is sensitive to these numbers (an it helps me adapt from your 4090 which I dont have).

Something like:

got prompt
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
Using scaled fp8: fp8 matrix mult: False, scale input: False
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load WanTEModel
loaded completely; 30235.11 MB usable, 6419.48 MB loaded, full load: True
Requested to load WanVAE
loaded completely; 13917.84 MB usable, 242.03 MB loaded, full load: True
gguf qtypes: F16 (694), Q8_0 (400), F32 (1)
model weight dtype torch.float16, manual cast: None
model_type FLOW
Requested to load WAN21
loaded partially; 347.74 MB usable, 347.73 MB loaded, 14477.73 MB offloaded, lowvram patches: 0
100%|██████████| 1/1 [00:26<00:00, 26.43s/it]
Requested to load WAN21
0 models unloaded.
loaded partially; 347.73 MB usable, 347.73 MB loaded, 14477.73 MB offloaded, lowvram patches: 0
100%|██████████| 1/1 [00:26<00:00, 26.95s/it]
Requested to load WanVAE
loaded completely; 18185.77 MB usable, 242.03 MB loaded, full load: True
Prompt executed in 94.00 seconds
got prompt
Requested to load WAN21
loaded completely; 16902.20 MB usable, 14825.47 MB loaded, full load: True
100%|██████████| 1/1 [00:03<00:00,  3.41s/it]
Requested to load WAN21
loaded completely; 17061.20 MB usable, 14825.47 MB loaded, full load: True
100%|██████████| 1/1 [00:06<00:00,  6.95s/it]
Requested to load WanVAE
loaded completely; 3864.66 MB usable, 242.03 MB loaded, full load: True
got prompt
got prompt
Prompt executed in 35.49 seconds
Prompt executed in 0.00 seconds
Prompt executed in 0.00 seconds
got prompt
got prompt
Requested to load WAN21
0 models unloaded.
loaded completely; 14827.84 MB usable, 14825.47 MB loaded, full load: True
  0%|          | 0/1 [00:00<?, ?it/s]
!!! Exception during processing !!! Allocation on device 
Traceback (most recent call last):
  File "/home/rattus/ComfyUI/execution.py", line 510, in execute
    output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/execution.py", line 324, in get_output_data
    return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/execution.py", line 298, in _async_map_node_over_list
    await process_inputs(input_dict, i)
  File "/home/rattus/ComfyUI/execution.py", line 286, in process_inputs
    result = f(**inputs)
             ^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy_extras/nodes_custom_sampler.py", line 658, in sample
    samples = comfy.sample.sample_custom(model, noise, cfg, sampler, sigmas, positive, negative, latent_image, noise_mask=noise_mask, callback=callback, disable_pbar=disable_pbar, seed=noise_seed)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/sample.py", line 65, in sample_custom
    samples = comfy.samplers.sample(model, noise, positive, negative, cfg, model.load_device, sampler, sigmas, model_options=model.model_options, latent_image=latent_image, denoise_mask=noise_mask, callback=callback, disable_pbar=disable_pbar, seed=seed)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/samplers.py", line 1053, in sample
    return cfg_guider.sample(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/samplers.py", line 1035, in sample
    output = executor.execute(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed, latent_shapes=latent_shapes)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/samplers.py", line 997, in outer_sample
    output = self.inner_sample(noise, latent_image, device, sampler, sigmas, denoise_mask, callback, disable_pbar, seed, latent_shapes=latent_shapes)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/samplers.py", line 980, in inner_sample
    samples = executor.execute(self, sigmas, extra_args, callback, noise, latent_image, denoise_mask, disable_pbar)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/samplers.py", line 752, in sample
    samples = self.sampler_function(model_k, noise, sigmas, extra_args=extra_args, callback=k_callback, disable=disable_pbar, **self.extra_options)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/venv2/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/k_diffusion/sampling.py", line 199, in sample_euler
    denoised = model(x, sigma_hat * s_in, **extra_args)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/samplers.py", line 401, in __call__
    out = self.inner_model(x, sigma, model_options=model_options, seed=seed)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/samplers.py", line 953, in __call__
    return self.outer_predict_noise(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/samplers.py", line 960, in outer_predict_noise
    ).execute(x, timestep, model_options, seed)
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/samplers.py", line 963, in predict_noise
    return sampling_function(self.inner_model, x, timestep, self.conds.get("negative", None), self.conds.get("positive", None), self.cfg, model_options=model_options, seed=seed)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/samplers.py", line 381, in sampling_function
    out = calc_cond_batch(model, conds, x, timestep, model_options)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/samplers.py", line 206, in calc_cond_batch
    return _calc_cond_batch_outer(model, conds, x_in, timestep, model_options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/samplers.py", line 214, in _calc_cond_batch_outer
    return executor.execute(model, conds, x_in, timestep, model_options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/samplers.py", line 326, in _calc_cond_batch
    output = model.apply_model(input_x, timestep_, **c).chunk(batch_chunks)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/model_base.py", line 161, in apply_model
    return comfy.patcher_extension.WrapperExecutor.new_class_executor(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/model_base.py", line 203, in _apply_model
    model_output = self.diffusion_model(xc, t, context=context, control=control, transformer_options=transformer_options, **extra_conds)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/venv2/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/venv2/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/ldm/wan/model.py", line 627, in forward
    return comfy.patcher_extension.WrapperExecutor.new_class_executor(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/ldm/wan/model.py", line 647, in _forward
    return self.forward_orig(x, timestep, context, clip_fea=clip_fea, freqs=freqs, transformer_options=transformer_options, **kwargs)[:, :, :t, :h, :w]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/ldm/wan/model.py", line 580, in forward_orig
    x = block(x, e=e0, freqs=freqs, context=context, context_img_len=context_img_len, transformer_options=transformer_options)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/venv2/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/venv2/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/ldm/wan/model.py", line 245, in forward
    y = self.ffn(torch.addcmul(repeat_e(e[3], x), self.norm2(x), 1 + repeat_e(e[4], x)))
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/venv2/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/venv2/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/venv2/lib/python3.12/site-packages/torch/nn/modules/container.py", line 250, in forward
    input = module(input)
            ^^^^^^^^^^^^^
  File "/home/rattus/venv2/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/venv2/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/venv2/lib/python3.12/site-packages/torch/nn/modules/activation.py", line 816, in forward
    return F.gelu(input, approximate=self.approximate)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: Allocation on device 

Got an OOM, unloading all loaded models.
Prompt executed in 22.44 seconds
Requested to load WAN21
loaded partially; 3653.83 MB usable, 3650.10 MB loaded, 11175.37 MB offloaded, lowvram patches: 0
!!! Exception during processing !!! CUDA error: part or all of the requested memory range is already mapped
Search for `cudaErrorHostMemoryAlreadyRegistered' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "/home/rattus/ComfyUI/execution.py", line 510, in execute
    output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/execution.py", line 324, in get_output_data
    return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/execution.py", line 298, in _async_map_node_over_list
    await process_inputs(input_dict, i)
  File "/home/rattus/ComfyUI/execution.py", line 286, in process_inputs
    result = f(**inputs)
             ^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy_extras/nodes_custom_sampler.py", line 658, in sample
    samples = comfy.sample.sample_custom(model, noise, cfg, sampler, sigmas, positive, negative, latent_image, noise_mask=noise_mask, callback=callback, disable_pbar=disable_pbar, seed=noise_seed)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/sample.py", line 65, in sample_custom
    samples = comfy.samplers.sample(model, noise, positive, negative, cfg, model.load_device, sampler, sigmas, model_options=model.model_options, latent_image=latent_image, denoise_mask=noise_mask, callback=callback, disable_pbar=disable_pbar, seed=seed)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/samplers.py", line 1053, in sample
    return cfg_guider.sample(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/samplers.py", line 1035, in sample
    output = executor.execute(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed, latent_shapes=latent_shapes)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/samplers.py", line 990, in outer_sample
    noise = noise.to(device)
            ^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: part or all of the requested memory range is already mapped
Search for `cudaErrorHostMemoryAlreadyRegistered' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

@rattus128
Copy link
Contributor Author

I can't re-test at the moment, but the current changes look good to me. I think we can merge this and maybe do a follow-up PR for the LoRA/torch compile code for any fixes that are needed there, just to get main into a working state for the time being if that works for you @rattus128

Yes I think so. If we need to go again here ill do another PR. Just this much will help a significant number of users.

@city96 city96 merged commit 02dac86 into city96:main Nov 7, 2025
@nestflow
Copy link

nestflow commented Nov 8, 2025

Hi @rattus128 thanks again for your help! I updated the main Comfy repo and this custom node to the latest and did another test. (Besides, I actually have a 5070 Ti rather than a 4090.) Here is a more complete log which errors in the first run:

got prompt
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load WanTEModel
loaded completely; 13401.80 MB usable, 10835.48 MB loaded, full load: True
Requested to load WanVAE
loaded completely; 327.00 MB usable, 242.03 MB loaded, full load: True
gguf qtypes: F16 (694), Q8_0 (400), F32 (1)
model weight dtype torch.float16, manual cast: None
model_type FLOW
Requested to load WAN21
loaded partially; 218.89 MB usable, 215.06 MB loaded, 14610.41 MB offloaded, lowvram patches: 0
100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [02:30<00:00, 150.49s/it]
Requested to load WAN21
!!! Exception during processing !!! CUDA error: invalid argument
Search for 'cudaErrorInvalidValue' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "D:\Projects\ComfyUI\execution.py", line 510, in execute
    output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\execution.py", line 324, in get_output_data
    return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\execution.py", line 298, in _async_map_node_over_list
    await process_inputs(input_dict, i)
  File "D:\Projects\ComfyUI\execution.py", line 286, in process_inputs
    result = f(**inputs)
  File "D:\Projects\ComfyUI\comfy_extras\nodes_custom_sampler.py", line 658, in sample
    samples = comfy.sample.sample_custom(model, noise, cfg, sampler, sigmas, positive, negative, latent_image, noise_mask=noise_mask, callback=callback, disable_pbar=disable_pbar, seed=noise_seed)
  File "D:\Projects\ComfyUI\comfy\sample.py", line 65, in sample_custom
    samples = comfy.samplers.sample(model, noise, positive, negative, cfg, model.load_device, sampler, sigmas, model_options=model.model_options, latent_image=latent_image, denoise_mask=noise_mask, callback=callback, disable_pbar=disable_pbar, seed=seed)
  File "D:\Projects\ComfyUI\comfy\samplers.py", line 1053, in sample
    return cfg_guider.sample(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
           ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\comfy\samplers.py", line 1035, in sample
    output = executor.execute(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed, latent_shapes=latent_shapes)
  File "D:\Projects\ComfyUI\comfy\patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\comfy\samplers.py", line 984, in outer_sample
    self.inner_model, self.conds, self.loaded_models = comfy.sampler_helpers.prepare_sampling(self.model_patcher, noise.shape, self.conds, self.model_options)
                                                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\comfy\sampler_helpers.py", line 130, in prepare_sampling
    return executor.execute(model, noise_shape, conds, model_options=model_options)
           ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\comfy\patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\comfy\sampler_helpers.py", line 138, in _prepare_sampling
    comfy.model_management.load_models_gpu([model] + models, memory_required=memory_required + inference_memory, minimum_memory_required=minimum_memory_required + inference_memory)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\comfy\model_management.py", line 697, in load_models_gpu
    loaded_model.model_load(lowvram_model_memory, force_patch_weights=force_patch_weights)
    ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\comfy\model_management.py", line 506, in model_load
    self.model_use_more_vram(use_more_vram, force_patch_weights=force_patch_weights)
    ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\comfy\model_management.py", line 535, in model_use_more_vram
    return self.model.partially_load(self.device, extra_memory, force_patch_weights=force_patch_weights)
           ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\comfy\model_patcher.py", line 919, in partially_load
    self.unpatch_model(self.offload_device, unpatch_weights=unpatch_weights)
    ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\custom_nodes\ComfyUI-GGUF\nodes.py", line 77, in unpatch_model
    return super().unpatch_model(device_to=device_to, unpatch_weights=unpatch_weights)
           ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\comfy\model_patcher.py", line 832, in unpatch_model
    self.model.to(device_to)
    ~~~~~~~~~~~~~^^^^^^^^^^^
  File "D:\Projects\ComfyUI\.venv\Lib\site-packages\torch\nn\modules\module.py", line 1371, in to
    return self._apply(convert)
           ~~~~~~~~~~~^^^^^^^^^
  File "D:\Projects\ComfyUI\.venv\Lib\site-packages\torch\nn\modules\module.py", line 930, in _apply
    module._apply(fn)
    ~~~~~~~~~~~~~^^^^
  File "D:\Projects\ComfyUI\.venv\Lib\site-packages\torch\nn\modules\module.py", line 930, in _apply
    module._apply(fn)
    ~~~~~~~~~~~~~^^^^
  File "D:\Projects\ComfyUI\.venv\Lib\site-packages\torch\nn\modules\module.py", line 930, in _apply
    module._apply(fn)
    ~~~~~~~~~~~~~^^^^
  File "D:\Projects\ComfyUI\.venv\Lib\site-packages\torch\nn\modules\module.py", line 957, in _apply
    param_applied = fn(param)
  File "D:\Projects\ComfyUI\.venv\Lib\site-packages\torch\nn\modules\module.py", line 1357, in convert
    return t.to(
           ~~~~^
        device,
        ^^^^^^^
        dtype if t.is_floating_point() or t.is_complex() else None,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        non_blocking,
        ^^^^^^^^^^^^^
    )
    ^
  File "D:\Projects\ComfyUI\custom_nodes\ComfyUI-GGUF\ops.py", line 58, in to
    new = super().to(*args, **kwargs)
  File "D:\Projects\ComfyUI\.venv\Lib\site-packages\torch\_tensor.py", line 1654, in __torch_function__
    ret = func(*args, **kwargs)
torch.AcceleratorError: CUDA error: invalid argument
Search for `cudaErrorInvalidValue' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


Prompt executed in 194.29 seconds

@phaserblast
Copy link

@phaserblast if you want to give it a test to see if it helps your use case let me know either way.

Looks good. Tested it on both a 4090 and a GB10, no problems. This is the line that fixed it for me:

n.mmap_released = getattr(self, "mmap_released", False)

@rattus128
Copy link
Contributor Author

Hi @rattus128 thanks again for your help! I updated the main Comfy repo and this custom node to the latest and did another test. (Besides, I actually have a 5070 Ti rather than a 4090.) Here is a more complete log which errors in the first run:

got prompt
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load WanTEModel
loaded completely; 13401.80 MB usable, 10835.48 MB loaded, full load: True
Requested to load WanVAE
loaded completely; 327.00 MB usable, 242.03 MB loaded, full load: True
gguf qtypes: F16 (694), Q8_0 (400), F32 (1)
model weight dtype torch.float16, manual cast: None
model_type FLOW
Requested to load WAN21
loaded partially; 218.89 MB usable, 215.06 MB loaded, 14610.41 MB offloaded, lowvram patches: 0
100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [02:30<00:00, 150.49s/it]
Requested to load WAN21
!!! Exception during processing !!! CUDA error: invalid argument
Search for 'cudaErrorInvalidValue' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "D:\Projects\ComfyUI\execution.py", line 510, in execute
    output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\execution.py", line 324, in get_output_data
    return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\execution.py", line 298, in _async_map_node_over_list
    await process_inputs(input_dict, i)
  File "D:\Projects\ComfyUI\execution.py", line 286, in process_inputs
    result = f(**inputs)
  File "D:\Projects\ComfyUI\comfy_extras\nodes_custom_sampler.py", line 658, in sample
    samples = comfy.sample.sample_custom(model, noise, cfg, sampler, sigmas, positive, negative, latent_image, noise_mask=noise_mask, callback=callback, disable_pbar=disable_pbar, seed=noise_seed)
  File "D:\Projects\ComfyUI\comfy\sample.py", line 65, in sample_custom
    samples = comfy.samplers.sample(model, noise, positive, negative, cfg, model.load_device, sampler, sigmas, model_options=model.model_options, latent_image=latent_image, denoise_mask=noise_mask, callback=callback, disable_pbar=disable_pbar, seed=seed)
  File "D:\Projects\ComfyUI\comfy\samplers.py", line 1053, in sample
    return cfg_guider.sample(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
           ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\comfy\samplers.py", line 1035, in sample
    output = executor.execute(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed, latent_shapes=latent_shapes)
  File "D:\Projects\ComfyUI\comfy\patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\comfy\samplers.py", line 984, in outer_sample
    self.inner_model, self.conds, self.loaded_models = comfy.sampler_helpers.prepare_sampling(self.model_patcher, noise.shape, self.conds, self.model_options)
                                                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\comfy\sampler_helpers.py", line 130, in prepare_sampling
    return executor.execute(model, noise_shape, conds, model_options=model_options)
           ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\comfy\patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\comfy\sampler_helpers.py", line 138, in _prepare_sampling
    comfy.model_management.load_models_gpu([model] + models, memory_required=memory_required + inference_memory, minimum_memory_required=minimum_memory_required + inference_memory)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\comfy\model_management.py", line 697, in load_models_gpu
    loaded_model.model_load(lowvram_model_memory, force_patch_weights=force_patch_weights)
    ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\comfy\model_management.py", line 506, in model_load
    self.model_use_more_vram(use_more_vram, force_patch_weights=force_patch_weights)
    ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\comfy\model_management.py", line 535, in model_use_more_vram
    return self.model.partially_load(self.device, extra_memory, force_patch_weights=force_patch_weights)
           ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\comfy\model_patcher.py", line 919, in partially_load
    self.unpatch_model(self.offload_device, unpatch_weights=unpatch_weights)
    ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\custom_nodes\ComfyUI-GGUF\nodes.py", line 77, in unpatch_model
    return super().unpatch_model(device_to=device_to, unpatch_weights=unpatch_weights)
           ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\comfy\model_patcher.py", line 832, in unpatch_model
    self.model.to(device_to)
    ~~~~~~~~~~~~~^^^^^^^^^^^
  File "D:\Projects\ComfyUI\.venv\Lib\site-packages\torch\nn\modules\module.py", line 1371, in to
    return self._apply(convert)
           ~~~~~~~~~~~^^^^^^^^^
  File "D:\Projects\ComfyUI\.venv\Lib\site-packages\torch\nn\modules\module.py", line 930, in _apply
    module._apply(fn)
    ~~~~~~~~~~~~~^^^^
  File "D:\Projects\ComfyUI\.venv\Lib\site-packages\torch\nn\modules\module.py", line 930, in _apply
    module._apply(fn)
    ~~~~~~~~~~~~~^^^^
  File "D:\Projects\ComfyUI\.venv\Lib\site-packages\torch\nn\modules\module.py", line 930, in _apply
    module._apply(fn)
    ~~~~~~~~~~~~~^^^^
  File "D:\Projects\ComfyUI\.venv\Lib\site-packages\torch\nn\modules\module.py", line 957, in _apply
    param_applied = fn(param)
  File "D:\Projects\ComfyUI\.venv\Lib\site-packages\torch\nn\modules\module.py", line 1357, in convert
    return t.to(
           ~~~~^
        device,
        ^^^^^^^
        dtype if t.is_floating_point() or t.is_complex() else None,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        non_blocking,
        ^^^^^^^^^^^^^
    )
    ^
  File "D:\Projects\ComfyUI\custom_nodes\ComfyUI-GGUF\ops.py", line 58, in to
    new = super().to(*args, **kwargs)
  File "D:\Projects\ComfyUI\.venv\Lib\site-packages\torch\_tensor.py", line 1654, in __torch_function__
    ret = func(*args, **kwargs)
torch.AcceleratorError: CUDA error: invalid argument
Search for `cudaErrorInvalidValue' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


Prompt executed in 194.29 seconds

@nestflow I never got a first flow reproducer or the einval you pasted here. Im back in no reproducer purgatory.

I did reproduce an OOM that I have since PRd but it's not your error case.

I have implemented something of a sledgehammer fix to lock down the pinned memory pieces to the core and avoid custom node packs leaking memory or creating use-after-frees.

Can you give this branch of comfy core a go?

https://github.com/rattus128/ComfyUI/tree/wip/pin-lockdown#

If it still fails, paste log, if it work, can you expand the console in the comfyui GUI and paste me the success log too? (I added warming prints to the conditions we dont want to see but I compensated for).

Thanks for your help in tracking these down.

Depending on what happens we may want to cut a fresh issue somewhere.

@nestflow
Copy link

nestflow commented Nov 8, 2025

Hi @rattus128 thanks again for your help! I updated the main Comfy repo and this custom node to the latest and did another test. (Besides, I actually have a 5070 Ti rather than a 4090.) Here is a more complete log which errors in the first run:

got prompt
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load WanTEModel
loaded completely; 13401.80 MB usable, 10835.48 MB loaded, full load: True
Requested to load WanVAE
loaded completely; 327.00 MB usable, 242.03 MB loaded, full load: True
gguf qtypes: F16 (694), Q8_0 (400), F32 (1)
model weight dtype torch.float16, manual cast: None
model_type FLOW
Requested to load WAN21
loaded partially; 218.89 MB usable, 215.06 MB loaded, 14610.41 MB offloaded, lowvram patches: 0
100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [02:30<00:00, 150.49s/it]
Requested to load WAN21
!!! Exception during processing !!! CUDA error: invalid argument
Search for 'cudaErrorInvalidValue' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "D:\Projects\ComfyUI\execution.py", line 510, in execute
    output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\execution.py", line 324, in get_output_data
    return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\execution.py", line 298, in _async_map_node_over_list
    await process_inputs(input_dict, i)
  File "D:\Projects\ComfyUI\execution.py", line 286, in process_inputs
    result = f(**inputs)
  File "D:\Projects\ComfyUI\comfy_extras\nodes_custom_sampler.py", line 658, in sample
    samples = comfy.sample.sample_custom(model, noise, cfg, sampler, sigmas, positive, negative, latent_image, noise_mask=noise_mask, callback=callback, disable_pbar=disable_pbar, seed=noise_seed)
  File "D:\Projects\ComfyUI\comfy\sample.py", line 65, in sample_custom
    samples = comfy.samplers.sample(model, noise, positive, negative, cfg, model.load_device, sampler, sigmas, model_options=model.model_options, latent_image=latent_image, denoise_mask=noise_mask, callback=callback, disable_pbar=disable_pbar, seed=seed)
  File "D:\Projects\ComfyUI\comfy\samplers.py", line 1053, in sample
    return cfg_guider.sample(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
           ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\comfy\samplers.py", line 1035, in sample
    output = executor.execute(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed, latent_shapes=latent_shapes)
  File "D:\Projects\ComfyUI\comfy\patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\comfy\samplers.py", line 984, in outer_sample
    self.inner_model, self.conds, self.loaded_models = comfy.sampler_helpers.prepare_sampling(self.model_patcher, noise.shape, self.conds, self.model_options)
                                                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\comfy\sampler_helpers.py", line 130, in prepare_sampling
    return executor.execute(model, noise_shape, conds, model_options=model_options)
           ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\comfy\patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\comfy\sampler_helpers.py", line 138, in _prepare_sampling
    comfy.model_management.load_models_gpu([model] + models, memory_required=memory_required + inference_memory, minimum_memory_required=minimum_memory_required + inference_memory)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\comfy\model_management.py", line 697, in load_models_gpu
    loaded_model.model_load(lowvram_model_memory, force_patch_weights=force_patch_weights)
    ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\comfy\model_management.py", line 506, in model_load
    self.model_use_more_vram(use_more_vram, force_patch_weights=force_patch_weights)
    ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\comfy\model_management.py", line 535, in model_use_more_vram
    return self.model.partially_load(self.device, extra_memory, force_patch_weights=force_patch_weights)
           ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\comfy\model_patcher.py", line 919, in partially_load
    self.unpatch_model(self.offload_device, unpatch_weights=unpatch_weights)
    ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\custom_nodes\ComfyUI-GGUF\nodes.py", line 77, in unpatch_model
    return super().unpatch_model(device_to=device_to, unpatch_weights=unpatch_weights)
           ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Projects\ComfyUI\comfy\model_patcher.py", line 832, in unpatch_model
    self.model.to(device_to)
    ~~~~~~~~~~~~~^^^^^^^^^^^
  File "D:\Projects\ComfyUI\.venv\Lib\site-packages\torch\nn\modules\module.py", line 1371, in to
    return self._apply(convert)
           ~~~~~~~~~~~^^^^^^^^^
  File "D:\Projects\ComfyUI\.venv\Lib\site-packages\torch\nn\modules\module.py", line 930, in _apply
    module._apply(fn)
    ~~~~~~~~~~~~~^^^^
  File "D:\Projects\ComfyUI\.venv\Lib\site-packages\torch\nn\modules\module.py", line 930, in _apply
    module._apply(fn)
    ~~~~~~~~~~~~~^^^^
  File "D:\Projects\ComfyUI\.venv\Lib\site-packages\torch\nn\modules\module.py", line 930, in _apply
    module._apply(fn)
    ~~~~~~~~~~~~~^^^^
  File "D:\Projects\ComfyUI\.venv\Lib\site-packages\torch\nn\modules\module.py", line 957, in _apply
    param_applied = fn(param)
  File "D:\Projects\ComfyUI\.venv\Lib\site-packages\torch\nn\modules\module.py", line 1357, in convert
    return t.to(
           ~~~~^
        device,
        ^^^^^^^
        dtype if t.is_floating_point() or t.is_complex() else None,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        non_blocking,
        ^^^^^^^^^^^^^
    )
    ^
  File "D:\Projects\ComfyUI\custom_nodes\ComfyUI-GGUF\ops.py", line 58, in to
    new = super().to(*args, **kwargs)
  File "D:\Projects\ComfyUI\.venv\Lib\site-packages\torch\_tensor.py", line 1654, in __torch_function__
    ret = func(*args, **kwargs)
torch.AcceleratorError: CUDA error: invalid argument
Search for `cudaErrorInvalidValue' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


Prompt executed in 194.29 seconds

@nestflow I never got a first flow reproducer or the einval you pasted here. Im back in no reproducer purgatory.

I did reproduce an OOM that I have since PRd but it's not your error case.

I have implemented something of a sledgehammer fix to lock down the pinned memory pieces to the core and avoid custom node packs leaking memory or creating use-after-frees.

Can you give this branch of comfy core a go?

https://github.com/rattus128/ComfyUI/tree/wip/pin-lockdown#

If it still fails, paste log, if it work, can you expand the console in the comfyui GUI and paste me the success log too? (I added warming prints to the conditions we dont want to see but I compensated for).

Thanks for your help in tracking these down.

Depending on what happens we may want to cut a fresh issue somewhere.

Hi @rattus128 , thanks for your continual following up on this issue. Yes this fix seemed to work and here is the success log:

got prompt
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load WanTEModel
loaded completely; 13401.80 MB usable, 10835.48 MB loaded, full load: True
Requested to load WanVAE
Unloading WanTEModel
loaded completely; 327.00 MB usable, 242.03 MB loaded, full load: True
gguf qtypes: F16 (694), Q8_0 (400), F32 (1)
model weight dtype torch.float16, manual cast: None
model_type FLOW
Requested to load WAN21
Unloading WanTEModel
Unloading WanVAE
2 idle models unloaded.
loaded partially; 218.89 MB usable, 215.06 MB loaded, 14610.41 MB offloaded, lowvram patches: 0
100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [02:32<00:00, 152.74s/it]
Requested to load WAN21
Unloading WAN21
1 active models unloaded for increased offloading.
loaded partially; 212.89 MB usable, 212.89 MB loaded, 14612.58 MB offloaded, lowvram patches: 0
100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [02:37<00:00, 157.89s/it]
Requested to load WanVAE
loaded completely; 1766.30 MB usable, 242.03 MB loaded, full load: True
Prompt executed in 362.12 seconds

@RYG81
Copy link

RYG81 commented Nov 9, 2025

I am having issue of crashing at VAE decode? can someone guide how to fix? I am not able to work on my workflow.

@rattus128
Copy link
Contributor Author

I am having issue of crashing at VAE decode? can someone guide how to fix? I am not able to work on my workflow.

what is your pytorch version? There is a theory that pytorch 2.7 has an issue. We are yet to confirm this one but we could use the data and its something you could try.

@RYG81
Copy link

RYG81 commented Nov 11, 2025

i had Pytorch 2.7 only
I just installed 2.8 and still crashing same way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants